【Coursera GenAI with LLM】 Week 3 LLM-powered applications Class Notes

Model optimizations to improve application performance

  1. Distillation: uses a larger model, the teacher model, to train a smaller model, the student model, we freeze teacher's weights and generate completions, also generate student model's completion, the difference between those 2 completions is Distillation Loss . Student model will adjust its final prediction layer or hidden layer. You then use the smaller model for inference to lower your storage and compute budget.

  2. Quantization: post training quantization transforms a model's weights to a lower precision representation, such as a 16-bit floating point or eight-bit integer. This reduces the memory footprint of your model.

  3. Pruning: removes redundant model parameters that contribute little to the model's performance.

Cheat Sheet

RAG (Retrieval Augmented Generation)

Chain of thought prompting

Program-Aided Language Model (PAL)

  • LLM + Code interpreter --> to solve the problem that LLM can't do math

Orchestrator: can manage the information between LLM, external app and external databases. ex. Langchain

ReAct: it's a format for prompting (?), synergizing reasoning and action in LLMs

  • Thought: reason about the current situation
  • Action: an external task model can carry out from an allowed set of actions--search, lookup, finish
  • Observation: a few example

posted @ 2024-03-22 22:45  MiraMira  阅读(24)  评论(0)    收藏  举报