### Embeddings in the LLM + RAG world | What they are | How they’re made | Why we care | |---------------|------------------|-------------| | **Vector representations of text** – a numeric “fingerprint” that captures meaning, syntax, and context. | • Feed‑forward or transformer models (BERT, GPT, Sentence‑Transformers) produce hidden states.
• Take the final layer (or an average/CLS token) → a dense vector (e.g., 768‑dim).
• Optional dimensionality reduction (PCA, UMAP) for speed. | • **Similarity search**: cosine similarity lets us find passages that “mean” the same thing.
• **Scalability**: storing billions of vectors is far cheaper than full text indexes.
• **Privacy / obfuscation**: raw text isn’t exposed, only its numeric embedding. | #### Key steps in a typical RAG pipeline 1. **Encode documents** → embeddings → store in a vector‑DB (FAISS, Pinecone, Milvus). 2. **Query time** – encode the user question → get top‑k similar document vectors via nearest‑neighbor search. 3. **Retrieve** the corresponding text passages. 4. **Fuse** with the LLM: feed the retrieved context + original query into the model to generate an answer. #### Practical tips | Tip | Why it matters | |-----|----------------| | **Choose a domain‑specific encoder** (e.g., BioBERT for biomedical) | Improves relevance of retrieved passages. | | **Normalize vectors** (unit length) before similarity search | Cosine distance becomes dot product, which is fast and numerically stable. | | **Keep embeddings up to date** when documents change | Avoid stale context that can mislead the LLM. | #### Common pitfalls - **Embedding drift**: Using a model version that changes the embedding space can break retrieval. - **Over‑compression**: Too many dimensions removed → loss of nuance. - **Bias propagation**: Embedding models encode biases present in training data, which can surface in retrieved content. --- In short, embeddings are the *bridge* between raw text and the LLM’s reasoning engine—turning unstructured documents into a searchable, vectorized space that lets the model pull out just the right pieces of information for each question.