Embeddings in the LLM + RAG world
| What they are | How they’re made | Why we care |
|---|---|---|
| Vector representations of text – a numeric “fingerprint” that captures meaning, syntax, and context. | • Feed‑forward or transformer models (BERT, GPT, Sentence‑Transformers) produce hidden states. • Take the final layer (or an average/CLS token) → a dense vector (e.g., 768‑dim). • Optional dimensionality reduction (PCA, UMAP) for speed. |
• Similarity search: cosine similarity lets us find passages that “mean” the same thing. • Scalability: storing billions of vectors is far cheaper than full text indexes. • Privacy / obfuscation: raw text isn’t exposed, only its numeric embedding. |
Key steps in a typical RAG pipeline
- Encode documents → embeddings → store in a vector‑DB (FAISS, Pinecone, Milvus).
- Query time – encode the user question → get top‑k similar document vectors via nearest‑neighbor search.
- Retrieve the corresponding text passages.
- Fuse with the LLM: feed the retrieved context + original query into the model to generate an answer.
Practical tips
| Tip | Why it matters |
|---|---|
| Choose a domain‑specific encoder (e.g., BioBERT for biomedical) | Improves relevance of retrieved passages. |
| Normalize vectors (unit length) before similarity search | Cosine distance becomes dot product, which is fast and numerically stable. |
| Keep embeddings up to date when documents change | Avoid stale context that can mislead the LLM. |
Common pitfalls
- Embedding drift: Using a model version that changes the embedding space can break retrieval.
- Over‑compression: Too many dimensions removed → loss of nuance.
- Bias propagation: Embedding models encode biases present in training data, which can surface in retrieved content.
In short, embeddings are the bridge between raw text and the LLM’s reasoning engine—turning unstructured documents into a searchable, vectorized space that lets the model pull out just the right pieces of information for each question.