Embeddings in the LLM + RAG world

What they are How they’re made Why we care
Vector representations of text – a numeric “fingerprint” that captures meaning, syntax, and context. • Feed‑forward or transformer models (BERT, GPT, Sentence‑Transformers) produce hidden states.
• Take the final layer (or an average/CLS token) → a dense vector (e.g., 768‑dim).
• Optional dimensionality reduction (PCA, UMAP) for speed.
Similarity search: cosine similarity lets us find passages that “mean” the same thing.
Scalability: storing billions of vectors is far cheaper than full text indexes.
Privacy / obfuscation: raw text isn’t exposed, only its numeric embedding.

Key steps in a typical RAG pipeline

  1. Encode documents → embeddings → store in a vector‑DB (FAISS, Pinecone, Milvus).
  2. Query time – encode the user question → get top‑k similar document vectors via nearest‑neighbor search.
  3. Retrieve the corresponding text passages.
  4. Fuse with the LLM: feed the retrieved context + original query into the model to generate an answer.

Practical tips

Tip Why it matters
Choose a domain‑specific encoder (e.g., BioBERT for biomedical) Improves relevance of retrieved passages.
Normalize vectors (unit length) before similarity search Cosine distance becomes dot product, which is fast and numerically stable.
Keep embeddings up to date when documents change Avoid stale context that can mislead the LLM.

Common pitfalls


In short, embeddings are the bridge between raw text and the LLM’s reasoning engine—turning unstructured documents into a searchable, vectorized space that lets the model pull out just the right pieces of information for each question.