Large Language Models (LLMs) are powerful tools for language understanding and generation, but they suffer from a fundamental limitation: the context window. Most LLMs can only process a limited number of tokens at a time. For instance, Claude does not allow uploading very long documents in full, which makes tasks like long document analysis or question answering particularly challenging.

To overcome this limitation, many systems use a technique called Retrieval-Augmented Generation (RAG). Rather than feeding the entire document to the model, RAG dynamically retrieves only the relevant portions based on the user’s query and uses that to construct the model’s context window.

1. Vector-Based RAG: The Traditional Approach

Vector-based RAG relies on semantic embeddings and vector databases to identify relevant text chunks.

Preprocessing Stage

  1. Split the document into smaller chunks.
  2. Embed each chunk into a vector space using an embedding model.
  3. Store the resulting vectors in a vector database (e.g., FAISS, Pinecone).

Query Stage

  1. Embed the user query using the same embedding model.
  2. Search the database for semantically similar chunks.
  3. Retrieve the top-k results and use them to form the model’s input context.

Limitations

While simple and effective for short texts, vector-based RAG faces several major challenges:

  1. Query and knowledge space mismatch

Vector retrieval assumes that the most semantically similar text to the query is also the most relevant. But this isn’t always true — queries often express intent, not content.

  1. semantic similarity does not equivalent to relevance

This is especially problematic in technical or legal documents, where many passages share near-identical semantics but differ in relevance.

3. hard chunking breaks the semantic integration

Documents are split into fixed-size chunks (e.g., 512 or 1000 tokens) for embedding. This “hard chunking” often cuts through sentences, paragraphs, or sections, fragmenting context.

4. cannot integrate chat history

Each query is treated independently. The retriever doesn’t know what’s been asked or answered before.