Large Language Models (LLMs) are powerful tools for language understanding and generation, but they suffer from a fundamental limitation: the context window. Most LLMs can only process a limited number of tokens at a time. For instance, Claude does not allow uploading very long documents in full, which makes tasks like long document analysis or question answering particularly challenging.
To overcome this limitation, many systems use a technique called Retrieval-Augmented Generation (RAG). Rather than feeding the entire document to the model, RAG dynamically retrieves only the relevant portions based on the user’s query and uses that to construct the model’s context window.
Vector-based RAG relies on semantic embeddings and vector databases to identify relevant text chunks.
While simple and effective for short texts, vector-based RAG faces several major challenges:
Vector retrieval assumes that the most semantically similar text to the query is also the most relevant. But this isn’t always true — queries often express intent, not content.
This is especially problematic in technical or legal documents, where many passages share near-identical semantics but differ in relevance.
3. hard chunking breaks the semantic integration
Documents are split into fixed-size chunks (e.g., 512 or 1000 tokens) for embedding. This “hard chunking” often cuts through sentences, paragraphs, or sections, fragmenting context.
4. cannot integrate chat history
Each query is treated independently. The retriever doesn’t know what’s been asked or answered before.