[GenAI] Indexing overview
For a RAG system, indexing is not just about putting chunks into a vector database. It usually includes three major parts:
1. Chunk Splitting
This step focuses on how to split raw documents into smaller units that can be embedded and retrieved.
Common strategies:
- Semantic Chunking
- Recursive Chunking
2. Chunk Enhancement
This step focuses on making each chunk more self-contained and easier to retrieve.
Common techniques:
- Contextual Headers
- Document Augmentation
- Metadata Extraction
3. Index Management
This step focuses on how to maintain the quality, structure, and lifecycle of the index.
Common mechanisms:
- Deduplication and Cleaning
- Hierarchical Indexing
- Incremental Updates
Chunk Splitting
This section mainly covers how to split data into chunks before indexing or retrieval.
Semantic chunking
Semantic chunking splits content based on meaning, instead of only using token size.
The core idea is to preserve semantic completeness and avoid breaking a sentence, paragraph, or logical section in the middle.
Common approaches:
- Split by sentence
- Split by paragraph
- Split by heading hierarchy
- Split at points where semantic similarity changes significantly
Example:
Bad:
'The elevator maintenance report shows that the door sensor failed because'
'of repeated signal loss during peak operating hours.'
Why it is bad: the sentence is split in the middle, so each chunk loses context.
Good:
The elevator maintenance report shows that the door sensor failed because of repeated signal loss during peak operating hours.
Why it is good: the sentence keeps its complete meaning.
Recursiec Chunking
Recursive chunking uses multiple splitting strategies in priority order.
A common flow is:
- First split by paragraph or heading.
- If a chunk is still too long, split it by sentence.
- If a sentence is still too long, split it by punctuation.
- As a final fallback, split by token length.
The benefit is that it tries to keep the original structure and meaning as much as possible, while still ensuring each chunk fits within the target size limit.
How to Design the Indexing API in a RAG SDK?
For a RAG SDK, the indexing API should be flexible enough for common use cases, while still allowing advanced users to customize the pipeline.
A good design can include three layers:
1. Provide Built-in Splitters
The SDK should provide several built-in splitters for common scenarios, for example:
TokenTextSplitterRecursiveTextSplitterMarkdownHeaderSplitterSemanticTextSplitterCodeSplitter
const index = await rag.index({
documents,
splitter: new RecursiveTextSplitter({
chunkSize: 800,
chunkOverlap: 100,
}),
})
2. Allow Third-party Splitters Through Adapters
Users may already use splitters from LangChain, LlamaIndex, or internal company libraries.
Instead of forcing users to rewrite everything, the SDK can provide adapter support.
const splitter = fromLangChainSplitter(langChainSplitter)
await rag.index({
documents,
splitter,
})
3. Provide a Splitter Interface for Custom Implementations
The SDK should expose a clear interface, so users can create their own splitters.
interface DocumentSplitter {
split(document: Document): Promise<Chunk[]>
}
class CustomBusinessSplitter implements DocumentSplitter {
async split(document: Document): Promise<Chunk[]> {
// Custom splitting logic
return chunks
}
}
Chunk Enhancement
After we already have chunks, we can further improve them for better search, retrieval, and indexing.
The goal is not just to split the document, but to make each chunk more self-contained and easier to retrieve correctly.
Contextual Headers
Contextual headers add metadata or structural information to each chunk, so that even after chunking, the chunk can still preserve its original context.
This is especially useful when a chunk is retrieved independently from the original document.
Common metadata includes:
- Document title
- Paragraph title
- Section title
- Module name
- Product name
- Source file name
- Page number
- Hierarchy path
Example:
Chunk without contextual header:
## Pricing
The Pro plan supports up to 100 users and includes advanced analytics.
This is understandable, but the context is weak. The retriever may not know this chunk is about pricing.
Chunk with contextual header:
Document: Product Plan Guide
Section: Pricing
Module: Pro Plan
The Pro plan supports up to 100 users and includes advanced analytics.
This is better because the chunk carries its context even when retrieved alone.
Document Augmentation
Document augmentation means adding extra information before indexing to improve retrieval quality.
The augmentation does not necessarily need to modify the original document. It can be added only to the indexed version, metadata, or embedding input.
The goal is to make the content easier to search, match, and understand.
Common approaches:
- Add missing titles
- Add paragraph summaries
- Add standard questions for FAQ content
- Add keywords for important concepts
- Convert tables into natural language descriptions
- Normalize terminology
- Add synonyms or business-specific aliases
Example:
Original content:
You can cancel your subscription from the billing page.
Augmented content for indexing:
Standard questions:
- How can I cancel my subscription?
- Where do I cancel my plan?
- How do I stop auto-renewal?
Answer:
You can cancel your subscription from the billing page.
Metadata Extraction
Metadata extraction means extracting structured information from documents during the indexing stage and storing it as metadata. This makes later search, filtering, sorting, and display easier.
This metadata is often not part of the main document content, but it is very important for the system.
Common metadata includes:
- Document title
- Source path
- Author
- Last updated time
- Knowledge base
- Tags
- Permission level
- Chapter title
Interface managment
A interface for RAG application looks like this:
type IndexItem = {
id: string
content: string
embedding: number[]
metadata: {
source: string
title?: string
tags?: string[]
}
}
Basically it's a combination of chunk + embedding + metadata
Deduplication and Cleaning
Deduplication and cleaning are part of pre-indexing data processing.
This means that before indexing, we first remove duplicate content, noise, and dirty data, so they do not enter the vector database.
This is a very common and important step. If the raw data quality is poor, even good embeddings and retrieval methods will struggle to produce ideal results.
Common operations include:
- Remove duplicate documents
- Remove duplicate chunks
- Remove empty content
- Remove template headers and footers
- Remove garbled text and meaningless characters
- Standardize the format
Hierarchical Indexing
Hierarchical indexing is an indexing strategy that builds knowledge representations at multiple levels.
Instead of creating only one flat index at a single granularity, hierarchical indexing creates indexes at different levels, such as document, section, and chunk.
The goal is to allow the retrieval system to first locate the relevant area at a coarse level, and then drill down into more specific content. It can also choose different levels of detail depending on the complexity of the user’s question.
Comparison between normal indexing and hierarchical indexing:
Normal Indexing
In normal indexing, all chunks are usually stored at the same level:
chunk1
chunk2
chunk3
The retriever directly searches across all chunks.
Hierarchical Indexing
In hierarchical indexing, the content keeps its original structure:
Document A
├── Section 1
│ ├── chunk1
│ └── chunk2
└── Section 2
└── chunk3

浙公网安备 33010602011771号