[GenAI] Indexing overview

For a RAG system, indexing is not just about putting chunks into a vector database. It usually includes three major parts:

1. Chunk Splitting

This step focuses on how to split raw documents into smaller units that can be embedded and retrieved.

Common strategies:

Semantic Chunking
Recursive Chunking

2. Chunk Enhancement

This step focuses on making each chunk more self-contained and easier to retrieve.

Common techniques:

Contextual Headers
Document Augmentation
Metadata Extraction

3. Index Management

This step focuses on how to maintain the quality, structure, and lifecycle of the index.

Common mechanisms:

Deduplication and Cleaning
Hierarchical Indexing
Incremental Updates

Chunk Splitting

This section mainly covers how to split data into chunks before indexing or retrieval.

Semantic chunking

Semantic chunking splits content based on meaning, instead of only using token size.

The core idea is to preserve semantic completeness and avoid breaking a sentence, paragraph, or logical section in the middle.

Common approaches:

Split by sentence
Split by paragraph
Split by heading hierarchy
Split at points where semantic similarity changes significantly

Example：

Bad:

'The elevator maintenance report shows that the door sensor failed because'

'of repeated signal loss during peak operating hours.'

Why it is bad: the sentence is split in the middle, so each chunk loses context.

Good：

The elevator maintenance report shows that the door sensor failed because of repeated signal loss during peak operating hours.

Why it is good: the sentence keeps its complete meaning.

Recursiec Chunking

Recursive chunking uses multiple splitting strategies in priority order.

A common flow is:

First split by paragraph or heading.
If a chunk is still too long, split it by sentence.
If a sentence is still too long, split it by punctuation.
As a final fallback, split by token length.

The benefit is that it tries to keep the original structure and meaning as much as possible, while still ensuring each chunk fits within the target size limit.

How to Design the Indexing API in a RAG SDK？

For a RAG SDK, the indexing API should be flexible enough for common use cases, while still allowing advanced users to customize the pipeline.

A good design can include three layers:

1. Provide Built-in Splitters

The SDK should provide several built-in splitters for common scenarios, for example:

TokenTextSplitter
RecursiveTextSplitter
MarkdownHeaderSplitter
SemanticTextSplitter
CodeSplitter

const index = await rag.index({
	documents,
	splitter: new RecursiveTextSplitter({
		chunkSize: 800,
		chunkOverlap: 100,
	}),
})

2. Allow Third-party Splitters Through Adapters

Users may already use splitters from LangChain, LlamaIndex, or internal company libraries.

Instead of forcing users to rewrite everything, the SDK can provide adapter support.

const splitter = fromLangChainSplitter(langChainSplitter)

await rag.index({
	documents,
	splitter,
})

3. Provide a Splitter Interface for Custom Implementations

The SDK should expose a clear interface, so users can create their own splitters.

interface DocumentSplitter {
	split(document: Document): Promise<Chunk[]>
}

class CustomBusinessSplitter implements DocumentSplitter {
	async split(document: Document): Promise<Chunk[]> {
		// Custom splitting logic
		return chunks
	}
}

Chunk Enhancement

After we already have chunks, we can further improve them for better search, retrieval, and indexing.

The goal is not just to split the document, but to make each chunk more self-contained and easier to retrieve correctly.

Contextual Headers

Contextual headers add metadata or structural information to each chunk, so that even after chunking, the chunk can still preserve its original context.

This is especially useful when a chunk is retrieved independently from the original document.

Common metadata includes:

Document title
Paragraph title
Section title
Module name
Product name
Source file name
Page number
Hierarchy path

Example：

Chunk without contextual header:

## Pricing

The Pro plan supports up to 100 users and includes advanced analytics.

This is understandable, but the context is weak. The retriever may not know this chunk is about pricing.

Chunk with contextual header:

Document: Product Plan Guide
Section: Pricing
Module: Pro Plan

The Pro plan supports up to 100 users and includes advanced analytics.

This is better because the chunk carries its context even when retrieved alone.

Document Augmentation

Document augmentation means adding extra information before indexing to improve retrieval quality.

The augmentation does not necessarily need to modify the original document. It can be added only to the indexed version, metadata, or embedding input.

The goal is to make the content easier to search, match, and understand.

Common approaches:

Add missing titles
Add paragraph summaries
Add standard questions for FAQ content
Add keywords for important concepts
Convert tables into natural language descriptions
Normalize terminology
Add synonyms or business-specific aliases

Example：

Original content:

You can cancel your subscription from the billing page.

Augmented content for indexing:

Standard questions:
- How can I cancel my subscription?
- Where do I cancel my plan?
- How do I stop auto-renewal?

Answer:
You can cancel your subscription from the billing page.

Metadata Extraction

Metadata extraction means extracting structured information from documents during the indexing stage and storing it as metadata. This makes later search, filtering, sorting, and display easier.

This metadata is often not part of the main document content, but it is very important for the system.

Common metadata includes:

Document title
Source path
Author
Last updated time
Knowledge base
Tags
Permission level
Chapter title

Interface managment

A interface for RAG application looks like this:

type IndexItem = {
    id: string
    content: string
    embedding: number[]
    metadata: {
        source: string
        title?: string
        tags?: string[]
    }
}

Basically it's a combination of chunk + embedding + metadata

Deduplication and Cleaning

Deduplication and cleaning are part of pre-indexing data processing.

This means that before indexing, we first remove duplicate content, noise, and dirty data, so they do not enter the vector database.

This is a very common and important step. If the raw data quality is poor, even good embeddings and retrieval methods will struggle to produce ideal results.

Common operations include:

Remove duplicate documents
Remove duplicate chunks
Remove empty content
Remove template headers and footers
Remove garbled text and meaningless characters
Standardize the format

Hierarchical Indexing

Hierarchical indexing is an indexing strategy that builds knowledge representations at multiple levels.

Instead of creating only one flat index at a single granularity, hierarchical indexing creates indexes at different levels, such as document, section, and chunk.

The goal is to allow the retrieval system to first locate the relevant area at a coarse level, and then drill down into more specific content. It can also choose different levels of detail depending on the complexity of the user’s question.

Comparison between normal indexing and hierarchical indexing:

Normal Indexing

In normal indexing, all chunks are usually stored at the same level:

chunk1
chunk2
chunk3

The retriever directly searches across all chunks.

Hierarchical Indexing

In hierarchical indexing, the content keeps its original structure:

Document A
├── Section 1
│   ├── chunk1
│   └── chunk2
└── Section 2
    └── chunk3

During retrieval, the system can:

First find the relevant document
Then find the relevant section
Finally find the most relevant chunk

This makes retrieval more structured and often improves accuracy, especially for long documents or complex knowledge bases.

Incremental Updates

Incremental updates are part of index lifecycle management.

It means that when updating an index, we do not need to rebuild everything from scratch every time. Instead, we only process the data that has been added, modified, or deleted.

The problem it solves:

When the knowledge base becomes large, a full rebuild can be too expensive and inefficient.

Incremental updates usually rely on mechanisms such as:

Document version number
Last updated time
Content hash
File change log
Chunk-level diff

Example:

Document A v1
├── chunk1
├── chunk2
└── chunk3

After the document is updated:

Document A v2
├── chunk1 unchanged
├── chunk2 modified
└── chunk3 unchanged

With incremental updates, the system only needs to re-process and re-index chunk2, instead of rebuilding the whole document.

This makes indexing faster, cheaper, and easier to maintain at scale.

posted @ 2026-05-19 13:48 Zhentiw 阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

Answer1215

[GenAI] Indexing overview

1. Chunk Splitting

2. Chunk Enhancement

3. Index Management

Chunk Splitting

Semantic chunking

Recursiec Chunking

Chunk Enhancement

Contextual Headers

Document Augmentation

Metadata Extraction

Interface managment

Deduplication and Cleaning

Hierarchical Indexing

Normal Indexing

Hierarchical Indexing

Incremental Updates

公告