[GenAI] Indexing overview

For a RAG system, indexing is not just about putting chunks into a vector database. It usually includes three major parts:

1. Chunk Splitting

This step focuses on how to split raw documents into smaller units that can be embedded and retrieved.

Common strategies:

  • Semantic Chunking
  • Recursive Chunking

2. Chunk Enhancement

This step focuses on making each chunk more self-contained and easier to retrieve.

Common techniques:

  • Contextual Headers
  • Document Augmentation
  • Metadata Extraction

3. Index Management

This step focuses on how to maintain the quality, structure, and lifecycle of the index.

Common mechanisms:

  • Deduplication and Cleaning
  • Hierarchical Indexing
  • Incremental Updates

 

Chunk Splitting

This section mainly covers how to split data into chunks before indexing or retrieval.

 

Semantic chunking

Semantic chunking splits content based on meaning, instead of only using token size.

The core idea is to preserve semantic completeness and avoid breaking a sentence, paragraph, or logical section in the middle.

Common approaches:

  • Split by sentence
  • Split by paragraph
  • Split by heading hierarchy
  • Split at points where semantic similarity changes significantly

Example:

Bad:

'The elevator maintenance report shows that the door sensor failed because'

'of repeated signal loss during peak operating hours.'

Why it is bad: the sentence is split in the middle, so each chunk loses context.

 

Good:

The elevator maintenance report shows that the door sensor failed because of repeated signal loss during peak operating hours.

Why it is good: the sentence keeps its complete meaning.

 

Recursiec Chunking

Recursive chunking uses multiple splitting strategies in priority order.

A common flow is:

  1. First split by paragraph or heading.
  2. If a chunk is still too long, split it by sentence.
  3. If a sentence is still too long, split it by punctuation.
  4. As a final fallback, split by token length.

The benefit is that it tries to keep the original structure and meaning as much as possible, while still ensuring each chunk fits within the target size limit.

 

How to Design the Indexing API in a RAG SDK?

For a RAG SDK, the indexing API should be flexible enough for common use cases, while still allowing advanced users to customize the pipeline.

A good design can include three layers:

1. Provide Built-in Splitters

The SDK should provide several built-in splitters for common scenarios, for example:

  • TokenTextSplitter
  • RecursiveTextSplitter
  • MarkdownHeaderSplitter
  • SemanticTextSplitter
  • CodeSplitter
const index = await rag.index({
	documents,
	splitter: new RecursiveTextSplitter({
		chunkSize: 800,
		chunkOverlap: 100,
	}),
})

2. Allow Third-party Splitters Through Adapters

Users may already use splitters from LangChain, LlamaIndex, or internal company libraries.

Instead of forcing users to rewrite everything, the SDK can provide adapter support.

const splitter = fromLangChainSplitter(langChainSplitter)

await rag.index({
	documents,
	splitter,
})

3. Provide a Splitter Interface for Custom Implementations

The SDK should expose a clear interface, so users can create their own splitters.

interface DocumentSplitter {
	split(document: Document): Promise<Chunk[]>
}

class CustomBusinessSplitter implements DocumentSplitter {
	async split(document: Document): Promise<Chunk[]> {
		// Custom splitting logic
		return chunks
	}
}

 

Chunk Enhancement

After we already have chunks, we can further improve them for better search, retrieval, and indexing.

The goal is not just to split the document, but to make each chunk more self-contained and easier to retrieve correctly.

 

Contextual Headers

Contextual headers add metadata or structural information to each chunk, so that even after chunking, the chunk can still preserve its original context.

This is especially useful when a chunk is retrieved independently from the original document.

Common metadata includes:

  • Document title
  • Paragraph title
  • Section title
  • Module name
  • Product name
  • Source file name
  • Page number
  • Hierarchy path

Example:

Chunk without contextual header:

## Pricing

The Pro plan supports up to 100 users and includes advanced analytics.

This is understandable, but the context is weak. The retriever may not know this chunk is about pricing.

Chunk with contextual header:

Document: Product Plan Guide
Section: Pricing
Module: Pro Plan

The Pro plan supports up to 100 users and includes advanced analytics.

This is better because the chunk carries its context even when retrieved alone.

 

Document Augmentation

Document augmentation means adding extra information before indexing to improve retrieval quality.

The augmentation does not necessarily need to modify the original document. It can be added only to the indexed version, metadata, or embedding input.

The goal is to make the content easier to search, match, and understand.

Common approaches:

  • Add missing titles
  • Add paragraph summaries
  • Add standard questions for FAQ content
  • Add keywords for important concepts
  • Convert tables into natural language descriptions
  • Normalize terminology
  • Add synonyms or business-specific aliases

Example:

Original content:

You can cancel your subscription from the billing page.

Augmented content for indexing:

Standard questions:
- How can I cancel my subscription?
- Where do I cancel my plan?
- How do I stop auto-renewal?

Answer:
You can cancel your subscription from the billing page.

 

 

Metadata Extraction

Metadata extraction means extracting structured information from documents during the indexing stage and storing it as metadata. This makes later search, filtering, sorting, and display easier.

This metadata is often not part of the main document content, but it is very important for the system.

Common metadata includes:

  • Document title
  • Source path
  • Author
  • Last updated time
  • Knowledge base
  • Tags
  • Permission level
  • Chapter title

 

 

Interface managment

A interface for RAG application looks like this:

type IndexItem = {
    id: string
    content: string
    embedding: number[]
    metadata: {
        source: string
        title?: string
        tags?: string[]
    }
}

Basically it's a combination of chunk + embedding + metadata

 

Deduplication and Cleaning

Deduplication and cleaning are part of pre-indexing data processing.

This means that before indexing, we first remove duplicate content, noise, and dirty data, so they do not enter the vector database.

This is a very common and important step. If the raw data quality is poor, even good embeddings and retrieval methods will struggle to produce ideal results.

Common operations include:

  • Remove duplicate documents
  • Remove duplicate chunks
  • Remove empty content
  • Remove template headers and footers
  • Remove garbled text and meaningless characters
  • Standardize the format

 

Hierarchical Indexing

Hierarchical indexing is an indexing strategy that builds knowledge representations at multiple levels.

Instead of creating only one flat index at a single granularity, hierarchical indexing creates indexes at different levels, such as document, section, and chunk.

The goal is to allow the retrieval system to first locate the relevant area at a coarse level, and then drill down into more specific content. It can also choose different levels of detail depending on the complexity of the user’s question.

Comparison between normal indexing and hierarchical indexing:

Normal Indexing

In normal indexing, all chunks are usually stored at the same level:

chunk1
chunk2
chunk3

The retriever directly searches across all chunks.

 

Hierarchical Indexing

In hierarchical indexing, the content keeps its original structure:

Document A
├── Section 1
│   ├── chunk1
│   └── chunk2
└── Section 2
    └── chunk3

During retrieval, the system can:

  1. First find the relevant document
  2. Then find the relevant section
  3. Finally find the most relevant chunk

This makes retrieval more structured and often improves accuracy, especially for long documents or complex knowledge bases.

 
 

Incremental Updates

Incremental updates are part of index lifecycle management.

It means that when updating an index, we do not need to rebuild everything from scratch every time. Instead, we only process the data that has been added, modified, or deleted.

The problem it solves:

When the knowledge base becomes large, a full rebuild can be too expensive and inefficient.

Incremental updates usually rely on mechanisms such as:

  • Document version number
  • Last updated time
  • Content hash
  • File change log
  • Chunk-level diff

Example:

 
Document A v1
├── chunk1
├── chunk2
└── chunk3

After the document is updated:

Document A v2
├── chunk1 unchanged
├── chunk2 modified
└── chunk3 unchanged
 

With incremental updates, the system only needs to re-process and re-index chunk2, instead of rebuilding the whole document.

This makes indexing faster, cheaper, and easier to maintain at scale.

 

 

posted @ 2026-05-19 13:48  Zhentiw  阅读(2)  评论(0)    收藏  举报