【项目实训1】Haystack技术架构分析报告
—— written by Unalome (2025.04.02)
Haystack是一个由deepset开发的开源RAG框架,旨在构建生产就绪的LLM应用、检索增强生成管道及大型文档集合上智能工作的搜索系统。
一、Retrieval:文本检索技术
对应文件
haystack/components/retrivers/in_memory/bm25_retriver.py
haystack/components/rankers/transformers_similarity.py
技术分析
- BM25检索算法优化长文本检索
BM25(Best Match 25)是基于概率检索框架的改进算法,其核心公式为:
\[Score(D, Q) = \sum_{i=1}^n IDF(q_i) \cdot\frac{ f(q_i, D) \cdot (k_1 + 1)}{ f(q_i, D) + k_1 \cdot (1 - b + b \cdot\frac{ |D|} { avg \text{dl}})}
\]
\(\qquad\)\(IDF(q_i)\):逆文档频率,惩罚常见词权重
\(\qquad\)\(f(q_i, D)\):词项在文档中的词频
\(\qquad\)\(|D|/avg\text{dl}\):文档长度与平均长度的比值
\(\qquad\)\(k_1(1.2-2.0)\):控制词频饱和速率的超参
\(\qquad\)\(b (0.75)\):长度规范化系数
- 基于语义相似度的文档排序:
- 使用 Hugging Face 的 cross-encoder/ms-marco-MiniLM-L-6-v2 模型,该模型是一种预训练的跨编码器模型,能够捕捉文档之间的语义相似度。
- 具体流程:
- 文档预处理:首先,对输入的文档进行预处理,包括分词、去停用词、词干化等步骤,以便于后续的相似度计算。
- 模型加载:加载预训练的 Hugging Face 模型和 tokenizer,用于计算文档之间的语义相似度。
- 文档向量化:使用 tokenizer 将预处理后的文档转化为向量形式,作为输入喂给模型。
- 相似度计算:使用模型计算文档之间的语义相似度,得到相似度矩阵。
- 排名:根据相似度矩阵对文档进行排名,返回排名结果。
similarity_scores = []
with torch.inference_mode():
for features in inp_dataloader:
model_preds = self.model(**features).logits.squeeze(dim=1) # type: ignore
similarity_scores.extend(model_preds)
similarity_scores = torch.stack(similarity_scores)
if scale_score:
similarity_scores = torch.sigmoid(similarity_scores * calibration_factor)
_, sorted_indices = torch.sort(similarity_scores, descending=True)
sorted_indices = sorted_indices.cpu().tolist() # type: ignore
similarity_scores = similarity_scores.cpu().tolist()
ranked_docs = []
for sorted_index in sorted_indices:
i = sorted_index
documents[i].score = similarity_scores[i]
ranked_docs.append(documents[i])
if score_threshold is not None:
ranked_docs = [doc for doc in ranked_docs if doc.score >= score_threshold]
return {"documents": ranked_docs[:top_k]}
二、Generation:使用Hugging Face Transformers进行文本生成
对应文件
haystack/components/generators/chat/hugging_face_local.py
架构分析
-
文本生成:
- 生成器使用Hugging Face的Transformers模型来生成文本。
- 传入三个参数:
- 提示文本(prompt)
- 生成参数(updated_generation_kwargs)
- 停止词(stopping_criteria)
- 通过pipeline的generate方法生成文本。
output = self.pipeline(prompt, stopping_criteria=self.stopping_criteria_list, **updated_generation_kwargs) replies = [o["generated_text"] for o in output if "generated_text" in o] -
停止词处理:
- 生成文本时,根据指定指令的停止词停止生成文本。
- Transformers中使用StopWordsCriteria类实现该功能
- 传入三个参数:
- 生成器使用的分词器(tokenizer)
- 停止词列表(stop_words)
- 生成器运行的设备(device)
def warm_up(self):
if self._warmed_up:
return
if self.pipeline is None:
self.pipeline = pipeline(**self.huggingface_pipeline_kwargs)
if self.stop_words:
stop_words_criteria = StopWordsCriteria(
tokenizer=self.pipeline.tokenizer, stop_words=self.stop_words, device=self.pipeline.device
)
self.stopping_criteria_list = StoppingCriteriaList([stop_words_criteria])
- 流回调处理:
- 处理文本时,实时调用回调函数处理生成的文本片段,可用于处理实时生成的文本
- Transformers中使用HFTokenStreamHandler类实现该功能
- 传入三个参数:
- 生成器使用的分词器(tokenizer)
- 指定流回调函数(stop_words)
- 生成器运行的设备(device)
streaming_callback = streaming_callback or self.streaming_callback
if streaming_callback:
num_responses = updated_generation_kwargs.get("num_return_sequences", 1)
if num_responses > 1:
msg = (
"Streaming is enabled, but the number of responses is set to {num_responses}. "
"Streaming is only supported for single response generation. "
"Setting the number of responses to 1."
)
logger.warning(msg, num_responses=num_responses)
updated_generation_kwargs["num_return_sequences"] = 1
# streamer parameter hooks into HF streaming, HFTokenStreamingHandler is an adapter to our streaming
updated_generation_kwargs["streamer"] = HFTokenStreamingHandler(
self.pipeline.tokenizer, # type: ignore
streaming_callback,
self.stop_words, # type: ignore
)
- 序列化和反序列化:
- 将生成器实例序列化为字典(反之亦然)。用于存储和恢复生成器实例,便于生成器的管理维护。
- Transformers中使用to_dict类和from_dict类实现该功能
def to_dict(self) -> Dict[str, Any]:
"""
Serializes the component to a dictionary.
:returns:
Dictionary with serialized data.
"""
callback_name = serialize_callable(self.streaming_callback) if self.streaming_callback else None
serialization_dict = default_to_dict(
self,
huggingface_pipeline_kwargs=self.huggingface_pipeline_kwargs,
generation_kwargs=self.generation_kwargs,
streaming_callback=callback_name,
stop_words=self.stop_words,
token=self.token.to_dict() if self.token else None,
)
huggingface_pipeline_kwargs = serialization_dict["init_parameters"]["huggingface_pipeline_kwargs"]
huggingface_pipeline_kwargs.pop("token", None)
serialize_hf_model_kwargs(huggingface_pipeline_kwargs)
return serialization_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalGenerator":
"""
Deserializes the component from a dictionary.
:param data:
The dictionary to deserialize from.
:returns:
The deserialized component.
"""
deserialize_secrets_inplace(data["init_parameters"], keys=["token"])
init_params = data.get("init_parameters", {})
serialized_callback_handler = init_params.get("streaming_callback")
if serialized_callback_handler:
data["init_parameters"]["streaming_callback"] = deserialize_callable(serialized_callback_handler)
huggingface_pipeline_kwargs = init_params.get("huggingface_pipeline_kwargs", {})
deserialize_hf_model_kwargs(huggingface_pipeline_kwargs)
return default_from_dict(cls, data)
三、Pipeline架构:高内聚低耦合的系统设计
对应文件
haystack-main/haystack/core/pipeline/pipeline.py
haystack-main/e2e/pipelines/test_rag_pipelines_e2e.py
技术分析
- 模块化流水线:
graph TD
A[Retriever] -->|BM25+语义检索| B(Reranker)
B -->|Top-K文档| C[Augmenter]
C -->|摘要/实体链接| D[Generator]
D -->|生成文本| E[Output]
- 核心特性:
- 异步执行器:使用Celery实现检索与生成的异步并行
- 缓存中间件:对高频查询构建FAISS向量缓存
- 动态路由:通过决策树自动选择检索策略(BM25/语义/混合),根据查询长度和领域自动切换
四、Augmentation:语义增强策略(需要手动实现)
- 文档摘要:
- 通过BART模型生成检索文档的摘要,降低生成模块的输入噪声
- 示例代码
from haystack import Pipeline, Document
from haystack.nodes import BM25Retriever, TransformersGenerator
from haystack.document_stores import InMemoryDocumentStore
# 场景1:用BART生成查询扩展(Query Expansion)
def query_expansion_with_bart(query: str, top_k: int = 3):
# 初始化BART生成器(使用问答专用模型)
bart_generator = TransformersGenerator(
model_name_or_path="facebook/bart-large-cnn", # 也可用"facebook/bart-large"
max_length=100,
top_k=top_k
)
# 构建查询扩展Pipeline
pipe = Pipeline()
pipe.add_node(component=bart_generator, name="bart_expander", inputs=["Query"])
# 运行:生成扩展查询
results = pipe.run(query=query)
generated_queries = [result['generated_text'] for result in results['results']]
return generated_queries
# 示例:将问题扩展为多个相关查询
original_query = "量子计算的应用场景有哪些?"
expanded_queries = query_expansion_with_bart(original_query)
print(f"扩展后的查询:{expanded_queries}")
# 场景2:用BART进行文档语义增强(Document Augmentation)
document_store = InMemoryDocumentStore()
retriever = BM25Retriever(document_store=document_store)
# 添加示例文档
documents = [
Document(content="量子计算机利用量子比特实现超并行计算。"),
Document(content="量子退相干是量子计算的主要技术挑战。")
]
document_store.write_documents(documents)
# 创建自定义BART语义增强节点
class BARTAugmenter(TransformersGenerator):
def __init__(self):
super().__init__(
model_name_or_path="facebook/bart-large",
generation_kwargs={"max_length": 150, "num_beams": 4},
top_k=2
)
def run(self, query: str, documents: list):
# 对每个文档生成增强内容
augmented_docs = []
for doc in documents:
generated = self.generate(query=doc.content, documents=[doc])["results"]
augmented_content = doc.content + " [增强] " + generated[0]['generated_text']
augmented_docs.append(Document(content=augmented_content))
return {"documents": augmented_docs}
# 构建增强Pipeline
augmentation_pipeline = Pipeline()
augmentation_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
augmentation_pipeline.add_node(component=BARTAugmenter(), name="bart_augmenter", inputs=["retriever"])
# 运行:获取增强后的文档
enhanced_results = augmentation_pipeline.run(
query="量子计算",
params={"retriever": {"top_k": 2}, "bart_augmenter": {"top_k": 1}}
)
for doc in enhanced_results["documents"]:
print(f"增强后的文档内容:\n{doc.content}\n{'-'*50}")
- 多跳推理:
- 迭代执行"检索→生成中间答案→二次检索"的链式处理,解决复杂问题
- 示例代码
from haystack import Pipeline
pipe = Pipeline()
pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipe.add_node(component=reranker, name="Reranker", inputs=["Retriever"])
pipe.add_node(component=generator, name="Generator", inputs=["Reranker"])

浙公网安备 33010602011771号