RAG学习--pdf读取与切割

RAG流程：

线下：

1、文档加载

2、文档切分

3、向量化

4、向向量数据库灌数据

线上：

1、获取用户问题

2、用户问题向量化

3、检索向量数据库

4、将检索结果和问题填充到pomp模板

5、用最终获得的pomp调用LLM

6、最终由LLM生成回复

本篇完成文档加载与切割(pdf加载与切割)

1、文档加载

加载PDF：

llama2.pdf

安装pdf读取包

pip install pdfminer.six

from pdfminer.high_level import extract_pages

from pdfminer.layout import LTTextContainer

#从pdf中提取文本extract_text_from_pdf

def extract_text_from_pdf(pdf_path,page_numbers=None,min_line_length =1):

paragraphs =[]

buff =''

full_text = ''

for i , page_layout in enumerate(extract_pages(pdf_path)):

if page_numbers is not None and i not in page_numbers:

continue

for element in page_layout:

if isinstance(element,LTTextContainer):

full_text += element.get_text() +'\n'

lines = full_text.split('\n')

for line in lines:

if len(line) >= min_line_length:

buff += (' '+line) if not line.endswith('-') else line.strip('-')

elif buff:

paragraphs.append(buff)

buff = ''

if buff:

paragraphs.append(buff)

return paragraphs

#以上是pdf读取方法extract_text_from_pdf

#调用程序,并显示前四行

paragraphs = extract_text_from_pdf('llama2.pdf',min_line_length=4)

for page in paragraphs[:4]:

print(page+'\n')

在terminal执行：py .\pdfread.py显示结果

pdf加载与切割完毕。

posted @ 2024-05-12 22:17 kin2022 阅读(339) 评论(0) 收藏举报

刷新页面返回顶部

RAG学习--pdf读取与切割

公告