unstructured

unstructured 是一个开源的 Python 库,专门用于处理非结构化数据,如从 PDF、Word 文档、HTML 文件等中提取文本内容,并将其转换为结构化格式

(1)安装依赖库

 pip install unstructured

使用text

from unstructured.partition.auto import partition

filename = "a.txt"
docs = partition(filename=filename)
for doc in docs:
    print(doc.text)

docx

from unstructured.partition.auto import partition

filename = "c.docx"
docs = partition(filename=filename)
for doc in docs:
    print(doc.text)

需要安装

pip install "unstructured[docx]"

markdown

from unstructured.partition.auto import partition

filename = "README.md"
docs = partition(filename=filename)
for doc in docs:
    print(doc.text)

pdf

from unstructured.partition.auto import partition

filename = "aa.pdf"
docs = partition(filename=filename)
for doc in docs:
    print(doc.text)

from unstructured.partition.pdf import partition_pdf

filename = "aa.pdf"
docs = partition_pdf(filename=filename)
for doc in docs:
    print(doc.text)

 

需要安装

pip install "unstructured[pdf]"

注意:

  安装中需要cmake

sudo apt install cmake

sudo yum install cmake

(2)本地部署服务 

https://github.com/Unstructured-IO/unstructured-api 

docker pull downloads.unstructured.io/unstructured-io/unstructured-api:latest

启动

docker run -p 9500:9500 -d --rm --name unstructured-api -e PORT=9500 downloads.unstructured.io/unstructured-io/unstructured-api:latest

服务的

URL url = "http://localhost:9500/general/v0/general"

posted @ 2025-03-19 22:47  慕尘  阅读(942)  评论(0)    收藏  举报