[Tools] 将 PDF / Docx 转换为 Markdown

Marker

介绍

Marker converts documents to markdown, JSON, chunks, and HTML quickly and accurately.

  • Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
  • Formats tables, forms, equations, inline math, links, references, and code blocks
  • Extracts and saves images
  • Removes headers/footers/other artifacts
  • Extensible with your own formatting and logic
  • Does structured extraction, given a JSON schema (beta)
  • Optionally boost accuracy with LLMs (and your own prompt)
  • Works on GPU, CPU, or MPS

安装

You'll need python 3.10+ and PyTorch.
推荐使用 Python v3.12 ( 用 v3.14 可能有问题 )
https://www.python.org/downloads/windows/

Install with:

pip install marker-pdf

If you want to use marker on documents other than PDFs, you will need to install additional dependencies with:
Get "Microsoft Visual C++ 14.0" or greater: https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-180#latest-supported-redistributable-version
Get "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

pip install marker-pdf[full]

 

单文件转换

marker_single /path/test_file.pdf

marker_single /path/test_file.docx

 

文件夹批量转换

marker /pathfolder

 

参考资料

https://github.com/datalab-to/marker

posted @ 2026-02-24 13:35  jinzesudawei  阅读(14)  评论(0)    收藏  举报