NPU环境安装部署GOT及初步测试OCR2.0效果

一、 GOT安装环境

1、环境搭建

　　GOT安装需要依赖Python 3.10，且相关依赖包版本如下：

torch==2.0.1
torchvision==0.15.2
transformers==4.37.2
tiktoken==0.6.0
verovio==4.3.1
accelerate==0.28.0
gcc>=10.1.0
ascend>=8.0

　　因NPU服务器Conda环境，其Python版本为3.7，所以需下载安装Python 3.10，并安装上述GOT依赖包及其相关依赖。安装Python 3.10后，为不影响NPU服务器上的其他应用，GOT运行环境采用venv虚拟环境，安装在/home/ma-user/work/projects/GOT-OCR2.0-main/ocr2_got_venv路径下。

图1 GOT模型依赖的Python虚拟环境安装路径　　

　　注意，GOT依赖verovio工具包解析音乐五线谱，而verovio需要基于g++编译C++代码，需要支持c++20标准，而NPU服务器的GCC 7.3.0不支持c++20标准，需升级GCC版本至10.1.0以上。
　　另外，需将Ascend昇腾环境升级至8.0以上，才能正常运行GOT模型代码，Ascend 8.0环境单独安装在/home/ma-user/Ascend，与原有的6.0版本区别开来，需切换到此8.0版本时，执行bash脚本/home/ma-user/Ascend/ascend-toolkit/set_env.sh即可。

图2 Ascend 8.0安装路径

2、GOT模型部署

2.1 模型相关资源

　　GOT模型相关资源如下：

论文网址：https://arxiv.org/pdf/2409.01704
论文源码：https://github.com/Ucas-HaoranWei/GOT-OCR2.0
论文GOT模型权重：https://huggingface.co/stepfun-ai/GOT-OCR2_0/tree/main 或 https://modelscope.cn/models/stepfun-ai/GOT-OCR2_0/summary

2.2 模型部署目录

　　GOT模型部署目录如图3所示：

图3 GOT模型部署路径概况

二、 GOT测试环境

1、初始化测试环境

　　按图4步骤初始化化测试环境。

图4 初始化GOT测试环境

2、测试脚本

　　GOT测试依赖图4中的run_ocr_2.0.py和run_ocr_2.0_crop.py脚本，这两个脚本拷贝自图5中GOT模型的demo脚本，并在此基础上进行手动NPU改造，因为demo脚本里多处出现cuda硬编码调用。

图4 GOT模型测试脚本及其路径

　　注意，GOT需将pdf等原始数据，转换为png、jpg、webp这类图片，然后才能进行run_ocr脚本调用。

图5 GOT模型demo脚本

3、测试图片资源

　　图6描述了测试图片资源的存储位置，其中pdf文件夹存放了multi-crop、multi-page两种模式的测试资源。

图6 测试图片资源

4、PDF转图片工具

　　在对PDF进行GOT OCR前，需将PDF转换为png等图片格式，本次测试基于pdf2image工具包，写了一个转换脚本transfer_pdf2image.py。

4.1安装pdf2image

　　pip install pdf2image

4.2安装poppler

　　pdf2image依赖poppler工具，才能工作，所以需要安装poppler，否则执行transfer_pdf2image.py脚本会报错。

Windows系统

　　在以下网址下载并安装二进制包Release-XX.XX.X.zip：
　　 https://github.com/oschwartz10612/poppler-windows/releases/tag/v21.03.0

Linux系统

　　在以下网址下载源码，然后编译安装。
　　 https://github.com/oschwartz10612/poppler-windows/releases/tag/v21.03.0

5、测试脚本使用样例

5.1 Plain Text OCR

　　此命令只识别指定图片中的文本。

　　python3 run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type ocr

5.2 Format Text OCR

　　此命令除识别指定图片中的文本，还识别文本对应的格式，对于非文本的数据类型，比如表格、数学公式、分子公式、几何图形、五线谱等，需用此命令进行识别。
　　python3 run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format

5.3 Fine-Grained OCR

　　此命令识别图片中，某个box指定的区域所包含的文本(已验证)，或某种颜色区域所包含的文本(未验证)。
　　python3 run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --box [x1,y1,x2,y2]
　　python3 run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format/ocr --color red/green/blue

5.4 Multi-Crop OCR

　　此命令将大图切成一个个小图，然后依次识别这些小图中的文本。
　　python3 run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /an/image/file.png

5.5 Multi-Page OCR

　　此命令读取指定文件夹中所有的图片，依次识别这些图片中的文本。
　　python3 run_ocr_2.0_crop.py --model-name /GOT_weights/ --image-file /images/path/ --multi-page

5.6 Render OCR Result

　　此命令将OCR结果保存至results/demo.html，本次测试，除verivio数据类型因内存分配失败外，其他的数据类型均能生成demo.html，但在chrome、edge中打开demo.html，无任何显示。
　　python3 run_ocr_2.0.py --model-name /GOT_weights/ --image-file /an/image/file.png --type format --render

三、GOT测试情况

1、Plain Text OCR

1.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_page/test1/GraphRAG_page_1.png --type ocr

GOT输出

2、Fine-Grained OCR

2.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_page/test1/GraphRAG_page_1.png --type ocr --box [0,600,2000,1200]

GOT输出

3、Table OCR

3.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/table_demo1.webp --type format –render

GOT输出

4、Math OCR

4.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/math_demo1.webp --type format

GOT输出

5、Molecular OCR

5.1 测试样例

执行命令

　　　　python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/chemistry_demo1.jpg --type format --render

GOT输出

6、Geometry OCR

6.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/geometry_demo1.jpg --type format –render

GOT输出

7、Verovio OCR

7.1 测试样例

执行命令

　　 python run_ocr_2.0.py --model-name ../weights/ --image-file ../assets/images/verovio_demo1.webp --type format –render

GOT输出

8、Multi-Crop OCR

8.1 测试样例1

执行命令

　　 python run_ocr_2.0_crop.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_crop/2104.09864v5.png --render

GOT输出

8.2 测试样例2

执行命令

　　 python run_ocr_2.0_crop.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_crop/GraphRAG.png --render

GOT输出

......

9、Multi-Page OCR

9.1 测试样例1

执行命令

　　 python run_ocr_2.0_crop.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_page/test1 --multi-page --render

GOT输出

9.2 测试样例2

执行命令

　　 python run_ocr_2.0_crop.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_page/test2 --multi-page --render

GOT输出

9.3 测试样例3

执行命令

　　 python run_ocr_2.0_crop.py --model-name ../weights/ --image-file ../assets/images/pdf/multi_page/test3 --multi-page --render

GOT输出

四、总结

　　GOT模型是一种Encoder-Decoder架构的复合大模型，Encoder采用类Vit的视觉编码器，而Decoder采用Qwen-0.5B大模型，将Encoder的编码结果进行解码，以生成等效OCR的识别结果。
　　从实际测试效果看，Plain Text、Format Text的PDF识别效果还可以，而multi-crop、multi-page在数据量较小时，表现也还OK，但数据量变大后，其识别效果便大幅降低。本次测试也对非文本的数据类型进行了简单测试，表格类、公式类的图片测试效果还不错，而几何图形、化学符号效果一般，五线谱verovio可以识别出来，但无法render生成html。
　　此外，GOT生成的rendered html，在chrome、edge浏览器中无法渲染出页面(显示空白页)。

-------------------------------------------------------------------------------------------------------------------------------

附 transfer_pdf2image.py：

import os
from pathlib import Path
from PIL import Image

from pdf2image import convert_from_path

def make_dirs_ifnecessary(path):
if not os.path.exists(path):
print(f"Creating f{path}")
os.makedirs(path)

def convert_pdf2image(pdf_path, output_dir, output_whole_also_flag=False):

"""
将pdf转换为png，默认情况下，pdf的每一页会生成一张png图片，如果想同时生成一张包含pdf所有页的png，将output_whole_also_flag指定True即可
"""
if pdf_path is None:
print("pdf_path is None")
return

  if not isinstance(pdf_path, str):
      print(f"pdf_path's type is {type(pdf_path)}, which is not supported")
      return

  if os.path.isfile(pdf_path):
      output_file = os.path.join(output_dir, Path(pdf_path).name)
      return _do_convert_pdf2image(pdf_path, output_file, output_whole_also_flag)
  else:
      for cur_dir, _, sub_files in os.walk(pdf_path, topdown=True):
          for sub_file in sub_files:
              intermediate_dir = cur_dir[len(pdf_path):]
              output_file = os.path.join(output_dir, intermediate_dir, Path(sub_file).name)
              _do_convert_pdf2image(os.path.join(cur_dir, sub_file), output_file, output_whole_also_flag=output_whole_also_flag)

def _do_convert_pdf2image(pdf_path, output_file, output_whole_also_flag):
print(f"Converting {pdf_path}")

output_file = Path(output_file)
make_dirs_ifnecessary(str(output_file.parent))

pdf_images = convert_from_path(pdf_path)

  if len(pdf_images) <= 0:
      print("No content needs convertion")
      return

  # 将每一张图片保存为PNG文件
  for i, image in enumerate(pdf_images):
      print(f"Exporting {i + 1}th page...")
      output_page_file = output_file.parent / f"{output_file.stem}_page_{i + 1}.png"
      image.save(str(output_page_file), "PNG")

if output_whole_also_flag:
print(f"Exporting {output_file}...")

  if len(pdf_images) > 1:
      page_width, page_height = pdf_images[0].width, pdf_images[0].height
      combined_width, combined_height = page_width, len(pdf_images) * page_height
      combined_image = Image.new("L", (combined_width, combined_height))

      for i, image in enumerate(pdf_images):
          if i > 0:
              image = image.resize((page_width, page_height))
              combined_image.paste(image, (0, page_height * i))

              combined_image.save(str(output_file.with_suffix(".png")), "PNG")
          else:
              pdf_images[0].save(str(output_file.with_suffix(".png")), "PNG")

file_root = "C:\\Users\\t60031665\\Downloads"
pdf_path = os.path.join(file_root, "From Local to Global_A Graph RAG Approach to Query-Focused Summarization.pdf")

convert_pdf2image(pdf_path, os.getcwd(), output_whole_also_flag=True)

关注更多安卓开发、AI技术、股票分析技术及个股诊断等理财、生活分享等资讯信息，请关注本人公众号(木圭龙的知识小屋)

posted @ 2026-03-30 16:56 tgltt 阅读(6) 评论(0) 收藏举报

刷新页面返回顶部