Fork me on GitHub

利用java实现图片表格检测与结构识别

   

Guide

  1. Overview
  2. Requirements
  3. Demo
  4. Modules

Overview

This java package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

Extract the the text into a CSV format…

节次 星期,周—,周二,周三,周四,周五
一,语文,英语,英语,自然,数学
二,语文,英语,英语,语文,数学
三,数学,语文,数学,语文,英语
四,数学,语文,数学,体育,英语
五,体育,思想品德,语文,数学,手工
六,美术,音乐,语文,数学,写字

Requirements

See maven dependency jar package.

  • pdfbox 2.0.26
  • javacv 1.5.7
  • djl 0.17.0
  • ...

 

Demo

There is a demo module that will try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

1.table/demo/MainDemo

That will run against the following image:

The following should be saved to your directory after running the class table/demo/MainDemo.java.

Extracted the following tables from the image:
[('/img_test/simple.png', ['/img_test/simple/table-0.png'])]
Extracted cells from /img_test/simple/table-0.png
Cells:
/img_test/simple/table-0/0-0.png
/img_test/simple/table-0/0-1.png
/img_test/simple/table-0/0-2.png
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

  • pdf_to_images uses pdfbox to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_cells extracts and orders cells from a table.
  • ocr_image uses djl to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow.

project_address : https://github.com/jiangnanboy/table_ocr_java

posted @ 2023-07-15 15:27  石头木  阅读(357)  评论(0编辑  收藏  举报