Fork me on GitHub

智能文本自动处理(Intelligent text automatic processing)(一)

AutoText

智能文本自动处理工具(Intelligent text automatic processing tool)。

项目地址:https://github.com/jiangnanboy/AutoText

AutoText的功能主要有文本纠错,图片ocr以及表格结构识别等。

Guide

文本纠错

  • 文本纠错部分详细见jcorrector
  • 本项目主要有基于ngram的纠错、基于深度学习的纠错、基于模板中文语法纠错以及成语、专名纠错等
  • 具体使用见本项目中的examples/correct部分

图片ocr

  • 这部分主要利用paddleocr 中的检测与识别部分,并将其中模型转为onnx格式进行调用,本项目在识别前对图片进行了预处理,使得在cpu环境下,平均一张图10秒左右。

  • 具体使用见本项目中的examples/ocr/text/OcrDemo部分

  • PS

    • 模型网盘下载
    • 提取码:b5vq
    • 模型下载后可放入resources的text_recgo下或其它位置
  • 使用

    // read image file
    String imagePath = "examples\\ocr\\img_test\\text_example.png";
    var imageFile = Paths.get(imagePath);
    var image = ImageFactory.getInstance().fromFile(imageFile);
    
    // init model
    String detectionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_det_model_path")).getPath().replaceFirst("/", "");
    String recognitionModelFile = OcrDemo.class.getClassLoader().getResource(PropertiesReader.get("text_recog_rec_model_path")).getPath().replaceFirst("/", "");
    Path detectionModelPath = Paths.get(detectionModelFile);
    Path recognitionModelPath = Paths.get(recognitionModelFile);
    OcrApp ocrApp = new OcrApp(detectionModelPath, recognitionModelPath);
    ocrApp.init();
    
    // predict result and consume time
    var timeInferStart = System.currentTimeMillis();
    Pair<List<TextListBox>, Image> imagePair = ocrApp.ocrImage(image, 960);
    System.out.println("consume time: " + (System.currentTimeMillis() - timeInferStart)/1000.0 + "s");
    for (var result : imagePair.getLeft()) {
            System.out.println(result);
    }
    // save ocr result image
    ocrApp.saveImageOcrResult(imagePair, "ocr_result.png", "examples\\ocr\\output");
    ocrApp.closeAllModel();
  • 结果,为文字及其坐标
    position: [800.0, 609.0, 877.0, 609.0, 877.0, 645.0, 800.0, 645.0], text: 8.23%
    position: [433.0, 607.0, 494.0, 607.0, 494.0, 649.0, 433.0, 649.0], text: 68.4
    position: [96.0, 610.0, 316.0, 611.0, 316.0, 641.0, 96.0, 640.0], text: 股东权益比率(%)
    position: [624.0, 605.0, 688.0, 605.0, 688.0, 650.0, 624.0, 650.0], text: 63.2
    position: [791.0, 570.0, 887.0, 570.0, 887.0, 600.0, 791.0, 600.0], text: -39.64%
    position: [625.0, 564.0, 687.0, 564.0, 687.0, 606.0, 625.0, 606.0], text: 49.7
    position: [134.0, 568.0, 279.0, 568.0, 279.0, 598.0, 134.0, 598.0], text: 毛利率(%)
    ......
  • 结果展示

 

表格结构识别

  • 基于规则由opencv研发,主要识别的表格类型有:有边界表格、无边界表格以及部分有边界表格。
  • 具体使用见本项目中的examples/ocr/table/TableDemo部分
  • 使用
    public static void borderedRecog() {
        String imagePath = "examples\\ocr\\img_test\\bordered_example.png";
        Mat imageMat = imread(imagePath);
        System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
        Pair< List<List<List<Integer>>>, Mat> pair = BorderedRecog.recognizeStructure(imageMat);
        System.out.println(pair.getLeft());
        ImageUtils.imshow("Image", pair.getRight());
    }

    public static void unBorderedRecog() {
        String imagePath = "examples\\ocr\\img_test\\unbordered_example.jpg";
        Mat imageMat = imread(imagePath);
        System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
        Pair< List<List<List<Integer>>>, Mat> pair = UnBorderedRecog.recognizeStructure(imageMat);
        System.out.println(pair.getLeft());
        ImageUtils.imshow("Image", pair.getRight());
    }

    public static void partiallyBorderedRecog() {
        String imagePath = "examples\\ocr\\img_test\\partially_example.jpg";
        Mat imageMat = imread(imagePath);
        System.out.println("imageMat : " + imageMat.size().height() + " " + imageMat.size().width() + " ");
        Pair< List<List<List<Integer>>>, Mat> pair = PartiallyBorderedRecog.recognizeStructure(imageMat);
        System.out.println(pair.getLeft());
        ImageUtils.imshow("Image", pair.getRight());
    }
  • 结果,为表格单元格坐标
    [[[58, 48, 247, 182], [309, 48, 247, 182], [560, 48, 247, 182], [], [], [1061, 48, 247, 182], [1312, 48, 247, 182], 
    [811, 48, 246, 182], [], [], [], []], [[58, 234, 247, 118], [309, 234, 247, 118], [560, 234, 247, 118], [], [811, 234, 246, 118], 
    [], [1061, 234, 247, 118], [], [], [1312, 234, 247, 118], [], []], [[58, 356, 247, 118], [], [309, 356, 247, 118], 
    [560, 356, 247, 118], [], [811, 356, 246, 118], [], [], [1061, 356, 247, 118], [], [1312, 356, 247, 118], []], [[58, 478, 247, 118],
    [309, 478, 247, 118], [], [560, 478, 247, 118], [811, 478, 246, 118], [], [], [1312, 478, 247, 118], [], [1061, 478, 247, 118], [], []],
    [[58, 600, 247, 119], [309, 600, 247, 119], [], [811, 600, 246, 119], [560, 600, 247, 119], [1061, 600, 247, 119], [],
    [1312, 600, 247, 119], [], [], [], []], [[58, 723, 247, 118], [], [309, 723, 247, 118], [811, 723, 246, 118], [560, 723, 247, 118],
    [], [], [1061, 723, 247, 118], [], [1312, 723, 247, 118], [], []], [[58, 845, 247, 118], [309, 845, 247, 118], [], [], 
    [811, 845, 246, 118], [560, 845, 247, 118], [], [1312, 845, 247, 118], [], [1061, 845, 247, 118], [], []]]
  • 结果展示

 

posted @ 2023-01-19 19:06  石头木  阅读(485)  评论(0编辑  收藏  举报