itextsharp upgrade to itext7

Why am I getting duplicate pages extracted from iText7 C#?

Actually it is not the same text being returned from sequential pages. Instead you get

  • the text from page 1 when you extract page 1;
  • the text from pages 1 and 2 when you extract page 2;
  • the text from pages 1, 2, and 3 when you extract page 3;
  • ...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

需要注意的是,策略换成LocationTextExtractionStrategy读出来的内容就和原来一样了

 

posted @ 2024-01-10 14:27  ChuckLu  阅读(11)  评论(0编辑  收藏  举报