SciTech-EECS-Library: PDF的ISO国际标准 + xpdf:C++的pdf库 + PyPDF:纯Python的pdf库 + img2pdf 与 pdf2image : Python 的 pdf 与 image 双向转换库 + 数学算法拼接图片(Numpy Array) + OpenCV: 垂直和水平拼接图片数组

SciTech-EECS-Library:


PDF Standards:ISO国际标准

https://pdfa.org/resource/iso-32000-2/

ISO 32000-2

Portable document format – Part 2: PDF 2.0
This document specifies a digital form for representing electronic documents to enable users to exchange and view electronic documents independent of the environment in which they were created or the environment in which they are viewed or printed. It is intended for developers of software that creates PDF files (PDF writers), software that reads existing PDF files and (usually) interprets their contents for display (PDF readers), software that reads and displays PDF content and interacts with the computer users to possibly modify and save the PDF file (interactive PDF processors) and PDF products that read and/or write PDF files for a variety of other purposes (PDF processors). (PDF writers and PDF readers are more specialised classifications of interactive PDF processors and all are PDF processors).


PDF Specification Archive

The latest PDF 2.0 standard (ISO 32000-2:2020) is now available at no cost.
For additional reference, this page serves as an index of the evolution of the PDF format, and includes external links to all legacy Adobe PDF references and errata, as well as to the ISO 32000 family of standards.


PDF Version Year Reference Documents
ISO 32000-2:2020
(PDF 2.0)

2020
ISO 32000-2:2020 Document management — Portable document format — Part 2: PDF 2.0
Errata: ISO and industry-based resolutions for issues against ISO 32000-2:2020
ISO Technical Specification Extensions and Clarifications to PDF 2.0:
ISO/TS 32001:2022 Document management — Portable Document Format — Extensions to Hash Algorithm Support in ISO 32000-2 (PDF 2.0)
ISO/TS 32002:2022 Document management — Portable Document Format — Extensions to Digital Signatures in ISO 32000-2 (PDF 2.0)
ISO/TS 24064:2023 Document management — Portable document format — RichMedia annotations conforming to the ISO 10303-242 (STEP AP 242) specification
ISO/TS 32003:2023 Document management — Portable Document Format — Adding support of AES-GCM in PDF 2.0
ISO/TS 32004:202x Document management — Portable Document Format — Integrity protection in encrypted documents in PDF 2.0 Extension
ISO/TS 32005:2023 Document management — Portable Document Format — PDF 1.7 and 2.0 structure namespace inclusion in ISO 32000-2 Clarification



xpdf:C++的pdf库

Download Xpdf and XpdfReader

About Xpdf and XpdfReader

Tools

The Xpdf open source project includes a PDF viewer along with a collection of command line tools which perform various functions on PDF files:

  • xpdf: PDF viewer (click for a screenshot)
  • pdftotext: converts PDF to text
  • pdftops: converts PDF to PostScript
  • pdftoppm: converts PDF pages to netpbm (PPM/PGM/PBM) image files
  • pdftopng: converts PDF pages to PNG image files
  • pdftohtml: converts PDF to HTML
  • pdfinfo: extracts PDF metadata
  • pdfimages: extracts raw images from PDF files
  • pdffonts: lists fonts used in PDF files
  • pdfdetach: extracts attached files from PDF files
  • XpdfReader, available from the download page, is a closed source version of the PDF viewer,
    which includes a few extra features not found in the open source xpdf viewer.

Cross-platform

All of the open source tools are available for Linux, Windows, and Mac.
The viewer (xpdf / XpdfReader) uses the Qt toolkit.


History

Xpdf was first released in 1995. It was written, and is still developed, by Derek Noonburg.



PyPDF:纯Python的pdf库

PyPDF是开源纯 python的 PDF 库,能够拆分、 合并、 裁剪和转换 PDF 文件的页面。
它还可以 为 PDF 文件添加自定义数据、查看选项和密码。
PyPDF 也可以 从 PDF中检索文本 和 元数据。


PyPDF Documents
https://pypdf.readthedocs.io/en/stable/


安装:

您可以通过 pip 安装 PyPDF:

pip install pypdf

# 如果您计划使用 PyPDF2 加密或解密使用 AES 的 PDF,
# 您将需要安装一些额外的依赖项。使用常规安装支持使用 RC4 加密。
pip install pypdf[crypto]

用法

from pypdf import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

Extract Images(提取pdf文档page的嵌入image)

Every page of a PDF document can contain an arbitrary number of images.
The names of the files may not be unique.

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
for count, image_file_object in enumerate(page.images):
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)

Other images

Some other objects can contain images, such as stamp annotations.
For example, this document contains such stamps: test_stamp.pdf

You can extract the image from the annotation with the following code:

from pypdf import PdfReader

reader = PdfReader("test_stamp.pdf")
im = (
    reader.pages[0]["/Annots"][0]
    .get_object()["/AP"]["/N"]["/Resources"]["/XObject"]["/Im4"]
    .decode_as_image()
)

im.show()



OpenCV图像的连接与变换: cv.vconcat()、cv.hconcat()、cv.resize()、cv.flip()

cv.vconcat([img0, img1, img3]): 垂直方向拼接
cv.hconcat([img0, img1, img3]): 水平方向拼接
cv.resize(): 缩放
cv.flip(): 翻转

OpenCV 在图像上画直线



Python 的 pdf 与 image 双向转换库

安装有关 Python 库

apt-get install -y qpdf poppler # 必须先安装 poppler library 和 qpdf linrary
pip3 install img2pdf pdf2image Pillow

pdf 转 image

Approach:

  • Import the pdf2image module
  • Store a PDF file with convert_from_path()
  • Save image with save()

Below is the Implementation.

# 将多页pdf文档转换成几乎无损的一整张无缝拼接的图片
# Save this source of code as pdf_to_one_image.py
# then run `python  pdf_to_one_image.py   <the_pdf_file>`


from io import BytesIO
from pdf2image import convert_from_path as cfp
import os, sys, cv2 as cv, numpy as np


pdffile = name_prefix = sys.argv[1]


def mark_line(img, row_pos, col_pos, color, thickness):
    print("  mark ", row_pos, col_pos)
    t, (r0, r1), (c0, c1) = thickness, row_pos, col_pos
    img[max(r0, 0) : (r0 + t), max(c0, 0) : c1, :] = color
    img[max(r1 - t, 0) : r1, max(c0, 0) : c1, :] = color
    img[max(r0, 0) : r1, max(c0, 0) : (c0 + t), :] = color
    img[max(r0, 0) : r1, max(c1 - t, 0) : c1, :] = color
    return img


def detect_row_bound(img, noise_threshold=10, min_pixels=3):
    rows, cols, chs = img.shape
    head_index, tail_index = 0, (rows - 1)
    total_bound = cols * chs * 255
    total_threshold = min_pixels * chs * noise_threshold

    row_scores = [(total_bound - img[i, :, :].sum()) for i in range(rows)]

    hi, ti = head_index, tail_index
    while hi < rows:
        if row_scores[hi] > total_threshold:
            head_index = max(0, hi - 1)
            break
        hi += 1

    while ti > 0:
        if row_scores[ti] > total_threshold:
            tail_index = ti + 1
            break
        ti -= 1

    return [head_index, tail_index]


# image array processing [rows, columns, channels]
def crop_image(img, row_pos, col_pos, chs_pos, noise_threshold=10):
    cropped = img[
        row_pos[0] : row_pos[1], col_pos[0] : col_pos[1], chs_pos[0] : chs_pos[1]
    ]
    return cropped


# create colored image array from BytesIO object
def image_from_byte_io(bytes_io):
    bytes_io.seek(0)
    file_bytes = np.asarray(bytearray(bytes_io.read()), dtype=np.uint8)
    img = cv.imdecode(file_bytes, cv.IMREAD_COLOR)
    return img


def image_from_pdffile(pdf_file):
    # name_fmt = '%s.%d%s'
    images = cfp(pdf_file)
    images_num = len(images)
    bsios = [BytesIO() for i in range(images_num)]
    rets = [images[i].save(bsios[i], "JPEG") for i in range(images_num)]
    imgs = [image_from_byte_io(bsios[i]) for i in range(images_num)]
    return imgs, images_num


# 数学算法检测"pdf文件"的内容边界,自动裁剪完,拼接成一张连续的图片
# 数值计算 驱动的 "图像处理",所以拼接效果几乎可以达到"无损拼接"的效果。
# 为精准判断裁剪是否最合适,会同时生成一张对比图,把要裁剪的区域用"蓝色直线"框出。

def pdf_to_image0(pdf_file, noise_threshold=10, min_pixels=3, line_height=0):
    imgs, images_num = image_from_pdffile(pdffile)
    if images_num < 1:
        return []

    color, thickness = (255, 0, 0), 5
    rows, cols, chs = imgs[0].shape
    chs_pos, e_args, m_args = [0, chs], ((0, 0, 255), 2), (color, thickness)
    print("Cropping: rows:%s, cols:%s, " % (rows, cols))

    crop_col_bound = (0, cols)

    crop_row_head = detect_row_bound(imgs[0], noise_threshold, min_pixels)
    crop_row_tail = detect_row_bound(imgs[-1], noise_threshold, min_pixels)
    crop_row_tail = [
        max(0, crop_row_tail[0] - line_height),
        min(rows, crop_row_tail[1] + crop_row_head[0]),
    ]
    crop_row_head = [0, crop_row_head[1]]

    crop_args = [(crop_row_head, crop_col_bound)]

    for i in imgs[1:-1]:
        crop_row_bound = detect_row_bound(i, noise_threshold, min_pixels)
        crop_row_bound = [max(0, crop_row_bound[0] - line_height), crop_row_bound[1]]
        crop_args.append((crop_row_bound, crop_col_bound))

    crop_args.append((crop_row_tail, crop_col_bound))

    m_imgs = []
    i_imgs = []

    for i in range(len(imgs)):
        img = imgs[i]
        i_rows, i_cols, i_chs = img.shape

        crop_arg = crop_args[i]
        m_img = mark_line(img.copy(), (0, i_rows), (0, i_cols), *e_args)
        m_img = mark_line(m_img, *crop_arg, *m_args)
        m_imgs.append(m_img)

        i_img = crop_image(img.copy(), *crop_arg, [0, i_chs])
        i_imgs.append(i_img)

    m_concated = cv.vconcat(m_imgs)
    i_concated = cv.vconcat(i_imgs)
    return m_concated, i_concated

# 手动检测"pdf文件"的内容边界,自动裁剪完,拼接成一张连续的图片
# 为精准判断裁剪是否最合适,会同时生成一张对比图,把要裁剪的区域用"蓝色直线"框出。
def pdf_to_image1(pdf_file, row_ratio=0.0729, col_ratio=0.0):
    imgs, images_num = image_from_pdffile(pdffile)
    if images_num < 1:
        return []

    color, thickness = (255, 0, 0), 5
    rows, cols, chs = imgs[0].shape
    row_delta, col_delta = int(rows * row_ratio), int(cols * col_ratio)
    chs_pos, e_args, m_args = [0, chs], ((0, 0, 255), 2), (color, thickness)
    print("Cropping: rows:%s, cols:%s, " % (rows, cols))

    crop_arg_head = ((0, rows - row_delta), (col_delta, cols - col_delta))
    crop_arg_other = ((row_delta, rows - row_delta), (col_delta, cols - col_delta))
    crop_arg_tail = ((row_delta, rows), (col_delta, cols - col_delta))

    # import pdb;pdb.set_trace()
    m_concated, i_concated = None, None
    if images_num == 1:
        m_head = mark_line(imgs[0].copy(), (0, rows), (0, cols), *e_args)
        m_head = mark_line(m_head, *crop_arg_other, *m_args)
        m_concated = m_head
        i_head = crop_image(imgs[0].copy(), (0, rows), (0, cols), chs_pos)
        i_concated = i_head
    elif images_num > 1:
        m_head = mark_line(imgs[0].copy(), (0, rows), (0, cols), *e_args)
        m_head = mark_line(m_head, *crop_arg_head, *m_args)
        m_other = [
            mark_line(i.copy(), (0, rows), (0, cols), *e_args) for i in imgs[1:-1]
        ]
        m_other = [mark_line(i, *crop_arg_other, *m_args) for i in m_other]
        m_tail = mark_line(imgs[-1].copy(), (0, rows), (0, cols), *e_args)
        m_tail = mark_line(m_tail, *crop_arg_tail, *m_args)
        m_imgs = [m_head] + m_other + [m_tail]
        m_concated = cv.vconcat(m_imgs)

        i_head = crop_image(imgs[0].copy(), *crop_arg_head, chs_pos)
        i_other = [crop_image(i.copy(), *crop_arg_other, chs_pos) for i in imgs[1:-1]]
        i_tail = crop_image(imgs[-1].copy(), *crop_arg_tail, chs_pos)
        i_imgs = [i_head] + i_other + [i_tail]
        i_concated = cv.vconcat(i_imgs)
    return m_concated, i_concated


# vertical concat all images in imgs_processed
m_concated, i_concated = pdf_to_image0(
    pdffile, noise_threshold=10, min_pixels=3, line_height=0
)
cv.imwrite(pdffile + "_0marked.jpg", m_concated)
cv.imwrite(pdffile + "_0.jpg", i_concated)

m_concated, i_concated = pdf_to_image1(pdffile, row_ratio=0.0729, col_ratio=0.0)

# save the processed image array
cv.imwrite(pdffile + "_1marked.jpg", m_concated)
cv.imwrite(pdffile + "_1.jpg", i_concated)


image 转 pdf

# importing necessary libraries
import os,sys, img2pdf
from PIL import Image

# storing path
img_path = "./do_nawab.png"
pdf_path = "./file.pdf"

with Image.open(img_path) as image:
    pdf_bytes = img2pdf.convert(image.filename)
    with open(pdf_path, "wb") as pdf_file:
        pdf_file.write(pdf_bytes)
posted @ 2025-07-28 20:39  abaelhe  阅读(18)  评论(0)    收藏  举报