用 Python 解析验证码：Tesseract OCR 的应用

验证码（CAPTCHA）广泛用于防止机器人自动提交数据，但在自动化测试、数据采集等场景中，我们需要自动解析验证码。本文将介绍如何使用 Python + Tesseract OCR 识别验证码，并优化识别准确率。

环境准备

在开始之前，确保 Python 和 Tesseract OCR 已正确安装。
更多内容访问ttocr.com或联系1436423940
1.1 安装 Python

如果尚未安装 Python，可前往 Python 官网
下载并安装最新版本。安装完成后，检查是否成功：

python --version

1.2 安装 Tesseract OCR
Windows 用户

从 Tesseract OCR GitHub
下载 Windows 版本并安装。

在环境变量 PATH 中添加 Tesseract 安装路径，如：C:\Program Files\Tesseract-OCR\。

Linux/macOS 用户

Ubuntu

sudo apt update && sudo apt install tesseract-ocr

macOS (使用 Homebrew)

brew install tesseract

安装完成后，检查 Tesseract 是否可用：

tesseract --version

1.3 安装 Python 依赖库
pip install pytesseract opencv-python pillow numpy

代码实现

下面是一个 Python 示例，展示如何加载验证码图像、进行预处理，并使用 Tesseract OCR 解析验证码。

2.1 代码示例
import cv2
import pytesseract
import numpy as np
from PIL import Image

Windows 用户需指定 Tesseract 路径

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def preprocess_image(image_path):
""" 预处理验证码图片，提高 OCR 识别率 """
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # 读取灰度图
img = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)[1] # 二值化处理
img = cv2.GaussianBlur(img, (3, 3), 0) # 高斯模糊，减少噪点
return img

def recognize_captcha(image_path):
""" 使用 Tesseract OCR 识别验证码 """
img = preprocess_image(image_path)
text = pytesseract.image_to_string(img, config="--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")
return text.strip()

if name == "main":
image_path = "captcha.png" # 替换为你的验证码图片路径
result = recognize_captcha(image_path)
print("识别出的验证码:", result)

代码解析
3.1 图像预处理

为了提高 OCR 识别率，我们对图像进行了以下优化：

灰度化：减少颜色干扰，只保留亮度信息。

二值化：转换为黑白，提高字符对比度。

高斯模糊：去除背景噪声，提高 OCR 识别效果。

img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # 转灰度
img = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY)[1] # 二值化
img = cv2.GaussianBlur(img, (3, 3), 0) # 高斯模糊

3.2 OCR 解析

使用 Tesseract OCR 解析图像文本：

text = pytesseract.image_to_string(img, config="--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")

--psm 6：单行文本模式，提高识别准确率。

tessedit_char_whitelist=...：限制可识别字符，减少误识别。

运行程序

确保 captcha.png 存在于代码目录中，然后运行：

python captcha_recognizer.py

程序会加载验证码图像，进行 OCR 解析并输出识别结果。

提高验证码识别准确率
5.1 选择合适的 PSM 模式

Tesseract 提供多种页面分割模式（PSM），验证码通常使用 PSM 6 或 7：

config="--psm 6"

5.2 进一步图像优化

去除噪点：使用 cv2.medianBlur() 进一步减少干扰。

字符分割：对于粘连字符，可以尝试轮廓检测 + 投影分割。

5.3 采用深度学习 OCR

如果 Tesseract 识别率不高，可使用深度学习 OCR，如：

EasyOCR（Python）

PaddleOCR（Python）

Google Vision API（云端 OCR）

posted @ 2025-09-07 19:27 ttocr、com 阅读(26) 评论(0) 收藏举报

刷新页面返回顶部

用 Python 解析验证码：Tesseract OCR 的应用

Ubuntu

macOS (使用 Homebrew)

Windows 用户需指定 Tesseract 路径

公告