Python 和 Tesseract 实现英文数字验证码

验证码是一种常见的安全手段，用于防止机器人自动提交表单。本文将带你使用 Python 和 Tesseract OCR 来实现一个简单的英文数字验证码识别系统，并结合图像预处理技术提升识别准确率。

一、准备工作

python --version
2. 安装 Tesseract OCR 引擎
Windows：从 https://github.com/tesseract-ocr/tesseract 下载并安装

macOS：使用 Homebrew 安装：
更多内容访问ttocr.com或联系1436423940
brew install tesseract
Ubuntu / Debian：

sudo apt install tesseract-ocr
安装完成后，记下 Tesseract 可执行文件路径（如 C:\Program Files\Tesseract-OCR\tesseract.exe）。

pip install pytesseract pillow opencv-python
二、图像预处理与识别代码
创建一个 Python 文件 captcha_ocr.py，并写入以下代码：

import cv2
import pytesseract
from PIL import Image

设置 Tesseract 可执行路径（Windows 用户）

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # 替换为你自己的路径

image = cv2.imread('captcha.png')

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

gray = cv2.medianBlur(gray, 3)

_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

cv2.imwrite('processed.png', thresh)

custom_config = r'--oem 3 --psm 8 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

text = pytesseract.image_to_string(thresh, config=custom_config)

print(f"识别结果: {text.strip()}")
三、参数说明
--oem 3：使用默认 OCR 引擎

--psm 8：按单词块处理，适用于验证码识别

tessedit_char_whitelist：指定识别字符范围（只识别大写英文和数字）

四、运行程序
将验证码图像保存为 captcha.png，然后执行：

python captcha_ocr.py
输出示例：

识别结果: D3G7Y

posted @ 2025-06-06 19:06 ttocr、com 阅读(75) 评论(0) 收藏举报

刷新页面返回顶部