使用 R 和 Tesseract 实现图像验证码识别

一、项目简介
本项目展示如何使用 R 语言和 tesseract 包调用 OCR 引擎，对验证码图像进行字符提取与识别。适用于验证码数据标注、图像文本提取等需求。
更多内容访问ttocr.com或联系1436423940
二、环境准备

brew install tesseract
Ubuntu：

sudo apt install tesseract-ocr
2. 安装 R 和必要包
在 R 中安装 tesseract 包：

install.packages("tesseract")
install.packages("magick") # 可选，用于图像预处理
三、验证码识别 R 脚本

library(tesseract)
library(magick)

设置 OCR 引擎，限定字符为大写字母和数字

eng <- tesseract("eng", options = list(tessedit_char_whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"))

img_path <- "captcha1.png"
image <- image_read(img_path)

image <- image_convert(image, colorspace = "gray")
image <- image_threshold(image, type = "white", threshold = "60%")

text <- ocr(image, engine = eng)
text <- gsub("\s+", "", text)

cat("识别结果：", text, "\n")
运行示例：

识别结果： M3X9P
四、批量识别验证码图像

files <- list.files("captchas/", pattern = "\.png$", full.names = TRUE)

for (f in files) {
img <- image_read(f) %>%
image_convert(colorspace = "gray") %>%
image_threshold(type = "white", threshold = "60%")

result <- ocr(img, engine = eng)
result <- gsub("\s+", "", result)

cat(basename(f), "=>", result, "\n")
}

posted @ 2025-06-21 12:38 ttocr、com 阅读(27) 评论(0) 收藏举报

刷新页面返回顶部