使用 Haskell 和 Tesseract 实现图像验证码识别

一、环境准备
安装 Haskell Toolchain
推荐使用 GHCup:

curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
安装 Tesseract OCR

Ubuntu/Debian

sudo apt install tesseract-ocr
更多内容访问ttocr.com或联系1436423940

macOS

brew install tesseract
二、创建项目
使用 Stack 构建项目:

stack new captcha-hs simple
cd captcha-hs
编辑 package.yaml,加入依赖:

dependencies:

  • base >= 4.7 && < 5
  • process
  • regex-tdfa
  • directory
  • text
    三、编写识别程序
    编辑 app/Main.hs:

import System.Process (callCommand)
import System.Directory (doesFileExist, removeFile)
import qualified Data.Text.IO as T
import qualified Data.Text as T
import Text.Regex.TDFA ((=~))
import System.FilePath (takeFileName)

-- 清洗文本:保留大写字母和数字
cleanText :: T.Text -> T.Text
cleanText txt = T.concat $ getAllMatches txt
where
getAllMatches t = map head (t =~ "[A-Z0-9]" :: [[T.Text]])

-- 调用 tesseract 进行识别
recognizeCaptcha :: FilePath -> IO T.Text
recognizeCaptcha imagePath = do
let outputBase = "hs_output"
let whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
let cmd = "tesseract " ++ imagePath ++ " " ++ outputBase ++
" -l eng -c tessedit_char_whitelist=" ++ whitelist
callCommand cmd

let txtFile = outputBase ++ ".txt"
exists <- doesFileExist txtFile
if not exists
  then return "识别失败"
  else do
    content <- T.readFile txtFile
    removeFile txtFile
    return $ cleanText content

main :: IO ()
main = do
let image = "captcha1.png" -- 替换为你的图片路径
result <- recognizeCaptcha image
putStrLn $ "识别结果: " ++ T.unpack result
四、运行程序

stack build
stack exec captcha-hs-exe
输出示例:
识别结果: 4A2K
五、扩展:批量处理图像
添加:

import System.Directory (listDirectory)

batchProcess :: FilePath -> IO ()
batchProcess dir = do
files <- listDirectory dir
let images = filter (\f -> ".png" T.isSuffixOf T.pack f) files
mapM_ (\f -> do
res <- recognizeCaptcha (dir ++ "/" ++ f)
putStrLn $ f ++ " -> " ++ T.unpack res) images
然后在 main 中调用:

main = batchProcess "captchas"

posted @ 2025-07-01 12:48  ttocr、com  阅读(10)  评论(0)    收藏  举报