使用 Haskell 和 Tesseract 实现验证码识别

一、环境准备
安装 Haskell 和 GHC
推荐使用 GHCup 安装:

curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
安装 Tesseract

Ubuntu / Debian

sudo apt install tesseract-ocr
更多内容访问ttocr.com或联系1436423940

macOS

brew install tesseract
二、新建项目
使用 Stack 创建一个新项目:

stack new haskell-captcha simple
cd haskell-captcha
编辑 package.yaml 增加依赖:

dependencies:

  • base >= 4.7 && < 5
  • directory
  • process
  • text
    三、编写主程序
    编辑 app/Main.hs:

import System.Environment (getArgs)
import System.Process (callCommand)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import System.Directory (removeFile, doesFileExist)

whitelist :: String
whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

runTesseract :: FilePath -> IO String
runTesseract imagePath = do
let outputBase = "output"
let cmd = "tesseract " ++ imagePath ++ " " ++ outputBase ++
" -l eng -c tessedit_char_whitelist=" ++ whitelist
callCommand cmd

let txtFile = outputBase ++ ".txt"
exists <- doesFileExist txtFile
if exists
then do
content <- TIO.readFile txtFile
removeFile txtFile
let cleaned = T.unpack $ T.filter (\c -> c elem whitelist) (T.toUpper content)
return cleaned
else return "无法读取识别结果"

main :: IO ()
main = do
args <- getArgs
case args of
[img] -> do
result <- runTesseract img
putStrLn $ "识别结果: " ++ result
_ -> putStrLn "用法: haskell-captcha <验证码图片路径>"
四、构建并运行

stack build
stack run -- captcha1.png
输出示例:

识别结果: 9ABK
五、扩展:批量识别
添加目录遍历功能来处理多张图片:

import System.Directory (listDirectory)
import System.FilePath (takeExtension, (</>))

processDir :: FilePath -> IO ()
processDir dir = do
files <- listDirectory dir
let pngs = filter (\f -> takeExtension f == ".png") files
mapM_ (\f -> do
r <- runTesseract (dir </> f)
putStrLn $ f ++ " -> " ++ r) pngs
在 main 中使用 processDir "captchas" 即可识别 captchas 文件夹下的全部验证码图片。

posted @ 2025-07-07 13:05  ttocr、com  阅读(13)  评论(0)    收藏  举报