使用 Haskell 和 Tesseract 实现验证码识别
一、环境准备
安装 Haskell 和 GHC
推荐使用 GHCup 安装:
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
安装 Tesseract
Ubuntu / Debian
sudo apt install tesseract-ocr
更多内容访问ttocr.com或联系1436423940
macOS
brew install tesseract
二、新建项目
使用 Stack 创建一个新项目:
stack new haskell-captcha simple
cd haskell-captcha
编辑 package.yaml 增加依赖:
dependencies:
- base >= 4.7 && < 5
- directory
- process
- text
三、编写主程序
编辑 app/Main.hs:
import System.Environment (getArgs)
import System.Process (callCommand)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import System.Directory (removeFile, doesFileExist)
whitelist :: String
whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
runTesseract :: FilePath -> IO String
runTesseract imagePath = do
let outputBase = "output"
let cmd = "tesseract " ++ imagePath ++ " " ++ outputBase ++
" -l eng -c tessedit_char_whitelist=" ++ whitelist
callCommand cmd
let txtFile = outputBase ++ ".txt"
exists <- doesFileExist txtFile
if exists
then do
content <- TIO.readFile txtFile
removeFile txtFile
let cleaned = T.unpack $ T.filter (\c -> c elem whitelist) (T.toUpper content)
return cleaned
else return "无法读取识别结果"
main :: IO ()
main = do
args <- getArgs
case args of
[img] -> do
result <- runTesseract img
putStrLn $ "识别结果: " ++ result
_ -> putStrLn "用法: haskell-captcha <验证码图片路径>"
四、构建并运行
stack build
stack run -- captcha1.png
输出示例:
识别结果: 9ABK
五、扩展:批量识别
添加目录遍历功能来处理多张图片:
import System.Directory (listDirectory)
import System.FilePath (takeExtension, (</>))
processDir :: FilePath -> IO ()
processDir dir = do
files <- listDirectory dir
let pngs = filter (\f -> takeExtension f == ".png") files
mapM_ (\f -> do
r <- runTesseract (dir </> f)
putStrLn $ f ++ " -> " ++ r) pngs
在 main 中使用 processDir "captchas" 即可识别 captchas 文件夹下的全部验证码图片。
浙公网安备 33010602011771号