使用 Haskell 和 Tesseract 实现图像验证码识别
一、环境准备
安装 Haskell Toolchain
推荐使用 GHCup:
curl --proto '=https' --tlsv1.2 -sSf https://get-ghcup.haskell.org | sh
安装 Tesseract OCR
Ubuntu/Debian
sudo apt install tesseract-ocr
更多内容访问ttocr.com或联系1436423940
macOS
brew install tesseract
二、创建项目
使用 Stack 构建项目:
stack new captcha-hs simple
cd captcha-hs
编辑 package.yaml,加入依赖:
dependencies:
- base >= 4.7 && < 5
- process
- regex-tdfa
- directory
- text
三、编写识别程序
编辑 app/Main.hs:
import System.Process (callCommand)
import System.Directory (doesFileExist, removeFile)
import qualified Data.Text.IO as T
import qualified Data.Text as T
import Text.Regex.TDFA ((=~))
import System.FilePath (takeFileName)
-- 清洗文本:保留大写字母和数字
cleanText :: T.Text -> T.Text
cleanText txt = T.concat $ getAllMatches txt
where
getAllMatches t = map head (t =~ "[A-Z0-9]" :: [[T.Text]])
-- 调用 tesseract 进行识别
recognizeCaptcha :: FilePath -> IO T.Text
recognizeCaptcha imagePath = do
let outputBase = "hs_output"
let whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
let cmd = "tesseract " ++ imagePath ++ " " ++ outputBase ++
" -l eng -c tessedit_char_whitelist=" ++ whitelist
callCommand cmd
let txtFile = outputBase ++ ".txt"
exists <- doesFileExist txtFile
if not exists
then return "识别失败"
else do
content <- T.readFile txtFile
removeFile txtFile
return $ cleanText content
main :: IO ()
main = do
let image = "captcha1.png" -- 替换为你的图片路径
result <- recognizeCaptcha image
putStrLn $ "识别结果: " ++ T.unpack result
四、运行程序
stack build
stack exec captcha-hs-exe
输出示例:
识别结果: 4A2K
五、扩展:批量处理图像
添加:
import System.Directory (listDirectory)
batchProcess :: FilePath -> IO ()
batchProcess dir = do
files <- listDirectory dir
let images = filter (\f -> ".png" T.isSuffixOf T.pack f) files
mapM_ (\f -> do
res <- recognizeCaptcha (dir ++ "/" ++ f)
putStrLn $ f ++ " -> " ++ T.unpack res) images
然后在 main 中调用:
main = batchProcess "captchas"
浙公网安备 33010602011771号