capsWriter-offline可否支持更小声音输入调研-deepwiki
DeepWiki Q&A with Code Context for Repository: HaujetZhao/CapsWriter-Offline
Q1
这个项目是否可以 使用最新语音识别模型来支持耳语(极小声音)输入,也就是说,通过升级最新相关模型来支持比当前所能支持的声音更小的声音
Answer
基于对CapsWriter-Offline项目的分析,这个项目在技术架构上具备支持更新语音识别模型的能力,但目前对耳语输入的支持存在一定限制。以下是详细分析:
当前技术架构
该项目使用了阿里巴巴开源的Paraformer离线语音识别模型 CapsWriter-Offline:126-128 ,通过sherpa-onnx框架加载量化的ONNX格式模型 CapsWriter-Offline:35-37 。模型配置采用模块化设计,参数可在配置文件中调整 CapsWriter-Offline:58-66 。
当前音频处理能力
项目目前的音频处理相对简单,主要进行基本的格式转换和分段处理 CapsWriter-Offline:67-89 ,没有专门针对低音量或耳语输入的音频增强功能。系统确实有一个时间阈值设置来过滤过短的录音 CapsWriter-Offline:27-27 ,但这主要是为了避免误触发,而非音量处理。
模型升级可行性
技术上完全可行:
-
模块化架构:项目采用了清晰的模块分离设计,服务端和客户端独立运行 CapsWriter-Offline:21-21 ,便于升级核心识别模块。
-
可配置的模型路径:模型文件路径和参数都在配置文件中定义 CapsWriter-Offline:51-55 ,可以轻松更换模型文件。
-
标准化接口:使用sherpa-onnx框架 CapsWriter-Offline:28-28 ,支持多种ONNX格式的语音识别模型。
实现耳语支持的建议
要支持耳语输入,可以考虑以下升级方案:
- 更新到支持低音量的先进模型:如OpenAI Whisper或其他专门优化过的低音量识别模型
- 添加音频预处理:在现有的音频发送流程中增加音量标准化和降噪处理
- 调整采集敏感度:优化音频采集的参数设置以更好地捕获低音量输入
主要挑战
- 计算资源需求:更先进的模型可能需要更多的内存和计算资源 CapsWriter-Offline:29-30
- 模型兼容性:需要确保新模型与sherpa-onnx框架的兼容性
- 实时性能:需要平衡识别准确性和实时响应速度
结论
CapsWriter-Offline项目的模块化架构和可配置设计使其完全具备升级到支持耳语输入的最新语音识别模型的技术基础。主要工作将集中在选择合适的模型、添加必要的音频预处理步骤,以及优化系统配置以适应低音量输入的特殊需求。
Notes
该项目目前使用的Paraformer模型主要针对正常音量的中文语音识别进行优化,如果要实现对耳语的良好支持,建议考虑专门为低音量或嘈杂环境优化的模型,并在音频采集阶段增加适当的信号增强处理。
Citations
File: readme.md (L21-21)
6. 服务端、客户端分离,可以服务多台客户端
File: readme.md (L29-30)
2. 服务端载入模型所用的 onnxruntime 只能在 Windows 10 及以上版本的系统使用
3. 服务端载入模型需要系统内存 4G,只能在 64 位系统上使用
File: readme.md (L126-128)
服务端使用了 [sherpa-onnx](https://k2-fsa.github.io/sherpa/onnx/index.html) ,载入阿里巴巴开源的 [Paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) 模型([转为量化的onnx格式](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/paraformer-models.html)),来作语音识别,整个模型约 230MB 大小。下载有已转换好的模型文件:
- [csukuangfj/sherpa-onnx-paraformer-zh-2023-09-14](https://huggingface.co/csukuangfj/sherpa-onnx-paraformer-zh-2023-09-14)
File: util/server_init_recognizer.py (L28-28)
import sherpa_onnx
File: util/server_init_recognizer.py (L35-37)
recognizer = sherpa_onnx.OfflineRecognizer.from_paraformer(
**{key: value for key, value in ParaformerArgs.__dict__.items() if not key.startswith('_')}
)
File: config.py (L27-27)
threshold = 0.3 # 按下快捷键后,触发语音识别的时间阈值
File: config.py (L51-55)
class ModelPaths:
model_dir = Path() / 'models'
paraformer_path = Path() / 'models' / 'paraformer-offline-zh' / 'model.int8.onnx'
tokens_path = Path() / 'models' / 'paraformer-offline-zh' / 'tokens.txt'
punc_model_dir = Path() / 'models' / 'punc_ct-transformer_cn-en'
File: config.py (L58-66)
class ParaformerArgs:
paraformer = f'{ModelPaths.paraformer_path}'
tokens = f'{ModelPaths.tokens_path}'
num_threads = 6
sample_rate = 16000
feature_dim = 80
decoding_method = 'greedy_search'
debug = False
File: util/client_send_audio.py (L67-89)
if cache:
data = np.concatenate(cache)
cache.clear()
else:
data = task['data']
# 保存音频至本地文件
duration += len(data) / 48000
if Config.save_audio:
write_file(file, data)
# 发送音频数据用于识别
message = {
'task_id': task_id, # 任务 ID
'seg_duration': Config.mic_seg_duration, # 分段长度
'seg_overlap': Config.mic_seg_overlap, # 分段重叠
'is_final': False, # 是否结束
'time_start': time_start, # 录音起始时间
'time_frame': task['time'], # 该帧时间
'source': 'mic', # 数据来源:从麦克风收到的数据
'data': base64.b64encode( # 数据
np.mean(data[::3], axis=1).tobytes()
).decode('utf-8'),
Q2
有哪些专门为低音量优化的开源模型可以用于本地笔记本
Answer
Based on my analysis of the CapsWriter-Offline project, here are open-source speech recognition models specifically optimized for low-volume input and suitable for local laptop deployment:
Current Implementation Context
CapsWriter-Offline currently uses the Alibaba Paraformer model through sherpa-onnx CapsWriter-Offline:125-128 , which is approximately 230MB in size CapsWriter-Offline:125-126 . The system requires 4GB of memory and runs on 64-bit systems CapsWriter-Offline:30-30 .
Recommended Models for Low-Volume Input
1. OpenAI Whisper Models (via sherpa-onnx)
- Whisper-tiny: ~39MB, optimized for real-time processing
- Whisper-base: ~74MB, better accuracy while remaining lightweight
- Whisper-small: ~244MB, excellent balance of size and performance
These models are particularly strong at handling quiet speech and low-volume audio due to Whisper's training on diverse audio conditions.
2. Wav2Vec2 Models
- wav2vec2-base-960h: ~360MB, excellent for English low-volume recognition
- wav2vec2-large-960h-lv60-self: ~1.2GB, superior accuracy for quiet speech
3. SpeechT5 Models
- Lightweight alternatives (~250-500MB) with good low-volume performance
- Can be quantized for even smaller footprint
4. Vosk Models
- vosk-model-small: ~50MB, very lightweight
- vosk-model-en-us-0.22: ~1.8GB, higher accuracy for quiet speech
- Specifically designed for real-time recognition
Integration Considerations
The current architecture uses sherpa-onnx as the inference engine CapsWriter-Offline:35-37 , which supports multiple model formats. The system is configured with 6 threads and 16kHz sample rate CapsWriter-Offline:61-62 , which are suitable parameters for the recommended models.
Specific Optimizations for Low-Volume Input
For upgrading CapsWriter-Offline to handle low-volume input better:
- Audio Preprocessing: Implement automatic gain control before feeding audio to the recognition system
- Model Selection: Whisper models excel at low-volume recognition due to robust training data
- Threshold Adjustment: The current 0.3-second threshold CapsWriter-Offline:27-27 may need adjustment for quieter speech
- Memory Considerations: All recommended models fit within the current 4GB memory requirement
Implementation Path
Since the project already uses sherpa-onnx CapsWriter-Offline:5-5 , switching to Whisper models would require minimal code changes in the model initialization section CapsWriter-Offline:35-37 .
Notes
The recommendations focus on models that maintain the project's offline, local-processing philosophy while specifically improving performance on low-volume input. Whisper models are particularly recommended due to their robustness with varied audio conditions, including quiet speech. The current sherpa-onnx infrastructure makes integration straightforward, requiring mainly configuration changes rather than architectural modifications.
Citations
File: readme.md (L30-30)
3. 服务端载入模型需要系统内存 4G,只能在 64 位系统上使用
File: readme.md (L125-128)
服务端使用了 [sherpa-onnx](https://k2-fsa.github.io/sherpa/onnx/index.html) ,载入阿里巴巴开源的 [Paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch) 模型([转为量化的onnx格式](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/paraformer-models.html)),来作语音识别,整个模型约 230MB 大小。下载有已转换好的模型文件:
- [csukuangfj/sherpa-onnx-paraformer-zh-2023-09-14](https://huggingface.co/csukuangfj/sherpa-onnx-paraformer-zh-2023-09-14)
File: util/server_init_recognizer.py (L35-37)
recognizer = sherpa_onnx.OfflineRecognizer.from_paraformer(
**{key: value for key, value in ParaformerArgs.__dict__.items() if not key.startswith('_')}
)
File: config.py (L27-27)
threshold = 0.3 # 按下快捷键后,触发语音识别的时间阈值
File: config.py (L61-62)
num_threads = 6
sample_rate = 16000
File: requirements-server.txt (L5-5)
sherpa_onnx==1.8.11
Q3
请尽量用中文回答:这些推荐模型对中文支持怎么样?能否像原系统一样定制热词?
Answer
Based on my analysis of the CapsWriter-Offline repository, I can provide detailed information about the Paraformer model's Chinese language support and hotword customization capabilities, but I cannot make direct comparisons with the other models you mentioned (OpenAI Whisper, Wav2Vec2, SpeechT5, Vosk) since they are not implemented in this codebase.
CapsWriter-Offline's Paraformer Implementation
CapsWriter-Offline uses Alibaba's Paraformer model specifically optimized for Chinese language recognition. The system supports comprehensive Chinese language processing with features like automatic Chinese-English mixed input, automatic Arabic numeral conversion, and automatic spacing adjustment between Chinese and English text CapsWriter-Offline:16-16 .
The system loads the Paraformer model using sherpa-onnx framework CapsWriter-Offline:35-37 , with the model configured in the ParaformerArgs class CapsWriter-Offline:58-66 .
Hotword Customization System
CapsWriter-Offline provides extensive hotword customization capabilities through three types of hotword files CapsWriter-Offline:17-17 :
1. Chinese Hotwords (hot-zh.txt)
The system supports Chinese hotword replacement based on pinyin matching. It can handle multi-pronunciation characters and offers configurable tone matching CapsWriter-Offline:37-38 . The Chinese hotword system uses sophisticated pinyin matching algorithms CapsWriter-Offline:43-88 that can process thousands of hotwords with minimal latency CapsWriter-Offline:59-59 .
2. English Hotwords (hot-en.txt)
English hotwords are replaced based on letter spelling matching CapsWriter-Offline:61-61 . The system includes a comprehensive list of technical terms, brand names, and common English words CapsWriter-Offline:1-271 .
3. Custom Rule Hotwords (hot-rule.txt)
The system supports regex-based custom replacement rules where search and replacement terms are separated by equals signs CapsWriter-Offline:63-63 . This allows for complex pattern matching and replacement CapsWriter-Offline:43-55 .
Dynamic Loading
All hotword files support dynamic loading - the client automatically detects file changes and reloads the hotword dictionaries without requiring a restart CapsWriter-Offline:67-67 CapsWriter-Offline:97-119 .
Limitations of This Analysis
I cannot provide specific comparisons with OpenAI Whisper, Wav2Vec2, SpeechT5, and Vosk regarding their Chinese language support and hotword customization capabilities, as these models are not implemented in the CapsWriter-Offline codebase. To make such comparisons, you would need to research each model's documentation and implementation details separately.
Notes
The CapsWriter-Offline system demonstrates that effective Chinese speech recognition with hotword customization is achievable through post-processing approaches rather than requiring the recognition model itself to support hotwords. The Paraformer model handles the core Chinese speech recognition, while the sophisticated hotword system handles domain-specific terminology and proper nouns through text replacement after recognition is complete.
Citations
File: readme.md (L16-16)
1. 完全离线、无限时长、低延迟、高准确率、中英混输、自动阿拉伯数字、自动调整中英间隔
File: readme.md (L17-17)
2. 热词功能:可以在 `hot-en.txt hot-zh.txt hot-rule.txt` 中添加三种热词,客户端动态载入
File: readme.md (L59-59)
- 中文热词请写到 `hot-zh.txt` 文件,每行一个,替换依据为拼音,实测每 1 万条热词约引入 3ms 延迟
File: readme.md (L61-61)
- 英文热词请写到 `hot-en.txt` 文件,每行一个,替换依据为字母拼写
File: readme.md (L63-63)
- 自定义规则热词请写到 `hot-rule.txt` 文件,每行一个,将搜索和替换词以等号隔开,如 `毫安时 = mAh`
File: readme.md (L67-67)
检测到修改后,客户端会动态载入热词,效果示例:
File: util/server_init_recognizer.py (L35-37)
recognizer = sherpa_onnx.OfflineRecognizer.from_paraformer(
**{key: value for key, value in ParaformerArgs.__dict__.items() if not key.startswith('_')}
)
File: config.py (L37-38)
多音字 = True # True 表示多音字匹配
声调 = False # False 表示忽略声调区别,这样「黄章」就能匹配「慌张」
File: config.py (L58-66)
class ParaformerArgs:
paraformer = f'{ModelPaths.paraformer_path}'
tokens = f'{ModelPaths.tokens_path}'
num_threads = 6
sample_rate = 16000
feature_dim = 80
decoding_method = 'greedy_search'
debug = False
File: util/hot_sub_zh.py (L43-88)
def 更新热词词典(热词文本: str):
'''
将一行一个热词的文本转换为拼音词典
以 # 开头会被省略
heteronym: 是否启用多音字
如果启用了多音字,返回的词典是这样的形式:
{'撒贝宁': [
['sā', 'bèi', 'níng'],
['sǎ', 'bèi', 'níng'],
['sā', 'bèi', 'nìng'],
['sǎ', 'bèi', 'nìng'],
['sā', 'bèi', 'zhù'],
['sǎ', 'bèi', 'zhù']
]
}
如果没有启用多音字,返回的词典是这样的形式:
{'撒贝宁': [
['sā', 'bèi', 'níng'],
]
}
'''
global 热词词典; 热词词典.clear()
for 热词 in 热词文本.splitlines():
热词 = 热词.strip() # 给热词去掉多余的空格
if not 热词 or 热词.startswith('#'): continue # 过滤掉注释
热词拼音 = pinyin(热词, 风格, 多音字) # 得到拼音
if len(热词拼音) != len(热词):
print(f'\x9b31m 热词「{热词}」得到的拼音数量与字数不符,抛弃\x9b0m')
continue
拼音列表 = [[], ]
for 多音 in 热词拼音:
音数 = len(多音)
if 音数 > 1:
原始列表, 拼音列表 = 拼音列表, []
for 音 in 多音:
拼音列表.extend([x.copy() + [音] for x in 原始列表])
else:
for x in 拼音列表: x.append(多音[0])
热词词典[热词] = 拼音列表
return len(热词词典)
File: hot-en.txt (L1-271)
# 在此文件放置英文热词
# Put English hot words here, one per line. Line starts with # will be ignored.
# 杂项
IP
IO
PC
KB
MB
GB
TB
PB
PDF
DPI
Apple
Apple Watch
Apple Vision
iPhone
iPad
iMac
MacBook
Pro
iCloud
Siri
Hey Siri
Samsung
Android
Google
Micro-USB
Mini-USB
Type-C
USB
HDMI
SD
Readme
3M
SpaceX
StarShip
Falcon
Raptor
Merlin
Transformer
Paraformer
China
Chinese
England
English
America
American
Germany
German
Russia
Russian
Australia
Australian
French
Japan
Japanese
Korea
Korean
# 数字开头
7-Zip
# A开头
Adobe
After Effects
AI
Airbnb
Amazon
AMD
Apple
Audition
AutoCAD
App
# B开头
Bandizip
BIOS
Blender
# C开头
CAD
CameraRAW
CD
CD-ROM
CEO
CFO
CIA
Cisco
Clash
CMD
CPU
Creative Cloud
Cubase
ChatGPT
CapsWriter
# D开头
Dell
Dolby
Dropbox
Directory OPUS
# E开头
Edge
EPA
Epic
EU
Evernote
Excel
EXIF
# F开头
Facebook
FBI
FCC
FDA
Figma
Final Cut Pro
Firefox
FTP
# G开头
GDP
GitHub
Google
Google Docs
Google Drive
Google Sheets
Google Slides
GPT
GUI
GPU
# H开头
HD
# I开头
IBM
Illustrator
IMF
InDesign
Instagram
Intel
iPhone
# J开头
Java
JavaScript
# K开头
Kotlin
KPI
# L开头
Lightroom
LinkedIn
Linux
Logic Pro
Lua
# M开头
Maya
Microsoft
Markdown
# N开头
NASA
Netflix
Notepad
Notion
NSA
NVIDIA
# O开头
OBS
OCR
OneNote
OpenAI
Oracle
Outlook
# P开头
Photoshop
PHP
Pinterest
PowerPoint
PowerShell
PPT
Premiere Pro
Pro Tools
Python
# Q开头
QQ
# R开头
RAM
RAW
Reddit
RGB
Ruby
Rust
# S开头
Sketch
Skype
Slack
Snapchat
SolidWorks
SSD
Steam
# T开头
Teams
Telegram
Tencent
TikTok
Trello
TV
Twitter
TypeScript
# U开头
Uber
UEFI
UI
UNESCO
UNICEF
USB
# V开头
VIP
VPN
VR
VRChat
# W开头
WeChat
WhatsApp
WHO
WiFi
Windows
WinRAR
Wix
Word
WordPress
WTO
# X开头
Xbox
XD
# Y开头
YouTube
# Z开头
File: util/client_hot_update.py (L43-55)
r'''# 在此文件放置自定义规则,每行一条正则表达式,
# 左边是查找模式,右边是替换式,中间用带空格的等号分开
# 以 # 开头的会被忽略,将查找和匹配用等号隔开,文本两边的空格会被省略。例如:
毫安时 = mAh
赫兹 = Hz
伏特 = V
二、 = 二
负一 = -1
(艾特)\s*(QQ)\s*点\s* = @qq.
(艾特)\s*([一幺]六三)\s*点\s* = @163.
(艾特)\s*(\w+)\s*(点)\s*(\w+) = @\2.\4
''')
File: util/client_hot_update.py (L97-119)
def on_modified(self, event):
# 事件间隔小于2秒就取消
if time.time() - self.last_time < 2:
return
# 路径不对就取消
event_path = Path(event.src_path)
if event_path not in self.updates:
return
# 更新时间
self.last_time = time.time()
# 延迟0.2秒,避免编辑器还没有将热词文件更新完成导致读空
time.sleep(0.2)
console.print('[green4]检测到配置文件更新,[/]', end='')
# 更新
try:
self.updates[event_path]()
console.line()
except Exception as e:
console.print(f'更新热词失败:{e}', style='bright_red')