下载搜狗的大部分细胞词库

import sys
import requests
from bs4 import BeautifulSoup as BS

def get_links (url):
  links = []
  try:
    r = requests.get(url); r.raise_for_status() # 失败抛出异常
    links = BS(r.text, 'html.parser').find_all('a')
    # find_all()找不到时返回[], find则None
  except Exception: pass
  r = []
  for a in links:
    # Tag有text等属性，但href这类标签属性需要用get
    # 无href时返回''而非None
    r.append((a.text, a.get('href', '')))
  return r

B = 'https://pinyin.sogou.com/'
S = '/dict/detail/index/'
T = '?rf=dictindex'
if len(sys.argv) == 1:
  for a in get_links(B + 'dict/'):
    if a[1].find(S) != -1: print(a[0], a[1].replace(S, '').replace(T, ''))
else:
  for a in get_links(B + S + sys.argv[1] + T):
    if a[1].find('download_cell') != -1: print('https:' + a[1])

试了一通wget -r -A '*.scel' 没成功，请AI写了个link_extractor.py改了改。

$ py le.py
...
汽车词汇大全 15153
歌手人名大全 20658
热门电影大全 20652
$ py le.py 15153
...
https://pinyin.sogou.com/d/dict/download_cell.php?id=82331&name=斯柯达
https://pinyin.sogou.com/d/dict/download_cell.php?id=93870&name=考啦维护
https://pinyin.sogou.com/d/dict/download_cell.php?id=93866&name=考啦学车
https://pinyin.sogou.com/d/dict/download_cell.php?id=93868&name=考啦陪驾

〔把搜狗细胞词库转成文本文件〕〔搜狗皮肤转Fcitx〕

让AI“用python, tikinter写个程序，从列表里选择一项或多项，并返回index，像dkpg-reconfigure locales那样”即可。

PySimpleGUI现在用不了。pip install pysimplegui飞快地装上的没用 (16KB)。

>>> import PySimpleGUI
PySimpleGUI is now located on a private PyPI server. 
Please add to your pip command: -i https://PySimpleGUI.net/install

The version you just installed should uninstalled:
   python -m pip uninstall PySimpleGUI
   python -m pip cache purge

Then install the latest from the private server:
python -m pip install --upgrade --extra-index-url https://PySimpleGUI.net/install PySimpleGUI

按照说明把本尊请来后，…首次运行…99美元…

使用-i或--input-file选项，可以让wget从文件中读取多个 URL并依次下载它们。

我的~/.wgetrc

debug=off
random_wait=off
header=Connection: keep-alive
header=sec-ch-ua: "Not/A)Brand";v="8", "Chromium";v="126"
header=sec-ch-ua-mobile: ?0
header=User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.251 Safari/537.36
header=sec-ch-ua-platform: "Linux"
header=Sec-Fetch-Site: same-origin
header=Sec-Fetch-Mode: no-cors
#header=Accept-Encoding: gzip, deflate, br, zstd
header=Accept-Language: zh-CN,zh-TW;q=0.9,zh;q=0.8,en-US;q=0.7,en;q=0.6

抄的360浏览器的。获取这些header有许许多多种方法，其中之一是用Python写个最最简单的httpd并print.

补充：

遇到gzip压缩的，wget -r下载index.html后就停了。应注释掉Accept-Encoding: gzip, deflate, br, zstd
- 加那个不是为了装B，而是为了从B站下载视频
  - 可耻地失败了。但用you-get下了1477集《四郎讲棋》。除了程序高档和漂亮，人还有api-key
Python程序未能下载全部.scel——不是所有链接都处理了（处理中）
旧Pyhton程序转不了网络流行新词.scel，增加了对b'\x40\x15\0\0\x45\x43\x53\x01\x01\0\0\0'型的处理
把全国网友的劳动成果据为己有，是互联网精神么？

许许多多？我近来喜欢用长词来测试输入法。:-)

from http.server import * 
from threading import Thread 

class ReqHandler (SimpleHTTPRequestHandler):

  def do_GET (m):
    print(m.headers) 

def httpd_thread ():
  httpd = ThreadingHTTPServer(('', 8000), ReqHandler)
  httpd.allow_reuse_port = True # 服务器崩溃后快速重启时重用同一端口
  # SO_REUSEPORT 允许多个socket同时绑定到完全相同的IP和端口组合，内核通过哈希分配连接请求到不同socket
  print('Listening at', httpd.server_address[1])
  httpd.serve_forever()
                                                                                                                        
Thread(daemon=True, target=httpd_thread).start()  
try:                                                                                                                    
  while True: input()                                                                                                   
except BaseException: pass

所有的加起来至少有4449756个词（未去重）。168M用7-Zip最大压缩后26M

posted @ 2025-11-08 09:20 华容道专家阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

Penilum meum pullo sententia Latin a est 「通过浪费时间获得快乐」

下载搜狗的大部分细胞词库