大名鼎鼎的Brown语料库和nltk
AI说:nltk.download('brown') 若网速慢,可手动下载百度网盘资源。吓我一跳。
飞快地在~/nltk_data下得到了brown.zip,高达3.2MB,3.2MB啊。500来个二三十KB的文本文件。
print(len(brown.words())) # 1,161,192 总词数
print(f"总句子数: {len(brown.sents())}") # 57,340
download('all')狠,连.zip带解压后的,3.5GB
把*.zip挪到别的目录下后,2.8G. corpora/下 2.4G,75个目录。
corpora/words,en 235,886行,en-basic 850行,README说:
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
nltk.download()
A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory.
For central installation, set this to C:\nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix).
Next, select the packages or collections you want to download.
If you did not install the data to one of the above central locations, you will need to set the NLTK_DATA environment variable to specify the location of the data.

浙公网安备 33010602011771号