NLP | mC4数据集
MC4 是C4 的子集,MC4 是从公共 Common Crawl 存储库中提取的约 750GB 英语文本的集合。Common Crawl 包含数十亿个从 Internet 抓取的网页。尽管 C4 数据集被明确设计为仅英语,但 MC4 覆盖了 Common Crawl 迄今为止发布的 108 种语言,具有 10000 多个网页。
有证据表明,语言模型会放大他们所训练的数据集中存在的偏见。尽管一些研究人员声称,目前没有任何机器学习技术可以充分防止有害输出,但谷歌研究人员通过对 MC4 文档中进行重复数据删除和过滤包含不良词的页面减轻了 MT5 中的偏差。他们还使用工具检测了每个页面的主要语言,并删除了置信度低于 70% 的页面。
谷歌表示,最大的 MT5 模型(具有 130 亿个参数)超过了 2020 年 10 月的所有基准测试。其中包括 Xtreme 多语言基准测试的五项任务、XNLI 涵盖任务涵盖 14 种语言、XQuAD/MLQA 和 TyDi QA 阅读理解基准分别使用了 10 种、7 种 和 11 种语言,以及具有 7 种语言的 PAWS-X 复述识别数据集。
C4
- 没有标签的数据集。
- 收集了750GB的英文文本数据。
- 只收集英文占比超过99%的文本。
- 在文档中重复数据删除行,并删除包含坏单词的页面。
mC4
- 使用cld3来收集超过100种语言的数据。
- 删除没有以英文终端标点符号结尾的行。
- 应用一个“行长过滤器(line length filter)”,要求页面包含至少三行包含200个或更多字符的文本。
- 在文档中重复数据删除行,并删除包含坏单词的页面。
论文来源:
MT5 (multilingual Text-to-Text Transfer Transformer) is pretrained on a new Common Crawl-based dataset--mC4 dataset covering 101 languages.

您可以像这样加载任何语言的 mC4 子集:
from datasets import load_dataset
mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
支持的任务和排行榜
mC4 主要用于预训练语言模型和单词表示。
数据集结构
数据语言
| language code | language name |
|---|---|
| af | Afrikaans |
| am | Amharic |
| ar | Arabic |
| az | Azerbaijani |
| be | Belarusian |
| bg | Bulgarian |
| bg-Latn | Bulgarian (Latin) |
| bn | Bangla |
| ca | Catalan |
| ceb | Cebuano |
| co | Corsican |
| cs | Czech |
| cy | Welsh |
| da | Danish |
| de | German |
| el | Greek |
| el-Latn | Greek (Latin) |
| en | English |
| eo | Esperanto |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fi | Finnish |
| fil | Filipino |
| fr | French |
| fy | Western Frisian |
| ga | Irish |
| gd | Scottish Gaelic |
| gl | Galician |
| gu | Gujarati |
| ha | Hausa |
| haw | Hawaiian |
| hi | Hindi |
| hi-Latn | Hindi (Latin script) |
| hmn | Hmong, Mong |
| ht | Haitian |
| hu | Hungarian |
| hy | Armenian |
| id | Indonesian |
| ig | Igbo |
| is | Icelandic |
| it | Italian |
| iw | former Hebrew |
| ja | Japanese |
| ja-Latn | Japanese (Latin) |
| jv | Javanese |
| ka | Georgian |
| kk | Kazakh |
| km | Khmer |
| kn | Kannada |
| ko | Korean |
| ku | Kurdish |
| ky | Kyrgyz |
| la | Latin |
| lb | Luxembourgish |
| lo | Lao |
| lt | Lithuanian |
| lv | Latvian |
| mg | Malagasy |
| mi | Maori |
| mk | Macedonian |
| ml | Malayalam |
| mn | Mongolian |
| mr | Marathi |
| ms | Malay |
| mt | Maltese |
| my | Burmese |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| ny | Nyanja |
| pa | Punjabi |
| pl | Polish |
| ps | Pashto |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| ru-Latn | Russian (Latin) |
| sd | Sindhi |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| sm | Samoan |
| sn | Shona |
| so | Somali |
| sq | Albanian |
| sr | Serbian |
| st | Southern Sotho |
| su | Sundanese |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| tg | Tajik |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| und | Unknown language |
| ur | Urdu |
| uz | Uzbek |
| vi | Vietnamese |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zh | Chinese |
| zh-Latn | Chinese (Latin) |
| zu | Zulu |
数据实例
配置的一个示例en是:
{'timestamp': '2018-06-24T01:32:39Z',
'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
数据字段
数据有几个字段:
url: 字符串形式的源 urltext:文本内容作为字符串timestamp: 时间戳作为字符串

浙公网安备 33010602011771号