FastText实践

FastText是facebook开源的一个词向量与文本分类工具，在2016年开源，典型应用场景是“有监督的文本分类问题”。提供简单而高效的文本分类和表征学习的方法，性能比肩深度学习而且速度更快。

FastText结合了自然语言处理和机器学习中最成功的理念。包括使用词袋以及n-gram袋表征语句，还有使用子字(subword)信息，并通过隐藏表征在类别间共享信息。另外采用了一个softmax层级(利用了类别不均衡分布的优势)来加速运算过程。

这些不同概念被用于两个不同任务：

有效文本分类：有监督学习
学习词向量表征：无监督学习
举例来说：FastText能够学会“男孩”、“女孩”、“男人”、“女人”指代的是特定的性别，并且能够将这些数值存在相关文档中。然后，当某个程序在提出一个用户请求（假设是“我女友现在在儿？”），它能够马上在FastText生成的文档中进行查找并且理解用户想要问的是有关女性的问题。

原理待整理

数据处理

数据来源使用微博的100k评论数据，分为积极或消极的情绪。

经过结巴分词，去除停用词过程，label和sentence以“__label__”的格式拼接

samples = []
for index, row in data_all.iterrows():

    sentence = filter(lambda x:x not in stopwords, jieba.cut(row["review"]))
    samples.append("__label__"+str(row["label"])+" , "+" ".join(sentence))

将处理好的样本保存为文件

训练

classifier=fasttext.train_supervised(train_file_name, lr=0.2, dim=100, epoch=50, word_ngrams=4, loss='softmax')

参数说明

arg_names = ['input', 'lr', 'dim', 'ws', 'epoch', 'minCount',             
             'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams', 'loss', 'bucket',
              'thread', 'lrUpdateRate', 't', 'label', 'verbose', 'pretrainedVectors',
              'seed', 'autotuneValidationFile', 'autotuneMetric',
              'autotunePredictions', 'autotuneDuration', 'autotuneModelSize']

-input：training file path
-lr：learning rate [0.05]
-dim：size of word vectors [100]
-ws：size of the context window [5]
-epoch：number of epochs [5]
-minCount：minimal number of word occurrences [5]
-minCountLabel：minimal number of label occurrences [0]
-minn：min length of char ngram [3]
-maxn：max length of char ngram [6]
-neg：number of negatives sampled [5]
-wordNgrams：max length of word ngram [1]
-loss：loss function {ns, hs, softmax} [ns]
-bucket：number of buckets [2000000]
-thread：number of threads [12]
-lrUpdateRate：change the rate of updates for the learning rate [100]
-t：sampling threshold [0.0001]
-label：labels prefix [__label__]
-verbose：verbosity level [2]
-pretrainedVectors：pretrained word vectors for supervised learning []

模型保存

classifier.save_model("model_file.bin")

测试集测试

result = classifier.test(test_file_name)

预测

texts=['李韬 之流 年 似锦 上班 滴人 偶 充满 羡慕 嫉妒 恨 偶在 苦哈哈 滴结 一月份 滴账 做 财务 滴人 艰辛 泪   李韬 之流 年 似锦 回复 贵州 媳妇 啥意思']
lables=classifier.predict(texts, k=2)
print(lables)

posted on 2022-02-08 11:39 bubbleeee 阅读(155) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

bubbleeee

公告

数据处理

训练