Stay Hungry,Stay Foolish!

Neural Network SMS Text Classifier

Neural Network SMS Text Classifier

https://www.freecodecamp.org/learn/machine-learning-with-python/machine-learning-with-python-projects/neural-network-sms-text-classifier

In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

You can access the full project instructions and starter code on Google Colaboratory.

 

参考

https://www.kaggle.com/akhatova/sms-spam-classification-by-keras#3.-Keras-Model

此示例存在两种解法:

(1)词向量 + 回归模型

(2)输入层序列模式 + 词嵌套 + keras模型/CNN模型

经过验证, 词向量 特征更加适合垃圾邮件检测, 最终使用模型 词向量 + KERAS DNN模型。

 

数据

https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

The table below lists the provided dataset in different file formats, the amount of samples in each class and the total number of samples.

ApplicationFile format# Spam# HamTotalLink
General Plain text 747 4,827 5,574 Link 1
Weka ARFF 747 4,827 5,574 Link 2

 

The collection is composed by just one file, where each line has the correct class (ham or spam) followed by the raw message.


ham   What you doing?how are you?
ham   Ok lar... Joking wif u oni...
ham   dun say so early hor... U c already then say...
ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham   Siva is in hostel aha:-.
ham   Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam  FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam  Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam  URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

 

词向量特征提取-TfidfVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)

 

或者使用TensorFlow处理接口

https://www.tensorflow.org/guide/keras/preprocessing_layers#encoding_text_as_a_dense_matrix_of_ngrams_with_tf-idf_weighting
# Define some text data to adapt the layer
data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
text_vectorizer.adapt(data)

print(
    "Encoded text:\n",
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
    "\n",
)

# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)

 

 

类别样本量不均衡

https://keras.io/examples/structured_data/imbalanced_classification/

 

设置权重法

计算类别权重,

数量少的类别,给予高的权重

counts = np.bincount(train_targets[:, 0])
print(
    "Number of positive samples in training data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(train_targets)
    )
)

weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]

 

在训练接口中,指定类别权重

metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    class_weight=class_weight,
)

 

设置度量指标

https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#train_the_model

类别不均衡的情况下, 不能只使用 acc 准确度指标, 否则训练模型很有可能,只考虑大数量类别的情况, 忽略少数数量类别的数量。

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
      keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]

 

过采样 - Oversampling

https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversample_the_minority_class

对于少数的类别, 通过抽样方法,生成和多数类别相同数量。

个人感觉这种方法, 仅仅解决数量上的问题, 但是数据质量并没有提升, 数据的多样性问题没有得到解决, 那么最终影响模型对少数类别的泛化能力。

Using NumPy

You can balance the dataset manually by choosing the right number of random indices from the positive examples:

ids = np.arange(len(pos_features))
choices = np.random.choice(ids, len(neg_features))

res_pos_features = pos_features[choices]
res_pos_labels = pos_labels[choices]

res_pos_features.shape
(181966, 29)
resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)

order = np.arange(len(resampled_labels))
np.random.shuffle(order)
resampled_features = resampled_features[order]
resampled_labels = resampled_labels[order]

resampled_features.shape
(363932, 29)

 

 

TSV

https://en.wikipedia.org/wiki/Tab-separated_values

A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data,[1] and a way of exchanging information between databases.[2] Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe

Use pandas.read_table(filepath). The default separator is tab.

 

posted @ 2021-05-12 12:18  lightsong  阅读(255)  评论(0编辑  收藏  举报
Life Is Short, We Need Ship To Travel