【实作】RNN-文本情感分类

文本情感分类 实验笔记

本实验为台大李宏毅老师机器学习2020年的HW4【实验说明】【官方实现代码参考】【实现代码

数据介绍

本次实验数据为twitter上的推文,每个推文会被标注为正面或负面。其中 0 --> 负面,1 --> 正面

实验数据共包括,

Labeled training data: 20w条数据

 1 1 +++$+++ are wtf ... awww thanks !
 2 1 +++$+++ leavingg to wait for kaysie to arrive myspacin itt for now ilmmthek .!
 3 0 +++$+++ i wish i could go and see duffy when she comes to mamaia romania .
 4 1 +++$+++ i know eep ! i can ' t wait for one more day ....
 5 0 +++$+++ so scared and feeling sick . fuck ! hope someone at hr help ... wish it would be wendita or karen .
 6 0 +++$+++ my b day was thurs . i wanted 2 do 5 this weekend for my b day but i guess close enough next weekend . going alone
 7 1 +++$+++ e3 is in the trending topics only just noticed ive been tweeting on my iphone until now
 8 1 +++$+++ where did you get him from i know someone who would love that !
 9 0 +++$+++ dam just got buzzed by another huge fly ! this time it landed on my head ... not impressed
10 1 +++$+++ tomorrowwwwwwwww !!! you ' ll love tomorrow ' s news !
11 0 +++$+++ gonna try 2 sleep . damn garageband next to me won ' t let me tho
12 0 +++$+++ wish weekend .. but not really also .. cuz next monday is exam and i haven ' t studied at all yet hate exam .. grr
13 1 +++$+++ check this vid out .... you ' ll piss yourself laughin
14 0 +++$+++ damn you gavin !!!!!! i want my computer back !!!!
15 1 +++$+++ it ' s great that you feel better , fresh air is nice im sure it will help too
16 0 +++$+++ got a bloody wheel clamp yesterday ï ¿ ½150 for 15 mins parking
17 0 +++$+++ homework and summer school . we ' ll go soon though !
18 0 +++$+++ no it ' s not right ..... it is so wrong ... i would never have expected it
19 1 +++$+++ says sa mga mag gf bf na nag aaway make piece not war
20 1 +++$+++ only has under 200 words left to write on her assignment
21 0 +++$+++ son graduated 5th grade today hes so grown !
data sample

Unlabeled training data: 120w条数据  用于半监督学习

 1 mkhang mlbo . dami niang followers ee . di q rin naman sia masisisi . desperate n kng desperate , pero dpt tlga replyn nia q = d
 2 don ' t you hate it when you hang on to a seemingly interesting movie to see the ending only to find out that the ending sucks ?
 3 ok so never went to the movies because friend wasn ' t feeling well but next weekend . back to work today , wasn ' t too bad .
 4 can ' t wait to see diversity ' s performance !
 5 i love britney spears haha joey this is what u do go party with eric or do things haha
 6 wish i could call in but i can ' t do blogtalk from work
 7 1 more day !
 8 nursing celeste with a tummy ache .
 9 hates being this burnt !! ouch
10 just couldn ' t sleep last night . working 7a 3p , than dinner with megan . happy bday jl !
11 i love slaves ! by david raccah , linkedin , rotfl
12 is being super organised and making up orders to post first thing tomorrow !
13 laying in the bed . it feels soooooo good . what a long day
14 finally , at the airport . currently chilling out at the citibank lounge . maaaan , the wi fi here doesn ' t work ! lameeee !
15 back and still feeling shattered . still no cockney ... i ' m ashamed to say .
16 so do i
17 don ' t ask me difficult questions , i know how to spell , but not ponder the bigger picture !
18 hey guys ! i am a big fan too just like my twin lol .. have a good day ! and wishin ya the best of luck ! xd
19 ay dios mio ! 2 weeks left of college !!! ah can ' t wait !!
20 oh , we must be related ! i ' ve heard that line before !
21 i know , i don ' t know if kayley knows . he ' ll probably be resting again tomorrow , i hope not he ' ll be better .
22 good luck
23 the app never works for me
24 ew , im not that clever , im just lucky what bother you at the class ? the lessons ?
25 whoah crap , that was a mistake ... do not put the three letters together in a tweet im el im .. just got overwhelmed with follow bots .
26 problem with feedburner again . showing no . of feed readers less than actual ones .
27 im having problems don ' t worry
28 listening too mgmt time to pretend
data sample

Testing data: 20w条数据(10w public, 10w private)

 1 id,text
 2 0,my dog ate our dinner . no , seriously ... he ate it .
 3 1,omg last day sooon n of primary noooooo x im gona be swimming out of school wif the amount of tears am gona cry
 4 2,stupid boys .. they ' re so .. stupid !
 5 3,hi ! do u know if the nurburgring is open for tourists today ? we want to go , but there is an event today
 6 4,having lunch in the office , and thinking of how to resolve this discount form issue
 7 5,shopping was fun
 8 6,wondering where all the nice weather has gone .
 9 7,morning ! yeeessssssss new mimi in aug
10 8,umm ... maybe that ' s how the british spell it ?
11 9,yes it ' s 3 : 50 am . yes i ' m still awake . yes i can ' t sleep . yes i ' ll regret it tomorrow . haha i love you mr saturday
12 10,cute heart shaped portal cube . my baby is playing games , im reading fan fictions !
13 11,had a song on mtv movie awards !!!!!
14 12,thanks nite
15 13,did not start her religion isu i will fail
16 14,that sounds wonderful !! i shall have to try it one day soon !
17 15,i love ya mariah , i love listening to your songs , your such an inspiration for alot of people out there !!!
18 16,there is sooo much love on here that i could faint ! lol . go celtics !! i miss my b ball team . i ' m proud of you donnie !
19 17,just found out i ' m gonna be let out early tomorrow , cos we ' re getting the results . omg if i fail science ...
20 18,that was a good thing to wake up to your right we will , and thats why god made us friends !!! ily
21 19,and old cam ' pic of tene and i . goodtimes . heehe . i want my cake now , mum
22 20,ooh my god ! i know the feeling i cannot stand getting into london from harold wood
23 21,nothing ! just kept us there for 20 minutes until they realized a walkie talkie is just a little toy and not a spy tool
24 22,6flags today teexxxt i need to shower but i ' m being lazy . i really don ' t feel that good
25 23,apparently , these are from filming , not the aftermath of the skanky hoebag fans . celebrity sites twisted the truth
26 24,hey fairuz ili ! nice to see some friends here
27 25,also cancelled my nikon 50mm lens order needed to buy some struts and tires for my car ...
28 26,headed to dallas tomorrow ... need some sleep but i am not tired yet !!
29 27,i just found out that i won a shirt from pretty effin sweet , eh ? i wonder what i ' ll get
30 28,sad i didnt get tickets 2 nin ja in albuquerque and it sold out
31 29,has had the most enjoyable day she ' s had for a lonng time
32 30,this should do the trick
33 31,o i have 21 tests i do 10 subjects lucky ... n o right ... kl is it hard ??
34 32,lol ! i thought so ! have fun in vegas .
35 33,sarah vowell ? if your dad likes humor with his history
36 34,i like corpus
data sample

 

实现步骤

一、 数据预处理

1.1) 读取数据,包括 train_label_data、train_no_label_data、test_data。放到word2vec模型中(gensim),训练得到w2v_all.model

1.2) 读取训练数据train_label --> input,对 input 中的sentence处理成word embedding的形式 --> train_x:

  • 根据 input,制作embedding matrix 及 word和idx的对应字典。记得加上<PAD>和<UNK>
  • 将 input 中每个句子中的每个词都转为idx,并以一句为单位存到train_x中作为输入。如果当前词未出现过,归为<UNK>
  • 根据超参sen_len,对每个sentence进行裁剪及补全(补全则每位补<PAD>),使每个sentence一样长度

1.3) 将 label 从 str 转为 int 型 --> y

二、 准备数据

2.1) 将train_x, y 分为训练集和验证集: X_train, X_val, y_train, y_val

2.2) 制作 train 和 val 的Dataset 和 DataLoader,方便模型操作shuffle、喂batch等

三、 准备RNN模型

3.1) 新建一个LSTM_Net模型

 1 # model.py
 2 class LSTM_Net(nn.Module):
 3     # 此处的embedding是embedding matrix
 4     def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True):
 5         super(LSTM_Net, self).__init__()
 6         # embedding layer
 7         self.embedding = torch.nn.Embedding(embedding.size(0), embedding.size(1))
 8         self.embedding.weight = torch.nn.Parameter(embedding)
 9         
10         self.embedding.weight.requires_grad = False if fix_embedding else True
11         self.embedding_dim = embedding.size(1)
12         
13         self.hidden_dim = hidden_dim
14         self.num_layers = num_layers
15         self.dropout = dropout
16         self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
17         self.classifier = nn.Sequential(nn.Dropout(dropout),
18                                          nn.Linear(hidden_dim, 1),
19                                          nn.Sigmoid())
20     
21     def forward(self, inputs):
22         inputs = self.embedding(inputs)
23         x, _ = self.lstm(inputs, None)
24         # x 的 dimension (batch, seq_len, hidden_size)
25         # 取 LSTM 最后一层的hidden state
26         x = x[:, -1, :]
27         x = self.classifier(x)
28         return x
View Code

四、进行模型训练

4.1)model.train() 模式下训练,model.eval()模式下验证。与之前图像CNN的过程类似。

4.2)epoch都训练完后,保存最后一个epoch中 best_acc 的 model

五、对 test 数据进行预测

5.1) 读取test数据,并记得做embedding处理

5.2) 把处理后的test数据喂给模型,得到预测结果,保存至csv中。

 

补充:半监督学习

  利用未标注数据。这边采用一个比较好实现的方法 self-Training

  Self-Training:把训练好的模型对未标注数据做预测。并将这些预测后的值转成未标注数据的标注,并加入这些新数据做训练。可调整不同的threshold,或多次取样得到比较有信心的data。

 

 

posted @ 2020-06-29 15:31  YeZzz  阅读(777)  评论(5编辑  收藏  举报