Stanford Dependency Parser总结
参考链接: https://stanfordnlp.github.io/stanfordnlp/models.html
论文地址:http://arxiv.org/abs/1901.10457
模型概述
对于给定的句子(单句or多句),进行
①分句、分词
②得到token的universal postag、treebank postag,词法特征(morphological features)
③根据token和upos的对应关系找到单词原形
④得到句子的governor以及依存关系
Tokenize
1、效果
"This is a test sentence for stanfordnlp. This is another sentence."====== Sentence 1 tokens =======index: 1 token: Thisindex: 2 token: isindex: 3 token: aindex: 4 token: testindex: 5 token: sentenceindex: 6 token: forindex: 7 token: stanfordnlpindex: 8 token: .====== Sentence 2 tokens =======index: 1 token: Thisindex: 2 token: isindex: 3 token: anotherindex: 4 token: sentenceindex: 5 token: .2、论文对应
3、实现细节
对于一个给定的unit(char)判断是否是EOT,EOS
First-Layer:
BiLSTM:
input:[char_embedding,feats]
char_embedding:(句子+<Pad>对应char vocab中的index)做nn.embedding
feats:每一个char对应4-dim feat.(区分大写字母,空格)
Convld
Input 同BiLSTM
tok0= Linear1(LSTM_out+Conv_out) sent0= Linear2(LSTM_out+Conv_out)
mwt0= Linear3(LSTM_out+Conv_out)
Second-Layer:
BiLSTM':
Input:(LSTM_out+Conv_out)*(1-sigmoid(-tok0))
tok+=linear1'(BiLSTM'_out) sent0+=linear2'(BiLSTM'_out) mwt+=linear3'(BiLSTM'_out)
模型输出
argmax后每个unit对应0~4
0:in token
1:end of token
2:end of sent
B a r a c k O b a m a w a s b o r n i n H a w a i i . 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 2 POS
1、效果
"Barack Obama was born in Hawaii."word: Barack upos: PROPN xpos: NNPword: Obama upos: PROPN xpos: NNPword: was upos: AUX xpos: VBDword: born upos: VERB xpos: VBNword: in upos: ADP xpos: INword: Hawaii upos: PROPN xpos: NNPword: . upos: PUNCT xpos: .UPOS:Universal POS tags
ADJ: adjective ADP: adposition ADV: adverb AUX: auxiliary CCONJ: coordinating conjunction DET: determiner INTJ: interjection NOUN: noun NUM: numeral PART: particle PRON: pronoun PROPN: proper noun PUNCT: punctuation SCONJ: subordinating conjunction SYM: symbol VERB: verb X: other
XPOS:treebank-specific POS tags
NN :常用名词 IN :preposition or conjunction, subordinating 介词或从属连词 DT :determiner 限定词 NNP :noun, proper, singular PRP :pronoun, personal 人称代词 JJ :adjective or numeral, ordinal 形容词或序数词 RB :adverb 副词 . :. VB : verb, base form NNS : noun, common, plural , : , CC : conjunction, coordinatin 表示连词 VBD : verb, past tense VBP : verb, present tense, not 3rd person singular VBZ : verb, present tense,3rd person singular CD : numeral, cardinal 表示基数词 VBN : verb, past participle VBG : verb, present participle or gerund MD : modal auxiliary 情态助动词 TO : "to" as preposition or infinitive marker 作为介词或不定式标记 PRP$: pronoun, possessive 所有格代词 -RRB-: -LRB- WDT : WH-determiner WH限定词 WRB : Wh-adverb WH副词 : `` '' WP : WH-pronoun WH代词 RP : particle 小品词 UH : POS : genitive marker 所有格标记 HYPH : JJR : adjective, comparative 形容词比较级 NNPS : noun, proper, plural JJS : adjective, superlative 形容词最高级 EX : existential there 存在句 NFP GW ADD RBR : pronoun, personal 人称代词 $ PDT : pre-determiner 前位限定词 RBS : adverb, superlative 副词最高级 SYM : symbol 符号 LS : list item marker 列表标识 FW : foreign word 外来词 AFX : WP$ : WH-pronoun, possessive WH所有格代词 XX
2、论文对应
UPOS
XPOS
3、实现细节
BiLSTM
Input:[word_embedding,char_embeddind,pretrain_embedding]
char_embedding实现:LSTM+attention
①用char_index 进行nn.embedding作为lstm输入
②weight=sigmod(Linear(lstm_output))
③final_output=(weight*lstm_output).sum(1),其中dim 1是token中char的维度
Upos
upos_hid = ReLu(Linear1(BiLSTM_output))
upos_pred = Linear1'(upos_hid)
Xpos
xpos_hid = Relu(Linear2(BiLSTM_output))
upos_emb = Linear(<len_upos_vocab*1>Upos_pred对于每一个upos,max_value对应的下标)
xpos_pred = BiLinear([upos_emb,1],[xpos_hid,1])
Lemma
1、效果
"Barack Obama was born in Hawaii."word: Barack lemma: Barackword: Obama lemma: Obamaword: was lemma: beword: born lemma: bearword: in lemma: inword: Hawaii lemma: Hawaiiword: . lemma: .2、实现细节
词表过滤(map known)
两个相关的vocab filter:(word,Upos)vocab --> word vocab。
根据上一步得到的token和upos,过滤得到不在两个vocab中的token
在vocab中的token直接根据vocab对应到lemma
Seq2Seq
输出两个Tensor:char_level_predict 以及edit_logit(token_num*3)
edit的dim是3,用来对char_level_predict进一步判断,0表示最终取predict的结果,1表示取原始token,2表示取lower_case
Encoder LSTM
Input:[char_embedding,upos_embedding]
output:En_out,(H_n,C_n)
edit_logit = Linear1'(ReLu(Linear1(H_n))
Decoder (with BeamSearch)
得到char_level_predict
Edit作用于Predict
将char_level_predict对照到char_vocab,合并为token后根据edit选择原则得到最终lemma
Dependency Parse
对于句子中的每个token,得到 ①句法头部 ②与头部之间的依存关系
1、效果
"Barack Obama was born in Hawaii."word: Barack governor: born deprel: nsubj:password: Obama governor: Barack deprel: flatword: was governor: born deprel: aux:password: born governor: root deprel: rootword: in governor: Hawaii deprel: caseword: Hawaii governor: born deprel: oblword: . governor: born deprel: punctdeprel关系说明
abbrev : abbreviation modifier,缩写 acomp : adjectival complement,形容词的补充; advcl : adverbial clause modifier,状语从句修饰词 advmod : adverbial modifier状语 agent : agent,代理,一般有by的时候会出现这个 amod : adjectival modifier形容词 appos : appositional modifier,同位词 attr : attributive,属性 aux : auxiliary,非主要动词和助词,如BE,HAVE SHOULD/COULD等到 auxpass : passive auxiliary 被动词 cc : coordination,并列关系,一般取第一个词 ccomp : clausal complement从句补充 complm : complementizer,引导从句的词好重聚中的主要动词 conj : conjunct,连接两个并列的词。 cop : copula。系动词(如be,seem,appear等,命题主词与谓词间的)连系 csubj : clausal subject,从主关系 csubjpass : clausal passive subject 主从被动关系 dep : dependent依赖关系 det : determiner决定词,如冠词等 dobj : direct object直接宾语 expl : expletive,主要是抓取there infmod : infinitival modifier,动词不定式 iobj : indirect object,非直接宾语,也就是所以的间接宾语; mark : marker,主要出现在有that or whether,because, when, mwe : multi-word expression,多个词的表示 neg : negation modifier否定词 nn : noun compound modifier名词组合形式 npadvmod : noun phrase as adverbial modifier名词作状语 nsubj : nominal subject,名词主语 nsubjpass : passive nominal subject,被动的名词主语 num : numeric modifier,数值修饰 number : element of compound number,组合数字 partmod : participial modifier动词形式的修饰 pcomp : prepositional complement,介词补充 pobj : object of a preposition,介词的宾语 poss : possession modifier,所有形式,所有格,所属 possessive: possessive modifier,这个表示所有者和那个’S的关系 preconj : preconjunct,常常是出现在 either, both, neither的情况下 predet : predeterminer,前缀决定,常常是表示所有 prep : prepositional modifier prepc : prepositional clausal modifier prt : phrasal verb particle,动词短语 punct : punctuation,这个很少见,但是保留下来了,结果当中不会出现这个 purpcl : purpose clause modifier,目的从句 quantmod : quantifier phrase modifier,数量短语 rcmod : relative clause modifier相关关系 ref : referent,指示物,指代 rel : relative root : root,最重要的词,从它开始,根节点 tmod : temporal modifier xcomp : open clausal complement xsubj : controlling subject 掌控者
2、论文对应
3、实现细节
BiLSTM
input=[pretrain_emb,word_emb,lemma_emb,pos_emb,char_emb]
其中char_emb由charModel得出(同Pos 的char_embedding)
DeepBiaffine
对于head和relation 的计算,分别使用DeepBiaffine模型
其中head对应的output_dim是1,输出为(token_num*token_num*1)
relation对应的output_dim是len(vocab['deprel']),
输出为(token_num*token_num*len(vocab['deprel']))
DeepBiaffine的输入为input1,input2。对于head和relation的deepbiaffine,input1,input2=BiLSTMoutput
DeepBiaffine模型结构:BiLinear(Linear1(input1),Linear2(input2))
relation_pred(deprl)
relation_pred直接取DeepBiaffine_output的argmax即可
head_pred(head)
对于head_pred,另外需要考虑token在句子中的位置影响
首先得到head_offset矩阵(token_num*token_num*1)
head_offset(i,j)=index[i]-index[j]
①考虑token前后位置的影响
In a language where heads always follow their dependents, P(sgn(i − j) = 1|yij) would be extremely low, heavily penalizing rightward attachments.
head_pred += logSigmoid(Biaffine(lstm_out,lstm_out)*sgn(head_offset))
②考虑token距离的影响
Similarly, in a language where dependencies are always short, P(abs(i−j) ≫ 0|yij) would be extremely low, penalizing longer edges.
预测treebank中 i,j间距离关系,dis(i,j)>=1故要在预测中+1
dis_pred = 1+log(1+exp(Biaffine(lstm_out,lstm_out)))
在句子中i,j实际距离
dis_tgt = abs(head_offset)
二者之间差距dis_dif = dis_tgt - dis_pred
dis_dif接近0的时候不惩罚(语法树中边长与原始句子中距离相近),dis_dif过大或过小进行惩罚
dis_pen = -log(dis_dif²/2+1)
head_pred += dis_pen
最后根据head_pred的得到head关系,再对应到relation_pred找到deprel

浙公网安备 33010602011771号