Stanford Dependency Parser总结

参考链接: https://stanfordnlp.github.io/stanfordnlp/models.html

论文地址:http://arxiv.org/abs/1901.10457

模型概述

对于给定的句子(单句or多句),进行
①分句、分词
②得到token的universal postag、treebank postag,词法特征(morphological features)
③根据token和upos的对应关系找到单词原形
④得到句子的governor以及依存关系

Tokenize

1、效果

"This is a test sentence for stanfordnlp. This is another sentence."
====== Sentence 1 tokens =======
index:   1        token: This
index:   2        token: is
index:   3        token: a
index:   4        token: test
index:   5        token: sentence
index:   6        token: for
index:   7        token: stanfordnlp
index:   8        token: .
====== Sentence 2 tokens =======
index:   1        token: This
index:   2        token: is
index:   3        token: another
index:   4        token: sentence
index:   5        token: .

2、论文对应

3、实现细节

对于一个给定的unit(char)判断是否是EOT,EOS
First-Layer:
BiLSTM:
input:[char_embedding,feats]
char_embedding:(句子+<Pad>对应char vocab中的index)做nn.embedding
feats:每一个char对应4-dim feat.(区分大写字母,空格)
Convld
Input 同BiLSTM
tok0= Linear1(LSTM_out+Conv_out) sent0= Linear2(LSTM_out+Conv_out) 
mwt0= Linear3(LSTM_out+Conv_out)
Second-Layer:
BiLSTM':
Input:(LSTM_out+Conv_out)*(1-sigmoid(-tok0))
tok+=linear1'(BiLSTM'_out)  sent0+=linear2'(BiLSTM'_out) mwt+=linear3'(BiLSTM'_out) 
模型输出
argmax后每个unit对应0~4
0:in token
1:end of token
2:end of sent
B a r a c k  O b a m a  w a s  b o r n  i n  H a w a i i . 
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 2 

POS

1、效果 

"Barack Obama was born in Hawaii."
word: Barack      upos: PROPN        xpos: NNP
word: Obama       upos: PROPN        xpos: NNP
word: was        upos: AUX          xpos: VBD
word: born       upos: VERB         xpos: VBN
word: in         upos: ADP          xpos: IN
word: Hawaii      upos: PROPN        xpos: NNP
word: .          upos: PUNCT        xpos: .
UPOS:Universal POS tags
​ADJ​​: adjective​
​ADP​​: adposition​
​ADV​​: adverb​
​AUX​​: auxiliary​
​CCONJ​​: coordinating conjunction​
​DET​​: determiner​
​INTJ​​: interjection​
​NOUN​​: noun​
​NUM​​: numeral​
​PART​​: particle​
​PRON​​: pronoun​
​PROPN​​: proper noun​
​PUNCT​​: punctuation​
​SCONJ​​: subordinating conjunction​
​SYM​​: symbol​
​VERB​​: verb​
​X​​: other​
XPOS:treebank-specific POS tags
NN  :常用名词
IN  :preposition or conjunction, subordinating 介词或从属连词
DT  :determiner 限定词
NNP :noun, proper, singular
PRP :pronoun, personal 人称代词
JJ  :adjective or numeral, ordinal 形容词或序数词
RB  :adverb 副词
.   :.
VB  : verb, base form
NNS : noun, common, plural
,   : ,
CC  : conjunction, coordinatin 表示连词
VBD : verb, past tense
VBP : verb, present tense, not 3rd person singular
VBZ : verb, present tense,3rd person singular
CD  : numeral, cardinal 表示基数词
VBN : verb, past participle
VBG : verb, present participle or gerund
MD  : modal auxiliary 情态助动词
TO  : "to" as preposition or infinitive marker 作为介词或不定式标记
PRP$: pronoun, possessive 所有格代词
-RRB-: 
-LRB-
WDT : WH-determiner WH限定词
WRB : Wh-adverb WH副词
:
``
''
WP : WH-pronoun WH代词
RP : particle 小品词
UH : 
POS : genitive marker 所有格标记
HYPH : 
JJR : adjective, comparative 形容词比较级
NNPS : noun, proper, plural
JJS : adjective, superlative 形容词最高级
EX : existential there 存在句
NFP 
GW
ADD
RBR : pronoun, personal 人称代词
$
PDT : pre-determiner 前位限定词
RBS : adverb, superlative 副词最高级
SYM : symbol 符号
LS : list item marker 列表标识
FW : foreign word 外来词
AFX : 
WP$ : WH-pronoun, possessive WH所有格代词
XX

2、论文对应

UPOS 
XPOS

3、实现细节

BiLSTM
Input:[word_embedding,char_embeddind,pretrain_embedding]
char_embedding实现:LSTM+attention
①用char_index 进行nn.embedding作为lstm输入
②weight=sigmod(Linear(lstm_output)) 
③final_output=(weight*lstm_output).sum(1),其中dim 1是token中char的维度
Upos
upos_hid = ReLu(Linear1(BiLSTM_output))
upos_pred = Linear1'(upos_hid)
Xpos
xpos_hid = Relu(Linear2(BiLSTM_output))
upos_emb = Linear(<len_upos_vocab*1>Upos_pred对于每一个upos,max_value对应的下标)
xpos_pred = BiLinear([upos_emb,1],[xpos_hid,1])

Lemma

1、效果 

"Barack Obama was born in Hawaii."
word: Barack         lemma: Barack
word: Obama          lemma: Obama
word: was            lemma: be
word: born           lemma: bear
word: in             lemma: in
word: Hawaii         lemma: Hawaii
word: .              lemma: .

2、实现细节

词表过滤(map known)
两个相关的vocab filter:(word,Upos)vocab --> word vocab。
根据上一步得到的token和upos,过滤得到不在两个vocab中的token
在vocab中的token直接根据vocab对应到lemma
Seq2Seq
输出两个Tensor:char_level_predict 以及edit_logit(token_num*3)
edit的dim是3,用来对char_level_predict进一步判断,0表示最终取predict的结果,1表示取原始token,2表示取lower_case
Encoder  LSTM
Input:[char_embedding,upos_embedding]
output:En_out,(H_n,C_n)
edit_logit = Linear1'(ReLu(Linear1(H_n))
Decoder (with BeamSearch)
得到char_level_predict
Edit作用于Predict
将char_level_predict对照到char_vocab,合并为token后根据edit选择原则得到最终lemma

Dependency Parse

对于句子中的每个token,得到 ①句法头部 ②与头部之间的依存关系

1、效果

"Barack Obama was born in Hawaii."
word: Barack            governor: born          deprel: nsubj:pass
word: Obama             governor: Barack        deprel: flat
word: was               governor: born          deprel: aux:pass
word: born              governor: root          deprel: root
word: in                governor: Hawaii        deprel: case
word: Hawaii            governor: born          deprel: obl
word: .                 governor: born          deprel: punct
deprel关系说明
​abbrev    :         abbreviation modifier,缩写​
​acomp     :         adjectival complement,形容词的补充;​
​advcl     :         adverbial clause modifier,状语从句修饰词​
​advmod    :         adverbial modifier状语​
​agent     :         agent,代理,一般有by的时候会出现这个​
​amod      :         adjectival modifier形容词​
​appos     :         appositional modifier,同位词​
​attr      :         attributive,属性​
​aux       :         auxiliary,非主要动词和助词,如BE,HAVE SHOULD/COULD等到​
​auxpass   :         passive auxiliary 被动词​
​cc        :         coordination,并列关系,一般取第一个词​
​ccomp     :         clausal complement从句补充​
​complm    :         complementizer,引导从句的词好重聚中的主要动词​
​conj      :         conjunct,连接两个并列的词。​
​cop       :         copula。系动词(如be,seem,appear等,命题主词与谓词间的)连系​
​csubj     :         clausal subject,从主关系​
​csubjpass :         clausal passive subject 主从被动关系​
​dep       :         dependent依赖关系​
​det       :         determiner决定词,如冠词等​
​dobj      :         direct object直接宾语​
​expl      :         expletive,主要是抓取there​
​infmod    :         infinitival modifier,动词不定式​
​iobj      :         indirect object,非直接宾语,也就是所以的间接宾语;​
​mark      :         marker,主要出现在有that or whether,because, when,​
​mwe       :         multi-word expression,多个词的表示​
​neg       :         negation modifier否定词​
​nn        :         noun compound modifier名词组合形式​
​npadvmod  :         noun phrase as adverbial modifier名词作状语​
​nsubj     :         nominal subject,名词主语​
​nsubjpass :         passive nominal subject,被动的名词主语​
​num       :         numeric modifier,数值修饰​
​number    :         element of compound number,组合数字​
​partmod   :         participial modifier动词形式的修饰​
​pcomp     :         prepositional complement,介词补充​
​pobj      :         object of a preposition,介词的宾语​
​poss      :         possession modifier,所有形式,所有格,所属​
​possessive:         possessive modifier,这个表示所有者和那个’S的关系​
​preconj   :         preconjunct,常常是出现在 either, both, neither的情况下​
​predet    :         predeterminer,前缀决定,常常是表示所有​
​prep      :         prepositional modifier​
​prepc     :         prepositional clausal modifier​
​prt       :         phrasal verb particle,动词短语​
​punct     :         punctuation,这个很少见,但是保留下来了,结果当中不会出现这个​
​purpcl    :         purpose clause modifier,目的从句​
​quantmod  :         quantifier phrase modifier,数量短语​
​rcmod     :         relative clause modifier相关关系​
​ref       :         referent,指示物,指代​
​rel       :         relative​
​root      :         root,最重要的词,从它开始,根节点​
​tmod      :         temporal modifier​
​xcomp     :         open clausal complement​
​xsubj     :         controlling subject 掌控者​

2、论文对应

3、实现细节

BiLSTM
input=[pretrain_emb,word_emb,lemma_emb,pos_emb,char_emb]
其中char_emb由charModel得出(同Pos 的char_embedding)
DeepBiaffine
对于head和relation 的计算,分别使用DeepBiaffine模型
其中head对应的output_dim是1,输出为(token_num*token_num*1)
relation对应的output_dim是len(vocab['deprel']),
输出为(token_num*token_num*len(vocab['deprel']))

DeepBiaffine的输入为input1,input2。对于head和relation的deepbiaffine,input1,input2=BiLSTMoutput
DeepBiaffine模型结构:BiLinear(Linear1(input1),Linear2(input2))
relation_pred(deprl)
relation_pred直接取DeepBiaffine_output的argmax即可
head_pred(head)
对于head_pred,另外需要考虑token在句子中的位置影响
首先得到head_offset矩阵(token_num*token_num*1)
head_offset(i,j)=index[i]-index[j]
①考虑token前后位置的影响
In a language where heads always follow their dependents, P(sgn(i − j) = 1|yij) would be extremely low, heavily penalizing rightward attachments. 
head_pred += logSigmoid(Biaffine(lstm_out,lstm_out)*sgn(head_offset))
②考虑token距离的影响
 Similarly, in a language where dependencies are always short, P(abs(i−j) ≫ 0|yij) would be extremely low, penalizing longer edges. 
预测treebank中 i,j间距离关系,dis(i,j)>=1故要在预测中+1
dis_pred = 1+log(1+exp(Biaffine(lstm_out,lstm_out)))
在句子中i,j实际距离
dis_tgt = abs(head_offset)
二者之间差距dis_dif = dis_tgt - dis_pred
dis_dif接近0的时候不惩罚(语法树中边长与原始句子中距离相近),dis_dif过大或过小进行惩罚
dis_pen = -log(dis_dif²/2+1)
head_pred += dis_pen
 
最后根据head_pred的得到head关系,再对应到relation_pred找到deprel
 
posted @ 2020-03-26 20:26  Jessyswing  阅读(525)  评论(0)    收藏  举报