自然语言处理（NLP）学习のspaCy工具的使用

spaCy官方学习文档

spaCy简介绍

nlp对象

# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

包含处理管道
包括特定于语言的令牌化规则等。

Doc对象

# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Token对象

doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

Span对象

doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

词汇属性

doc = nlp("It costs $5.")
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

练习(提供几个例子)

入门

English从中导入类spacy.lang.en并创建nlp对象。
创建一个doc并打印其文本。

from spacy.lang.en import English
nlp = English()
doc = nlp("This is a sentense.")
print(doc.text)

文档跨度和令牌

调用nlp字符串时，spaCy首先标记文本并创建文档对象。

步骤1

导入English语言类并创建nlp对象。
处理文本并实例化Doc变量中的对象doc。

选择的第一个标记Doc并打印text。

# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

步骤2

导入English语言类并创建nlp对象。
处理文本并实例化Doc变量中的对象doc。
Doc为令牌“ tree kangaroos”和“ tree kangaroos and narwhals”创建一个切片。

# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

词汇属性

在此示例中，将使用spaCyDoc和Token对象以及词法属性在文本中查找百分比。您将寻找两个后续标记：一个数字和一个百分号。

使用like_num令牌属性检查令牌中的令牌是否doc 类似于数字。
获取文档中当前令牌之后的令牌。中的下一个标记的索引doc为token.i + 1。
检查下一个令牌的text属性是否为百分号“％”。

from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

统计模型

什么是统计模型

启用spaCy以预测上下文中的语言属性
- 词性标签
- 句法依存关系
- 命名实体
训练有标签的示例文本
可以使用更多示例进行更新以微调预测

模型包

$ python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

二进制权重
词汇
元信息（语言，管道）

预测词性标签

import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

预测句法依存关系

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Label	Description	Example
nsubj	nominal subject	She
dobj	direct object	pizza
det	determiner (article)	the

预测命名实体

# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Tips: spacy.explain方法

快速了解常见标签

spacy.explain("GPE")
'Countries, cities, states'
spacy.explain("NNP")
'noun, proper singular'
spacy.explain("dobj")
'direct object'

练习

模型包

您可以加载到spaCy的模型包中未包含哪些内容？

A、一个元文件，包括语言，管道和许可证。

B、二进制权重进行统计预测。

C、模型训练所依据的标记数据。

D、模型词汇量及其哈希值的字符串。

加载模型

使用spacy.load加载英语小模型"en_core_web_sm"。
处理文本并打印文档文本。

import spacy

# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

预测语言注释

现在，您将尝试使用spaCy的预训练模型包之一，并查看其运行中的预测。随意在您自己的文本上尝试一下！要了解标签或标签的含义，可以spacy.explain在循环中调用。例如：spacy.explain("PROPN")或spacy.explain("GPE")。

第1部分

使用nlp对象处理文本并创建一个doc。

对于每个令牌，打印令牌文本，令牌的.pos_（词性标签）和令牌的.dep_（依赖性标签）。

import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

第2部分

处理文本并创建一个doc对象。

遍历doc.ents并打印实体文本和label_属性。

import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

在上下文中预测命名实体

模型是统计的，并不总是正确的。他们的预测是否正确取决于训练数据和您要处理的文本。让我们看一个例子。

使用nlp对象处理文本。
遍历实体并打印实体文本和标签。

看起来该模型没有预测“ iPhone X”。手动为这些标记创建跨度。

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

基于规则的匹配

为什么不只是正则表达式

匹配Doc对象，而不仅仅是字符串
匹配令牌和令牌属性
使用模型的预测
例如：“ duck”（动词）与“ duck”（名词）

匹配模式

词典列表，每个令牌一个
匹配确切的令牌文本

[{"TEXT": "iPhone"}, {"TEXT": "X"}]

匹配词汇属性

[{"LOWER": "iphone"}, {"LOWER": "x"}]

匹配任何令牌属性

[{"LEMMA": "buy"}, {"POS": "NOUN"}]

匹配器的使用

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

import spacy
# 导入 spaCy Matcher
from spacy.matcher import Matcher
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
# 创建nlp对象
# doc = nlp("I love little sheep.")
# doc2 = nlp("Indians spent over $71 billion on clothes in 2018")

# # spaCy处理管道
# print(nlp.pipe_names)

# # 词性标记
# for token in doc:
#     # Print the token and its part-of-speech tag
#     print(token.text, "-->", token.pos_)
# print(spacy.explain("PUNCT"))

# # 依存分析
# for token in doc:
#     print(token.text, "-->", token.dep_)

# # 命名实体识别
# for ent in doc2.ents:
#     print(ent.text, ent.label_)

# 基于规则的spaCy匹配


# 用spaCy词汇表初始化Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("Some people start their day with lemon water")
# 定义规则
pattern = [{'TEXT': 'lemon'}, {'TEXT': 'water'}]

# 添加规则
matcher.add('rule_1', [pattern])
matchers = matcher(doc)
print(matchers)

posted @ 2021-05-21 17:06 坐怀不乱的H先生阅读(628) 评论(0) 收藏举报

刷新页面返回顶部

坐怀不乱的H先生

自然语言处理（NLP）学习のspaCy工具的使用

spaCy官方学习文档

spaCy简介绍

nlp对象

Doc对象

Token对象

Span对象

词汇属性

练习(提供几个例子)

入门

文档跨度和令牌

步骤1

步骤2

词汇属性

统计模型

什么是统计模型

模型包

预测词性标签

预测句法依存关系

预测命名实体

练习

模型包

加载模型

预测语言注释

第1部分

第2部分

在上下文中预测命名实体

基于规则的匹配

为什么不只是正则表达式

匹配模式

匹配器的使用

公告