Python-自然语言处理秘籍-全-

Python 自然语言处理秘籍（全）

原文：zh.annas-archive.org/md5/783186954e97066ec43d243155438313

译者：飞龙

协议：CC BY-NC-SA 4.0

前言

Python 是最广泛使用的自然语言处理（自然语言处理）语言，这得益于其丰富的文本分析和提取计算机可用数据的工具和库。本书将向您介绍一系列文本处理技术，从基本的词性分析到复杂的话题，如主题建模、文本分类和可视化。

本书从自然语言处理概述开始，提供了将文本划分为句子、词干提取和词形还原、去除停用词以及词性标注的食谱，以帮助您准备数据。然后，您将了解提取和表示语法信息的方法，例如依存句法分析和反身代词消解；发现使用词袋、TF-IDF、词嵌入和 BERT 等不同方式表示语义的不同方法；并发展使用关键词、SVM、LSTM 等其他技术进行文本分类的技能。

随着您的进步，您还将了解如何从文本中提取信息，实现无监督和监督的主题建模技术，并对短文本，如推文进行主题建模。此外，本书还涵盖了文本数据的可视化。

最后，本书介绍了基于 Transformer 的模型及其如何利用它们执行另一组新颖的自然语言处理任务。这些基于编码器-解码器的模型是深度神经网络模型，并在大型文本语料库上进行了训练。这些模型在各种自然语言处理任务上已经实现了或超过了最先进的技术。特别是基于解码器的生成模型，它们具有根据提供的上下文生成文本的能力。其中一些模型内置了推理能力。这些模型将把自然语言处理带入下一个时代，使其成为主流技术应用的一部分。

在阅读完这本自然语言处理书籍之后，您将掌握使用强大的文本处理工具集的技能。

本书面向的对象

数据科学家、机器学习工程师和熟悉基本编程和数据科学概念的开发者可以从本书中获得实际见解。本书作为入门书籍，介绍了自然语言处理的概念及其实际应用。

本书的目标角色如下：

数据科学家：作为数据科学家，您将了解如何处理文本。具备中级 Python 知识将帮助您充分利用本书。如果您已经是自然语言处理从业者，本书将在您进行项目开发时作为代码参考。

软件工程师和架构师：希望在该领域建立自然语言处理能力的开发者将了解自然语言处理在文本处理中的所有基本和高级应用。这将帮助您提升知识水平，并在需要时开发使用自然语言处理技术的解决方案。

产品经理：尽管这本书包含了食谱的代码示例，但每个食谱都附有解释为什么执行某些步骤以及这些步骤的最终输出是什么。这使得它成为产品经理的有用资源，他们想了解使用某种自然语言处理食谱可以实现什么，这将使他们能够设想使用它的新颖解决方案。

本书涵盖的内容

第一章，学习NLP基础，介绍了自然语言处理的基本知识。本章中的食谱展示了进行进一步自然语言处理工作所需的基本预处理步骤。我们展示了如何标记文本，或将其划分为句子和单词；为单个单词分配词性；词元化它们，或获取它们的规范形式；以及去除停用词。

第二章，玩转语法，展示了如何从文本中获取语法信息。这些信息在确定文本中提到的不同实体之间的关系时可能很有用。我们首先展示如何确定一个名词是单数还是复数。然后，我们展示如何获取一个句子中词语之间关系的依存句法分析。接着，我们演示如何获取名词短语，即带有其依赖词（如形容词）的名词。之后，我们研究如何解析句子的主语和宾语。最后，我们展示如何使用正则表达式风格的匹配器从句子中提取语法短语。

第三章，表示文本 – 捕捉语义，探讨了在自然语言处理模型中进一步处理文本的不同方法。由于计算机不能直接处理单词，我们需要将它们编码成向量形式。为了展示不同编码方法的有效性，我们首先创建了一个简单的分类器，然后使用不同的编码方法。我们考虑以下编码方法：词袋模型、N-gram模型、TF-IDF、词嵌入、BERT和OpenAI嵌入。我们还展示了如何训练自己的词袋模型，并演示了如何创建一个简单的检索增强生成（RAG）解决方案。

第四章，文本分类，展示了执行文本分类的各种方法，这是最常见的一种自然语言处理任务。首先，我们展示如何预处理数据集以准备分类。然后，我们演示不同的分类器，包括基于规则的分类器、通过K-means的无监督分类器、训练用于分类的支持向量机（SVM）、训练用于文本分类的spaCy模型，以及最终使用OpenAI GPT模型进行文本分类。

第5章，信息提取入门，展示了如何从文本中提取信息，这是另一个非常重要的NLP任务。我们首先使用正则表达式进行简单的信息提取。然后我们看看如何使用Levenshtein距离来处理拼写错误。然后，我们展示了如何从不同的文本中提取特征关键词。我们探讨了如何使用spaCy提取命名实体，以及如何训练自己的定制spaCy NER模型。最后，我们展示了如何微调BERT NER模型。

第6章，主题建模，展示了如何使用各种无监督方法确定文本的主题，包括LDA、基于BERT嵌入的社区检测、K-means聚类和BERTopic。最后，我们使用与多语言模型和输入一起工作的上下文主题模型。

第7章，可视化文本数据，专注于使用各种工具创建文本数据和处理的富有信息性的可视化。我们创建了依存句法分析、词性和命名实体的图形表示。我们还创建了混淆矩阵图和词云。最后，我们使用pyLDAvis和BERTopic来可视化文本中的主题。

第8章，Transformer及其应用，提供了对Transformer的介绍。本章首先演示了如何将文本转换为适合Transformer模型内部处理的格式。然后探讨了使用预训练的Transformer模型进行文本分类的技术。此外，本章深入探讨了使用Transformer进行文本生成，解释了如何调整生成参数以产生连贯且自然的声音的文本。最后，它涵盖了Transformer在语言翻译中的应用。

第9章，自然语言理解，涵盖了帮助推断文本中包含的信息的NLP技术。本章首先讨论了开放和封闭域中的问答，然后讨论了使用提取和抽象方法从文档来源回答问题的方法。随后的部分涵盖了文本摘要和句子蕴涵。本章以可解释性技术结束，这些技术展示了模型如何做出分类决策以及文本的不同部分如何有助于分配给定的类别标签。

第10章，生成式AI和大型语言模型，介绍了开源大型语言模型（LLMs）如Mistral和Llama，演示了如何使用提示根据简单的人类定义要求生成文本。它进一步探讨了从自然语言指令生成Python代码和SQL语句的技术。最后，它介绍了利用来自OpenAI的复杂闭源LLM来编排自定义任务代理的方法。这些代理协同工作，通过网络搜索和基本算术来回答需要复杂问题的解决方案。

为了充分利用本书

您需要了解Python编程语言以及如何管理和安装其包。了解Jupyter Notebook会有所帮助，尽管这不是必需的。对于包管理，建议了解poetry包管理，尽管您也可以通过pip使示例工作。为了能够在系统中使用GPU（如果有的话），请确保已安装最新的GPU设备驱动程序以及CUDA/cuDNN依赖项。

本书涵盖的软件/硬件	操作系统要求
Python 3.10	Windows、macOS或Linux
Poetry	Windows、macOS或Linux
Jupyter Notebook（可选）	Windows、macOS或Linux

如果您使用的是本书的数字版，我们建议您亲自输入代码或从书的GitHub仓库（下一节中提供链接）获取代码。这样做将帮助您避免与代码的复制和粘贴相关的任何潜在错误。

下载示例代码文件

您可以从GitHub下载本书的示例代码文件https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition。如果代码有更新，它将在GitHub仓库中更新。

我们还有其他来自我们丰富的书籍和视频目录的代码包，可在https://github.com/PacktPublishing/找到。查看它们吧！

使用的约定

本书使用了几个文本约定。

文本中的代码：表示文本中的代码单词、数据库表名、文件夹名、文件名、文件扩展名、路径名、虚拟URL、用户输入和Twitter/X用户名。以下是一个示例：“实例化一个bert-base-cased类型的分词器。”

代码块设置如下：

from transformers import pipeline
import torch

当我们希望您注意代码块中的特定部分时，相关的行或项目将以粗体显示：

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

任何命令行输入或输出都应如下编写：

Classification Results
{'accuracy': 0.88,
'precision': 0.92,
'recall': 0.84}

粗体：表示新术语、重要单词或屏幕上看到的单词。例如，菜单或对话框中的单词以粗体显示。以下是一个示例：“从管理面板中选择系统信息。”

小贴士或重要注意事项

看起来像这样。

联系我们

我们始终欢迎读者的反馈。

一般反馈：如果您对此书的任何方面有疑问，请通过customercare@packtpub.com给我们发邮件，并在邮件主题中提及书名。

勘误表：尽管我们已经尽一切努力确保内容的准确性，但错误仍然可能发生。如果您在此书中发现错误，我们将不胜感激，如果您能向我们报告，请访问www.packtpub.com/support/errata并填写表格。

盗版：如果您在互联网上以任何形式发现我们作品的非法副本，如果您能提供位置地址或网站名称，我们将不胜感激。请通过版权@packt.com与我们联系，并提供材料的链接。

如果您有兴趣成为作者：如果您在某个领域有专业知识，并且您有兴趣撰写或为书籍做出贡献，请访问 authors.packtpub.com。

分享您的想法

一旦您阅读了《Python自然语言处理食谱》，我们非常乐意听到您的想法！请点击此处直接转到此书的亚马逊评论页面并分享您的反馈。

您的评论对我们和科技社区非常重要，并将帮助我们确保我们提供高质量的内容。

下载此书的免费PDF副本

感谢您购买此书！

您喜欢在路上阅读，但无法携带您的印刷书籍到处走？

您的电子书购买是否与您选择的设备不兼容？

别担心，现在每本Packt书籍都附带一本无DRM的PDF版本，无需额外费用。

在任何地方、任何设备上阅读。直接从您喜欢的技术书籍中搜索、复制和粘贴代码到您的应用程序中。

优惠不止于此，您还可以获得独家折扣、时事通讯和每日免费内容的每日电子邮件。

按照以下简单步骤获取优惠：

扫描二维码或访问以下链接

https://packt.link/free-ebook/978-1-80324-574-4

提交您的购买证明
就这些！我们将直接将您的免费PDF和其他优惠发送到您的电子邮件。

第一章：学习NLP基础知识

在编写这本书的过程中，我们专注于包括对各种NLP项目有用的食谱。它们从简单到复杂，从处理语法到处理可视化，在许多食谱中还包括了除英语之外的语言选项。在新版中，我们包括了使用GPT和其他大型语言模型、可解释人工智能、关于转换器的新章节以及自然语言理解的新主题。我们希望这本书对你有所帮助。

这本书的格式类似于编程食谱，其中每个食谱都是一个具有具体目标和需要执行的一系列步骤的短期迷你项目。理论解释很少，重点是实际目标和实现它们所需的工作。

在我们开始真正的NLP工作之前，我们需要为文本处理做好准备。本章将向你展示如何进行。到本章结束时，你将能够得到一个包含文本中单词及其词性、词干或词根的列表，并且移除了非常频繁的单词。

自然语言工具包（NLTK）和spaCy是我们在本章以及整本书中将要使用的重要库。书中还会使用到其他一些库，例如PyTorch和Hugging Face Transformers。我们还将利用OpenAI API和GPT模型。

本章包含的食谱如下：

将文本划分为句子
将句子划分为单词——分词
词性标注
结合相似词语——词形还原
移除停用词

技术要求

在整本书中，我们将使用Poetry来管理Python包的安装。你可以使用最新版本的Poetry，因为它保留了之前版本的功能。一旦安装了Poetry，管理要安装的包将会非常容易。整本书我们将使用Python 3.9。你还需要安装Jupyter以便运行笔记本。

注意

你可以尝试使用Google Colab来运行笔记本，但你需要调整代码以便使其在Colab上工作。

按照以下安装步骤进行：

安装Git：https://github.com/git-guides/install-git。
安装Poetry：https://python-poetry.org/docs/#installation。
安装Jupyter：https://jupyter.org/install。
在终端中输入以下命令以克隆包含本书所有代码的GitHub仓库（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition）：

git clone https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition.git

在包含pyproject.toml文件的目录中，使用终端运行以下命令：

poetry install
poetry shell

启动笔记本引擎：

jupyter notebook

现在，你应该能够运行你克隆的仓库中的所有笔记本。

如果你不想使用Poetry，你可以使用书中提供的requirements.txt文件设置虚拟环境。你可以有两种方法来做这件事。你可以使用pip：

pip install -r requirements.txt

你也可以使用conda：

conda create --name <env_name> --file requirements.txt

将文本分割成句子

当我们处理文本时，我们可以处理不同尺度的文本单元：文档本身，例如一篇报纸文章，段落，句子或单词。句子是许多NLP任务中的主要处理单元。例如，当我们将数据发送到大型语言模型（LLMs）时，我们经常想在提示中添加一些上下文。在某些情况下，我们希望这个上下文中包含文本中的句子，以便模型可以从该文本中提取一些重要信息。在本节中，我们将向您展示如何将文本分割成句子。

准备工作

对于这部分，我们将使用《福尔摩斯探案集》的文本。你可以在这本书的GitHub文件中找到整个文本（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes.txt）。对于这个食谱，我们只需要书的开始部分，可以在文件https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes_1.txt中找到。

为了完成这个任务，你需要NLTK包及其句子分词器，它们是Poetry文件的一部分。安装Poetry的说明在技术要求部分中描述。

如何操作...

现在，我们将分割《福尔摩斯探案集》一小部分的文本，输出句子列表。（参考笔记本：https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter01/dividing_text_into_sentences_1.1.ipynb）。在这里，我们假设你正在运行笔记本，所以路径都是相对于笔记本位置的：

从util文件夹中导入文件实用函数（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/util/file_utils.ipynb）：
```
%run -i "../util/file_utils.ipynb"
```

读取书籍的部分文本：

sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")

read_text_file函数位于我们之前导入的util笔记本中。以下是它的源代码：

def read_text_file(filename):
    file = open(filename, "r", encoding="utf-8")
    return file.read()

打印出结果以确保一切正常并且文件已加载：

print(sherlock_holmes_part_of_text)

打印输出的开始部分将看起来像这样：

To Sherlock Holmes she is always _the_ woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her sex…

导入nltk包：
```
import nltk
```
如果你第一次运行代码，你需要下载分词器数据。之后你不需要再运行此命令：
```
nltk.download('punkt')
```

初始化分词器：

tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

使用分词器将文本分割成句子。结果将是一个句子列表：

sentences_nltk = tokenizer.tokenize(
    sherlock_holmes_part_of_text)

打印结果：

print(sentences_nltk)

它应该看起来像这样。句子中包含来自书籍格式的换行符，它们不一定是句子的结尾：

['To Sherlock Holmes she is always _the_ woman.', 'I have seldom heard him\nmention her under any other name.', 'In his eyes she eclipses and\npredominates the whole of her sex.', 'It was not that he felt any emotion\nakin to love for Irene Adler.', 'All emotions, and that one particularly,\nwere abhorrent to his cold, precise but admirably balanced mind.', 'He\nwas, I take it, the most perfect reasoning and observing machine that\nthe world has seen, but as a lover he would have placed himself in a\nfalse position.', 'He never spoke of the softer passions, save with a gibe\nand a sneer.', 'They were admirable things for the observer—excellent for\ndrawing the veil from men's motives and actions.', 'But for the trained\nreasoner to admit such intrusions into his own delicate and finely\nadjusted temperament was to introduce a distracting factor which might\nthrow a doubt upon all his mental results.', 'Grit in a sensitive\ninstrument, or a crack in one of his own high-power lenses, would not\nbe more disturbing than a strong emotion in a nature such as his.', 'And\nyet there was but one woman to him, and that woman was the late Irene\nAdler, of dubious and questionable memory.']

打印结果中的句子数量；总共有11个句子：
```
print(len(sentences_nltk))
```
这将给出以下结果：
```
11
```

虽然使用正则表达式在句号处分割文本以将其分为句子可能看起来很简单，但实际上要复杂得多。我们在句子的其他地方也使用句号；例如，在缩写词之后——例如，“Dr. Smith will see you now.” 类似地，虽然英语中的所有句子都以大写字母开头，但我们也会使用大写字母来表示专有名词。nltk中使用的这种方法考虑了所有这些因素；它是一个无监督算法的实现，该算法在https://aclanthology.org/J06-4003.pdf中提出。

还有更多...

我们还可以使用不同的策略来将文本解析为句子，采用另一个非常流行的NLP包，spaCy。以下是它是如何工作的：

导入spaCy包：
```
import spacy
```
第一次运行笔记本时，你需要下载spaCy模型。该模型是在大量英文文本上训练的，并且可以使用包括句子分词器在内的几个工具。在这里，我正在下载最小的模型，但你也可以尝试其他模型（见https://spacy.io/usage/models/)。
```
!python -m spacy download en_core_web_sm
```
初始化spaCy引擎：
```
nlp = spacy.load("en_core_web_sm")
```
使用spaCy引擎处理文本。这一行假设你已经初始化了sherlock_holmes_part_of_text变量。如果没有，你需要运行之前的一个单元格，其中文本被读入这个变量：
```
doc = nlp(sherlock_holmes_part_of_text)
```

从处理后的doc对象中获取句子，并打印出结果数组和它的长度：

sentences_spacy = [sentence.text for sentence in doc.sents]
print(sentences_spacy)
print(len(sentences_spacy))

结果将看起来像这样：

['To Sherlock Holmes she is always _the_ woman.', 'I have seldom heard him\nmention her under any other name.', 'In his eyes she eclipses and\npredominates the whole of her sex.', 'It was not that he felt any emotion\nakin to love for Irene Adler.', 'All emotions, and that one particularly,\nwere abhorrent to his cold, precise but admirably balanced mind.', 'He\nwas, I take it, the most perfect reasoning and observing machine that\nthe world has seen, but as a lover he would have placed himself in a\nfalse position.', 'He never spoke of the softer passions, save with a gibe\nand a sneer.', 'They were admirable things for the observer—excellent for\ndrawing the veil from men's motives and actions.', 'But for the trained\nreasoner to admit such intrusions into his own delicate and finely\nadjusted temperament was to introduce a distracting factor which might\nthrow a doubt upon all his mental results.', 'Grit in a sensitive\ninstrument, or a crack in one of his own high-power lenses, would not\nbe more disturbing than a strong emotion in a nature such as his.', 'And\nyet there was but one woman to him, and that woman was the late Irene\nAdler, of dubious and questionable memory.']
11

spaCy与NLTK之间的重要区别在于完成句子分割过程所需的时间。原因在于spaCy加载了一个语言模型，并使用除了分词器之外的其他工具，而NLTK的分词器只有一个功能：将文本分割成句子。我们可以通过使用time包并将分割句子的代码放入main函数中来计时：

import time
def split_into_sentences_nltk(text):
    sentences = tokenizer.tokenize(text)
    return sentences
def split_into_sentences_spacy(text):
    doc = nlp(text)
    sentences = [sentence.text for sentence in doc.sents]
    return sentences
start = time.time()
split_into_sentences_nltk(sherlock_holmes_part_of_text)
print(f"NLTK: {time.time() - start} s")
start = time.time()
split_into_sentences_spacy(sherlock_holmes_part_of_text)
print(f"spaCy: {time.time() - start} s")

spaCy算法耗时0.019秒，而NLTK算法耗时0.0002秒。时间是通过从代码块开始设置的时间减去当前时间(time.time())来计算的。你可能会得到略微不同的值。

您可能会使用spaCy的原因是如果您在使用该包进行其他处理的同时，还需要将其分割成句子。spaCy处理器执行许多其他任务，这就是为什么它需要更长的时间。如果您正在使用spaCy的其他功能，就没有必要仅为了句子分割而使用NLTK，最好在整个流程中使用spaCy。

还可以使用spaCy的tokenizer而不使用其他工具。请参阅他们的文档以获取更多信息：https://spacy.io/usage/processing-pipelines。

重要提示

spaCy可能较慢，但它后台执行了许多更多的事情，如果您正在使用它的其他功能，那么在句子分割时也使用spaCy。

另请参阅

您可以使用NLTK和spaCy来分割非英语语言的文本。NLTK包括捷克语、丹麦语、荷兰语、爱沙尼亚语、芬兰语、法语、德语、希腊语、意大利语、挪威语、波兰语、葡萄牙语、斯洛文尼亚语、西班牙语、瑞典语和土耳其语的tokenizer模型。为了加载这些模型，请使用语言名称后跟.pickle扩展名：

tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle")

查看NLTK文档以获取更多信息：https://www.nltk.org/index.html。

同样，spaCy也提供了其他语言的模型：中文、荷兰语、英语、法语、德语、希腊语、意大利语、日语、葡萄牙语、罗马尼亚语、西班牙语以及其他语言。这些模型都是在这些语言的文本上训练的。为了使用这些模型，您需要分别下载它们。例如，对于西班牙语，可以使用以下命令下载模型：

python -m spacy download es_core_news_sm

然后，将此行代码放入以使用它：

nlp = spacy.load("es_core_news_sm")

查看spaCy文档以获取更多信息：https://spacy.io/usage/models。

将句子分割成单词 – 分词

在许多情况下，我们在进行NLP任务时依赖于单个单词。例如，当我们通过依赖单个单词的语义来构建文本的语义模型时，或者当我们寻找具有特定词性的单词时，这种情况就会发生。为了将文本分割成单词，我们可以使用NLTK和spaCy来为我们完成这个任务。

准备工作

对于这部分，我们将使用书籍《福尔摩斯探案集》的相同文本。您可以在书籍的GitHub仓库中找到整个文本。对于这个食谱，我们只需要书的开始部分，这部分可以在sherlock_holmes_1.txt文件中找到。

为了完成这个任务，您将需要NLTK和spaCy包，它们是Poetry文件的一部分。在技术要求部分描述了安装Poetry的说明。

（笔记本参考：https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter01/dividing_sentences_into_words_1.2.ipynb）。

如何实现

流程如下：

导入file_utils笔记本。实际上，我们在一个笔记本中运行file_utils笔记本，这样我们就可以访问其定义的函数和变量：
```
%run -i "../util/file_utils.ipynb"
```

读取书籍片段文本：

sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
print(sherlock_holmes_part_of_text)

结果应该看起来像这样：

To Sherlock Holmes she is always _the_ woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her sex... [Output truncated]

导入nltk包：
```
import nltk
```
将输入分成单词。在这里，我们使用NLTK单词分词器将文本分割成单个单词。该函数的输出是一个包含单词的Python列表：
```
words = nltk.tokenize.word_tokenize(
    sherlock_holmes_part_of_text)
print(words)
print(len(words))
```

输出将是文本中的单词列表和words列表的长度：

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', '.', 'All', 'emotions', ',', 'and', 'that', 'one', 'particularly', ',', 'were', 'abhorrent', 'to', 'his', 'cold', ',', 'precise', 'but', 'admirably', 'balanced', 'mind', '.', 'He', 'was', ',', 'I', 'take', 'it', ',', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', ',', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', '.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', ',', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', '.', 'They', 'were', 'admirable', 'things', 'for', 'the', 'observer—excellent', 'for', 'drawing', 'the', 'veil', 'from', 'men', ''', 's', 'motives', 'and', 'actions', '.', 'But', 'for', 'the', 'trained', 'reasoner', 'to', 'admit', 'such', 'intrusions', 'into', 'his', 'own', 'delicate', 'and', 'finely', 'adjusted', 'temperament', 'was', 'to', 'introduce', 'a', 'distracting', 'factor', 'which', 'might', 'throw', 'a', 'doubt', 'upon', 'all', 'his', 'mental', 'results', '.', 'Grit', 'in', 'a', 'sensitive', 'instrument', ',', 'or', 'a', 'crack', 'in', 'one', 'of', 'his', 'own', 'high-power', 'lenses', ',', 'would', 'not', 'be', 'more', 'disturbing', 'than', 'a', 'strong', 'emotion', 'in', 'a', 'nature', 'such', 'as', 'his', '.', 'And', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', ',', 'and', 'that', 'woman', 'was', 'the', 'late', 'Irene', 'Adler', ',', 'of', 'dubious', 'and', 'questionable', 'memory', '.']
230

输出是一个列表，其中每个标记要么是单词，要么是标点符号。NLTK分词器使用一组规则将文本分割成单词。它分割但不扩展缩写，如don’t → do n’t和men’s → men ’s，正如前面的例子所示。它将标点和引号视为单独的标记，因此结果包括没有其他标记的单词。

还有更多…

有时候，不将某些单词分开，而将它们作为一个整体使用是有用的。这种做法的一个例子可以在第3章中找到，在表示短语 – phrase2vec配方中，我们存储的是短语而不是单个单词。NLTK包允许我们使用其自定义分词器MWETokenizer来实现这一点：

导入MWETokenizer类：
```
from nltk.tokenize import MWETokenizer
```
初始化分词器并指示单词dim sum dinner不应被分割：
```
tokenizer = MWETokenizer([('dim', 'sum', 'dinner')])
```
添加更多应该保留在一起的单词：
```
tokenizer.add_mwe(('best', 'dim', 'sum'))
```

使用分词器分割一个句子：

tokens = tokenizer.tokenize('Last night I went for dinner in an Italian restaurant. The pasta was delicious.'.split())
print(tokens)

结果将包含与之前相同方式的分割标记：

['Last', 'night', 'I', 'went', 'for', 'dinner', 'in', 'an', 'Italian', 'restaurant.', 'The', 'pasta', 'was', 'delicious.']

分割不同的句子：

tokens = tokenizer.tokenize('I went out to a dim sum dinner last night. This restaurant has the best dim sum in town.'.split())
print(tokens)

在这种情况下，分词器会将短语组合成一个单元，并用下划线代替空格：

['I', 'went', 'out', 'to', 'a', 'dim_sum_dinner', 'last', 'night.', 'This', 'restaurant', 'has', 'the_best_dim_sum', 'in', 'town.']

我们也可以使用spaCy进行分词。单词分词是spaCy在处理文本时完成的一系列任务中的一个任务。

还有更多

如果你正在对文本进行进一步处理，使用spaCy是有意义的。以下是它是如何工作的：

导入spacy包：
```
import spacy
```
仅在您之前没有执行此命令的情况下执行此命令：
```
!python -m spacy download en_core_web_sm
```
使用英语模型初始化spaCy引擎：
```
nlp = spacy.load("en_core_web_sm")
```

将文本分割成句子：

doc = nlp(sherlock_holmes_part_of_text)
words = [token.text for token in doc]

打印结果：

print(words)
print(len(words))

输出将如下所示：

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_', 'the', '_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', '\n', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', '\n', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.', 'It', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', '\n', 'akin', 'to', 'love', 'for', 'Irene', 'Adler', '.', 'All', 'emotions', ',', 'and', 'that', 'one', 'particularly', ',', '\n', 'were', 'abhorrent', 'to', 'his', 'cold', ',', 'precise', 'but', 'admirably', 'balanced', 'mind', '.', 'He', '\n', 'was', ',', 'I', 'take', 'it', ',', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', '\n', 'the', 'world', 'has', 'seen', ',', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', '\n', 'false', 'position', '.', 'He', 'never', 'spoke', 'of', 'the', 'softer', 'passions', ',', 'save', 'with', 'a', 'gibe', '\n', 'and', 'a', 'sneer', '.', 'They', 'were', 'admirable', 'things', 'for', 'the', 'observer', '—', 'excellent', 'for', '\n', 'drawing', 'the', 'veil', 'from', 'men', ''s', 'motives', 'and', 'actions', '.', 'But', 'for', 'the', 'trained', '\n', 'reasoner', 'to', 'admit', 'such', 'intrusions', 'into', 'his', 'own', 'delicate', 'and', 'finely', '\n', 'adjusted', 'temperament', 'was', 'to', 'introduce', 'a', 'distracting', 'factor', 'which', 'might', '\n', 'throw', 'a', 'doubt', 'upon', 'all', 'his', 'mental', 'results', '.', 'Grit', 'in', 'a', 'sensitive', '\n', 'instrument', ',', 'or', 'a', 'crack', 'in', 'one', 'of', 'his', 'own', 'high', '-', 'power', 'lenses', ',', 'would', 'not', '\n', 'be', 'more', 'disturbing', 'than', 'a', 'strong', 'emotion', 'in', 'a', 'nature', 'such', 'as', 'his', '.', 'And', '\n', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', ',', 'and', 'that', 'woman', 'was', 'the', 'late', 'Irene', '\n', 'Adler', ',', 'of', 'dubious', 'and', 'questionable', 'memory', '.']
251

你会注意到，当使用spaCy时，单词列表的长度比NLTK长。其中一个原因是spaCy保留了换行符，每个换行符都是一个单独的标记。另一个区别是spaCy会分割带有连字符的单词，如high-power。您可以通过运行以下行来找到两个列表之间的确切差异：

print(set(words_spacy)-set(words_nltk))

这应该产生以下输出：

{'high', 'power', 'observer', '-', '_', '—', 'excellent', ''s', '\n'}

重要提示

如果你正在使用spaCy进行其他处理，使用它是有意义的。否则，NLTK单词分词就足够了。

参见

NLTK包只为英语提供单词分词。

spaCy有其他语言的模型：中文、荷兰语、英语、法语、德语、希腊语、意大利语、日语、葡萄牙语、罗马尼亚语、西班牙语和其他语言。为了使用这些模型，你需要单独下载它们。例如，对于西班牙语，使用以下命令下载模型：

python -m spacy download es_core_news_sm

然后，在代码中添加这一行来使用它：

nlp = spacy.load("es_core_news_sm")

查看spaCy文档以获取更多信息：https://spacy.io/usage/models。

词性标注

在许多情况下，NLP处理取决于确定文本中单词的词性。例如，当我们想要找出文本中出现的命名实体时，我们需要知道单词的词性。在这个食谱中，我们再次考虑NLTK和spaCy算法。

准备工作

对于这部分，我们将使用书籍《福尔摩斯探案集》的相同文本。你可以在这本书的GitHub仓库中找到整个文本。对于这个食谱，我们只需要书的开始部分，这部分可以在文件https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes_1.txt中找到。

为了完成这个任务，你需要NLTK和spaCy包，这些包在技术要求部分有所描述。

我们还将使用OpenAI API的GPT模型来完成这个任务，以证明它也能像spaCy和NLTK一样完成。为了运行这部分，你需要openai包，该包包含在Poetry环境中。你还需要自己的OpenAI API密钥。

如何操作…

在这个食谱中，我们将使用spaCy包来标注单词的词性。

流程如下：

导入util文件和语言util文件。语言util文件包含spaCy和NLTK的导入，以及将小的spaCy模型初始化到small_model对象中。这些文件还包括从文件中读取文本和使用spaCy和NLTK的标记化函数：
```
%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"
```
我们将定义一个函数，该函数将为每个单词输出词性。在这个函数中，我们首先使用spaCy模型处理输入文本，这会产生一个Document对象。产生的Document对象包含一个带有Token对象的迭代器，每个Token对象都包含有关词性的信息。

我们使用这些信息来创建两个列表，一个包含单词，另一个包含它们各自的词性。

最后，我们将两个列表进行压缩，将单词与词性配对，并返回结果列表的元组。我们这样做是为了能够轻松地打印出带有相应词性的整个列表。当你想在代码中使用词性标注时，你只需遍历标记列表：
```
def pos_tag_spacy(text, model):
    doc = model(text)
    words = [token.text for token in doc]
    pos = [token.pos_ for token in doc]
    return list(zip(words, pos))
```

读取文本：

text = read_text_file("../data/sherlock_holmes_1.txt")

使用文本和模型作为输入运行前面的函数：
```
words_with_pos = pos_tag_spacy(text, small_model)
```

打印输出：

print(words_with_pos)

以下部分结果显示如下；要查看完整的输出，请参阅 Jupyter 笔记本 (https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter01/part_of_speech_tagging_1.3.ipynb)：

[('To', 'ADP'),
 ('Sherlock', 'PROPN'),
 ('Holmes', 'PROPN'),
 ('she', 'PRON'),
 ('is', 'AUX'),
 ('always', 'ADV'),
 ('_', 'PUNCT'),
 ('the', 'DET'),
 ('_', 'PROPN'),
 ('woman', 'NOUN'),
 ('.', 'PUNCT'),
 ('I', 'PRON'),
 ('have', 'AUX'),
 ('seldom', 'ADV'),
 ('heard', 'VERB'),
 ('him', 'PRON'),
 ('\n', 'SPACE'),
 ('mention', 'VERB'),
 ('her', 'PRON'),
 ('under', 'ADP'),
 ('any', 'DET'),
 ('other', 'ADJ'),
 ('name', 'NOUN'),
 ('.', 'PUNCT'),…

结果列表包含单词和词性的元组。词性标签列表可在以下位置找到：https://universaldependencies.org/u/pos/。

还有更多

我们可以将 spaCy 的性能与 NLTK 在此任务中的性能进行比较。以下是使用 NLTK 获取词性的步骤：

我们导入的语言 util 文件中已经处理了导入，所以我们首先创建一个函数，该函数输出输入单词的词性。在其中，我们利用也导入自语言 util 笔记本的 word_tokenize_nltk 函数：
```
def pos_tag_nltk(text):
    words = word_tokenize_nltk(text)
    words_with_pos = nltk.pos_tag(words)
    return words_with_pos
```
接下来，我们将该函数应用于之前读取的文本：
```
words_with_pos = pos_tag_nltk(text)
```

打印出结果：

print(words_with_pos)

以下部分输出如下。要查看完整的输出，请参阅 Jupyter 笔记本：

[('To', 'TO'),
 ('Sherlock', 'NNP'),
 ('Holmes', 'NNP'),
 ('she', 'PRP'),
 ('is', 'VBZ'),
 ('always', 'RB'),
 ('_the_', 'JJ'),
 ('woman', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('have', 'VBP'),
 ('seldom', 'VBN'),
 ('heard', 'RB'),
 ('him', 'PRP'),
 ('mention', 'VB'),
 ('her', 'PRP'),
 ('under', 'IN'),
 ('any', 'DT'),
 ('other', 'JJ'),
 ('name', 'NN'),
 ('.', '.'),…

NLTK 使用的词性标签列表与 SpaCy 使用的不同，可以通过运行以下命令访问：

python
>>> import nltk
>>> nltk.download('tagsets')
>>> nltk.help.upenn_tagset()

比较性能，我们发现 spaCy 需要 0.02 秒，而 NLTK 需要 0.01 秒（你的数字可能不同），因此它们的性能相似，NLTK 略好。然而，词性信息在初始处理完成后已经存在于 spaCy 对象中，所以如果你要进行任何进一步的处理，spaCy 是更好的选择。

重要提示

spaCy 会一次性完成所有处理，并将结果存储在 Doc 对象中。通过迭代 Token 对象可以获得词性信息。

还有更多

我们可以使用 GPT-3.5 和 GPT-4 模型通过 OpenAI API 执行各种任务，包括许多 NLP 任务。在这里，我们展示了如何使用 OpenAI API 获取输入文本的 NLTK 风格的词性。你还可以在提示中指定输出格式和词性标签的风格。为了使此代码正确运行，你需要自己的 OpenAI API 密钥：

导入 openai 并使用您的 API 密钥创建 OpenAI 客户端。OPEN_AI_KEY 常量变量在 ../****util/file_utils.ipynb 文件中设置：
```
from openai import OpenAI
client = OpenAI(api_key=OPEN_AI_KEY)
```

设置提示：

prompt="""Decide what the part of speech tags are for a sentence.
Preserve original capitalization.
Return the list in the format of a python tuple: (word, part of speech).
Sentence: In his eyes she eclipses and predominates the whole of her sex."""

向 OpenAI API 发送请求。我们发送到 API 的一些重要参数是我们想要使用的模型、温度，这会影响模型响应的变化程度，以及模型应返回的最大令牌数作为补全：

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=256,
    top_p=1.0,
    frequency_penalty=0,
    presence_penalty=0,
    messages=[
        {"role": "system", 
         "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
)

打印响应：

print(response)

输出将如下所示：

ChatCompletion(id='chatcmpl-9hCq34UAzMiNiqNGopt2U8ZmZM5po', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are the part of speech tags for the sentence "In his eyes she eclipses and predominates the whole of her sex" in the format of a Python tuple:\n\n[(\'In\', \'IN\'), (\'his\', \'PRP$\'), (\'eyes\', \'NNS\'), (\'she\', \'PRP\'), (\'eclipses\', \'VBZ\'), (\'and\', \'CC\'), (\'predominates\', \'VBZ\'), (\'the\', \'DT\'), (\'whole\', \'JJ\'), (\'of\', \'IN\'), (\'her\', \'PRP$\'), (\'sex\', \'NN\')]', role='assistant', function_call=None, tool_calls=None))], created=1720084483, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=120, prompt_tokens=74, total_tokens=194))

要仅查看 GPT 输出，请执行以下操作：

print(response.choices[0].message.content)

输出将如下所示：

Here are the part of speech tags for the sentence "In his eyes she eclipses and predominates the whole of her sex" in the format of a Python tuple:
[('In', 'IN'), ('his', 'PRP$'), ('eyes', 'NNS'), ('she', 'PRP'), ('eclipses', 'VBZ'), ('and', 'CC'), ('predominates', 'VBZ'), ('the', 'DT'), ('whole', 'JJ'), ('of', 'IN'), ('her', 'PRP$'), ('sex', 'NN')]

我们可以使用literal_eval函数将响应转换为元组。我们要求GPT模型只返回答案，而不附加任何解释，这样答案中就没有自由文本，我们可以自动处理它。我们这样做是为了能够比较OpenAI API的输出与NLTK的输出：

from ast import literal_eval
def pos_tag_gpt(text, client):
    prompt = f"""Decide what the part of speech tags are for a sentence.
    Preserve original capitalization.
    Return the list in the format of a python tuple: (word, part of speech).
    Do not include any other explanations.
    Sentence: {text}."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        max_tokens=256,
        top_p=1.0,
        frequency_penalty=0,
        presence_penalty=0,
        messages=[
            {"role": "system", 
             "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
    )
    result = response.choices[0].message.content
    result = result.replace("\n", "")
    result = list(literal_eval(result))
    return result

现在，让我们计时GPT函数，以便我们可以将其性能与其他先前使用的方法进行比较：

start = time.time()
first_sentence = "In his eyes she eclipses and predominates the whole of her sex."
words_with_pos = pos_tag_gpt(first_sentence, OPEN_AI_KEY)
print(words_with_pos)
print(f"GPT: {time.time() - start} s")

结果看起来像这样：

[('In', 'IN'), ('his', 'PRP$'), ('eyes', 'NNS'), ('she', 'PRP'), ('eclipses', 'VBZ'), ('and', 'CC'), ('predominates', 'VBZ'), ('the', 'DT'), ('whole', 'NN'), ('of', 'IN'), ('her', 'PRP$'), ('sex', 'NN'), ('.', '.')]
GPT: 2.4942469596862793 s

GPT的输出与NLTK非常相似，但略有不同：

words_with_pos_nltk = pos_tag_nltk(first_sentence)
print(words_with_pos == words_with_pos_nltk)

这会输出以下内容：

False

GPT与NLTK的区别在于，GPT将整个单词标记为形容词，而NLTK将其标记为名词。在这种情况下，NLTK是正确的。

我们看到LLM的输出非常相似，但比NLTK慢约400倍。

参见

如果你想用另一种语言标记文本，你可以使用spaCy的其他语言模型。例如，我们可以加载西班牙语spaCy模型来处理西班牙语文本：

nlp = spacy.load("es_core_news_sm")

如果spaCy没有你正在使用的语言的模型，你可以使用spaCy训练自己的模型。请参阅https://spacy.io/usage/training#tagger-parser。

结合相似单词 - 词元化

我们可以使用词元化来找到单词的规范形式。例如，单词cats的词元是cat，而单词ran的词元是run。当我们试图匹配某些单词而不想列出所有可能形式时，这很有用。相反，我们只需使用其词元。

准备工作

我们将使用spaCy包来完成这个任务。

如何做到这一点...

当spaCy处理一段文本时，生成的Document对象包含一个迭代器，遍历其中的Token对象，正如我们在词性标注食谱中看到的。这些Token对象包含文本中每个单词的词元信息。

获取词元的过程如下：

导入文件和语言工具文件。这将导入spaCy并初始化小型模型对象：
```
%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"
```

创建一个我们想要词元化的单词列表：

words = ["leaf", "leaves", "booking", "writing", "completed", "stemming"]

为每个单词创建一个文档对象：

docs = [small_model(word) for word in words]

打印列表中每个单词及其词元：
```
for doc in docs:
    for token in doc:
        print(token, token.lemma_)
```
结果将如下所示：
```
leaf leaf
leaves leave
booking book
writing write
completed complete
stemming stem
```
结果显示所有单词的正确词元化。然而，有些单词是模糊的。例如，单词leaves可以是动词，在这种情况下词元是正确的，或者它是名词，在这种情况下这个词元是错误的。如果我们给spaCy连续文本而不是单个单词，它很可能会正确地消除歧义。

现在，将词元化应用于更长的文本。在这里，我们读取一小部分福尔摩斯探案集文本，并对其每个单词进行词元化：

Text = read_text_file(../data/sherlock_holmes_1.txt")
doc = small_model(text)
for token in doc:
    print(token, token.lemma_)

部分结果将如下所示：

To to
Sherlock Sherlock
Holmes Holmes
she she
is be
always always
_ _
the the
_ _
woman woman
. ….

移除停用词

当我们处理单词时，尤其是如果我们正在考虑单词的语义时，我们有时需要排除一些在句子中不带来任何实质性意义的非常频繁的单词（例如但是、可以、我们等）。例如，如果我们想对文本的主题有一个大致的了解，我们可以计算其最频繁的单词。然而，在任何文本中，最频繁的单词将是停用词，因此我们希望在处理之前移除它们。这个菜谱展示了如何做到这一点。我们在这个菜谱中使用的停用词列表来自NLTK包，可能不包括你需要的所有单词。你需要相应地修改列表。

准备工作

我们将使用spaCy和NLTK来移除停用词；这些包是我们之前安装的Poetry环境的一部分。

我们将使用之前提到的福尔摩斯探案集文本。对于这个菜谱，我们只需要书的开始部分，可以在文件https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes_1.txt中找到。

在步骤 1中，我们运行实用工具笔记本。在步骤 2中，我们导入nltk包及其停用词列表。在步骤 3中，如果需要，我们下载停用词数据。在步骤 4中，我们打印出停用词列表。在步骤 5中，我们读取《福尔摩斯探案集》的一小部分。在步骤 6中，我们对文本进行分词并打印其长度，为230。在步骤 7中，我们通过列表推导式从原始单词列表中移除停用词。然后，我们打印结果的长度，看到列表长度已减少到105。你会注意到在列表推导式中，我们检查单词的小写形式是否在停用词列表中，因为所有停用词都是小写的。

如何做到这一点…

在这个菜谱中，我们将读取文本文件，对文本进行分词，并从列表中移除停用词：

运行文件和语言实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

导入NLTK停用词列表：
```
from nltk.corpus import stopwords
```
第一次运行笔记本时，下载停用词数据。下次运行代码时，无需再次下载停用词：
```
nltk.download('stopwords')
```

注意

这里是一个NLTK支持的停用词语言列表：阿拉伯语、阿塞拜疆语、丹麦语、荷兰语、英语、芬兰语、法语、德语、希腊语、匈牙利语、意大利语、哈萨克语、尼泊尔语、挪威语、葡萄牙语、罗马尼亚语、俄语、西班牙语、瑞典语和土耳其语。

你可以通过打印列表来查看NLTK附带的所有停用词：

print(stopwords.words('english'))

结果将如下所示：

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

读取文本文件：

text = read_text_file("../data/sherlock_holmes_1.txt")

将文本分词并打印结果列表的长度：
```
words = word_tokenize_nltk(text)
print(len(words))
```
结果将如下所示：
```
230
```
使用列表推导从列表中移除停用词并打印结果长度。你将注意到在列表推导中，我们检查单词的小写版本是否在停用词列表中，因为所有停用词都是小写的。
```
words = [word for word in words if word not in stopwords.words("english")]
print(len(words))
```
结果将如下所示：
```
105
```

代码随后会从文本中过滤掉停用词，并且只有当这些词不在停用词列表中时，才会保留文本中的单词。从两个列表的长度来看，一个未过滤，另一个没有停用词，我们移除了超过一半的单词。

重要提示

你可能会发现提供的停用词列表中的某些单词是不必要的或缺失的。你需要相应地修改列表。NLTK的停用词列表是一个Python列表，你可以使用标准的Python列表函数添加和删除元素。

还有更多…

我们还可以使用spaCy移除停用词。以下是这样做的方法：

为了方便，将停用词分配给一个变量：
```
stopwords = small_model.Defaults.stop_words
```
将文本分词并打印其长度：
```
words = word_tokenize_nltk(text)
print(len(words))
```
它将给出以下结果：
```
230
```
使用列表推导从列表中移除停用词并打印结果长度：
```
words = [word for word in words if word.lower() not in stopwords]
print(len(words))
```
结果将非常类似于NLTK：
```
106
```
spaCy中的停用词存储在一个集合中，我们可以向其中添加更多单词：
```
print(len(stopwords))
stopwords.add("new")
print(len(stopwords))
```
结果将如下所示：
```
327
328
```
同样，如果需要，我们可以移除单词：
```
print(len(stopwords))
stopwords.remove("new")
print(len(stopwords))
```
结果将如下所示：
```
328
327
```

我们也可以使用我们正在处理的文本来编译一个停用词列表，并计算其中单词的频率。这为你提供了一个自动移除停用词的方法，无需手动审查。

还有更多

在本节中，我们将展示两种实现方式。你需要使用文件https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes.txt。NLTK包中的FreqDist对象计算每个单词的出现次数，这是我们后来用来找到最频繁的单词并将其移除作为停用词的依据：

导入NTLK的FreqDist类：
```
from nltk.probability import FreqDist
```

定义一个将编译停用词列表的函数：

def compile_stopwords_list_frequency(text, cut_off=0.02):
    words = word_tokenize_nltk(text)
    freq_dist = FreqDist(word.lower() for word in words)
    words_with_frequencies = [
        (word, freq_dist[word]) for word in freq_dist.keys()]
    sorted_words = sorted(words_with_frequencies, 
        key=lambda tup: tup[1])
    stopwords = []
    if (type(cut_off) is int):
        # First option: use a frequency cutoff
        stopwords = [tuple[0] for tuple in sorted_words 
            if tuple[1] > cut_off]
    elif (type(cut_off) is float):
        # Second option: use a percentage of the words
        length_cutoff = int(cut_off*len(sorted_words))
        stopwords = [tuple[0] for tuple in 
            sorted_words[-length_cutoff:]]
    else:
        raise TypeError("The cut off needs to be either a float (percentage) or an int (frequency cut off)")
    return stopwords

使用默认设置定义停用词列表，并打印结果及其长度：

text = read_text_file("../data/sherlock_holmes.txt")
stopwords = compile_stopwords_list_frequency(text)
print(stopwords)
print(len(stopwords))

结果将如下所示：

['make', 'myself', 'night', 'until', 'street', 'few', 'why', 'thought', 'take', 'friend', 'lady', 'side', 'small', 'still', 'these', 'find', 'st.', 'every', 'watson', 'too', 'round', 'young', 'father', 'left', 'day', 'yet', 'first', 'once', 'took', 'its', 'eyes', 'long', 'miss', 'through', 'asked', 'most', 'saw', 'oh', 'morning', 'right', 'last', 'like', 'say', 'tell', 't', 'sherlock', 'their', 'go', 'own', 'after', 'away', 'never', 'good', 'nothing', 'case', 'however', 'quite', 'found', 'made', 'house', 'such', 'heard', 'way', 'yes', 'hand', 'much', 'matter', 'where', 'might', 'just', 'room', 'any', 'face', 'here', 'back', 'door', 'how', 'them', 'two', 'other', 'came', 'time', 'did', 'than', 'come', 'before', 'must', 'only', 'know', 'about', 'shall', 'think', 'more', 'over', 'us', 'well', 'am', 'or', 'may', 'they', ';', 'our', 'should', 'now', 'see', 'down', 'can', 'some', 'if', 'will', 'mr.', 'little', 'who', 'into', 'do', 'has', 'could', 'up', 'man', 'out', 'when', 'would', 'an', 'are', 'by', '!', 'were', 's', 'then', 'one', 'all', 'on', 'no', 'what', 'been', 'your', 'very', 'him', 'her', 'she', 'so', ''', 'holmes', 'upon', 'this', 'said', 'from', 'there', 'we', 'me', 'be', 'but', 'not', 'for', '?', 'at', 'which', 'with', 'had', 'as', 'have', 'my', ''', 'is', 'his', 'was', 'you', 'he', 'it', 'that', 'in', '"', 'a', 'of', 'to', '"', 'and', 'i', '.', 'the', ',']
181

现在，使用频率截止值为5%的函数（使用最频繁的5%的单词作为停用词）：

text = read_text_file("../data/sherlock_holmes.txt")
stopwords = compile_stopwords_list_frequency(text, cut_off=0.05)
print(len(stopwords))

结果将如下所示：

现在，使用绝对频率截止值为100（选取频率大于100的单词）：

stopwords = compile_stopwords_list_frequency(text, cut_off=100)
print(stopwords)
print(len(stopwords))

结果如下：

['away', 'never', 'good', 'nothing', 'case', 'however', 'quite', 'found', 'made', 'house', 'such', 'heard', 'way', 'yes', 'hand', 'much', 'matter', 'where', 'might', 'just', 'room', 'any', 'face', 'here', 'back', 'door', 'how', 'them', 'two', 'other', 'came', 'time', 'did', 'than', 'come', 'before', 'must', 'only', 'know', 'about', 'shall', 'think', 'more', 'over', 'us', 'well', 'am', 'or', 'may', 'they', ';', 'our', 'should', 'now', 'see', 'down', 'can', 'some', 'if', 'will', 'mr.', 'little', 'who', 'into', 'do', 'has', 'could', 'up', 'man', 'out', 'when', 'would', 'an', 'are', 'by', '!', 'were', 's', 'then', 'one', 'all', 'on', 'no', 'what', 'been', 'your', 'very', 'him', 'her', 'she', 'so', ''', 'holmes', 'upon', 'this', 'said', 'from', 'there', 'we', 'me', 'be', 'but', 'not', 'for', '?', 'at', 'which', 'with', 'had', 'as', 'have', 'my', ''', 'is', 'his', 'was', 'you', 'he', 'it', 'that', 'in', '"', 'a', 'of', 'to', '"', 'and', 'i', '.', 'the', ',']
131

创建停用词列表的函数接受文本和cut_off参数。它可以是表示停用词列表中频率排名单词百分比的浮点数。或者，它也可以是一个表示绝对阈值频率的整数，高于该频率的单词被视为停用词。在函数中，我们首先从书中提取单词，然后创建一个FreqDist对象，接着使用频率分布创建一个包含元组（单词，单词频率）的列表。我们使用单词频率对列表进行排序。然后，我们检查cut_off参数的类型，如果它不是浮点数或整数，则引发错误。如果是整数，我们返回频率高于参数的所有单词作为停用词。如果是浮点数，我们使用参数作为百分比来计算要返回的单词数量。

第二章：撒播语法

语法是语言的主要构建块之一。每种人类语言，以及编程语言，都有一个规则集，每个使用它的人都必须遵守，否则可能会不被理解。这些语法规则可以通过NLP揭示，并且对于从句子中提取数据很有用。例如，使用关于文本语法结构的信息，我们可以解析出主语、宾语以及不同实体之间的关系。

在本章中，你将学习如何使用不同的包来揭示单词和句子的语法结构，以及提取句子的某些部分。本章涵盖以下主题：

计数名词——复数和单数名词
获取依存句法分析
提取名词短语
提取句子的主语和宾语
使用语法信息在文本中寻找模式

技术要求

请按照第1章中给出的安装要求运行本章中的笔记本。

计数名词——复数和单数名词

在本食谱中，我们将做两件事：确定一个名词是复数还是单数，并将复数名词转换为单数，反之亦然。

你可能需要这两样东西来完成各种任务。例如，你可能想要统计单词统计信息，为此，你很可能需要一起计算单数和复数名词。为了将复数名词与单数名词一起计数，你需要一种方法来识别一个单词是复数还是单数。

准备工作

为了确定一个名词是单数还是复数，我们将通过两种不同的方法使用spaCy：通过查看词元和实际单词之间的差异，以及通过查看morph属性。为了屈折这些名词，或将单数名词转换为复数或反之亦然，我们将使用textblob包。我们还将了解如何通过OpenAI API使用GPT-3确定名词的数量。本节代码位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter02。

如何做到这一点...

我们将首先使用spaCy的词元信息来推断一个名词是单数还是复数。然后，我们将使用Token对象的morph属性。然后，我们将创建一个函数，使用这些方法之一。最后，我们将使用GPT-3.5来确定名词的数量：

运行文件和语言实用工具笔记本中的代码。如果你遇到一个错误，说小或大模型不存在，你需要打开lang_utils.ipynb文件，取消注释，并运行下载模型的语句：
```
%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"
```
初始化text变量，并使用spaCy小型模型进行处理，以获取结果Doc对象：
```
text = "I have five birds"
doc = small_model(text)
```
在这一步，我们遍历 Doc 对象。对于对象中的每个标记，我们检查它是否是名词，以及词元是否与单词本身相同。由于词元是单词的基本形式，如果词元与单词不同，则该标记是复数：
```
for token in doc:
    if (token.pos_ == "NOUN" and token.lemma_ != token.text):
        print(token.text, "plural")
```
结果应该是这样的：
```
birds plural
```
现在，我们将使用不同的方法来检查名词的数量：Token 对象的 morph 特征。morph 特征是单词的形态学特征，如数量、格等。由于我们知道标记 3 是一个名词，我们直接访问 morph 特征并获取 Number 以获得之前相同的结果：
```
doc = small_model("I have five birds.")
print(doc[3].morph.get("Number"))
```
下面是结果：
```
['Plur']
```
在这一步，我们准备定义一个返回元组 (noun, number) 的函数。为了更好地编码名词数量，我们使用一个 Enum 类，将不同的值分配给数字。我们将 1 分配给单数，将 2 分配给复数。一旦创建了这个类，我们就可以直接引用名词数量变量为 Noun_number.SINGULAR 和 Noun_number.PLURAL：
```
class Noun_number(Enum):
    SINGULAR = 1
    PLURAL = 2
```

在这一步，我们定义了一个函数。该函数接受文本、spaCy 模型以及确定名词数量的方法作为输入。这两种方法是 lemma 和 morph，分别与我们之前在 步骤 3 和 步骤 4 中使用的相同两种方法。该函数输出一个元组列表，每个元组的格式为 (名词文本, 名词数量)，其中名词数量使用在 步骤 5 中定义的 Noun_number 类表示：

def get_nouns_number(text, model, method="lemma"):
    nouns = []
    doc = model(text)
    for token in doc:
        if (token.pos_ == "NOUN"):
            if method == "lemma":
                if token.lemma_ != token.text:
                    nouns.append((token.text, 
                        Noun_number.PLURAL))
                else:
                    nouns.append((token.text,
                        Noun_number.SINGULAR))
            elif method == "morph":
                if token.morph.get("Number") == "Sing":
                    nouns.append((token.text,
                        Noun_number.PLURAL))
                else:
                    nouns.append((token.text,
                        Noun_number.SINGULAR))
    return nouns

我们可以使用前面的函数并查看它在不同的 spaCy 模型上的性能。在这一步，我们使用我们刚刚定义的函数和小的 spaCy 模型。使用两种方法，我们看到 spaCy 模型错误地获取了不规则名词 geese 的数量：

text = "Three geese crossed the road"
nouns = get_nouns_number(text, small_model, "morph")
print(nouns)
nouns = get_nouns_number(text, small_model)
print(nouns)

结果应该是这样的：

[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]

现在，让我们使用大型模型做同样的事情。如果您尚未下载大型模型，请通过运行第一行来下载。否则，您可以将其注释掉。在这里，我们看到尽管 morph 方法仍然错误地将 geese 分配为单数，但 lemma 方法提供了正确的答案：

!python -m spacy download en_core_web_lg
large_model = spacy.load("en_core_web_lg")
nouns = get_nouns_number(text, large_model, "morph")
print(nouns)
nouns = get_nouns_number(text, large_model)
print(nouns)

结果应该是这样的：

[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
[('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]

现在，让我们使用 GPT-3.5 来获取名词数量。在结果中，我们看到 GPT-3.5 给出了相同的结果，并且正确地识别了 geese 和 road 的数量：

from openai import OpenAI
client = OpenAI(api_key=OPEN_AI_KEY)
prompt="""Decide whether each noun in the following text is singular or plural.
Return the list in the format of a python tuple: (word, number). Do not provide any additional explanations.
Sentence: Three geese crossed the road."""
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=256,
    top_p=1.0,
    frequency_penalty=0,
    presence_penalty=0,
    messages=[
        {"role": "system", "content": "You are a helpful 
            assistant."},
        {"role": "user", "content": prompt}
    ],
)
print(response.choices[0].message.content)

结果应该是这样的：

('geese', 'plural')
('road', 'singular')

还有更多…

我们也可以将名词从复数变为单数，反之亦然。我们将使用 textblob 包来完成这项工作。该包应通过 Poetry 环境自动安装：

从包中导入 TextBlob 类：
```
from textblob import TextBlob
```

初始化一个文本变量列表，并通过列表推导式使用 TextBlob 类进行处理：

texts = ["book", "goose", "pen", "point", "deer"]
blob_objs = [TextBlob(text) for text in texts]

使用对象的 pluralize 函数来获取复数。此函数返回一个列表，我们访问其第一个元素。打印结果：
```
plurals = [blob_obj.words.pluralize()[0] 
    for blob_obj in blob_objs]
print(plurals)
```
结果应该是这样的：
```
['books', 'geese', 'pens', 'points', 'deer']
```
现在，我们将进行相反的操作。我们使用前面的复数列表将复数名词转换为 TextBlob 对象：
```
blob_objs = [TextBlob(text) for text in plurals]
```

使用 singularize 函数将名词转换为单数并打印：

singulars = [blob_obj.words.singularize()[0] 
    for blob_obj in blob_objs]
print(singulars)

结果应该与我们在 步骤 2 中开始时的列表相同：

['book', 'goose', 'pen', 'point', 'deer']

获取依存句法分析

依存句法分析是一种显示句子中依存关系的工具。例如，在句子 The cat wore a hat 中，句子的根是动词，wore，而主语，the cat，和宾语，a hat，都是依存词。依存句法分析在许多 NLP 任务中非常有用，因为它显示了句子的语法结构，包括主语、主要动词、宾语等。然后它可以用于下游处理。

spaCy NLP 引擎将其整体分析的一部分作为依存句法分析。依存句法分析标签解释了句子中每个词的作用。ROOT 是所有其他词都依赖的词，通常是动词。

准备中

我们将使用 spaCy 来创建依存句法分析。所需的包是 Poetry 环境的一部分。

如何做…

我们将从 sherlock_holmes1.txt 文件中选取几句话来展示依存句法分析。步骤如下：

运行文件和语言实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

定义我们将要分析的句子：

sentence = 'I have seldom heard him mention her under any other name.'

定义一个函数，该函数将打印出词、其嵌入在 dep_ 属性中的语法功能以及该属性的说明。Token 对象的 dep_ 属性显示了词在句子中的语法功能：

def print_dependencies(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(token.text, "\t", token.dep_, "\t", 
            spacy.explain(token.dep_))

现在，让我们将此函数应用于我们列表中的第一句话。我们可以看到动词 heard 是句子的 ROOT 词，所有其他词都依赖于它：

print_dependencies(sentence, small_model)

结果应该如下所示：

I    nsubj    nominal subject
have    aux    auxiliary
seldom    advmod    adverbial modifier
heard    ROOT    root
him    nsubj    nominal subject
mention    ccomp    clausal complement
her    dobj    direct object
under    prep    prepositional modifier
any    det    determiner
other    amod    adjectival modifier
name    pobj    object of preposition
.    punct    punctuation

要探索依存句法分析结构，我们可以使用 Token 类的属性。使用 ancestors 和 children 属性，我们可以获取此标记所依赖的标记和依赖于它的标记，分别。打印祖先的函数如下：
```
def print_ancestors(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(token.text, [t.text for t in token.ancestors])
```
现在，让我们将此函数应用于我们列表中的第一句话：
```
print_ancestors(sentence, small_model)
```
输出将如下所示。在结果中，我们看到 heard 没有祖先，因为它是在句子中的主要词。所有其他词都依赖于它，实际上，它们的祖先列表中都包含 heard。

通过跟踪每个词的祖先链接，可以看到依存链。例如，如果我们查看单词 name，我们看到它的祖先是 under、mention 和 heard。name 的直接父词是 under，under 的父词是 mention，mention 的父词是 heard。依存链始终会引导到句子的根，或主要词：
```
I ['heard']
have ['heard']
seldom ['heard']
heard []
him ['mention', 'heard']
mention ['heard']
her ['mention', 'heard']
under ['mention', 'heard']
any ['name', 'under', 'mention', 'heard']
other ['name', 'under', 'mention', 'heard']
name ['under', 'mention', 'heard']
. ['heard']
```

要查看所有子词，请使用以下函数。此函数打印出每个词及其依赖于它的词，其 children：

def print_children(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(token.text,[t.text for t in token.children])

现在，让我们将此函数应用于我们列表中的第一句话：

print_children(sentence, small_model)

结果应该是这样的。现在，单词 heard 有一个依赖它的单词列表，因为它在句子中是主词：

I []
have []
seldom []
heard ['I', 'have', 'seldom', 'mention', '.']
him []
mention ['him', 'her', 'under']
her []
under ['name']
any []
other []
name ['any', 'other']
. []

我们还可以在单独的列表中看到左右子节点。在以下函数中，我们将子节点打印为两个单独的列表，左和右。这在进行句子语法转换时可能很有用：

def print_lefts_and_rights(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(token.text,
            [t.text for t in token.lefts],
            [t.text for t in token.rights])

让我们使用这个函数处理我们列表中的第一句话：

print_lefts_and_rights(sentence, small_model)

结果应该是这样的：

I [] []
have [] []
seldom [] []
heard ['I', 'have', 'seldom'] ['mention', '.']
him [] []
mention ['him'] ['her', 'under']
her [] []
under [] ['name']
any [] []
other [] []
name ['any', 'other'] []
. [] []

我们还可以通过使用此函数看到标记所在的子树：

def print_subtree(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(token.text, [t.text for t in token.subtree])

让我们使用这个函数处理我们列表中的第一句话：

print_subtree(sentence, small_model)

结果应该是这样的。从每个单词所属的子树中，我们可以看到句子中出现的语法短语，如 any other name 和 under any other name：

I ['I']
have ['have']
seldom ['seldom']
heard ['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
him ['him']
mention ['him', 'mention', 'her', 'under', 'any', 'other', 'name']
her ['her']
under ['under', 'any', 'other', 'name']
any ['any']
other ['other']
name ['any', 'other', 'name']
. ['.']

参见

可以使用 displaCy 包图形化地可视化依存句法，它是 spaCy 的一部分。请参阅 第7章 可视化文本数据，了解如何进行可视化的详细食谱。

提取名词短语

在语言学中，名词短语被称为名词短语。它们代表名词以及任何依赖和伴随名词的单词。例如，在句子 The big red apple fell on the scared cat 中，名词短语是 the big red apple 和 the scared cat。提取这些名词短语对于许多其他下游自然语言处理任务至关重要，例如命名实体识别以及处理实体及其关系。在本食谱中，我们将探讨如何从文本中提取命名实体。

准备工作

我们将使用 spaCy 包，它有一个用于提取名词短语的函数，以及 sherlock_holmes_1.txt 文件中的文本作为示例。

如何做到这一点...

使用以下步骤从文本中获取名词短语：

运行文件和语言实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

定义一个函数，该函数将打印出名词短语。名词短语包含在 doc.noun_chunks 类变量中：

def print_noun_chunks(text, model):
    doc = model(text)
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text)

从 sherlock_holmes_1.txt 文件中读取文本并使用该函数处理结果文本：
```
sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
print_noun_chunks(sherlock_holmes_part_of_text, small_model)
```
这是部分结果。请参阅笔记本的输出https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter02/noun_chunks_2.3.ipynb，以获取完整打印输出。该函数正确地获取了文本中的代词、名词和名词短语：
```
Sherlock Holmes
she
the_ woman
I
him
her
any other name
his eyes
she
the whole
…
```

参见

语义相似性的主题将在 第3章 中更详细地探讨。

提取句子的主语和宾语

有时，我们可能需要找到句子的主语和直接宾语，而使用 spaCy 包件可以轻松完成。

准备工作

我们将使用 spaCy 的依赖标签来查找主语和宾语。代码使用 spaCy 引擎解析句子。然后，主语函数遍历标记，如果依赖标签包含 subj，则返回该标记的子树，一个 Span 对象。存在不同的主语标签，包括 nsubj 用于普通主语和 nsubjpass 用于被动句的主语，因此我们希望寻找两者。

如何做到这一点...

我们将使用标记的 subtree 属性来找到完整的名词短语，它是动词的主语或直接宾语（参见 获取依赖分析 菜谱）。我们将定义函数来查找主语、直接宾语、宾语从句和介词短语：

运行文件和语言实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

我们将使用两个函数来找到句子的主语和直接宾语。这些函数将遍历标记并分别返回包含 subj 或 dobj 依赖标签的标记的子树。以下是主语函数。它寻找具有包含 subj 依赖标签的标记，然后返回包含该标记的子树。存在几个主语依赖标签，包括 nsubj 和 nsubjpass（被动句的主语），因此我们寻找最一般的模式：
```
def get_subject_phrase(doc):
    for token in doc:
        if ("subj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]
```

这里是直接宾语函数。它的工作方式与get_subject_phrase类似，但寻找的是dobj依赖标签，而不是包含subj的标签。如果句子没有直接宾语，它将返回None：

def get_object_phrase(doc):
    for token in doc:
        if ("dobj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

将句子列表分配给一个变量，遍历它们，并使用前面的函数打印出它们的主题和宾语：

sentences = [
    "The big black cat stared at the small dog.",
    "Jane watched her brother in the evenings.",
    "Laura gave Sam a very interesting book."
]
for sentence in sentences:
    doc = small_model(sentence)
    subject_phrase = get_subject_phrase(doc)
    object_phrase = get_object_phrase(doc)
    print(sentence)
    print("\tSubject:", subject_phrase)
    print("\tDirect object:", object_phrase)

结果将如下所示。由于第一个句子没有直接宾语，将打印出None。对于句子The big black cat stared at the small dog，主语是the big black cat，没有直接宾语（the small dog是介词at的宾语）。对于句子Jane watched her brother in the evenings，主语是Jane，直接宾语是her brother。在句子Laura gave Sam a very interesting book中，主语是Laura，直接宾语是a very interesting book：

The big black cat stared at the small dog.
  Subject: The big black cat
  Direct object: None
Jane watched her brother in the evenings.
  Subject: Jane
  Direct object: her brother
Laura gave Sam a very interesting book.
  Subject: Laura
  Direct object: a very interesting book

还有更多…

我们可以寻找其他宾语，例如，动词如give的宾格宾语和介词短语宾语。这些函数看起来非常相似，主要区别在于依赖标签：宾格宾语函数的标签是dative，介词宾语函数的标签是pobj。介词宾语函数将返回一个列表，因为一个句子中可能有多个介词短语：

宾格宾语函数检查标记的宾格标签。如果没有宾格宾语，则返回None：

def get_dative_phrase(doc):
    for token in doc:
        if ("dative" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

我们还可以将主题、宾语和宾格函数组合成一个，通过一个参数指定要查找哪种宾语：

def get_phrase(doc, phrase):
    # phrase is one of "subj", "obj", "dative"
    for token in doc:
        if (phrase in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

现在让我们定义一个带有宾格宾语的句子，并运行所有三种短语类型的函数：

sentence = "Laura gave Sam a very interesting book."
doc = small_model(sentence)
subject_phrase = get_phrase(doc, "subj")
object_phrase = get_phrase(doc, "obj")
dative_phrase = get_phrase(doc, "dative")
print(sentence)
print("\tSubject:", subject_phrase)
print("\tDirect object:", object_phrase)
print("\tDative object:", dative_phrase)

结果将如下所示。宾格宾语是Sam：

Laura gave Sam a very interesting book.
  Subject: Laura
  Direct object: a very interesting book
  Dative object: Sam

这里是介词宾语函数。它返回介词宾语的列表，如果没有，则列表为空：

def get_prepositional_phrase_objs(doc):
    prep_spans = []
    for token in doc:
        if ("pobj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            prep_spans.append(doc[start:end])
    return prep_spans

让我们定义一个句子列表，并在它们上运行这两个函数：

sentences = [
    "The big black cat stared at the small dog.",
    "Jane watched her brother in the evenings."
]
for sentence in sentences:
    doc = small_model(sentence)
    subject_phrase = get_phrase(doc, "subj")
    object_phrase = get_phrase(doc, "obj")
    dative_phrase = get_phrase(doc, "dative")
    prepositional_phrase_objs = \
        get_prepositional_phrase_objs(doc)
    print(sentence)
    print("\tSubject:", subject_phrase)
    print("\tDirect object:", object_phrase)
    print("\tPrepositional phrases:", prepositional_phrase_objs)

结果将如下所示：

The big black cat stared at the small dog.
  Subject: The big black cat
  Direct object: the small dog
  Prepositional phrases: [the small dog]
Jane watched her brother in the evenings.
  Subject: Jane
  Direct object: her brother
  Prepositional phrases: [the evenings]

每个句子中都有一个介词短语。在句子The big black cat stared at the small dog中是at the small dog，在句子Jane watched her brother in the evenings中是in the evenings。

请将实际带有介词的介词短语而不是仅依赖于这些介词的名词短语找出来，这留作练习：

使用语法信息在文本中查找模式

在本节中，我们将使用spaCy Matcher对象在文本中查找模式。我们将使用单词的语法属性来创建这些模式。例如，我们可能正在寻找动词短语而不是名词短语。我们可以指定语法模式来匹配动词短语。

准备工作

我们将使用 spaCy 的 Matcher 对象来指定和查找模式。它可以匹配不同的属性，而不仅仅是语法。你可以在https://spacy.io/usage/rule-based-matching/的文档中了解更多信息。

如何操作...

你的步骤应该格式化如下：

运行文件和语言实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

导入 Matcher 对象并初始化它。我们需要放入词汇对象，这与我们将用于处理文本的模型的词汇相同：
```
from spacy.matcher import Matcher
matcher = Matcher(small_model.vocab)
```
创建一个模式列表并将其添加到匹配器中。每个模式是一个字典列表，其中每个字典描述一个标记。在我们的模式中，我们只为每个标记指定词性。然后我们将这些模式添加到 Matcher 对象中。我们将使用的模式是一个单独的动词（例如，paints），一个助动词后面跟一个动词（例如，was observing），一个助动词后面跟一个形容词（例如，were late），以及一个助动词后面跟一个动词和一个介词（例如，were staring at）。这不是一个详尽的列表；请随意提出其他示例：
```
patterns = [
    [{"POS": "VERB"}],
    [{"POS": "AUX"}, {"POS": "VERB"}],
    [{"POS": "AUX"}, {"POS": "ADJ"}],
    [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}]
]
matcher.add("Verb", patterns)
```

在小部分 福尔摩斯 文本中阅读并使用小模型进行处理：

sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
doc = small_model(sherlock_holmes_part_of_text)

现在，我们使用 Matcher 对象和已处理文本来查找匹配项。然后我们遍历匹配项，打印出匹配ID、字符串ID（模式的标识符）、匹配的开始和结束位置以及匹配文本：

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = small_model.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

结果将如下所示：

14677086776663181681 Verb 14 15 heard
14677086776663181681 Verb 17 18 mention
14677086776663181681 Verb 28 29 eclipses
14677086776663181681 Verb 31 32 predominates
14677086776663181681 Verb 43 44 felt
14677086776663181681 Verb 49 50 love
14677086776663181681 Verb 63 65 were abhorrent
14677086776663181681 Verb 80 81 take
14677086776663181681 Verb 88 89 observing
14677086776663181681 Verb 94 96 has seen
14677086776663181681 Verb 95 96 seen
14677086776663181681 Verb 103 105 have placed
14677086776663181681 Verb 104 105 placed
14677086776663181681 Verb 114 115 spoke
14677086776663181681 Verb 120 121 save
14677086776663181681 Verb 130 132 were admirable
14677086776663181681 Verb 140 141 drawing
14677086776663181681 Verb 153 154 trained
14677086776663181681 Verb 157 158 admit
14677086776663181681 Verb 167 168 adjusted
14677086776663181681 Verb 171 172 introduce
14677086776663181681 Verb 173 174 distracting
14677086776663181681 Verb 178 179 throw
14677086776663181681 Verb 228 229 was

代码在文本中找到了一些动词短语。有时，它找到一个部分匹配，它是另一个匹配的一部分。清除这些部分匹配被留作练习。

参考以下内容

我们可以使用除了词性以外的其他属性。可以基于文本本身、其长度、是否为字母数字、标点符号、单词的大小写、dep_ 和 morph 属性、词元、实体类型等来匹配。还可以在模式上使用正则表达式。更多详细信息，请参阅 spaCy 文档：https://spacy.io/usage/rule-based-matching。

第三章：表示文本——捕捉语义

将单词、短语和句子的意义表示成计算机能理解的形式是NLP处理的基础之一。例如，机器学习将每个数据点表示为一个数字列表（固定大小的向量），我们面临的问题是如何将单词和句子转换为这些向量。大多数NLP任务首先将文本表示成某种数值形式，在本章中，我们将展示几种实现这一目标的方法。

首先，我们将创建一个简单的分类器来展示每种编码方法的有效性，然后我们将用它来测试不同的编码方法。我们还将学习如何将诸如“炸鸡”之类的短语转换为向量——也就是说，如何训练短语用word2vec模型。最后，我们将看到如何使用基于向量的搜索。

对于本节中讨论的一些概念的理论背景，请参阅Coelho等人所著的《用Python构建机器学习系统》。这本书将解释构建机器学习项目的基础，例如训练集和测试集，以及用于评估此类项目的指标，包括精确度、召回率、F1和准确率。

本章涵盖了以下食谱：

创建一个简单的分类器
将文档放入词袋中
构建一个N-gram模型
使用TF-IDF表示文本
使用词嵌入
训练自己的嵌入模型
使用BERT和OpenAI嵌入而不是词嵌入
使用检索增强****生成（RAG）

技术要求

本章的代码位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter03。本章所需的包应通过poetry环境自动安装。

此外，我们还将使用以下URL中定位的模型和数据集。谷歌word2vec模型是一个将单词表示为向量的模型，IMDB数据集包含电影标题、类型和描述。将它们下载到root目录下的data文件夹中：

谷歌word2vec模型：https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
IMDB电影数据集：https://github.com/venusanvi/imdb-movies/blob/main/IMDB-Movie-Data.csv（本书的GitHub仓库中也有提供）

除了前面的文件外，我们还将使用我们将在第一个菜谱中创建的简单分类器中的各种函数。此文件可在https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/util/util_simple_classifier.ipynb找到。

创建一个简单的分类器

我们需要将文本表示为向量的原因是为了将其转换为计算机可读的形式。计算机不能理解单词，但擅长处理数字。NLP的主要任务之一是文本分类，我们将创建一个用于电影评论的分类器。我们将使用相同的分类器代码，但使用不同的从文本创建向量的方法。

在本节中，我们将创建一个分类器，该分类器将为烂番茄评论分配负面或正面情绪，这是一个通过Hugging Face提供的、包含大量开源模型和数据集的数据库。然后我们将使用基线方法，通过计算文本中存在的不同词性数量（动词、名词、专有名词、形容词、副词、助动词、代词、数字和标点符号）来编码文本。

到本菜谱结束时，我们将创建一个单独的文件，其中包含创建数据集和训练分类器的函数。我们将使用此文件在本章中测试不同的编码方法。

准备工作

在本菜谱中，我们将创建一个简单的电影评论分类器。它将是一个sklearn包。

如何操作...

我们将从Hugging Face加载Rotten Tomatoes数据集。我们将只使用数据集的一部分，以便训练时间不会很长：

导入文件和语言util笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/lang_utils.ipynb"

从Hugging Face（datasets包）加载训练和测试数据集。对于训练集和测试集，我们将选择数据的前15%和后15%，而不是加载完整的数据集。完整的数据集很大，训练模型需要很长时间：
```
from datasets import load_dataset
train_dataset = load_dataset("rotten_tomatoes",
    split="train[:15%]+train[-15%:]")
test_dataset = load_dataset("rotten_tomatoes",
    split="test[:15%]+test[-15%:]")
```
打印出每个数据集的长度：
```
print(len(train_dataset))
print(len(test_dataset))
```
输出应该是这样的：
```
2560
320
```

在这里，我们创建了POS_vectorizer类。这个类有一个名为vectorize的方法，它处理文本并计算动词、名词、专有名词、形容词、副词、助动词、代词、数字和标点符号的数量。该类需要一个spaCy模型来处理文本。每段文本被转换成大小为10的向量。向量的第一个元素是文本的长度，其他数字表示该特定词性的文本中的单词数量：

class POS_vectorizer:
    def __init__(self, spacy_model):
        self.model = spacy_model
    def vectorize(self, input_text):
        doc = self.model(input_text)
        vector = []
        vector.append(len(doc))
        pos = {"VERB":0, "NOUN":0, "PROPN":0, "ADJ":0,
            "ADV":0, "AUX":0, "PRON":0, "NUM":0, "PUNCT":0}
        for token in doc:
            if token.pos_ in pos:
                pos[token.pos_] += 1
        vector_values = list(pos.values())
        vector = vector + vector_values
        return vector

现在，我们可以测试POS_vectorizer类。我们取第一篇评论的文本进行处理，并使用小的spaCy模型创建向量器。然后我们使用新创建的类对文本进行向量化：
```
sample_text = train_dataset[0]["text"]
vectorizer = POS_vectorizer(small_model)
vector = vectorizer.vectorize(sample_text)
```

让我们打印文本和向量：

print(sample_text)
print(vector)

结果应该看起来像这样。我们可以看到，向量正确地计算了词性。例如，有五个标点符号（两个引号、一个逗号、一个句号和一个破折号）：

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
[38, 3, 8, 3, 4, 1, 3, 1, 0, 5]

现在，我们将为训练我们的分类器准备数据。我们首先导入pandas和numpy包，然后创建两个数据框，一个用于训练，另一个用于测试。在每个数据集中，我们创建一个名为vector的新列，其中包含该文本的向量。我们使用apply方法将文本转换为向量并将它们存储在新列中。在此方法中，我们传递一个lambda函数，该函数接受一段文本并将其应用于POS_vectorizer类的vectorize方法。然后，我们将向量列和标签列转换为numpy数组，以便数据以正确的格式供分类器使用。我们使用np.stack方法对向量进行操作，因为它已经是一个列表，而使用to_numpy方法对评论标签进行操作，因为它们只是数字：
```
import pandas as pd
import numpy as np
train_df = train_dataset.to_pandas()
train_df.sample(frac=1)
test_df = test_dataset.to_pandas()
train_df["vector"] = train_df["text"].apply(
    lambda x: vectorizer.vectorize(x))
test_df["vector"] = test_df["text"].apply(
    lambda x: vectorizer.vectorize(x))
X_train = np.stack(train_df["vector"].values, axis=0)
X_test = np.stack(test_df["vector"].values, axis=0)
y_train = train_df["label"].to_numpy()
y_test = test_df["label"].to_numpy()
```
现在，我们将训练分类器。我们将选择逻辑回归算法，因为它是最简单的算法之一，同时也是最快的算法之一。首先，我们从sklearn中导入LogisticRegression类和classification_report方法。然后，我们创建LogisticRegression对象，并最终在之前步骤中的数据上对其进行训练：
```
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
clf = LogisticRegression(C=0.1)
clf = clf.fit(X_train, y_train)
```

我们可以通过将predict方法应用于测试数据中的向量并打印出分类报告来测试分类器。我们可以看到，整体准确率很低，略高于随机水平。这是因为我们使用的向量表示非常粗糙。在下一节中，我们将使用其他向量并看看它们如何影响分类器结果：

test_df["prediction"] = test_df["vector"].apply(
    lambda x: clf.predict([x])[0])
print(classification_report(test_df["label"], 
    test_df["prediction"]))

输出应该类似于这个：

              precision    recall  f1-score   support
           0       0.59      0.54      0.56       160
           1       0.57      0.62      0.60       160
    accuracy                           0.58       320
   macro avg       0.58      0.58      0.58       320
weighted avg       0.58      0.58      0.58       320

还有更多...

现在，我们将前面的代码转换为几个函数，这样我们就可以只改变在构建数据集时使用的向量器。生成的文件位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/util/util_simple_classifier.ipynb。生成的代码将如下所示：

导入必要的包：

from datasets import load_dataset
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

定义一个函数，该函数将创建并返回训练和测试数据框。它将从Hugging Face的Rotten Tomatoes数据集中创建它们：

def load_train_test_dataset_pd():
    train_dataset = load_dataset("rotten_tomatoes",
        split="train[:15%]+train[-15%:]")
    test_dataset = load_dataset("rotten_tomatoes",
        split="test[:15%]+test[-15%:]")
    train_df = train_dataset.to_pandas()
    train_df.sample(frac=1)
    test_df = test_dataset.to_pandas()
    return (train_df, test_df)

此函数接收数据框和vectorize方法，并为训练和测试数据创建numpy数组。这将使我们能够使用创建的向量来训练逻辑回归分类器：

def create_train_test_data(train_df, test_df, vectorize):
    train_df["vector"] = train_df["text"].apply(
        lambda x: vectorize(x))
    test_df["vector"] = test_df["text"].apply(
        lambda x: vectorize(x))
    X_train = np.stack(train_df["vector"].values, axis=0)
    X_test = np.stack(test_df["vector"].values, axis=0)
    y_train = train_df["label"].to_numpy()
    y_test = test_df["label"].to_numpy()
    return (X_train, X_test, y_train, y_test)

此函数在给定的训练数据上训练一个逻辑回归分类器：

def train_classifier(X_train, y_train):
    clf = LogisticRegression(C=0.1)
    clf = clf.fit(X_train, y_train)
    return clf

此最终函数接收测试数据和训练好的分类器，并打印出分类报告：

def test_classifier(test_df, clf):
    test_df["prediction"] = test_df["vector"].apply(
        lambda x: clf.predict([x])[0])
    print(classification_report(test_df["label"],         test_df["prediction"]))

在每个展示新向量化方法的后续部分，我们将使用这个文件来预加载必要的函数以测试分类结果。这将使我们能够评估不同的向量化方法。我们将只改变向量化器，保持分类器不变。当分类器表现更好时，这反映了底层向量化器对文本的表示效果。

将文档放入词袋中

词袋是表示文本的最简单方式。我们将文本视为一组文档，其中文档可以是句子、科学文章、博客文章或整本书。由于我们通常将不同的文档相互比较或将它们用于其他文档的更大上下文中，所以我们处理的是文档集合，而不仅仅是一个单独的文档。

词袋方法使用一个“训练”文本，为它提供一个应该考虑的单词列表。在编码新句子时，它计算每个单词在文档中的出现次数，最终向量包括词汇表中每个单词的这些计数。这种表示可以随后输入到机器学习算法中。

这种向量化的方法被称为“词袋”是因为它不考虑单词之间的相互关系，只计算每个单词出现的次数。关于什么代表一个文档的决定权在工程师手中，在许多情况下，这将是显而易见的。例如，如果你正在对属于特定主题的推文进行分类，那么一条单独的推文就是你的文档。相反，如果你想找出哪本书的章节与你已经拥有的书最相似，那么章节就是文档。

在这个菜谱中，我们将为Rotten Tomatoes的评论创建一个词袋。我们的文档将是评论。然后，我们通过构建逻辑回归分类器并使用前一个菜谱中的代码来测试编码。

准备工作

对于这个菜谱，我们将使用来自sklearn包的CountVectorizer类。它包含在poetry环境中。CountVectorizer类专门设计用来计算文本中每个单词出现的次数。

如何做到这一点...

我们的代码将接受一组文档——在这个例子中，是评论——并将它们表示为一个向量矩阵。我们将使用来自Hugging Face的Rotten Tomatoes评论数据集来完成这项任务：

运行简单的分类器实用程序文件，然后导入CountVectorizer对象和sys包。我们需要sys包来更改打印选项：
```
%run -i "../util/util_simple_classifier.ipynb"
from sklearn.feature_extraction.text import CountVectorizer
import sys
```
通过使用来自util_simple_classifier.ipynb文件的函数来加载训练和测试数据框。我们在之前的菜谱中创建了此函数，即创建简单分类器。该函数将Rotten Tomatoes数据集的15%加载到pandas数据框中，并随机化其顺序。这可能需要几分钟才能运行：
```
(train_df, test_df) = load_train_test_dataset_pd()
```
创建向量器，将其拟合到训练数据上，并打印出结果矩阵。我们将使用max_df参数来指定哪些单词应作为停用词。在这种情况下，我们指定在构建向量器时，出现超过40%的文档中的单词应被忽略。你应该进行实验，看看max_df的确切值哪个适合你的用例。然后我们将向量器拟合到train_df数据框的text列：
```
vectorizer = CountVectorizer(max_df=0.4)
X = vectorizer.fit_transform(train_df["text"])
print(X)
```
生成的矩阵是一个scipy.sparse._csr.csr_matrix对象，其打印输出的开头如下。稀疏矩阵的格式是(行, 列) 值。在我们的例子中，这意味着（文档索引，单词索引）后面跟着频率。在我们的例子中，第一篇评论，即第一篇文档，是文档编号0，它包含索引为6578、4219等的单词。这些单词的频率分别是1和2。
```
  (0, 6578)  1
  (0, 4219)  1
  (0, 2106)  1
  (0, 8000)  2
  (0, 717)  1
  (0, 42)  1
  (0, 1280)  1
  (0, 5260)  1
  (0, 1607)  1
  (0, 7889)  1
  (0, 3630)  1
…
```
在大多数情况下，我们使用不同的格式来表示向量，这是一种在实际中更容易使用的密集矩阵。我们不是用数字指定行和列，而是从值的位位置推断它们。现在我们将创建一个密集矩阵并打印它：
```
dense_matrix = X.todense()
print(dense_matrix)
```
生成的矩阵是一个NumPy矩阵对象，其中每个评论都是一个向量。你可以看到矩阵中的大多数值都是零，正如预期的那样，因为每个评论只使用了一小部分单词，而向量收集了词汇表中的每个单词或所有评论中的每个单词的计数。任何不在向量器词汇表中的单词将不会包含在向量中：
```
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
```
我们可以看到文档集中使用的所有单词和词汇表的大小。这可以用作合理性检查，并查看词汇表中是否存在任何不规则性：
```
print(vectorizer.get_feature_names_out())
print(len(vectorizer.get_feature_names_out()))
```
结果将如下。如果你想查看完整的、非截断的列表，请使用在步骤 8中使用的set_printoptions函数：
```
['10' '100' '101' ... 'zone' 'ótimo' 'últimos']
8856
```
我们还可以看到向量器使用的所有停用词：
```
print(vectorizer.stop_words_)
```
结果是三个单词，and、the和of，它们出现在超过40%的评论中：
```
{'and', 'the', 'of'}
```
现在，我们也可以使用CountVectorizer对象来表示原始文档集中未出现的新评论。这是在我们有一个训练好的模型并想在新的、未见过的样本上测试它时进行的。我们将使用测试数据集中的第一篇评论。为了获取测试集中的第一篇评论，我们将使用pandas的iat函数。
```
first_review = test_df['text'].iat[0]
print(first_review)
```
第一次审查看起来如下：
```
lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
```

现在，我们将从第一篇评论创建一个稀疏和一个密集向量。向量器的transform方法期望一个字符串列表，所以我们将创建一个列表。我们还设置了print选项来打印整个向量而不是只打印部分：

sparse_vector = vectorizer.transform([first_review])
print(sparse_vector)
dense_vector = sparse_vector.todense()
np.set_printoptions(threshold=sys.maxsize)
print(dense_vector)
np.set_printoptions(threshold=False)

稠密向量非常长，大部分是零，正如预期的那样：

  (0, 955)  1
  (0, 3968)  1
  (0, 4451)  1
  (0, 4562)  1
  (0, 4622)  1
  (0, 4688)  1
  (0, 4779)  1
  (0, 4792)  1
  (0, 5764)  1
  (0, 7547)  1
  (0, 7715)  1
  (0, 8000)  1
  (0, 8734)  1
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
…]]

我们可以使用不同的方法来计算停用词。在这里，停用词是通过在单词频率上设置绝对阈值来计算的。在这种情况下，我们使用所有在文档中频率低于300的单词。你可以看到，停用词列表现在更大了。
```
vectorizer = CountVectorizer(max_df=300)
X = vectorizer.fit_transform(train_df["text"])
print(vectorizer.stop_words_)
```
结果如下：
```
{'but', 'this', 'its', 'as', 'to', 'and', 'the', 'is', 'film', 'for', 'it', 'an', 'of', 'that', 'movie', 'with', 'in'}
```
最后，我们可以向矢量化器提供自己的停用词列表。这些单词将被矢量化器忽略，不会在矢量化中表示。如果你有非常具体的单词想要忽略，这很有用：
```
vectorizer = CountVectorizer(stop_words=['the', 'this',
    'these', 'in', 'at', 'for'])
X = vectorizer.fit_transform(train_df["text"])
```

现在，我们将使用我们在上一个配方中定义的函数测试这个词袋矢量化器对简单分类器的影响。首先，我们创建矢量化器，指定只使用在不到80%的文档中出现的单词。然后，我们加载训练和测试数据框。我们在训练集评论上拟合矢量化器。我们使用矢量化器创建一个矢量化函数，并将其传递给create_train_test_data函数，同时传递训练和测试数据框。然后我们训练分类器并在测试数据上测试它。我们可以看到，这种方法给我们带来的结果比我们在上一节中使用的基本词性计数矢量化器要好得多：

vectorizer = CountVectorizer(max_df=0.8)
(train_df, test_df) = load_train_test_dataset_pd()
X = vectorizer.fit_transform(train_df["text"])
vectorize = lambda x: vectorizer.transform([x]).toarray()[0]
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

结果将类似于以下内容：

              precision    recall  f1-score   support
           0       0.74      0.72      0.73       160
           1       0.73      0.75      0.74       160
    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320

构建N-gram模型

将文档表示为词袋是有用的，但语义不仅仅是关于孤立单词。为了捕捉词组合，使用n-gram模型是有用的。其词汇不仅包括单词，还包括单词序列，或n-gram。

在这个配方中，我们将构建一个bigram模型，其中bigram是两个单词的序列。

准备工作

CountVectorizer类非常灵活，允许我们构建n-gram模型。我们将在此配方中使用它，并用简单的分类器进行测试。

在这个配方中，我将代码及其结果与将文档放入词袋配方中的结果进行比较，因为这两个配方非常相似，但它们有一些不同的特性。

如何做到这一点...

运行简单的分类器笔记本并导入CountVectorizer类：

%run -i "../util/util_simple_classifier.ipynb"
from sklearn.feature_extraction.text import CountVectorizer

使用来自util_simple_classifier.ipynb笔记本的代码创建训练和测试数据框：
```
(train_df, test_df) = load_train_test_dataset_pd()
```
创建一个新的矢量化器类。在这种情况下，我们将使用ngram_range参数。当设置ngram_range参数时，CountVectorizer类不仅计算单个单词，还计算单词组合，组合中单词的数量取决于提供给ngram_range参数的数字。我们提供了ngram_range=(1,2)作为参数，这意味着组合中单词的数量范围从1到2，因此计算单语和双语：
```
bigram_vectorizer = CountVectorizer(
    ngram_range=(1, 2), max_df=0.8)
X = bigram_vectorizer.fit_transform(train_df["text"])
```
打印矢量化器的词汇及其长度。正如你所见，词汇的长度比单语矢量化器的长度大得多，因为我们除了单字外还使用了双字组合：
```
print(bigram_vectorizer.get_feature_names_out())
print(len(bigram_vectorizer.get_feature_names_out()))
```
结果应该看起来像这样：
```
['10' '10 inch' '10 set' ... 'ótimo esforço' 'últimos' 'últimos tiempos']
40552
```
现在，我们从测试数据框中取出第一条评论并获取其密集向量。结果看起来与 将文档放入词袋 菜谱中的向量输出非常相似，唯一的区别是现在的输出更长，因为它不仅包括单个单词，还包括二元组，即两个单词的序列：
```
first_review = test_df['text'].iat[0]
dense_vector = bigram_vectorizer.transform(
    [first_review]).todense()
print(dense_vector)
```
打印输出看起来像这样：
```
[[0 0 0 ... 0 0 0]]
```

最后，我们使用新的二元向量器训练一个简单的分类器。其结果准确率略低于上一节中使用单语元向量器的分类器的准确率。这可能有几个原因。一个是现在的向量要长得多，而且大部分是零。另一个原因是我们可以看到并非所有评论都是英文的，因此分类器很难泛化输入数据：

vectorize = \
    lambda x: bigram_vectorizer.transform([x]).toarray()[0]
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

输出将如下所示：

              precision    recall  f1-score   support
           0       0.72      0.75      0.73       160
           1       0.74      0.71      0.72       160
    accuracy                           0.73       320
   macro avg       0.73      0.73      0.73       320
weighted avg       0.73      0.73      0.73       320

使用 TF-IDF 表示文本

我们可以更进一步，使用 TF-IDF 算法来计算传入文档中的单词和 n-gram。TF-IDF 代表 词频-逆文档频率，它给独特于文档的单词比在整个文档中频繁重复的单词更多的权重。这允许我们给特定文档的独特特征词更多的权重。

在这个菜谱中，我们将使用一种不同类型的向量器，该向量器可以将 TF-IDF 算法应用于输入文本并构建一个小型分类器。

准备工作

我们将使用来自 sklearn 包的 TfidfVectorizer 类。TfidfVectorizer 类的特征应该与之前的两个菜谱 将文档放入词袋 和 构建 N-gram 模型 熟悉。我们将再次使用来自 Hugging Face 的 Rotten Tomatoes 评论数据集。

如何实现...

下面是构建和使用 TF-IDF 向量器的步骤：

运行小分类器笔记本并导入 TfidfVectorizer 类：

%run -i "../util/util_simple_classifier.ipynb"
from sklearn.feature_extraction.text import TfidfVectorizer

使用 load_train_test_dataset_pd() 函数创建训练和测试数据框：
```
(train_df, test_df) = load_train_test_dataset_pd()
```
创建向量器并在训练文本上拟合。我们将使用 max_df 参数来排除停用词——在这种情况下，是指比 300 更频繁的单词：
```
vectorizer = TfidfVectorizer(max_df=300)
vectorizer.fit(train_df["text"])
```
为了确保结果有意义，我们将打印向量器的词汇表及其长度。由于我们只是使用单语元，词汇表的大小应该与词袋菜谱中的相同：
```
print(vectorizer.get_feature_names_out())
print(len(vectorizer.get_feature_names_out()))
```
结果应该是这样的。词汇表长度应该与我们在词袋配方中得到的相同，因为我们没有使用n-grams：
```
['10' '100' '101' ... 'zone' 'ótimo' 'últimos']
8842
```
现在，让我们取测试数据框中的第一个审查并对其进行向量化。然后我们打印密集向量。要了解更多关于稀疏向量和密集向量之间的区别，请参阅将文档放入词袋配方。请注意，向量中的值现在是浮点数而不是整数。这是因为单个值现在是比率而不是计数：
```
first_review = test_df['text'].iat[0]
dense_vector = vectorizer.transform([first_review]).todense()
print(dense_vector)
```
结果应该是这样的：
```
[[0\. 0\. 0\. ... 0\. 0\. 0.]]
```

现在，让我们训练分类器。我们可以看到，分数略高于词袋分类器的分数，无论是单词还是n-gram版本：

vectorize = lambda x: vectorizer.transform([x]).toarray()[0]
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

测试分数的打印输出将类似于以下内容：

              precision    recall  f1-score   support
           0       0.76      0.72      0.74       160
           1       0.74      0.78      0.76       160
    accuracy                           0.75       320
   macro avg       0.75      0.75      0.75       320
weighted avg       0.75      0.75      0.75       320

它是如何工作的……

TfidfVectorizer类几乎与CountVectorizer类完全相同，只是在计算词频的方式上有所不同，所以大多数步骤应该是熟悉的。词频是单词在文档中出现的次数。逆文档频率是包含该单词的文档总数除以文档数。通常，这些频率是按对数缩放的。

这是通过以下公式完成的：

还有更多…

我们可以构建 TfidfVectorizer 并使用 [t, h, e, w, o, m, a, n, th, he, wo, om, ma, an, the, wom, oma, man] 集合。在一些实验设置中，基于字符 n-gram 的模型比基于单词的 n-gram 模型表现更好。

我们将使用小型的夏洛克·福尔摩斯文本文件，sherlock_holmes_1.txt，位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/sherlock_holmes_1.txt，以及相同的类，TfidfVectorizer。由于分析的单位是字符而不是单词，我们不需要标记化函数或停用词列表。创建向量器和分析句子的步骤如下：

创建一个新的使用 char_wb 分析器的向量器对象，然后将其拟合到训练文本上：

tfidf_char_vectorizer = TfidfVectorizer(
    analyzer='char_wb', ngram_range=(1,5))
tfidf_char_vectorizer = tfidf_char_vectorizer.fit(
    train_df["text"])

打印向量器的词汇表及其长度：

print(list(tfidf_char_vectorizer.get_feature_names_out()))
print(len(tfidf_char_vectorizer.get_feature_names_out()))

部分结果将看起来像这样：

[' ', ' !', ' ! ', ' "', ' " ', ' $', ' $5', ' $50', ' $50-', ' $9', ' $9 ', ' &', ' & ', " '", " ' ", " '5", " '50", " '50'", " '6", " '60", " '60s", " '7", " '70", " '70'", " '70s", " '[", " '[h", " '[ho", " 'a", " 'a ", " 'a'", " 'a' ", " 'ab", " 'aba", " 'ah", " 'ah ", " 'al", " 'alt", " 'an", " 'ana", " 'ar", " 'are", " 'b", " 'ba", " 'bar", " 'be", " 'bee", " 'bes", " 'bl", " 'blu", " 'br", " 'bra", " 'bu", " 'but", " 'c", " 'ch", " 'cha", " 'co", " 'co-", " 'com", " 'd", " 'di", " 'dif", " 'do", " 'dog", " 'du", " 'dum", " 'e", " 'ed", " 'edg", " 'em", " 'em ", " 'ep", " 'epi", " 'f", " 'fa", " 'fac", " 'fat", " 'fu", " 'fun", " 'g", " 'ga", " 'gar", " 'gi", " 'gir", " 'gr", " 'gra", " 'gu", " 'gue", " 'guy", " 'h", " 'ha", " 'hav", " 'ho", " 'hos", " 'how", " 'i", " 'i ", " 'if", " 'if ", " 'in", " 'in ", " 'is",…]
51270

使用新的向量器创建 vectorize 方法，然后创建训练数据和测试数据。训练分类器然后测试它：

vectorize = lambda x: tfidf_char_vectorizer.transform([
    x]).toarray()[0]
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

结果将类似于以下内容：

              precision    recall  f1-score   support
           0       0.74      0.74      0.74       160
           1       0.74      0.74      0.74       160
    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320

参见

你可以在 https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting 了解更多关于词频逆文档频率（TF-IDF）的词权重信息
更多关于 TfidfVectorizer 的信息，请参阅 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

使用词嵌入

在这个菜谱中，我们将转换方向，学习如何使用词嵌入来表示words，这是因为它们是训练一个预测句子中所有其他单词的神经网络的产物。嵌入也是向量，但通常大小要小得多，200或300。结果向量嵌入对于在相似上下文中出现的单词是相似的。相似度通常通过计算超平面中两个向量之间角度的余弦值来衡量，维度为200或300。我们将使用嵌入来展示这些相似性。

准备工作

在这个菜谱中，我们将使用预训练的word2vec模型，该模型可在https://github.com/mmihaltz/word2vec-GoogleNews-vectors找到。下载模型并将其解压缩到数据目录中。现在你应该有一个路径为…/``data/GoogleNews-vectors-negative300.bin.gz的文件。

我们还将使用gensim包来加载和使用模型。它应该在poetry环境中安装。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter03/3.5_word_embeddings.ipynb。

如何做到这一点...

我们将加载模型，演示gensim包的一些功能，然后使用词嵌入计算一个句子向量：

运行简单的分类器文件：

%run -i "../util/simple_classifier.ipynb"

导入gensim包：
```
import gensim
```

加载预训练模型。如果在这一步出现错误，请确保您已将模型下载到data目录中：

model = gensim.models.KeyedVectors.load_word2vec_format(
    '../data/GoogleNews-vectors-negative300.bin.gz',
    binary=True)

使用预训练模型，我们现在可以加载单个词向量。在这里，我们加载单词king的词向量。我们必须将其转换为小写，因为模型中的所有单词都是小写的。结果是表示该单词在word2vec模型中的长向量：

vec_king = model['king']
print(vec_king)

结果将如下所示：

[ 1.25976562e-01  2.97851562e-02  8.60595703e-03  1.39648438e-01
 -2.56347656e-02 -3.61328125e-02  1.11816406e-01 -1.98242188e-01
  5.12695312e-02  3.63281250e-01 -2.42187500e-01 -3.02734375e-01
 -1.77734375e-01 -2.49023438e-02 -1.67968750e-01 -1.69921875e-01
  3.46679688e-02  5.21850586e-03  4.63867188e-02  1.28906250e-01
  1.36718750e-01  1.12792969e-01  5.95703125e-02  1.36718750e-01
  1.01074219e-01 -1.76757812e-01 -2.51953125e-01  5.98144531e-02
  3.41796875e-01 -3.11279297e-02  1.04492188e-01  6.17675781e-02  …]

我们还可以获取与给定单词最相似的单词。例如，让我们打印出与apple和tomato最相似的单词。输出将打印出最相似的单词（即出现在相似上下文中）及其相似度分数。分数是两个向量之间的余弦距离——在这种情况下，表示一对单词。分数越大，两个单词越相似。结果是有意义的，因为与apple最相似的单词大多是水果，与tomato最相似的单词大多是蔬菜：

print(model.most_similar(['apple'], topn=15))
print(model.most_similar(['tomato'], topn=15))

结果如下所示：

[('apples', 0.720359742641449), ('pear', 0.6450697183609009), ('fruit', 0.6410146355628967), ('berry', 0.6302295327186584), ('pears', 0.613396167755127), ('strawberry', 0.6058260798454285), ('peach', 0.6025872826576233), ('potato', 0.5960935354232788), ('grape', 0.5935863852500916), ('blueberry', 0.5866668224334717), ('cherries', 0.5784382820129395), ('mango', 0.5751855969429016), ('apricot', 0.5727777481079102), ('melon', 0.5719985365867615), ('almond', 0.5704829692840576)]
[('tomatoes', 0.8442263007164001), ('lettuce', 0.7069936990737915), ('asparagus', 0.7050934433937073), ('peaches', 0.6938520669937134), ('cherry_tomatoes', 0.6897529363632202), ('strawberry', 0.6888598799705505), ('strawberries', 0.6832595467567444), ('bell_peppers', 0.6813562512397766), ('potato', 0.6784172058105469), ('cantaloupe', 0.6780219078063965), ('celery', 0.675195574760437), ('onion', 0.6740139722824097), ('cucumbers', 0.6706333160400391), ('spinach', 0.6682621240615845), ('cauliflower', 0.6681587100028992)]

在接下来的两个步骤中，我们通过平均句子中的所有词向量来计算一个句子向量。这种方法的一个挑战是表示模型中不存在的词，在这里，我们简单地跳过这些词。让我们定义一个函数，它将接受一个句子和一个模型，并返回句子词向量的列表。如果模型中不存在词，将返回KeyError，在这种情况下，我们捕获错误并继续：
```
def get_word_vectors(sentence, model):
    word_vectors = []
    for word in sentence:
        try:
            word_vector = model[word.lower()]
            word_vectors.append(word_vector)
        except KeyError:
            continue
    return word_vectors
```
现在，让我们定义一个函数，它将接受词向量列表并计算句子向量。为了计算平均值，我们将矩阵表示为一个numpy数组，并使用numpy的mean函数来获取平均向量：
```
def get_sentence_vector(word_vectors):
    matrix = np.array(word_vectors)
    centroid = np.mean(matrix[:,:], axis=0)
    return centroid
```

注意

通过平均词向量来获取句子向量是处理这个任务的一种方法，但并非没有问题。另一种选择是训练一个doc2vec模型，其中句子、段落和整个文档都可以作为单位，而不是词。

我们现在可以测试平均词嵌入作为向量器。我们的向量器接受字符串输入，获取每个词的词向量，然后返回我们在get_sentence_vector函数中计算的句子向量。然后我们加载训练数据和测试数据，创建数据集。我们训练逻辑回归分类器并对其进行测试：

vectorize = lambda x: get_sentence_vector(
    get_word_vectors(x, model))
(train_df, test_df) = load_train_test_dataset_pd()
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

我们可以看到，分数比前几节低得多。这可能有几个原因；其中之一是word2vec模型仅支持英语，而数据是多语言的。作为一个练习，你可以编写一个脚本来过滤仅支持英语的评论，看看是否可以提高分数：

              precision    recall  f1-score   support
           0       0.54      0.57      0.55       160
           1       0.54      0.51      0.53       160
    accuracy                           0.54       320
   macro avg       0.54      0.54      0.54       320
weighted avg       0.54      0.54      0.54       320

参见

有许多其他预训练模型可供选择，包括一些其他语言的模型；参见http://vectors.nlpl.eu/repository/。

一些预训练模型包括词性信息，这在区分词时可能很有帮助。这些模型将词与其cat_NOUN等属性连接起来，所以使用它们时请记住这一点。
要了解更多关于word2vec背后的理论，你可以从这里开始：https://jalammar.github.io/illustrated-word2vec/。

训练自己的嵌入模型

我们现在可以在语料库上训练自己的 word2vec 模型。这个模型是一个神经网络，当给定一个带有空格的句子时，可以预测一个单词。神经网络训练的副产品是训练词汇表中每个单词的向量表示。对于这个任务，我们将继续使用 Rotten Tomatoes 评论。数据集不是很大，所以结果不如拥有更大集合时那么好。

准备工作

我们将使用 gensim 包来完成这个任务。它应该作为 poetry 环境的一部分安装。

如何做到这一点...

我们将创建数据集，然后在数据上训练模型。然后我们将测试其性能：

导入必要的包和函数：

import gensim
from gensim.models import Word2Vec
from datasets import load_dataset
from gensim import utils

加载训练数据并检查其长度：

train_dataset = load_dataset("rotten_tomatoes", split="train")
print(len(train_dataset))

结果应该是这样的：

创建 RottenTomatoesCorpus 类。word2vec 训练算法需要一个具有定义的 iter 函数的类，这样你就可以遍历数据，这就是为什么我们需要这个类的原因：

class RottenTomatoesCorpus:
    def __init__(self, sentences):
        self.sentences = sentences
    def __iter__(self):
        for review in self.sentences:
            yield utils.simple_preprocess(
                gensim.parsing.preprocessing.remove_stopwords(
                    review))

使用加载的训练数据集创建一个 RottenTomatoesCorpus 实例。由于 word2vec 模型仅在文本上训练（它们是自监督模型），我们不需要评论评分：
```
sentences = train_dataset["text"]
corpus = RottenTomatoesCorpus(sentences)
```
在这个步骤中，我们初始化 word2vec 模型，训练它，然后将其保存到磁盘。唯一必需的参数是单词列表；其他一些重要的参数是 min_count、size、window 和 workers。min_count 参数指的是一个单词在训练语料库中必须出现的最小次数，默认值为 5。size 参数设置单词向量的大小。window 限制了句子中预测单词和当前单词之间的最大单词数。workers 指的是工作线程的数量；线程越多，训练速度越快。在训练模型时，epoch 参数将确定模型将经历的训练迭代次数。在初始化模型对象后，我们在语料库上训练它 100 个 epoch，最后将其保存到磁盘：
```
model = Word2Vec(sentences=corpus, vector_size=100,
    window=5, min_count=1, workers=4)
model.train(corpus_iterable=corpus,
    total_examples=model.corpus_count, epochs=100)
model.save("../data/rotten_tomato_word2vec.model")
```

找出与单词 movie 相似的 10 个单词。单词 sequels 和 film 与这个单词搭配合理；其余的则不太相关。这是因为训练语料库的规模较小。你得到的结果将会有所不同，因为每次训练模型时结果都会不同：

w1 = "movie"
words = model.wv.most_similar(w1, topn=10)
print(words)

这是一个可能的结果：

[('sequels', 0.38357362151145935), ('film', 0.33577531576156616), ('stuffed', 0.2925359606742859), ('quirkily', 0.28789234161376953), ('convict', 0.2810690104961395), ('worse', 0.2789292335510254), ('churn', 0.27702808380126953), ('hellish', 0.27698105573654175), ('hey', 0.27566075325012207), ('happens', 0.27498629689216614)]

还有更多...

有工具可以评估 word2vec 模型，尽管其创建是无监督的。gensim 包含一个文件，列出了单词类比，例如 Athens 对 Greece 的关系与 Moscow 对 Russia 的关系相同。evaluate_word_analogies 函数将类比通过模型运行，并计算正确答案的数量。

这里是如何做到这一点的：

使用evaluate_word_analogies函数评估我们的训练模型。我们需要类比文件，该文件可在GitHub存储库的书中找到，地址为https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/questions-words.txt。
```
(analogy_score, word_list) = model.wv.evaluate_word_analogies(
    '../data/questions-words.txt')
print(analogy_score)
```
结果应该类似于以下内容：
```
0.0015881418740074113
```

现在我们来评估预训练模型。这些命令可能需要更长的时间来运行：

pretrained_model = \
    gensim.models.KeyedVectors.load_word2vec_format(
        '../data/GoogleNews-vectors-negative300.bin.gz',
        binary=True)
(analogy_score, word_list) = \
    pretrained_model.evaluate_word_analogies(
        '../data/questions-words.txt')
print(analogy_score)

结果应该类似于以下内容：

0.7401448525607863

我们在预训练模型和我们的模型案例中使用了不同的evaluate_word_analogies函数，因为它们是不同类型的。对于预训练模型，我们只需加载向量（一个KeyedVectors类，其中每个由键表示的单词都映射到一个向量），而我们的模型是一个完整的word2vec模型对象。我们可以使用以下命令来检查类型：
```
print(type(pretrained_model))
print(type(model))
```
结果将如下所示：
```
<class 'gensim.models.keyedvectors.KeyedVectors'>
<class 'gensim.models.word2vec.Word2Vec'>
```
预训练模型是在一个更大的语料库上训练的，因此，预测地，它的表现更好。您也可以构建自己的评估文件，其中包含您数据所需的概念。

注意

确保您的评估基于您将在应用程序中使用的类型的数据；否则，您可能会得到误导性的评估结果。

参考信息

有一种额外的评估模型性能的方法，即通过比较模型分配给单词对的相似度与人类分配的判断之间的相似度。您可以通过使用evaluate_word_pairs函数来完成此操作。更多信息请参阅https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.evaluate_word_pairs。

使用BERT和OpenAI嵌入而不是词嵌入

我们可以使用双向编码器表示从Transformer（BERT）嵌入而不是词嵌入。BERT模型，就像词嵌入一样，是一个预训练模型，它给出一个向量表示，但它考虑上下文，可以表示整个句子而不是单个单词。

准备工作

对于这个食谱，我们可以使用Hugging Face的sentence_transformers包将句子表示为向量。我们需要PyTorch，它是作为poetry环境的一部分安装的。

为了获取向量，我们将使用all-MiniLM-L6-v2模型来完成这个食谱。

我们还可以使用来自OpenAI的大型语言模型（LLMs）的嵌入。

要使用OpenAI嵌入，您需要创建一个账户并从OpenAI获取API密钥。您可以在https://platform.openai.com/signup创建账户。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter03/3.6_train_own_word2vec.ipynb。

如何做到这一点…

Hugging Face 代码使使用 BERT 非常容易。第一次运行代码时，它将下载必要的模型，这可能需要一些时间。下载后，只需使用模型对句子进行编码即可。我们将使用这些嵌入测试简单的分类器：

运行简单的分类器笔记本以导入其函数：
```
%run -i "../util/util_simple_classifier.ipynb"
```

导入 SentenceTransformer 类：

from sentence_transformers import SentenceTransformer

加载句子转换器模型，检索句子 我爱爵士 的嵌入，并打印出来。

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode(["I love jazz"])
print(embedding)

如我们所见，它是一个与之前菜谱中的词嵌入向量相似的向量：

[[ 2.94217980e-03 -7.93536603e-02 -2.82228496e-02 -5.13779782e-02
  -6.44981042e-02  9.83557850e-02  1.09671958e-01 -3.26390602e-02
   4.96566631e-02  2.56580133e-02 -1.08482063e-01  1.88441798e-02
   2.70963665e-02 -3.80690470e-02  2.42502335e-02 -3.65605950e-03
   1.29364491e-01  4.32255343e-02 -6.64561391e-02 -6.93060979e-02
  -1.39410645e-01  4.36719768e-02 -7.85463024e-03  1.68625098e-02
  -1.01160072e-02  1.07926019e-02 -1.05814040e-02  2.57284809e-02
  -1.51516097e-02 -4.53920700e-02  7.12087378e-03  1.17573030e-01… ]]

现在，我们可以使用 BERT 嵌入来测试我们的分类器。首先，让我们定义一个函数，该函数将返回一个句子向量。这个函数接受输入文本和一个模型。然后，它使用该模型对文本进行编码，并返回结果嵌入。我们需要将文本放入列表中传递给 encode 方法，因为它期望一个可迭代对象。同样，我们返回结果中的第一个元素，因为它返回一个嵌入列表。
```
def get_sentence_vector(text, model):
    sentence_embeddings = model.encode([text])
    return sentence_embeddings[0]
```

现在，我们定义 vectorize 函数，使用我们创建在 创建简单分类器 菜谱中的 load_train_test_dataset_pd 函数创建训练和测试数据，训练分类器，并对其进行测试。我们将计时数据集创建步骤，因此包含了 time 包命令。我们看到整个数据集（约 85,000 条记录）的向量化大约需要 11 秒。然后我们训练模型并对其进行测试：

import time
vectorize = lambda x: get_sentence_vector(x, model)
(train_df, test_df) = load_train_test_dataset_pd()
start = time.time()
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
print(f"BERT embeddings: {time.time() - start} s")
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

结果是我们迄今为止最好的结果：

BERT embeddings: 11.410213232040405 s
              precision    recall  f1-score   support
           0       0.77      0.79      0.78       160
           1       0.79      0.76      0.77       160
    accuracy                           0.78       320
   macro avg       0.78      0.78      0.78       320
weighted avg       0.78      0.78      0.78       320

参见

更多预训练模型，请参阅 https://www.sbert.net/docs/pretrained_models.html。

检索增强生成（RAG）

在这个示例中，我们将看到向量嵌入的实际应用。RAG 是一种流行的处理大型语言模型（LLM）的方法。由于这些模型是在广泛可用的互联网数据上预训练的，因此它们无法访问我们的个人数据，我们也不能直接使用该模型来对其提问。一种克服这一限制的方法是使用向量嵌入来表示我们的数据。然后，我们可以计算我们的数据与问题之间的余弦相似度，并将最相似的数据片段连同问题一起包含在内——这就是“检索增强生成”这个名字的由来，因为我们首先通过余弦相似度检索相关数据，然后使用大型语言模型生成文本。

准备工作

我们将使用来自 Kaggle 的 IMDB 数据集，该数据集可以从 https://www.kaggle.com/PromptCloudHQ/imdb-data 下载，也包含在本书的 GitHub 仓库中 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/IMDB-Movie-Data.csv。下载数据集并解压 CSV 文件。

我们还将使用 OpenAI 嵌入，以及包含在 poetry 环境中的 llama_index 包。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter03/3.9_vector_search.ipynb。

如何操作…

我们将加载 IMDB 数据集，然后使用其前10个条目创建一个向量存储，然后使用 llama_index 包查询向量存储：

运行 utilities 笔记本：
```
%run -i "../util/file_utils.ipynb"
```

导入必要的类和包：

import csv
import openai
from llama_index import VectorStoreIndex
from llama_index import Document
openai.api_key = OPEN_AI_KEY

读取 CSV 数据。我们将跳过数据的第一行标题：

with open('../data/IMDB-Movie-Data.csv') as f:
    reader = csv.reader(f)
    data = list(reader)
    movies = data[1:]

在这一步，我们使用刚刚读取的数据的前10行来首先创建一个Document对象列表，然后创建一个包含这些Document对象的VectorStoreIndex对象。索引是一个用于搜索的对象，其中每个记录包含某些信息。向量存储索引存储每个记录的元数据以及向量表示。对于每部电影，我们将描述作为将被嵌入的文本，其余部分作为元数据。我们打印出document对象，可以看到每个对象都被分配了一个唯一的ID：

documents = []
for movie in movies[0:10]:
    doc_id = movie[0]
    title = movie[1]
    genres = movie[2].split(",")
    description = movie[3]
    director = movie[4]
    actors = movie[5].split(",")
    year = movie[6]
    duration = movie[7]
    rating = movie[8]
    revenue = movie[10]
    document = Document(
        text=description,
        metadata={
            "title": title,
            "genres": genres,
            "director": director,
            "actors": actors,
            "year": year,
            "duration": duration,
            "rating": rating,
            "revenue": revenue
        }
    )
    print(document)
    documents.append(document)
index = VectorStoreIndex.from_documents(documents)

部分输出将类似于以下内容：

id_='6e1ef633-f10b-44e3-9b77-f5f7b08dcedd' embedding=None metadata={'title': 'Guardians of the Galaxy', 'genres': ['Action', 'Adventure', 'Sci-Fi'], 'director': 'James Gunn', 'actors': ['Chris Pratt', ' Vin Diesel', ' Bradley Cooper', ' Zoe Saldana'], 'year': '2014', 'duration': '121', 'rating': '8.1', 'revenue': '333.13'} excluded_embed_metadata_keys=[] excluded_llm_metadata_keys=[] relationships={} hash='e18bdce3a36c69d8c1e55a7eb56f05162c68c97151cbaf40
91814ae3df42dfe8' text='A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.' start_char_idx=None end_char_idx=None text_template='{metadata_str}\n\n{content}' metadata_template='{key}: {value}' metadata_seperator='\n'

从我们刚刚创建的索引中创建查询引擎。查询引擎将允许我们向索引中加载的文档发送问题：
```
query_engine = index.as_query_engine()
```

使用引擎回答问题：

response = query_engine.query("""Which movies talk about something gigantic?""")
print(response.response)
The answer seems to make sense grammatically, and arguably the Great Wall of China is gigantic. However, it is not clear what is gigantic in the movie Prometheus. So here we have a partially correct answer. The Great Wall and Prometheus both talk about something gigantic. In The Great Wall, the protagonists become embroiled in the defense of the Great Wall of China against a horde of monstrous creatures. In Prometheus, the protagonists find a structure on a distant moon.

第四章：文本分类

在本章中，我们将使用不同的方法对文本进行分类。文本分类是经典的NLP问题。这个NLP任务包括为文本分配一个值，例如，一个主题（如体育或商业）或情感，如负面或正面，并且任何此类任务都需要评估。

在阅读本章后，您将能够使用关键词、无监督聚类和两种监督算法（支持向量机（SVMs）和spaCy框架内训练的卷积神经网络（CNN）模型）对文本进行预处理和分类。我们还将使用GPT-3.5对文本进行分类。

关于本节中讨论的一些概念的理论背景，请参阅Coelho等人所著的《Building Machine Learning Systems with Python》。这本书将解释构建机器学习项目的基础，例如训练和测试集，以及用于评估此类项目的指标，包括精确度、召回率、F1和准确度。

下面是本章中的食谱列表：

准备数据集和评估
使用关键词进行基于规则的文本分类
使用K-Means聚类句子 – 无监督文本分类
使用SVMs进行监督文本分类
训练spaCy模型进行监督文本分类
使用OpenAI模型进行文本分类

技术要求

本章的代码可以在书的GitHub仓库的Chapter04文件夹中找到(https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition)。像往常一样，我们将使用poetry环境安装必要的包。您也可以使用提供的requirements.txt文件安装所需的包。我们将使用Hugging Face的datasets包来获取本章中我们将使用的所有数据集。

准备数据集和评估

在这个食谱中，我们将加载一个数据集，为其准备处理，并创建一个评估基准。这个食谱基于第3章中的一些食谱，其中我们使用了不同的工具将文本表示成计算机可读的形式。

准备工作

对于这个食谱，我们将使用Rotten Tomatoes评论数据集，该数据集可通过Hugging Face获取。这个数据集包含用户电影评论，可以分类为正面和负面。我们将为机器学习分类准备数据集。在这种情况下，准备过程将涉及加载评论，过滤掉非英语语言评论，将文本分词成单词，并移除停用词。在机器学习算法运行之前，文本评论需要被转换成向量。这个转换过程在第3章中有详细描述。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.1_data_preparation.ipynb。

如何做到这一点…

我们将分类输入评论是负面还是正面情绪。我们首先过滤掉非英语文本，然后将其分词成单词并移除停用词和标点符号。最后，我们将查看类别分布并回顾每个类别中最常见的单词。

这里是步骤：

运行简单分类器文件：

%run -i "../util/util_simple_classifier.ipynb"

导入必要的类。我们从langdetect导入detect函数，这将帮助我们确定评论的语言。我们还导入word_tokenize函数，我们将用它将评论拆分成单词。NLTK中的FreqDist类将帮助我们查看评论中最频繁出现的正面和负面单词。我们将使用来自NLTK的stopwords列表来过滤文本中的停用词。最后，来自string包的punctuation字符串将帮助我们过滤标点符号：
```
from langdetect import detect
from nltk import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from string import punctuation
```

使用简单分类器文件中的函数加载训练和测试数据集，并打印出两个数据框。我们看到数据包含一个text列和一个label列，其中文本列是小写的：

(train_df, test_df) = load_train_test_dataset_pd("train", 
    "test")
print(train_df)
print(test_df)

输出应该看起来类似于以下内容：

                                                   text  label
0     the rock is destined to be the 21st century's ...      1
1     the gorgeously elaborate continuation of " the...      1
...                                                 ...    ...
8525  any enjoyment will be hinge from a personal th...      0
8526  if legendary shlockmeister ed wood had ever ma...      0
[8530 rows x 2 columns]
                                                   text  label
0     lovingly photographed in the manner of a golde...      1
1                 consistently clever and suspenseful .      1
...                                                 ...    ...
1061  a terrible movie that some people will neverth...      0
1062  there are many definitions of 'time waster' bu...      0
[1066 rows x 2 columns]

现在，我们在数据框中创建一个名为lang的新列，该列将包含评论的语言。我们使用detect函数通过apply方法填充此列。然后我们过滤数据框，只包含英语评论。过滤前后训练数据框的最终行数显示，有178行是非英语的。这一步可能需要一分钟才能运行：

train_df["lang"] = train_df["text"].apply(detect)
train_df = train_df[train_df['lang'] == 'en']
print(train_df)

现在的输出应该看起来像这样：

                                                   text  label lang
0     the rock is destined to be the 21st century's ...      1   en
1     the gorgeously elaborate continuation of " the...      1   en
...                                                 ...
    ...  ...
8528    interminably bleak , to say nothing of boring .      0   en
8529  things really get weird , though not particula...      0   en
[8364 rows x 3 columns]

现在我们将对测试数据框做同样的处理：

test_df["lang"] = test_df["text"].apply(detect)
test_df = test_df[test_df['lang'] == 'en']

现在，我们将文本分词成单词。如果你收到一个错误信息说没有找到english.pickle分词器，请在运行其余代码之前运行nltk.download('punkt')这一行。此代码也包含在lang_utils笔记本中(https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/util/lang_utils.ipynb)：

train_df["tokenized_text"] = train_df["text"].apply(
    word_tokenize)
print(train_df)
test_df["tokenized_text"] = test_df["text"].apply(word_tokenize)
print(test_df)

结果将类似于以下内容：

                                                   text  label lang  \
0     the rock is destined to be the 21st century's ...
      1   en
1     the gorgeously elaborate continuation of " the...      
1   en
...                                                 ...    ...
  ...
8528    interminably bleak , to say nothing of boring .      
0   en
8529  things really get weird , though not particula...      
0   en
                                         tokenized_text
0     [the, rock, is, destined, to, be, the, 21st, c...
1     [the, gorgeously, elaborate, continuation, of,...
...                                                 ...
8528  [interminably, bleak, ,, to, say, nothing, of,...
8529  [things, really, get, weird, ,, though, not, p...
[8352 rows x 4 columns]

在这一步中，我们将删除停用词和标点符号。首先，我们使用NLTK包加载停用词。然后我们将's和``添加到停用词列表中。你可以添加你认为也是停用词的其他单词。然后我们定义一个函数，该函数将接受一个单词列表作为输入并对其进行过滤，返回一个不包含停用词或标点的新的单词列表。最后，我们将此函数应用于训练数据和测试数据。从打印输出中，我们可以看到停用词和标点符号已被删除：

stop_words = list(stopwords.words('english'))
stop_words.append("``")
stop_words.append("'s")
def remove_stopwords_and_punct(x):
    new_list = [w for w in x if w not in stop_words and w not in punctuation]
    return new_list
train_df["tokenized_text"] = train_df["tokenized_text"].apply(
    remove_stopwords_and_punct)
print(train_df)
test_df["tokenized_text"] = test_df["tokenized_text"].apply(
    remove_stopwords_and_punct)
print(test_df)

结果将类似于这个：

                                                   text  label lang  \
0     the rock is destined to be the 21st century's ...
      1   en
1     the gorgeously elaborate continuation of " the...
      1   en
...                                                 ...
    ...  ...
8528    interminably bleak , to say nothing of boring .
      0   en
8529  things really get weird , though not particula...
      0   en
                                         tokenized_text
0     [rock, destined, 21st, century, new, conan, go...
1     [gorgeously, elaborate, continuation, lord, ri...
...                                                 ...
8528        [interminably, bleak, say, nothing, boring]
8529  [things, really, get, weird, though, particula...
[8352 rows x 4 columns]

现在我们将检查两个数据集的类别平衡。每个类别中项目数量大致相同是很重要的，因为如果某个类别占主导地位，模型可以学会总是分配这个主导类别，而不会犯很多错误：

print(train_df.groupby('label').count())
print(test_df.groupby('label').count())

我们看到在训练数据中负面评论略多于正面评论，但并不显著，而在测试数据中这两个数字几乎相等。

text  lang  tokenized_text
label
0      4185  4185            4185
1      4167  4167            4167
       text  lang  tokenized_text
label
0       523   523             523
1       522   522             522

现在我们将清理后的数据保存到磁盘：

train_df.to_json("../data/rotten_tomatoes_train.json")
test_df.to_json("../data/rotten_tomatoes_test.json")

在这一步中，我们定义一个函数，该函数将接受一个单词列表和单词数量作为输入，并返回一个FreqDist对象。它还将打印出前n个最频繁的单词，其中n是传递给函数的参数，默认值为200：
```
def get_stats(word_list, num_words=200):
    freq_dist = FreqDist(word_list)
    print(freq_dist.most_common(num_words))
    return freq_dist
```

现在我们使用前面的函数来展示正面和负面评论中最常见的单词，以查看这两个类别之间是否存在显著的词汇差异。我们创建了两个单词列表，一个用于正面评论，一个用于负面评论。我们首先通过标签过滤数据框，然后使用sum函数从所有评论中获取单词：

positive_train_words = train_df[
    train_df["label"] == 1].tokenized_text.sum()
negative_train_words = train_df[
    train_df["label"] == 0].tokenized_text.sum()
positive_fd = get_stats(positive_train_words)
negative_fd = get_stats(negative_train_words)

在输出中，我们看到单词film和movie以及一些其他单词在这种情况下也充当停用词，因为它们是两组中最常见的单词。我们可以在第7步将它们添加到停用词列表中，并重新进行清理：

[('film', 683), ('movie', 429), ("n't", 286), ('one', 280), ('--', 271), ('like', 209), ('story', 194), ('comedy', 160), ('good', 150), ('even', 144), ('funny', 137), ('way', 135), ('time', 127), ('best', 126), ('characters', 125), ('make', 124), ('life', 124), ('much', 122), ('us', 122), ('love', 118), ...]
[('movie', 641), ('film', 557), ("n't", 450), ('like', 354), ('one', 293), ('--', 264), ('story', 189), ('much', 177), ('bad', 173), ('even', 160), ('time', 146), ('good', 143), ('characters', 138), ('little', 137), ('would', 130), ('never', 122), ('comedy', 121), ('enough', 107), ('really', 105), ('nothing', 103), ('way', 102), ('make', 101), ...]

使用基于规则的文本分类使用关键词

在这个菜谱中，我们将使用文本的词汇来对烂番茄评论进行分类。我们将创建一个简单的分类器，该分类器将为每个类别有一个向量器。该向量器将包括该类特有的单词。分类将简单地使用每个向量器对文本进行向量化，然后使用拥有更多单词的类别。

准备工作

我们将使用CountVectorizer类和sklearn中的classification_report函数，以及NLTK中的word_tokenize方法。所有这些都包含在poetry环境中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.2_rule_based.ipynb。

如何做到这一点…

在这个食谱中，我们将为每个类别创建一个单独的向量器。然后我们将使用这些向量器来计算每个评论中每个类别的单词数量以进行分类：

运行简单的分类器文件：

%run -i "../util/util_simple_classifier.ipynb"

执行必要的导入：

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

从磁盘加载清洗后的数据。如果在这一步收到FileNotFoundError错误，你需要首先运行之前的食谱，获取数据集和评估准备，因为那些文件是在数据清洗后创建的：
```
train_df = pd.read_json("../data/rotten_tomatoes_train.json")
test_df = pd.read_json("../data/rotten_tomatoes_test.json")
```
在这里，我们创建一个包含每个类别独特单词的列表。我们首先将text列中的所有单词连接起来，根据相关的label值（0表示负面评论，1表示正面评论）进行过滤。然后我们在word_intersection变量中获取出现在这两个列表中的单词。最后，我们为每个类别创建过滤后的单词列表，这些列表不包含同时出现在两个类别中的单词。基本上，我们从相应的列表中删除了同时出现在正面和负面评论中的所有单词：
```
positive_train_words = train_df[train_df["label"] 
    == 1].text.sum()
negative_train_words = train_df[train_df["label"] 
    == 0].text.sum()
word_intersection = set(positive_train_words) \ 
    & set(negative_train_words)
positive_filtered = list(set(positive_train_words) 
    - word_intersection)
negative_filtered = list(set(negative_train_words) 
    - word_intersection)
```
接下来，我们定义一个函数来创建向量器，每个类别一个。该函数的输入是一个列表的列表，其中每个列表是只出现在那个类中的单词列表；我们在上一步创建了这些列表。对于每个单词列表，我们创建一个CountVectorizer对象，将单词列表作为vocabulary参数。提供这个参数确保我们只为分类目的计算这些单词：
```
def create_vectorizers(word_lists):
    vectorizers = []
    for word_list in word_lists:
        vectorizer = CountVectorizer(vocabulary=word_list)
        vectorizers.append(vectorizer)
    return vectorizers
```

使用前面的函数创建向量器：

vectorizers = create_vectorizers([negative_filtered,
    positive_filtered])

在这一步，我们创建一个vectorize函数，该函数接受一个单词列表和一个向量器列表。我们首先从单词列表创建一个字符串，因为向量器期望一个字符串。对于列表中的每个向量器，我们将它应用于文本，然后计算该向量器中单词的总计数。最后，我们将这个总和追加到分数列表中。这将按类别计数输入中的单词。函数结束时返回这个分数列表：
```
def vectorize(text_list, vectorizers):
    text = " ".join(text_list)
    scores = []
    for vectorizer in vectorizers:
        output = vectorizer.transform([text])
        output_sum = sum(output.todense().tolist()[0])
        scores.append(output_sum)
    return scores
```
在这一步，我们定义classify函数，该函数接受由vectorize函数返回的分数列表。这个函数简单地从列表中选择最大分数，并返回对应于类别标签的分数索引：
```
def classify(score_list):
    return max(enumerate(score_list),key=lambda x: x[1])[0]
```

在这里，我们将前面的函数应用于训练数据。我们首先对文本进行向量化，然后进行分类。我们为结果创建一个名为prediction的新列：

train_df["prediction"] = train_df["text"].apply(
    lambda x: classify(vectorize(x, vectorizers)))
print(train_df)

输出将类似于以下内容：

                                                   text  label lang  \
0     [rock, destined, 21st, century, new, conan, go...      
1   en
1     [gorgeously, elaborate, continuation, lord, ri...      
1   en
...                                                 ...    ...
  ...
8528        [interminably, bleak, say, nothing, boring]      
0   en
8529  [things, really, get, weird, though, particula...      
0   en
      prediction
0              1
1              1
...          ...
8528           0
8529           0
[8364 rows x 4 columns]

现在我们通过打印分类报告来衡量基于规则的分类器的性能。我们输入分配的标签和预测列。结果是整体准确率为87%：

print(classification_report(train_df['label'], 
    train_df['prediction']))

这导致以下结果：

              precision    recall  f1-score   support
           0       0.79      0.99      0.88      4194
           1       0.99      0.74      0.85      4170
    accuracy                           0.87      8364
   macro avg       0.89      0.87      0.86      8364
weighted avg       0.89      0.87      0.86      8364

在这里，我们对测试数据做同样的处理，我们看到准确率显著下降，降至62%。这是因为我们用来创建向量器的词汇表只来自训练数据，并不全面。它们会导致未见数据中的错误：

test_df["prediction"] = test_df["text"].apply(
    lambda x: classify(vectorize(x, vectorizers)))
print(classification_report(test_df['label'], 
    test_df['prediction']))

结果如下：

              precision    recall  f1-score   support
           0       0.59      0.81      0.68       523
           1       0.70      0.43      0.53       524
    accuracy                           0.62      1047
   macro avg       0.64      0.62      0.61      1047
weighted avg       0.64      0.62      0.61      1047

使用K-Means进行句子聚类——无监督文本分类

在这个食谱中，我们将使用BBC新闻数据集。该数据集包含按五个主题排序的新闻文章：政治、科技、商业、体育和娱乐。我们将应用无监督的K-Means算法将数据分类到未标记的类别中。

在阅读完这份食谱后，你将能够创建自己的无监督聚类模型，该模型能够将数据分类到几个类别中。之后，你可以将其应用于任何文本数据，而无需先对其进行标记。

准备工作

我们将使用KMeans算法创建我们的无监督模型。它是sklearn包的一部分，并包含在poetry环境中。

我们在这里使用的BBC新闻数据集是由Hugging Face用户上传的，随着时间的推移，链接和数据集可能会发生变化。为了避免任何潜在问题，你可以使用GitHub仓库中提供的CSV文件加载的书籍的BBC数据集。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.3_unsupervised_classification.ipynb。

如何操作...

在这个食谱中，我们将对数据进行预处理，将其向量化，然后使用K-Means进行聚类。由于无监督建模通常没有正确答案，因此评估模型更困难，但我们将能够查看一些统计数据，以及所有聚类中最常见的单词。

你的步骤应该格式化为如下：

运行简单的分类文件：

%run -i "../util/util_simple_classifier.ipynb"
%run -i "../util/lang_utils.ipynb"

导入必要的函数和包：

from nltk import word_tokenize
from sklearn.cluster import KMeans
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit

我们将加载BBC数据集。我们使用Hugging Face的datasets包中的**load_dataset**函数。此函数在步骤1中运行的简单分类文件中已导入。在Hugging Face仓库中，数据集通常分为训练集和测试集。我们将加载两者，尽管在无监督学习中，测试集通常不使用：

train_dataset = load_dataset("SetFit/bbc-news", split="train")
test_dataset = load_dataset("SetFit/bbc-news", split="test")
train_df = train_dataset.to_pandas()
test_df = test_dataset.to_pandas()
print(train_df)
print(test_df)

结果将类似于以下这样：

                                                   text  label
     label_text
0     wales want rugby league training wales could f...      2
          sport
1     china aviation seeks rescue deal scandal-hit j...      1
       business
...                                                 ...    ...
            ...
1223  why few targets are better than many the econo...      1
       business
1224  boothroyd calls for lords speaker betty boothr...      4
       politics
[1225 rows x 3 columns]
                                                  text  label
     label_text
0    carry on star patsy rowlands dies actress pats...      3
  entertainment
1    sydney to host north v south game sydney will ...      2
          sport
..                                                 ...    ...
            ...
998  stormy year for property insurers a string of ...      1
       business
999  what the election should really be about  a ge...      4
       politics
[1000 rows x 3 columns]

现在我们将检查训练数据和测试数据中每个类别的项目分布。在分类中，类别平衡很重要，因为一个不成比例更大的类别将影响最终的分类器：

print(train_df.groupby('label_text').count())
print(test_df.groupby('label_text').count())

我们看到类别分布相当均匀，但在商业和体育类别中示例更多：

               text  label
label_text
business        286    286
entertainment   210    210
politics        242    242
sport           275    275
tech            212    212
               text  label
label_text
business        224    224
entertainment   176    176
politics        175    175
sport           236    236
tech            189    189

由于测试集和训练集中的数据量几乎一样多，我们将合并数据并创建一个更好的训练/测试分割。我们首先连接两个数据框。然后我们创建一个StratifiedShuffleSplit，它将创建一个训练/测试分割，并在保持类别平衡的同时进行。我们指定我们只需要一个分割（n_splits），并且测试数据需要占整个数据集的20%（test_size）。sss对象的split方法返回一个生成器，其中包含分割的索引。然后我们可以使用这些索引来获取新的训练和测试数据框。为此，我们根据相关索引进行筛选，然后复制结果数据框的切片。如果我们没有复制，那么我们就会在原始数据框上工作。然后我们打印出两个数据框的类别计数，并看到有更多的训练数据和较少的测试数据：

combined_df = pd.concat([train_df, test_df],
    ignore_index=True, sort=False)
print(combined_df)
sss = StratifiedShuffleSplit(n_splits=1,
    test_size=0.2, random_state=0)
train_index, test_index = next(
    sss.split(combined_df["text"], combined_df["label"]))
train_df = combined_df[combined_df.index.isin(
    train_index)].copy()
test_df = combined_df[combined_df.index.isin(test_index)].copy()
print(train_df.groupby('label_text').count())
print(test_df.groupby('label_text').count())

结果应该看起来像这样：

               text  label  text_tokenized  text_clean  cluster
label_text
business        408    408             408         408      330
entertainment   309    309             309         309      253
politics        333    333             333         333      263
sport           409    409             409         409      327
tech            321    321             321         321      262
               text  label  text_tokenized  text_clean  cluster
label_text
business        102    102             102         102       78
entertainment    77     77              77          77       56
politics         84     84              84          84       70
sport           102    102             102         102       82
tech             80     80              80          80       59

现在，我们将预处理数据：对其进行分词并去除停用词和标点符号。执行此操作的函数（tokenize，remove_stopword_punct）在步骤1中运行的language_utils文件中导入。如果你收到一个错误，表明找不到english.pickle分词器，请在运行其余代码之前运行nltk.download('punkt')这一行。此代码也包含在lang_utils notebook中：
```
train_df = tokenize(train_df, "text")
train_df = remove_stopword_punct(train_df, "text_tokenized")
test_df = tokenize(test_df, "text")
test_df = remove_stopword_punct(test_df, "text_tokenized")
```
在这一步，我们创建向量器。为此，我们从训练新闻文章中获取所有单词。首先，我们将清洗后的文本保存在一个单独的列中，text_clean，然后我们将两个数据框保存到磁盘上。然后我们创建一个TF-IDF向量器，它将计算单语元、双语元和三元语（ngram_range参数）。然后我们仅在训练数据上拟合向量器。我们仅在训练数据上拟合它的原因是，如果我们同时在训练和测试数据上拟合它，就会导致数据泄露，我们会得到比实际在未见数据上的性能更好的测试分数：
```
train_df["text_clean"] = train_df["text_tokenized"].apply(
    lambda x: " ".join(list(x)))
test_df["text_clean"] = test_df["text_tokenized"].apply(
    lambda x: " ".join(list(x)))
train_df.to_json("../data/bbc_train.json")
test_df.to_json("../data/bbc_test.json")
vec = TfidfVectorizer(ngram_range=(1,3))
matrix = vec.fit_transform(train_df["text_clean"])
```
现在我们可以创建五个簇的Kmeans分类器，然后将其拟合到前面代码中使用的向量器生成的矩阵上。我们使用n_clusters参数指定簇的数量。我们还指定算法应该运行的次数为10，使用n_init参数。对于高维问题，建议进行多次运行。初始化分类器后，我们将其拟合到步骤7中使用的向量器创建的矩阵上。这将创建训练数据的聚类：

注意

在实际项目中，你不会像我们这样事先知道簇的数量。你需要使用肘部方法或其他方法来估计最佳类别数量。

km = KMeans(n_clusters=5, n_init=10)
km.fit(matrix)

get_most_frequent_words函数将返回一个列表，其中包含列表中最频繁的单词。最频繁单词列表将为我们提供有关文本是关于哪个主题的线索。我们将使用此函数打印出聚类中最频繁的单词，以了解它们指的是哪个主题。该函数接受输入文本，对其进行分词，然后创建一个FreqDist对象。我们通过使用其most_common函数获取顶级单词频率元组，并最终仅获取没有频率的单词并作为列表返回：
```
def get_most_frequent_words(text, num_words):
    word_list = word_tokenize(text)
    freq_dist = FreqDist(word_list)
    top_words = freq_dist.most_common(num_words)
    top_words = [word[0] for word in top_words]
    return top_words
```
在这一步，我们定义了另一个函数，print_most_common_words_by_cluster，它使用我们在上一步定义的get_most_frequent_words函数。我们以数据框、KMeans模型和聚类数量作为输入参数。然后我们获取分配给每个数据点的聚类列表，并在数据框中创建一个指定分配聚类的列。对于每个聚类，我们过滤数据框以获取仅针对该聚类的文本。我们使用此文本将其传递到get_most_frequent_words函数以获取该聚类中最频繁单词的列表。我们打印聚类编号和列表，并返回添加了聚类编号列的输入数据框：
```
def print_most_common_words_by_cluster(input_df, km, 
    num_clusters):
    clusters = km.labels_.tolist()
    input_df["cluster"] = clusters
    for cluster in range(0, num_clusters):
        this_cluster_text = input_df[
            input_df['cluster'] == cluster]
        all_text = " ".join(
            this_cluster_text['text_clean'].astype(str))
        top_200 = get_most_frequent_words(all_text, 200)
        print(cluster)
        print(top_200)
    return input_df
```
在这里，我们将上一步定义的函数应用于训练数据框。我们还传递了拟合的KMeans模型和聚类数量，5。打印输出给我们一个关于哪个聚类对应哪个主题的想法。聚类编号可能不同，但包含劳动、政党、选举作为最频繁单词的聚类是政治聚类；包含单词音乐、奖项和表演的聚类是娱乐聚类；包含单词游戏、英格兰、胜利、比赛和杯的聚类是体育聚类；包含单词销售和增长的聚类是商业聚类；包含单词软件、网络和搜索的聚类是技术聚类。我们还注意到单词说和先生是明显的停用词，因为它们出现在大多数聚类中接近顶部：
```
print_most_common_words_by_cluster(train_df, km, 5)
```
每次运行训练时结果都会有所不同，但它们可能看起来像这样（输出已截断）：
```
0
['mr', 'said', 'would', 'labour', 'party', 'election', 'blair', 'government', ...]
1
['film', 'said', 'best', 'also', 'year', 'one', 'us', 'awards', 'music', 'new', 'number', 'award', 'show', ...]
2
['said', 'game', 'england', 'first', 'win', 'world', 'last', 'one', 'two', 'would', 'time', 'play', 'back', 'cup', 'players', ...]
3
['said', 'mr', 'us', 'year', 'people', 'also', 'would', 'new', 'one', 'could', 'uk', 'sales', 'firm', 'growth', ...]
4
['said', 'people', 'software', 'would', 'users', 'mr', 'could', 'new', 'microsoft', 'security', 'net', 'search', 'also', ...]
```

在这一步，我们使用拟合模型预测测试示例的聚类。我们使用测试数据框的第1行的文本。它是一个政治示例。我们使用向量器将文本转换为向量，然后使用K-Means模型预测聚类。预测是聚类0，在这种情况下是正确的：

test_example = test_df.iloc[1, test_df.columns.get_loc('text')]
print(test_example)
vectorized = vec.transform([test_example])
prediction = km.predict(vectorized)
print(prediction)

结果可能看起来像这样：

lib dems  new election pr chief the lib dems have appointed a senior figure from bt to be the party s new communications chief for their next general election effort.  sandy walkington will now work with senior figures such as matthew taylor on completing the party manifesto. party chief executive lord rennard said the appointment was a  significant strengthening of the lib dem team . mr walkington said he wanted the party to be ready for any  mischief  rivals or the media tried to throw at it.   my role will be to ensure this new public profile is effectively communicated at all levels   he said.  i also know the party will be put under scrutiny in the media and from the other parties as never before - and we will need to show ourselves ready and prepared to counter the mischief and misrepresentation that all too often comes from the party s opponents.  the party is already demonstrating on every issue that it is the effective opposition.  mr walkington s new job title is director of general election communications.
[0]

最后，我们使用joblib包的dump函数保存模型，然后使用load函数再次加载它。我们检查加载模型的预测，它与内存中模型的预测相同。这一步将允许我们在未来重用该模型：
```
dump(km, '../data/kmeans.joblib')
km_ = load('../data/kmeans.joblib')
prediction = km_.predict(vectorized)
print(prediction)
```
结果可能看起来像这样：
```
[0]
```

使用SVM进行监督文本分类

在这个菜谱中，我们将构建一个使用SVM算法的机器学习分类器。到这个菜谱结束时，你将拥有一个可以对新输入进行测试并使用我们在上一节中使用的相同classification_report工具进行评估的工作分类器。我们将使用与之前KMeans相同的BBC新闻数据集。

准备工作

我们将继续使用之前菜谱中已经安装的相同包。需要的包安装在了poetry环境中，或者通过安装requirements.txt文件。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.4-svm_classification.ipynb。

如何做……

我们将加载在之前菜谱中保存的清洗后的训练和测试数据。然后我们将创建SVM分类器并对其进行训练。我们将使用BERT编码作为我们的向量器。

你的步骤应该格式化如下：

运行简单分类器文件：

%run -i "../util/util_simple_classifier.ipynb"

导入必要的函数和包：

from sklearn.svm import SVC
from sentence_transformers import SentenceTransformer
from sklearn.metrics import confusion_matrix

在这里，我们加载训练和测试数据。如果你在这个步骤中遇到FileNotFoundError错误，请运行之前菜谱中的步骤1-7，即使用K-Means聚类句子 – 无监督文本分类。然后我们使用sample函数对训练数据进行洗牌。洗牌确保我们没有长序列的数据，这些数据属于同一类别。最后，我们打印出每个类别的示例数量。我们看到类别大致平衡，这对于训练分类器很重要：

train_df = pd.read_json("../data/bbc_train.json")
test_df = pd.read_json("../data/bbc_test.json")
train_df.sample(frac=1)
print(train_df.groupby('label_text').count())
print(test_df.groupby('label_text').count())

结果将如下所示：

               text  label  text_tokenized  text_clean  cluster
label_text
business        231    231             231         231      231
entertainment   181    181             181         181      181
politics        182    182             182         182      182
sport           243    243             243         243      243
tech            194    194             194         194      194
               text  label  text_tokenized  text_clean
label_text
business         58     58              58          58
entertainment    45     45              45          45
politics         45     45              45          45
sport            61     61              61          61
tech             49     49              49          49

在这里，我们加载了为我们提供向量的句子转换器all-MiniLM-L6-v2模型。要了解更多关于该模型的信息，请阅读第3章中的使用BERT和OpenAI嵌入代替词嵌入菜谱。然后我们定义get_sentence_vector函数，该函数返回文本输入的句子嵌入：
```
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_sentence_vector(text, model):
    sentence_embeddings = model.encode([text])
    return sentence_embeddings[0]
```
定义一个函数，该函数将创建一个SVM对象并在给定输入数据上对其进行训练。它接受输入向量和金标签，创建一个具有RBF核和正则化参数0.1的SVC对象，并在训练数据上对其进行训练。然后它返回训练好的分类器：
```
def train_classifier(X_train, y_train):
    clf = SVC(C=0.1, kernel='rbf')
    clf = clf.fit(X_train, y_train)
    return clf
```

在这个步骤中，我们为分类器和vectorize方法创建标签列表。然后我们使用位于简单分类器文件中的create_train_test_data方法创建训练和测试数据集。然后我们使用train_classifier函数训练分类器并打印训练和测试指标。我们看到测试指标非常好，所有指标都超过90%：

target_names=["tech", "business", "sport", 
    "entertainment", "politics"]
vectorize = lambda x: get_sentence_vector(x, model)
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize, column_name="text_clean")
clf = train_classifier(X_train, y_train)
print(classification_report(train_df["label"],
        y_train, target_names=target_names))
test_classifier(test_df, clf, target_names=target_names)

输出将如下所示：

               precision    recall  f1-score   support
         tech       1.00      1.00      1.00       194
     business       1.00      1.00      1.00       231
        sport       1.00      1.00      1.00       243
entertainment       1.00      1.00      1.00       181
     politics       1.00      1.00      1.00       182
     accuracy                           1.00      1031
    macro avg       1.00      1.00      1.00      1031
 weighted avg       1.00      1.00      1.00      1031
               precision    recall  f1-score   support
         tech       0.92      0.98      0.95        49
     business       0.95      0.90      0.92        58
        sport       1.00      1.00      1.00        61
entertainment       1.00      0.98      0.99        45
     politics       0.96      0.98      0.97        45
     accuracy                           0.97       258
    macro avg       0.97      0.97      0.97       258
 weighted avg       0.97      0.97      0.96       258

在这一步，我们打印出混淆矩阵以查看分类器在哪些地方犯了错误。行代表正确的标签，列代表预测的标签。我们看到最多的混淆（四个例子）是正确的标签是商业但预测为技术，以及正确的标签是商业而预测为政治（两个例子）。我们还看到商业被错误地预测为技术、娱乐和政治各一次。这些错误也反映在指标中，我们看到商业的召回率和精确率都受到了影响。唯一得分完美的类别是体育，它在混淆矩阵的每个地方都是零，除了正确的行和预测的列的交叉点。我们可以使用混淆矩阵来查看哪些类别之间有最多的混淆，并在必要时采取措施纠正：
```
print(confusion_matrix(test_df["label"], test_df["prediction"]))
[[48  1  0  0  0]
 [ 4 52  0  0  2]
 [ 0  0 61  0  0]
 [ 0  1  0 44  0]
 [ 0  1  0  0 44]]
```

我们将在新的示例上测试分类器。我们首先将文本向量化，然后使用训练好的模型进行预测并打印预测结果。新文章是关于技术的，预测类别为 0，这确实是技术：

new_example = """iPhone 12: Apple makes jump to 5G
Apple has confirmed its iPhone 12 handsets will be its first to work on faster 5G networks.
The company has also extended the range to include a new "Mini" model that has a smaller 5.4in screen.
The US firm bucked a wider industry downturn by increasing its handset sales over the past year.
But some experts say the new features give Apple its best opportunity for growth since 2014, when it revamped its line-up with the iPhone 6.
"5G will bring a new level of performance for downloads and uploads, higher quality video streaming, more responsive gaming, real-time interactivity and so much more," said chief executive Tim Cook.
…"""
vector = vectorize(new_example)
prediction = clf.predict([vector])
print(prediction))

结果将如下所示：

[0]

还有更多…

有许多不同的机器学习算法可以用作 SVM 算法的替代。其中一些包括回归、朴素贝叶斯和决策树。你可以尝试它们，看看哪个表现更好。

训练 spaCy 模型进行监督文本分类

在这个菜谱中，我们将使用 BBC 数据集训练 spaCy 模型，与之前菜谱中使用的数据集相同，以预测文本类别。

准备工作

我们将使用 spaCy 包来训练我们的模型。所有依赖项都由 poetry 环境处理。

您需要从书籍的 GitHub 仓库下载配置文件，位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/spacy_config.cfg。此文件应位于相对于笔记本的路径 ../data/spacy_config.cfg。

注意

您可以修改训练配置，或在其 https://spacy.io/usage/training 上生成自己的配置。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.5-spacy_textcat.ipynb。

如何做…

训练的一般结构类似于普通机器学习模型训练，其中我们清理数据，创建数据集，并将其分为训练集和测试集。然后我们训练一个模型并在未见过的数据上测试它：

运行简单的分类器文件：
```
%run -i "../util/lang_utils.ipynb"
```

导入必要的函数和包：

import pandas as pd
from spacy.cli.train import train
from spacy.cli.evaluate import evaluate
from spacy.cli.debug_data import debug_data
from spacy.tokens import DocBin

在这里，我们定义了preprocess_data_entry函数，它将接受输入文本、其标签以及所有标签的列表。然后它将在文本上运行小的spaCy模型。这个模型是通过在步骤1中运行语言实用工具文件导入的。在这个步骤中我们使用哪个模型并不重要，因为我们只是想要从文本中创建一个Doc对象。这就是为什么我们运行最小的模型，因为它花费的时间更少。然后我们为文本类别创建一个one-hot编码，将类别标签设置为1，其余设置为0。然后我们创建一个将类别名称映射到其值的标签字典。我们将doc.cats属性设置为这个字典，并返回Doc对象。spaCy需要对此数据进行预处理才能训练分类模型：
```
def preprocess_data_entry(input_text, label, label_list):
    doc = small_model(input_text)
    cats = [0] * len(label_list)
    cats[label] = 1
    final_cats = {}
    for i, label in enumerate(label_list):
        final_cats[label] = cats[i]
    doc.cats = final_cats
    return doc
```

现在我们准备训练和测试数据集。我们为spaCy算法所需的训练和测试数据创建了DocBin对象。然后我们从磁盘加载保存的数据。这是我们保存到K-Means配方中的数据。如果你在这里遇到FileNotFoundError错误，你需要运行使用K-Means聚类句子 – 无监督文本分类配方中的步骤1-7。然后我们随机打乱训练数据框。然后我们使用之前定义的函数预处理每个数据点。然后我们将每个数据点添加到DocBin对象中。最后，我们将两个数据集保存到磁盘：

train_db = DocBin()
test_db = DocBin()
label_list = ["tech", "business", "sport", 
    "entertainment", "politics"]
train_df = pd.read_json("../data/bbc_train.json")
test_df = pd.read_json("../data/bbc_test.json")
train_df.sample(frac=1)
for idx, row in train_df.iterrows():
    text = row["text"]
    label = row["label"]
    doc = preprocess_data_entry(text, label, label_list)
    train_db.add(doc)
for idx, row in test_df.iterrows():
    text = row["text"]
    label = row["label"]
    doc = preprocess_data_entry(text, label, label_list)
    test_db.add(doc)
train_db.to_disk('../data/bbc_train.spacy')
test_db.to_disk('../data/bbc_test.spacy')

使用train命令训练模型。为了使训练工作，你需要将配置文件下载到data文件夹中。这在本配方的准备就绪部分有解释。训练配置指定了训练和测试数据集的位置，因此你需要运行前面的步骤才能使训练工作。train命令将模型保存到我们在输入中指定的目录的model_last子目录中（在本例中为../models/spacy_textcat_bbc/）：

train("../data/spacy_config.cfg", output_path="../models/spacy_textcat_bbc")

输出结果可能会有所不同，但可能看起来像这样（为了便于阅读而截断）。我们可以看到，我们训练的模型的最终准确率是85%：

ℹ Saving to output directory: ../models/spacy_textcat_bbc
ℹ Using CPU
=========================== Initializing pipeline ===========================
✔ Initialized pipeline
4.5-spacy_textcat.ipynb
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.16        8.48    0.08
  0     200         20.77         37.26       35.58    0.36
  0     400         98.56         35.96       26.90    0.27
  0     600         49.83         37.31       36.60    0.37
… (truncated)
  4    4800       7571.47          9.64       80.25    0.80
  4    5000      16164.99         10.58       87.71    0.88
  5    5200       8604.43          8.20       84.98    0.85
✔ Saved pipeline to output directory
../models/spacy_textcat_bbc/model-last

现在我们对一个未见过的例子进行模型测试。我们首先加载模型，然后从测试数据中获取一个例子。然后我们检查文本及其类别。我们在输入文本上运行模型并打印出结果概率。模型将给出一个包含各自概率得分的类别字典。这些得分表示文本属于相应类别的概率。概率最高的类别是我们应该分配给文本的类别。类别字典在doc.cats属性中，就像我们在准备数据时一样，但在这个情况下，模型分配它。在这种情况下，文本是关于政治的，模型正确地将其分类：

nlp = spacy.load("../models/spacy_textcat_bbc/model-last")
input_text = test_df.iloc[1, test_df.columns.get_loc('text')]
print(input_text)
print(test_df["label_text"].iloc[[1]])
doc = nlp(input_text)
print("Predicted probabilities: ", doc.cats)

输出将看起来类似于这样：

lib dems  new election pr chief the lib dems have appointed a senior figure from bt to be the party s new communications chief for their next general election effort.  sandy walkington will now work with senior figures such as matthew taylor on completing the party manifesto. party chief executive lord rennard said the appointment was a  significant strengthening of the lib dem team . mr walkington said he wanted the party to be ready for any  mischief  rivals or the media tried to throw at it.   my role will be to ensure this new public profile is effectively communicated at all levels   he said.  i also know the party will be put under scrutiny in the media and from the other parties as never before - and we will need to show ourselves ready and prepared to counter the mischief and misrepresentation that all too often comes from the party s opponents.  the party is already demonstrating on every issue that it is the effective opposition.  mr walkington s new job title is director of general election communications.
8    politics
Name: label_text, dtype: object
Predicted probabilities:  {'tech': 3.531841841208916e-08, 'business': 0.000641813559923321, 'sport': 0.00033847044687718153, 'entertainment': 0.00016174423217307776, 'politics': 0.9988579750061035}

在这一步，我们定义一个get_prediction函数，它接受文本、spaCy模型和潜在类别的列表，并输出概率最高的类别。然后我们将此函数应用于测试数据框的text列：

def get_prediction(input_text, nlp_model, target_names):
    doc = nlp_model(input_text)
    category = max(doc.cats, key = doc.cats.get)
    return target_names.index(category)
test_df["prediction"] = test_df["text"].apply(
    lambda x: get_prediction(x, nlp, label_list))

现在，我们根据之前步骤中生成的测试数据框中的数据打印出分类报告。模型的总体准确率为87%，它之所以有点低，是因为我们没有足够的数据来训练更好的模型：

print(classification_report(test_df["label"],
    test_df["prediction"], target_names=target_names))

结果应该看起来像这样：

               precision    recall  f1-score   support
         tech       0.82      0.94      0.87        80
     business       0.94      0.83      0.89       102
        sport       0.89      0.89      0.89       102
entertainment       0.94      0.87      0.91        77
     politics       0.78      0.83      0.80        84
     accuracy                           0.87       445
    macro avg       0.87      0.87      0.87       445
 weighted avg       0.88      0.87      0.87       445

在这一步，我们使用spaCy的evaluate命令进行相同的评估。此命令接受模型路径和测试数据集路径，并以略微不同的格式输出分数。我们看到这两个步骤的分数是一致的：

evaluate('../models/spacy_textcat_bbc/model-last', '../data/bbc_test.spacy')

结果应该看起来像这样：

{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'cats_score': 0.8719339318444819,
 'cats_score_desc': 'macro F',
 'cats_micro_p': 0.8719101123595505,
 'cats_micro_r': 0.8719101123595505,
 'cats_micro_f': 0.8719101123595505,
 'cats_macro_p': 0.8746516896205309,
 'cats_macro_r': 0.8732906799083269,
 'cats_macro_f': 0.8719339318444819,
 'cats_macro_auc': 0.9800144873453936,
 'cats_f_per_type': {'tech': {'p': 0.8152173913043478,
   'r': 0.9375,
   'f': 0.872093023255814},
  'business': {'p': 0.9444444444444444,
   'r': 0.8333333333333334,
   'f': 0.8854166666666667},
  'sport': {'p': 0.8921568627450981,
   'r': 0.8921568627450981,
   'f': 0.8921568627450981},
  'entertainment': {'p': 0.9436619718309859,
   'r': 0.8701298701298701,
   'f': 0.9054054054054054},
  'politics': {'p': 0.7777777777777778,
   'r': 0.8333333333333334,
   'f': 0.8045977011494253}},
 'cats_auc_per_type': {'tech': 0.9842808219178081,
  'business': 0.9824501229063054,
  'sport': 0.9933544846510032,
  'entertainment': 0.9834839073969509,
  'politics': 0.9565030998549005},
 'speed': 6894.989948433934}

使用OpenAI模型进行文本分类

在这个食谱中，我们将要求OpenAI模型对输入文本进行分类。我们将使用之前食谱中相同的BBC数据集。

准备工作

要运行此食谱，你需要安装openai包，该包作为poetry环境的一部分提供，以及requirements.txt文件。你还需要一个OpenAI API密钥。将其粘贴到文件实用工具笔记本中提供的字段（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/util/file_utils.ipynb）中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.6_openai_classification.ipynb。

注意

OpenAI经常更改和淘汰现有模型，并引入新的模型。我们在这个食谱中使用的gpt-3.5-turbo模型在你阅读本文时可能已经过时。在这种情况下，请检查OpenAI文档并选择另一个合适的模型。

如何操作…

在这个食谱中，我们将查询OpenAI API并提供一个作为提示的分类请求。然后我们将对结果进行后处理，并评估Open AI模型在此任务上的表现：

运行简单的分类器和文件实用工具笔记本：

%run -i "../util/file_utils.ipynb"
%run -i "../util/util_simple_classifier.ipynb"

使用API密钥导入必要的函数和包以创建OpenAI客户端：

import re
from sklearn.metrics import classification_report
from openai import OpenAI
client = OpenAI(api_key=OPEN_AI_KEY)

使用Hugging Face加载训练和测试数据集，无需对类别数量进行预处理，因为我们不会训练新的模型：
```
train_dataset = load_dataset("SetFit/bbc-news", split="train")
test_dataset = load_dataset("SetFit/bbc-news", split="test")
```

加载并打印数据集中的第一个示例及其类别：

example = test_dataset[0]["text"]
category = test_dataset[0]["label_text"]
print(example)
print(category)

结果应该是这样的：

carry on star patsy rowlands dies actress patsy rowlands  known to millions for her roles in the carry on films  has died at the age of 71.  rowlands starred in nine of the popular carry on films  alongside fellow regulars sid james  kenneth williams and barbara windsor. she also carved out a successful television career  appearing for many years in itv s well-loved comedy bless this house....
entertainment

在这个示例上运行OpenAI模型。在第5步，我们查询OpenAI API，要求它对这个示例进行分类。我们创建提示并将示例文本附加到它。在提示中，我们指定模型将输入文本分类为五个类别之一，并指定输出格式。如果我们不包括这些输出指令，它可能会添加其他词语并返回类似“这个话题是娱乐”的文本。我们选择gpt-3.5-turbo模型并指定提示、温度和其他几个参数。我们将温度设置为0，以便模型响应没有或最小变化。然后我们打印API返回的响应。输出可能会有所不同，但在大多数情况下，它应该返回“娱乐”，这是正确的：

prompt="""You are classifying texts by topics. There are 5 topics: tech, entertainment, business, politics and sport.
Output the topic and nothing else. For example, if the topic is business, your output should be "business".
Give the following text, what is its topic from the above list without any additional explanations: """ + example
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=256,
    top_p=1.0,
    frequency_penalty=0,
    presence_penalty=0,
    messages=[
        {"role": "system", "content": 
            "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
)
print(response.choices[0].message.content)

结果可能会有所不同，但应该看起来像这样：

entertainment

创建一个函数，该函数将提供输入文本的分类并返回类别。它接受输入文本并调用我们之前使用的相同提示的OpenAI API。然后它将响应转换为小写，去除额外的空白，并返回它：

def get_gpt_classification(input_text):
    prompt="""You are classifying texts by topics. There are 5 topics: tech, entertainment, business, politics and sport.
Output the topic and nothing else. For example, if the topic is business, your output should be "business".
Give the following text, what is its topic from the above list without any additional explanations: """ + input_text
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        max_tokens=256,
        top_p=1.0,
        frequency_penalty=0,
        presence_penalty=0,
        messages=[
            {"role": "system", "content": 
                "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
    )
    classification = response.choices[0].message.content
    classification = classification.lower().strip()
    return classification

在这一步，我们加载测试数据。我们从Hugging Face获取测试数据集并将其转换为数据框。然后我们打乱数据框并选择前200个示例。原因是我们要通过OpenAI API降低测试这个分类器的成本。你可以修改你测试此方法的数据量：
```
test_df = test_dataset.to_pandas()
test_df.sample(frac=1)
test_data = test_df[0:200].copy()
```
在第8步中，我们使用get_gpt_classification函数在测试数据框中创建一个新列。根据你拥有的测试示例数量，运行可能需要几分钟：
```
test_data["gpt_prediction"] = test_data["text"].apply(
    lambda x: get_gpt_classification(x))
```
尽管我们指示OpenAI只提供类别作为答案，但它可能还会添加一些其他词语，因此我们定义了一个函数get_one_word_match，用于清理OpenAI的输出。在这个函数中，我们使用正则表达式匹配其中一个类别标签，并从原始字符串中返回该单词。然后我们将此函数应用于测试数据框中的gpt_prediction列：
```
def get_one_word_match(input_text):
    loc = re.search(
        r'tech|entertainment|business|sport|politics',
        input_text).span()
    return input_text[loc[0]:loc[1]]
test_data["gpt_prediction"] = test_data["gpt_prediction"].apply(
    lambda x: get_one_word_match(x))
```

现在我们将标签转换为数值格式：

label_list = ["tech", "business", "sport", 
    "entertainment", "politics"]
test_data["gpt_label"] = test_data["gpt_prediction"].apply(
    lambda x: label_list.index(x))

我们打印出结果数据框。我们可以看到我们进行评估所需的所有信息。我们既有正确的标签（标签列）也有预测的标签（gpt_label列）：

print(test_data)

结果应该看起来像这样：

                                                  text  label
     label_text  \
0    carry on star patsy rowlands dies actress pats...      3
  entertainment
1    sydney to host north v south game sydney will ...      2
          sport
..                                                 ...    ...
            ...
198  xbox power cable  fire fear  microsoft has sai...      0
           tech
199  prop jones ready for hard graft adam jones say...      2
          sport
    gpt_prediction  gpt_label
0    entertainment          3
1            sport          2
..             ...        ...
198           tech          0
199          sport          2

现在我们可以打印出评估OpenAI分类的分类报告：

print(classification_report(test_data["label"],
        test_data["gpt_label"], target_names=label_list))

结果可能会有所不同。这是一个示例输出。我们看到整体准确率很好，达到90%：

               precision    recall  f1-score   support
         tech       0.97      0.80      0.88        41
     business       0.87      0.89      0.88        44
        sport       1.00      0.96      0.98        48
entertainment       0.88      0.90      0.89        40
     politics       0.76      0.96      0.85        27
     accuracy                           0.90       200
    macro avg       0.90      0.90      0.90       200
 weighted avg       0.91      0.90      0.90       200

第五章：信息提取入门

在本章中，我们将介绍信息提取的基础知识。信息提取是从文本中提取非常具体信息的任务。例如，您可能想知道新闻文章中提到的公司。您不必花时间阅读整篇文章，可以使用信息提取技术几乎立即访问这些公司。

我们将从从工作公告中提取电子邮件地址和URL开始。然后，我们将使用称为Levenshtein距离的算法来查找相似字符串。接下来，我们将从文本中提取重要关键词。之后，我们将使用spaCy在文本中查找命名实体，稍后，我们将在spaCy中训练自己的命名实体识别模型。然后，我们将进行基本的情感分析，最后，我们将训练两个自定义情感分析模型。

您将学习如何使用现有工具和训练自己的模型进行信息提取任务。

本章我们将涵盖以下食谱：

使用正则表达式
查找相似字符串 – Levenshtein距离
提取关键词
使用spaCy进行命名实体识别
使用spaCy训练自己的NER模型
微调BERT进行NER

技术要求

本章的代码位于本书GitHub仓库中名为Chapter05的文件夹中（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter05）。

如前几章所述，本章所需的包是Poetry环境的一部分。或者，您可以使用requirements.txt文件安装所有包。

使用正则表达式

在本食谱中，我们将使用正则表达式在文本中查找电子邮件地址和URL。正则表达式是特殊的字符序列，用于定义搜索模式，可以通过Python的re包创建和使用。我们将使用工作描述数据集并编写两个正则表达式，一个用于电子邮件，一个用于URL。

准备工作

在这里下载工作描述数据集：https://www.kaggle.com/andrewmvd/data-scientist-jobs。它也可以在本书的GitHub仓库中找到：https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/DataScientist.csv。将其保存到/data文件夹中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.1_regex.ipynb。

如何操作...

我们将从CSV文件中读取数据到pandas DataFrame中，并使用Python的re包来创建正则表达式并搜索文本。步骤如下：

导入re和pandas包：
```
import re
import pandas as pd
```

读取数据并检查其内容：

data_file = "../data/DataScientist.csv"
df = pd.read_csv(data_file, encoding='utf-8')
print(df)

输出将会很长，应该像这样开始：

图5.1 – DataFrame输出

get_list_of_items辅助函数接受一个DataFrame作为输入，并将其某一列转换为列表。它接受DataFrame和列名作为输入。首先，它获取列值，这是一个列表的列表，然后将其展平。然后，它通过将列表转换为集合来删除重复项，并将其转换回列表：
```
def get_list_of_items(df, column_name):
    values = df[column_name].values
    values = [item for sublist in values for item in sublist]
    list_of_items = list(set(values))
    return list_of_items
```
在这一步中，我们定义了get_emails函数来获取Job Description列中出现的所有电子邮件。正则表达式由三个部分组成，这些部分出现在方括号中，后面跟着量词：
- [^{\s:|()']+**是正则表达式的用户名部分，后面跟着一个**@**符号。它由一个字符组组成，显示在方括号中。这个组中的任何字符都可以在用户名中出现一次或多次。这使用**+**量词来表示。用户名中的字符可以是任何字符，但不能是空格（**\s**）、冒号（**😗*）、竖线（**|**）和引号（**'**）。**}字符表示字符类的否定。引号在正则表达式中是一个特殊字符，必须使用反斜杠转义才能调用字符的常规意义。
- [a-zA-Z0-9.]+是域名的一部分，后面跟着一个点。这部分是简单的字母数字字符，小写或大写，并且点出现一次或多次。由于点是特殊字符，我们使用反斜杠来转义它。a-z表达式表示从a到z的字符范围。
- [a-zA-Z]+是域名的一部分，即顶级域名，如.com、.org等。通常，这些顶级域名不允许出现数字，正则表达式匹配出现一次或多次的小写或大写字母。
这个正则表达式足以解析数据集中的所有电子邮件，并且不会出现任何假阳性。你可能会发现，在你的数据中，需要对正则表达式进行一些额外的调整：
```
def get_emails(df):
    email_regex = '[^\s:|()\']+@[a-zA-Z0-9\.]+\.[a-zA-Z]+'
    df['emails'] = df['Job Description'].apply(
        lambda x: re.findall(email_regex, x))
    emails = get_list_of_items(df, 'emails')
    return emails
```

我们现在将使用之前的功能从DataFrame中获取电子邮件：

emails = get_emails(df)
print(emails)
['hrhelpdesk@phila.gov', 'talent@quartethealth.com', …, 'careers@edo.com', 'Talent.manager@techquarry.com', 'resumes@nextgentechinc.com', …, 'talent@ebay.com', …, 'info@springml.com',…]

re包中的finditer函数。它在一个文本中找到所有匹配项，并将它们作为Match对象返回。我们可以通过使用span()对象方法来找到匹配的开始和结束位置。它返回一个元组，其中第一个元素是匹配的开始，第二个元素是匹配的结束：

def get_urls(df):
    url_regex = '(http[s]?://(www\.)?[A-Za-z0-9–_\.\-]+\.[A-Za-z]+/?[A-Za-z0-9$\–_\-\/\.]*)[\.)\"]*'
    df['urls'] = df['Job Description'].apply(
        lambda x: [
            x[item.span()[0]:item.span()[1]] 
            for item in re.finditer(url_regex, x)
        ]
    )
    urls = get_list_of_items(df, 'urls')
    return urls

我们将以类似的方式获取URL：

urls = get_urls(df)
print(urls)

结果的一部分可能看起来像这样：

['https://youtu.be/c5TgbpE9UBI', 'https://www.linkedin.com/in/emma-riley-72028917a/', 'https://www.dol.gov/ofccp/regs/compliance/posters/ofccpost.htm', 'https://www.naspovaluepoint.org/portfolio/mmis-provider-services-module-2018-2028/hhs-technology-group/).', 'https://www.instagram.com/gatestonebpo', 'http://jobs.sdsu.edu', 'http://www.colgatepalmolive.com.', 'http://www1.eeoc.gov/employers/upload/eeoc_self_print_poster.pdf', 'https://www.gofundme.com/2019https', 'https://www.decode-m.com/', 'https://bit.ly/2lCOcYS',…]

还有更多……

编写正则表达式可能会迅速变成一件杂乱无章的事情。我使用正则表达式测试网站来输入我期望匹配的文本和正则表达式。此类网站的一个例子是https://regex101.com/。

寻找相似字符串 – Levenshtein距离

在进行信息提取时，在许多情况下，我们处理拼写错误，这可能会给任务带来复杂性。为了解决这个问题，有几种方法可用，包括Levenshtein距离。此算法找到将一个字符串更改为另一个字符串所需的编辑/添加/删除的数量。例如，要将单词put更改为pat，需要将u替换为a，这是一个更改。要将单词kitten更改为smitten，需要进行两个编辑：将k更改为m并在开头添加一个s。

在本配方中，你将能够使用这种技术来找到与拼写错误的电子邮件的匹配项。

准备工作

我们将使用与之前配方中相同的包和数据科学家职位描述数据集，以及python-Levenshtein包，它是Poetry环境的一部分，并包含在requirements.txt文件中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.2_similar_strings.ipynb。

如何操作…

我们将数据集读入一个pandas DataFrame，并使用从中提取的电子邮件来搜索拼写错误的电子邮件。你的步骤应该格式化如下：

运行语言工具文件。此文件包含我们在之前的配方中创建的get_emails函数：
```
%run -i "../util/lang_utils.ipynb"
```
执行必要的导入：
```
import pandas as pd
import Levenshtein
```

将数据读入一个pandas DataFrame对象：

data_file = "../data/DataScientist.csv"
df = pd.read_csv(data_file, encoding='utf-8')

使用get_emails函数从DataFrame中过滤掉所有电子邮件，该函数在之前的配方中有更详细的解释，使用正则表达式：
```
emails = get_emails(df)
```
find_levenshtein函数接收一个DataFrame和一个输入字符串，并计算它与电子邮件列中每个字符串之间的Levenshtein距离。它接收一个输入字符串和一个包含电子邮件的DataFrame，并在其中创建一个新列，该列的值是输入与DataFrame中的电子邮件地址之间的Levenshtein距离。该列名为distance_to_[input_string]：
```
def find_levenshtein(input_string, df):
    df['distance_to_' + input_string] = \
        df['emails'].apply(lambda x: Levenshtein.distance(
            input_string, x))
    return df
```
在这一步，我们定义了get_closest_email_lev函数，它接收一个包含电子邮件的DataFrame和一个要匹配的电子邮件，并返回DataFrame中与输入最接近的电子邮件。我们通过使用find_levenshtein函数创建一个包含到输入电子邮件距离的新列，然后使用pandas中的idxmin()函数找到最小值的索引。我们使用最小索引来找到最近的电子邮件：
```
def get_closest_email_lev(df, email):
    df = find_levenshtein(email, df)
    column_name = 'distance_to_' + email
    minimum_value_email_index = df[column_name].idxmin()
    email = df.loc[minimum_value_email_index]['emails']
    return email
```
接下来，我们将电子邮件加载到新的 DataFrame 中，并使用拼写错误的电子邮件地址 rohitt.macdonald@prelim.com 在新的 email DataFrame 中查找匹配项：
```
new_df = pd.DataFrame(emails,columns=['emails'])
input_string = "rohitt.macdonald@prelim.com"
email = get_closest_email_lev(new_df, input_string)
print(email)
```
该函数返回正确的电子邮件地址拼写 rohit.mcdonald@prolim.com：
```
rohit.mcdonald@prolim.com
```

还有更多...

Levenshtein 包包括其他字符串相似度测量方法，您可以在 https://rapidfuzz.github.io/Levenshtein/ 中探索。在本节中，我们查看 Jaro 距离。

我们可以使用另一个函数，即 Jaro 相似度，它将两个字符串之间的相似度输出为一个介于 0 和 1 之间的数字，其中 1 表示两个字符串完全相同。过程类似，但我们需要具有最大值的索引而不是最小值，因为 Jaro 相似度函数对更相似的字符串返回更高的值。让我们一步步来：

find_jaro 函数接收一个 DataFrame 和一个输入字符串，并计算它与电子邮件列中每个字符串之间的 Jaro 相似度：

def find_jaro(input_string, df):
    df['distance_to_' + input_string] = df['emails'].apply(
        lambda x: Levenshtein.jaro(input_string, x)
    )
    return df

get_closest_email_jaro 函数使用我们在上一步中定义的函数来找到与输入最接近的电子邮件地址：

def get_closest_email_jaro(df, email):
    df = find_jaro(email, df)
    column_name = 'distance_to_' + email
    maximum_value_email_index = df[column_name].idxmax()
    email = df.loc[maximum_value_email_index]['emails']
    return email

接下来，我们使用拼写错误的电子邮件地址 rohitt.macdonald@prelim.com 在新的电子邮件 DataFrame 中查找匹配项：
```
email = get_closest_email_jaro(new_df, input_string)
print(email)
```
输出如下：
```
rohit.mcdonald@prolim.com
```
Jaro 相似度函数的一个扩展是 Jaro-Winkler 函数，它给单词的末尾附加一个权重，并且这个权重降低了末尾拼写错误的重要性。例如，让我们看看以下函数：
```
print(Levenshtein.jaro_winkler("rohit.mcdonald@prolim.com",
    "rohit.mcdonald@prolim.org"))
```
这将输出以下内容：
```
1.0
```

提取关键词

在这个菜谱中，我们将从文本中提取关键词。我们将使用包含新闻文章的 BBC 新闻数据集。您可以在第 4 章中了解更多关于数据集的信息，该章节的标题为 *使用 K-Means 进行句子聚类：无监督 文本分类。

从文本中提取关键词可以快速了解文章的主题，也可以作为标签系统的依据，例如，在网站上。

为了正确提取，我们需要训练一个 TF-IDF 向量化器，我们将在提取阶段使用它。

准备工作

在这个菜谱中，我们将使用 sklearn 包。它是 Poetry 环境的一部分。您也可以通过安装 requirements.txt 文件来与其他包一起安装它。

BBC 新闻数据集可在 Hugging Face 上获得 https://huggingface.co/datasets/SetFit/bbc-news。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.3_keyword_extraction.ipynb。

如何做到这一点...

要从给定文本中提取关键词，我们首先需要一个我们将拟合向量器的文本语料库。一旦完成，我们就可以使用它从与处理语料库相似的文本中提取关键词。以下是步骤：

运行语言工具笔记本：
```
%run -i "../util/lang_utils.ipynb"
```

导入必要的包和函数：

from datasets import load_dataset
from nltk import word_tokenize
from math import ceil
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

加载训练和测试数据集，将它们转换为pandas DataFrame对象，并打印出训练DataFrame以查看其外观。DataFrame有三个列，一列用于新闻文章文本，一列用于数字格式的标签，另一列用于标签文本：

train_dataset = load_dataset("SetFit/bbc-news", split="train")
test_dataset = load_dataset("SetFit/bbc-news", split="test")
train_df = train_dataset.to_pandas()
test_df = test_dataset.to_pandas()
print(train_df)
print(test_df)

结果应类似于以下内容：

     text  label     label_text
0  wales want rugby league training wales could f... 2  sport
1     china aviation seeks rescue deal scandal-hit j...  business
...     ...    ...            ...
1223  why few targets are better than many the econo... 1  business
1224  boothroyd calls for lords speaker betty boothr... 4  politics
[1225 rows x 3 columns]
     text  label     label_text
0  carry on star patsy rowlands dies actress pats... 3  entertainment
1    sydney to host north v south game sydney will ... 2  sport
..     ...    ...            ...
998  stormy year for property insurers a string of ... 1  business
999  what the election should really be about  a ge... 4  politics
[1000 rows x 3 columns]

创建向量器并将其拟合到训练数据文本上。要了解更多关于向量器的信息，请参阅第3章。在使用TF-IDF表示文本的配方中讨论了TF-IDF向量器。我们使用英语停用词，最小文档频率为2，最大文档频率为95%（要了解更多关于停用词的信息，请参阅第1章中的移除停用词配方）：
```
vectorizer = TfidfVectorizer(stop_words='english', 
    min_df=2, max_df=0.95)
vectorizer.fit(train_df["text"])
```
现在，我们将定义一些辅助函数。第一个函数将按TF-IDF分数对坐标矩阵进行排序。它接受由向量器创建的向量转换成的坐标矩阵。该坐标矩阵的col属性提供单词索引，data属性提供每个单词的TF-IDF分数。该函数从这个数据创建一个元组列表，其中元组的第一个值是索引，第二个值是TF-IDF分数。然后它按TF-IDF分数对元组列表进行排序并返回排序结果。这将给我们具有最大TF-IDF分数的单词或最具有该特定新闻稿特征的单词：
```
def sort_data_tfidf_score(coord_matrix):
    tuples = zip(coord_matrix.col, coord_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), 
        reverse=True)
```
下一个函数get_keyword_strings将获取给定向量的关键词。它返回给定向量的提取关键词。它接受拟合的向量器、要提取的关键词数和输入文本的排序向量作为输入。该函数首先将index_dict变量定义为键为单词索引、值为相应单词的字典。然后它遍历排序向量并将字典中的单词追加到words列表变量中。它达到所需单词数时停止。由于该函数遍历排序向量，它将给出具有最高TF-IDF分数的单词。这些单词将是该文档中最常使用但在其他文档中未使用的单词，从而给我们一个关于文章主题的想法：
```
def get_keyword_strings(vectorizer, num_words, sorted_vector):
    words = []
    index_dict = vectorizer.get_feature_names_out()
    for (item_index, score) in sorted_vector[0:num_words]:
        word = index_dict[item_index]
        words.append(word)
    return words
```
get_keywords_simple函数将返回给定文本的关键词列表。它接受输入文本、拟合的向量器和所需单词数。它使用向量器为输入文本创建一个向量，然后使用sort_data_tfidf_score函数对向量进行排序，并最终使用get_keyword_strings函数获取顶级单词：
```
def get_keywords_simple(vectorizer, input_text,
    num_output_words=10):
    vector = vectorizer.transform([input_text])
    sorted = sort_data_tfidf_score(vector.tocoo())
    words = get_keyword_strings(vectorizer, num_output_words, 
        sorted)
    return words
```

我们使用之前的函数处理测试数据集中的第一篇文本。我们从测试数据中的第一篇文章中提取关键词列表，使用get_keywords_simple函数。我们发现一些关键词适合摘要，而一些则不太合适：

print(test_df.iloc[0]["text"])
keywords = get_keywords_simple(vectorizer,
    test_df.iloc[0]["text"])
print(keywords)

结果将如下所示：

carry on star patsy rowlands dies actress patsy rowlands  known to millions for her roles in the carry on films  has died at the age of 71.  rowlands starred in nine of the popular carry on films  alongside fellow regulars sid james  kenneth williams and barbara windsor...
['carry', 'theatre', 'scholarship', 'appeared', 'films', 'mrs', 'agent', 'drama', 'died', 'school']

还有更多...

现在，我们将使用一种更复杂的方法来从新闻摘要中提取关键词。我们将使用一个向量器，它不仅对单个单词进行评分，还对双词和三词进行评分。我们还将使用spaCy名词短语来确保输出的双词和三词是有意义的。要了解更多关于名词短语的信息，请参阅第2章中的提取名词短语配方。这种方法的优势在于，我们不仅得到单个单词作为输出，还有短语，例如周六早晨而不是单独的周六和早晨。

创建新的向量器并将其拟合到训练摘要中。由于spaCy实体可能包含它，我们从停用词列表中排除了单词the：

stop_words = list(stopwords.words('english'))
stop_words.remove("the")
trigram_vectorizer = TfidfVectorizer(
    stop_words=stop_words, min_df=2,
    ngram_range=(1,3), max_df=0.95)
trigram_vectorizer.fit(train_df["summary"])

现在，定义get_keyword_strings_all函数。它将从排序向量中获取所有关键词，它对获取多少个单词没有限制：

def get_keyword_strings_all(vectorizer, sorted_vector):
    words = []
    index_dict = vectorizer.get_feature_names_out()
    for (item_index, score) in sorted_vector:
        word = index_dict[item_index]
        words.append(word)
    return words

接下来，我们定义get_keywords_complex函数，该函数输出主要关键词和最多三个单词长的短语：

def get_keywords_complex(
    vectorizer, input_text, spacy_model, num_words=70
):
    keywords = []
    doc = spacy_model(input_text)
    vector = vectorizer.transform([input_text])
    sorted = sort_coo(vector.tocoo())
    ngrams = get_keyword_strings_all(vectorizer, sorted)
    ents = [ent.text.lower() for ent in doc.noun_chunks]
    for i in range(0, num_words):
        keyword = ngrams[i]
        if keyword.lower() in ents and not
        keyword.isdigit() and keyword not in keywords:
            keywords.append(keyword)
    return keywords

现在，我们将在第一个测试摘要上使用之前的函数：

keywords = get_keywords_complex(trigram_vectorizer,
    test_df.iloc[0]["summary"], small_model)
print(keywords)

结果将如下所示：

['the gop', 'the 50 states', 'npr', '11 states', 'state', 'republican governors', 'the dems', 'reelection', 'the helm', 'grabs']

使用spaCy进行命名实体识别

命名实体识别（NER）是从文本中解析地点、人物、组织等名称的任务。这在许多下游任务中可能很有用。例如，你可以想象一个场景，你希望根据文章中提到的人物对文章集进行排序，例如，在研究某个特定人物时。

在这个配方中，我们将使用NER从BBC数据集中的文章文本中解析出命名实体。我们将加载包和解析引擎，并遍历NER结果。

准备工作

在这个配方中，我们将使用spaCy。要正确运行它，你需要下载一个语言模型。我们将下载小型和大型模型。这些模型占用大量的磁盘空间：

python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.4_named_entity_extraction.ipynb。

如何做到这一点...

NER在spaCy对输入文本的处理过程中自动发生。通过doc.ents变量访问实体。我们将输入一篇关于苹果iPhone的文章，并查看哪些实体将从其中解析出来。让我们看看步骤：

运行语言工具文件。这将导入必要的包和函数并初始化spaCy引擎：
```
%run -i "../util/lang_utils.ipynb"
```

初始化文章文本。这是一篇来自 https://www.globalsmt.net/social-media-news/iphone-12-apple-makes-jump-to-5g/:

article = """iPhone 12: Apple makes jump to 5G
Apple has confirmed its iPhone 12 handsets will be its first to work on faster 5G networks.
The company has also extended the range to include a new "Mini" model that has a smaller 5.4in screen.
The US firm bucked a wider industry downturn by increasing its handset sales over the past year.
But some experts say the new features give Apple its best opportunity for growth since 2014, when it revamped its line-up with the iPhone 6.
…
"Networks are going to have to offer eye-wateringly attractive deals, and the way they're going to do that is on great tariffs and attractive trade-in deals,"
predicted Ben Wood from the consultancy CCS Insight. Apple typically unveils its new iPhones in September, but opted for a later date this year.
It has not said why, but it was widely speculated to be related to disruption caused by the coronavirus pandemic. The firm's shares ended the day 2.7% lower.
This has been linked to reports that several Chinese internet platforms opted not to carry the livestream,
although it was still widely viewed and commented on via the social media network Sina Weibo."""

在这里，我们创建 spaCy Doc 对象并使用它来提取实体。Doc 对象是通过在文本上使用小 spaCy 模型创建的。该模型提取不同的属性，包括命名实体。我们打印出解析出的实体的长度以及实体本身，包括起始和结束字符信息以及实体类型（命名实体标签的含义可以在 spaCy 文档的 https://spacy.io/models/en 找到）：
```
doc = small_model(article)
print(len(doc.ents))
small_model_ents = doc.ents
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
```
当我们打印出结果时，我们可以看到不同类型的实体，包括基数词、百分比、人名、日期、组织以及一个 NORP 实体，代表 国籍或宗教或 政治团体：
```
44
12 7 9 CARDINAL
Apple 11 16 ORG
5 31 32 CARDINAL
…
a later date this year 2423 2445 DATE
2.7% 2594 2598 PERCENT
Chinese 2652 2659 NORP
Sina Weibo 2797 2807 PERSON
```

还有更多…

我们可以通过以下步骤比较小模型和大模型的表现：

从 How to do it… 部分的 步骤 3 开始运行相同的步骤，但使用大型模型：

doc = large_model(article)
print(len(doc.ents))
large_model_ents = doc.ents
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

结果将如下所示：

46
12 7 9 CARDINAL
Apple 11 16 ORG
5 31 32 CARDINAL
…
the day 2586 2593 DATE
2.7% 2594 2598 PERCENT
Chinese 2652 2659 NORP
Sina Weibo 2797 2807 PERSON

大型模型解析出的实体更多，我们可以看看它们之间的差异。我们打印出两个列表；一个列表包含小模型识别的实体，另一个列表包含大模型识别但小模型没有识别的实体：

small_model_ents = [str(ent) for ent in small_model_ents]
large_model_ents = [str(ent) for ent in large_model_ents]
in_small_not_in_large = set(small_model_ents) \ 
    - set(large_model_ents)
in_large_not_in_small = set(large_model_ents) \ 
    - set(small_model_ents)
print(in_small_not_in_large)
print(in_large_not_in_small)

结果将如下所示：

{'iPhone 11', 'iPhone', 'iPhones'}
{'6', 'the day', 'IDC', '11', 'Pro', 'G\nApple', 'SE'}

你可以看到，两个模型提供的结果之间存在一些差异。

使用 spaCy 训练自己的 NER 模型

在上一个菜谱中，我们使用了预训练的 spaCy 模型来提取命名实体。这个 NER 模型在很多情况下已经足够使用。然而，可能会有其他时候，我们希望从头开始创建一个新的模型。在这个菜谱中，我们将训练一个新的 NER 模型来解析音乐家和他们的艺术作品的名字。

准备工作

我们将使用 spaCy 包来训练一个新的 NER 模型。除了 spacy 之外，你不需要其他任何包。我们将使用的数据来自 https://github.com/deezer/music-ner-eacl2023。数据文件已预先加载在数据文件夹中（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/music_ner.csv），你需要从本书的 GitHub 仓库将其下载到 data 目录中。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.5_training_own_spacy_model.ipynb。

如何操作…

我们将定义我们的训练数据，然后使用它来训练一个新的模型。然后我们将测试该模型并将其保存到磁盘。步骤如下：

运行语言工具文件：
```
%run -i "../util/lang_utils.ipynb"
```

导入其他函数和包：

import pandas as pd
from spacy.cli.train import train
from spacy.cli.evaluate import evaluate
from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split

在这一步，我们加载数据并打印出来：
```
music_ner_df = pd.read_csv('../data/music_ner.csv')
print(music_ner_df)
```
数据有五个列：id、start offset、end offset、text和label。如果句子中有一个以上的实体，则句子会重复，因为每个命名实体有一行。数据中有428个条目。

图5.2 – DataFrame输出

在这里，我们将_deduced从标签中删除，因此标签现在是Artist、WoA (work of art)、Artist_or_WoA：

def change_label(input_label):
    input_label = input_label.replace("_deduced", "")
    return input_label
music_ner_df["label"] = music_ner_df["label"].apply(change_label)
print(music_ner_df)

结果将如下所示：

图5.3 – DataFrame输出

在这一步，我们创建存储处理数据的DocBin对象。DocBin对象是spaCy模型输入数据所必需的（要了解更多信息，请参阅第4章中的训练spaCy textcat模型配方）：
```
train_db = DocBin()
test_db = DocBin()
```
在这里，我们创建一个唯一的ID列表并将其拆分为训练和测试数据。我们想要获取唯一ID的原因是因为句子在数据集中重复。有227个唯一的ID（或句子），训练数据中有170个句子，测试数据中有57个句子：
```
# Get a unique list of unique ids
ids = list(set(music_ner_df["id"].values))
print(len(ids))
# Split ids into training and test
train_ids, test_ids = train_test_split(ids)
print(len(train_ids))
print(len(test_ids))
```
结果将如下所示：
```
227
170
57
```

在这里，我们创建并保存训练和测试数据到DocBin对象中。我们遍历ID，对于每个ID，我们获取句子。我们使用小型模型处理句子，然后得到一个spaCy Doc对象。然后，我们遍历句子中的实体并将它们添加到Doc对象的ents属性中。处理后的Doc对象然后进入一个DocBin对象：

for id in ids:
    entity_rows = music_ner_df.loc[music_ner_df['id'] == id]
    text = entity_rows.head(1)["text"].values[0]
    doc = small_model(text)
    ents = []
    for index, row in entity_rows.iterrows():
        label = row["label"]
        start = row["start_offset"]
        end = row["end_offset"]
        span = doc.char_span(start, end, label=label, 
            alignment_mode="contract")
        ents.append(span)
    doc.ents = ents
    if id in train_ids:
        train_db.add(doc)
    else:
        test_db.add(doc)
train_db.to_disk('../data/music_ner_train.spacy')
test_db.to_disk('../data/music_ner_test.spacy')

在这一步，我们训练模型。我们使用data文件夹中的spacy_config_ner.cfg配置文件。您可以在https://spacy.io/usage/training/#quickstart创建自己的定制配置文件。输出显示了每个epoch的损失、准确率、精确率、召回率、F1分数和其他指标。最后，它将模型保存到指定的目录：
```
train("../data/spacy_config_ner.cfg", output_path="../models/spacy_music_ner")
```
输出将如下所示：

图5.4 – 模型训练输出

在这一步，我们加载训练好的模型并使用它对训练期间未见过的数据进行处理。我们从测试集中获取一个ID，获取具有该ID的所有测试数据行，并加载句子。然后我们打印出句子和标注的实体。然后，我们使用我们的模型（与其他预训练的spaCy模型完全相同的方式）处理句子并打印出它解析的实体：

nlp = spacy.load("../models/spacy_music_ner/model-last")
first_test_id = test_ids[0]
test_rows = music_ner_df.loc[music_ner_df['id'] 
    == first_test_id]
input_text = entity_rows.head(1)["text"].values[0]
print(input_text)
print("Gold entities:")
for index, row in entity_rows.iterrows():
    label = row["label"]
    start = row["start_offset"]
    end = row["end_offset"]
    span = doc.char_span(start, end, label=label,
        alignment_mode="contract")
    print(span)
doc = nlp(input_text)
print("Predicted entities: ")
for entity in doc.ents:
    print(entity)

我们看到生成的实体相当不错（输出结果可能有所不同）：

songs with themes of being unable to settle | ex hoziers someone new elle kings exes and ohs
Gold entities:
hoziers
someone new
elle kings
exes and ohs
Predicted entities:
hoziers
someone new
elle kings
exes and

在这里，我们使用 spaCy 的 evaluate 函数评估模型。我们看到 WoA 和 Artist 标签的指标低但为两位数，而 Artist_or_WoA 标签的 F1 分数约为 10%。这是因为它比其他两个标签的数据量少得多。总体而言，根据统计数据，模型的性能并不很好，这是因为我们整体的数据量非常小：

evaluate('../models/spacy_music_ner/model-last', '../data/music_ner_tes t.spacy')

统计数据可能会有所不同，但以下是我得到的结果（输出已压缩）：

{'token_acc': 1.0,
 'token_p': 1.0,
 'token_r': 1.0,
 'token_f': 1.0,
 'tag_acc': 0.800658978583196,
…
 'ents_p': 0.4421052631578947,
 'ents_r': 0.42,
 'ents_f': 0.4307692307692308,
 'ents_per_type': {'WoA': {'p': 0.4358974358974359,
   'r': 0.425,
   'f': 0.43037974683544306},
  'Artist_or_WoA': {'p': 0.1,
   'r': 0.09090909090909091,
   'f': 0.09523809523809525},
  'Artist': {'p': 0.5217391304347826,
   'r': 0.4897959183673469,
   'f': 0.5052631578947369}},
 'speed': 3835.591242612551}

微调 BERT 用于命名实体识别

在这个配方中，我们将微调预训练的 BERT 模型以用于命名实体识别任务。从头开始训练模型和微调模型之间的区别如下。微调 NLP 模型，如 BERT，涉及取一个预训练模型并对其进行修改以适应你的特定任务，例如本例中的 NER。预训练模型已经存储了大量的知识，结果可能比从头开始训练模型要好。

我们将使用与上一个配方类似的数据，创建一个可以将实体标记为 Artist 或 WoA 的模型。数据来自同一个数据集，但使用 IOB 格式进行标记，这是我们将要使用的 transformers 包所必需的。我们还只使用了 Artist 和 WoA 标签，移除了 Artist_or_WoA 标签，因为该标签的数据量不足。

对于这个配方，我们将使用 Hugging Face 的 Trainer 类，尽管也可以使用 PyTorch 或 Tensorflow 训练 Hugging Face 模型。更多信息请参阅 https://huggingface.co/docs/transformers/training。

准备工作

我们将使用 Hugging Face 的 transformers 包。它在 Poetry 环境中已预加载。你也可以从 requirements.txt 文件中安装此包。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter05/5.6_fine_tune_bert.ipynb。

如何做到这一点...

我们将加载数据并进行预处理，训练模型，然后评估它，最后我们将使用它对未见过的数据进行处理。你的步骤应该格式化如下：

运行语言工具笔记本：
```
%run -i "../util/lang_utils.ipynb"
```

导入其他包和函数：

from datasets import (
    load_dataset, Dataset, Features, Value,
    ClassLabel, Sequence, DatasetDict)
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from transformers import DataCollatorForTokenClassification
from transformers import (
    AutoModelForTokenClassification,
    TrainingArguments, Trainer)
import numpy as np
from sklearn.model_selection import train_test_split
from evaluate import load

在这个步骤中，我们使用 pandas 的 read_csv 函数加载音乐命名实体识别数据集。然后我们定义一个函数，它接受一个标签，将其在下划线处分割，并移除最后一部分（_deduced）。然后我们将此函数应用于 label 列。我们还替换了 | 字符，以防它可能干扰我们的代码：
```
music_ner_df = pd.read_csv('../data/music_ner.csv')
def change_label(input_label):
    input_label = input_label.replace("_deduced", "")
    return input_label
music_ner_df["label"] = music_ner_df["label"].apply(
    change_label)
music_ner_df["text"] = music_ner_df["text"].apply(
    lambda x: x.replace("|", ","))
print(music_ner_df)
```
输出将类似于以下内容：

图 5.5 – 数据集 DataFrame

在这里，我们开始我们的数据预处理。我们从 id 列中获取唯一 ID 的列表。我们遍历这个列表，获取与给定 ID 对应的句子。然后，我们使用 spaCy 小型模型处理文本，并将 DataFrame 中的实体添加到 Doc 对象中。然后，我们将每个句子存储在一个字典中，其中键是句子文本字符串，值是 Doc 对象：

ids = list(set(music_ner_df["id"].values))
docs = {}
for id in ids:
    entity_rows = music_ner_df.loc[music_ner_df['id'] == id]
    text = entity_rows.head(1)["text"].values[0]
    doc = small_model(text)
    ents = []
    for index, row in entity_rows.iterrows():
        label = row["label"]
        start = row["start_offset"]
        end = row["end_offset"]
        span = doc.char_span(start, end, label=label,
            alignment_mode="contract")
        ents.append(span)
    doc.ents = ents
    docs[doc.text] = doc

现在，我们以 IOB 格式加载数据。这种格式是微调 BERT 所必需的，与 spaCy 使用的格式不同。为此，我们加载一个单独的数据文件，../data/music_ner_bio.bio。我们创建一个标签映射字典，并为标记、NER 标签和跨度初始化空列表。然后，我们遍历从数据文件中读取的句子数据。对于每个句子，每行都是一个词及其标签的配对。我们将单词添加到 words 列表，并将与标签对应的数字添加到 tags 列表。我们还从之前步骤中创建的 Doc 对象字典中获取跨度，并将其添加到 spans 列表：

data_file = "../data/music_ner_bio.bio"
tag_mapping = {"O": 0, "B-Artist": 1, "I-Artist": 2, 
    "B-WoA": 3, "I-WoA": 4}
with open(data_file) as f:
    data = f.read()
tokens = []
ner_tags = []
spans = []
sentences = data.split("\n\n")
for sentence in sentences:
    words = []
    tags = []
    this_sentence_spans = []
    word_tag_pairs = sentence.split("\n")
    for pair in word_tag_pairs:
        (word, tag) = pair.split("\t")
        words.append(word)
        tags.append(tag_mapping[tag])
    sentence_text = " ".join(words)
    try:
        doc = docs[sentence_text]
    except:
        pass
    ent_dict = {}
    for ent in doc.ents:
        this_sentence_spans.append(f"{ent.label_}: {ent.text}")
    tokens.append(words)
    ner_tags.append(tags)
    spans.append(this_sentence_spans)

在这里，我们将数据分为训练集和测试集。为此，我们拆分 spans 列表的索引。然后，我们为训练数据和测试数据创建单独的标记、NER 标签和 spans 列表：

indices = range(0, len(spans))
train, test = train_test_split(indices, test_size=0.1)
train_tokens = []
test_tokens = []
train_ner_tags = []
test_ner_tags = []
train_spans = []
test_spans = []
for i, (token, ner_tag, span) in enumerate(
    zip(tokens, ner_tags, spans)
):
    if i in train:
        train_tokens.append(token)
        train_ner_tags.append(ner_tag)
        train_spans.append(span)
    else:
        test_tokens.append(token)
        test_ner_tags.append(ner_tag)
        test_spans.append(span)
print(len(train_spans))
print(len(test_spans))

输出将如下所示：

539
60

在这一步，我们从在 步骤 6 中编译的训练和测试列表中创建新的 DataFrames。然后，我们将 tokens 列的内容与空格连接起来，以获取句子字符串而不是单词列表。然后，我们使用 dropna() 函数删除空数据，并打印测试 DataFrame 的内容：

training_df = pd.DataFrame({"tokens":train_tokens,
    "ner_tags": train_ner_tags, "spans": train_spans})
test_df = pd.DataFrame({"tokens": test_tokens,
    "ner_tags": test_ner_tags, "spans": test_spans})
training_df["text"] = training_df["tokens"].apply(
    lambda x: " ".join(x))
test_df["text"] = test_df["tokens"].apply(lambda x: " ".join(x))
training_df.dropna()
test_df.dropna()
print(test_df)

结果将如下所示：

                                               tokens  \
0   [i, love, radioheads, kid, a, something, simil...
1   [bluesy, songs, kinda, like, evil, woman, by, ...
...
58  [looking, for, like, electronic, music, with, ...
59  [looking, for, pop, songs, about, the, end, of...
                                             ner_tags  \
0       [0, 0, 1, 3, 4, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]
1                         [0, 0, 0, 0, 3, 4, 0, 1, 2]
...
58      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
59                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                                                spans  \
0     [Artist: radioheads, Artist_or_WoA: aphex twin]
1            [WoA: evil woman, Artist: black sabbath]
...
58  [WoA: the piper at the gates of dawn, Artist: ...
59  [WoA: the piper at the gates of dawn, Artist: ...
                                                 text
0   i love radioheads kid a something similar , ki...
1   bluesy songs kinda like evil woman by black sa...
...
58  looking for like electronic music with a depre...
59   looking for pop songs about the end of the world

在这里，我们加载预训练模型和分词器，并初始化 Dataset 对象。Features 对象描述了数据和其属性。我们创建一个训练集和一个测试集 Dataset 对象。我们使用之前创建的 Features 对象和之前步骤中初始化的 DataFrames。然后，我们将这些新创建的 Dataset 对象添加到 DatasetDict 中，一个条目用于训练数据集，另一个用于测试数据。然后，我们打印出结果对象：

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
features = Features(
    {'tokens': Sequence(feature=Value(dtype='string',
            id=None),
        length=-1, id=None),
            'ner_tags': Sequence(feature=ClassLabel(
                names=['O', 'B-Artist', 'I-Artist',
                'B-WoA', 'I-WoA'], id=None),
                length=-1, id=None),
            'spans': Sequence(
                feature=Value(dtype='string',id=None),
                length=-1, id=None),
            'text': Value(dtype='string', id=None)
                    })
training_dataset = Dataset.from_pandas(
    training_df, features=features)
test_dataset = Dataset.from_pandas(test_df, features=features)
dataset = DatasetDict({"train":training_dataset, 
    "test":test_dataset})
print(dataset["train"].features)
label_names = \
    dataset["train"].features["ner_tags"].feature.names
print(dataset)

结果将如下所示：

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-Artist', 'I-Artist', 'B-WoA', 'I-WoA'], id=None), length=-1, id=None), 'spans': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'text': Value(dtype='string', id=None)}
DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'spans', 'text'],
        num_rows: 539
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'spans', 'text'],
        num_rows: 60
    })
})

在这一步，我们创建了一个名为 tokenize_adjust_labels 的函数，该函数将为词的部分分配正确的标签。我们定义了 tokenize_adjust_labels 函数。BERT 分词器将一些词拆分成组件，我们需要确保每个词的部分都被分配相同的标签。该函数首先使用预加载的分词器对所有文本样本进行分词。然后，它遍历分词样本的输入 ID，并根据词的部分调整标签：

def tokenize_adjust_labels(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
    all_samples_per_split["text"])
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(
            batch_index=k)
        existing_label_ids = all_samples_per_split[
            "ner_tags"][k]
        i = -1
        adjusted_label_ids = []
        for wid in word_ids_list:
            if (wid is None):
                adjusted_label_ids.append(-100)
            elif (wid != prev_wid):
                i = i + 1
                adjusted_label_ids.append(existing_label_ids[i])
                prev_wid = wid
            else:
                label_name =label_names[existing_label_ids[i]]
                adjusted_label_ids.append(existing_label_ids[i])
        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

在数据集上使用之前的函数：

tokenized_dataset = dataset.map(tokenize_adjust_labels, 
    batched=True)

在这里，我们初始化数据合并对象。数据合并器简化了训练数据的手动处理，例如，对所有的输入进行填充和截断，以确保它们具有相同的长度：
```
data_collator = DataCollatorForTokenClassification(tokenizer)
```

现在，我们创建compute_metrics函数，该函数计算评估指标，包括精确度、召回率、F1分数和准确率。在函数中，我们删除所有带有标签-100的标记，这些是特殊标记。此函数使用seqeval评估方法，这是常用的用于评估NER任务的评估方法：

metric = load("seqeval")
def compute_metrics(data):
    predictions, labels = data
    predictions = np.argmax(predictions, axis=2)
    data = zip(predictions, labels)
    data = [
        [(p, l) for (p, l) in zip(prediction, label) 
            if l != -100]
        for prediction, label in data
    ]
    true_predictions = [
        [label_names[p] for (p, l) in data_point]
        for data_point in data
    ]
    true_labels = [
        [label_names[l] for (p, l) in data_point]
        for data_point in data
    ]
    results = metric.compute(predictions=true_predictions, 
        references=true_labels)
    flat_results = {
        "overall_precision": results["overall_precision"],
        "overall_recall": results["overall_recall"],
        "overall_f1": results["overall_f1"],
        "overall_accuracy": results["overall_accuracy"],
    }
    for k in results.keys():
      if (k not in flat_results.keys()):
        flat_results[k + "_f1"] = results[k]["f1"]
    return flat_results

在这里，我们加载预训练的BERT模型（由于我们的输入是小写，所以使用未分词版本）。然后通过初始化TrainingArguments对象来指定训练参数。此对象包含模型超参数。然后通过提供训练参数、数据集、分词器、数据收集器和metrics函数来初始化Trainer对象。然后开始训练过程：

model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-uncased', num_labels=len(label_names))
training_args = TrainingArguments(
    output_dir="./fine_tune_bert_output",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    logging_steps = 1000,
    run_name = "ep_10_tokenized_11",
    save_strategy='no'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

输出将包括不同的信息，包括以下内容：

TrainOutput(global_step=238, training_loss=0.25769581514246326, metrics={'train_runtime': 25.8951, 'train_samples_per_second': 145.703, 'train_steps_per_second': 9.191, 'total_flos': 49438483110900.0, 'train_loss': 0.25769581514246326, 'epoch': 7.0})

在这一步，我们评估微调后的模型：

trainer.evaluate()

对于Artist标签，它达到了76%的F1分数，而对于WoA标签，它达到了52%的F1分数：

{'eval_loss': 0.28670933842658997,
 'eval_overall_precision': 0.6470588235294118,
 'eval_overall_recall': 0.7096774193548387,
 'eval_overall_f1': 0.6769230769230768,
 'eval_overall_accuracy': 0.9153605015673981,
 'eval_Artist_f1': 0.761904761904762,
 'eval_WoA_f1': 0.5217391304347826,
 'eval_runtime': 0.3239,
 'eval_samples_per_second': 185.262,
 'eval_steps_per_second': 12.351,
 'epoch': 7.0}

保存模型：

trainer.save_model("../models/bert_fine_tuned")

现在，加载训练好的模型：

model = AutoModelForTokenClassification.from_pretrained("../models/bert_fine_tuned")
tokenizer = AutoTokenizer.from_pretrained(
    "../models/bert_fine_tuned")

在这里，我们在一个未见过的文本上测试微调后的模型。我们初始化text变量。然后加载pipeline包以创建我们将使用的管道。一个文本处理管道将文本传递给模型，并得到最终的处理输出值。这个特定的管道指定了任务为token-classification，即使用哪个微调模型，相应的分词器，以及聚合策略。聚合策略参数指定了当使用多个模型时如何组合多个模型的结果。然后我们在文本上运行管道：

text = "music similar to morphine robocobra quartet | featuring elements like saxophone prominent bass"
from transformers import pipeline
pipe = pipeline(task="token-classification",
    model=model.to("cpu"), tokenizer=tokenizer,
    aggregation_strategy="simple")
pipe(text)
# tag_mapping = {"O": 0, "B-Artist": 1, "I-Artist": 2, "B-WoA": 3, "I-WoA": 4}

输出将会有所不同。以下是一个示例输出，它识别了音乐艺术家，Morphine Robocobra Quartet：

[{'entity_group': 'LABEL_0',
  'score': 0.9991929,
  'word': 'music similar to',
  'start': 0,
  'end': 16},
 {'entity_group': 'LABEL_1',
  'score': 0.8970744,
  'word': 'morphine robocobra',
  'start': 17,
  'end': 35},
 {'entity_group': 'LABEL_2',
  'score': 0.5060059,
  'word': 'quartet',
  'start': 36,
  'end': 43},
 {'entity_group': 'LABEL_0',
  'score': 0.9988042,
  'word': '| featuring elements like saxophone prominent bass',
  'start': 44,
  'end': 94}]

我们可以看到，模型分配的标签是正确的。

第六章：主题建模

在本章中，我们将介绍主题建模，即对文本语料库中存在的主题进行分类。主题建模是一种非常有用的技术，可以让我们了解哪些主题出现在文档集中。例如，主题建模用于社交媒体的趋势发现。此外，在许多情况下，将主题建模作为数据集初步数据分析的一部分是有用的，以了解其中包含哪些主题。

有许多不同的算法可以完成这项工作。所有这些算法都试图在不同文本之间找到相似性，并将它们放入几个聚类中。这些不同的聚类表示不同的主题。

在本章中，您将学习如何通过各种技术使用BBC新闻数据集创建和使用主题模型。这个数据集包含以下主题的新闻：政治、体育、商业、科技和娱乐。因此，我们知道在每种情况下，我们需要有五个主题聚类。在现实场景中情况可能并非如此，您将需要估计主题聚类的数量。关于如何做到这一点的一个很好的参考是Andriy Burkov的《百页机器学习书》（第112页）。

本章包含以下食谱：

使用gensim进行LDA主题建模
使用SBERT进行社区检测聚类
使用BERT进行K-Means主题建模
使用BERTopic进行主题建模
使用上下文化的主题模型

技术要求

在本章中，我们将使用与第4章中相同的BBC数据集。数据集位于本书的GitHub仓库中：

https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/bbc-text.csv

它也通过Hugging Face提供：

https://huggingface.co/datasets/SetFit/bbc-news

注意

本书使用此数据集时已获得研究者的许可。与该数据集相关的原始论文如下：

Derek Greene和Pádraig Cunningham. “在核文档聚类中解决对角优势问题的实用解决方案”，载于第23届国际机器学习会议（ICML’06），2006年。

原始文章文本内容中的所有权利，包括版权，均归BBC所有。

请确保将GitHub上util文件夹中的所有Python笔记本下载到您计算机上的util文件夹中。您计算机上的目录结构应与GitHub仓库中的设置相匹配。在本章的多个食谱中，我们将访问此目录中的文件。

使用gensim进行LDA主题建模

潜在狄利克雷分配（LDA）是主题建模中最古老的算法之一。它是一种统计生成模型，用于计算不同单词的概率。一般来说，LDA是较长的文本的好模型选择。

我们将使用主主题建模算法之一，LDA，为BBC新闻文本创建一个主题模型。我们知道BBC新闻数据集有五个主题：科技、政治、商业、娱乐和体育。因此，我们将使用五个作为预期的聚类数量。

准备工作

我们将使用gensim包，它是poetry环境的一部分。您也可以安装requirements.txt文件以获取该包。

数据集位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/data/bbc-text.csv，应下载到data文件夹。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.1_topic_modeling_gensim.ipynb。

如何操作...

LDA模型需要数据是干净的。这意味着需要从文本中删除停用词和其他不必要的标记。这包括数字和标点符号。如果跳过此步骤，可能会出现围绕停用词、数字或标点的主题。

我们将使用gensim包中的simple_preprocess函数加载数据，清理它并进行预处理。然后我们将创建LDA模型。任何主题模型都需要工程师提前估计主题的数量。我们将使用五个，因为我们知道数据中存在五个主题。有关如何估计主题数量的更多信息，请参阅本章的介绍部分。

步骤如下：

执行必要的导入：

import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
import gensim.corpora as corpora
from pprint import pprint
from gensim.corpora import MmCorpus

加载停用词和BBC新闻数据，然后打印生成的数据框。在这里，我们使用NLTK的标准停用词列表。正如我们在第4章中看到的，在使用K-Means聚类句子：无监督文本分类配方中，said这个词也被视为这个数据集中的停用词，因此我们必须手动将其添加到列表中。
```
stop_words = stopwords.words('english')
stop_words.append("said")
bbc_df = pd.read_csv("../data/bbc-text.csv")
print(bbc_df)
```
结果将类似于以下内容：

图6.1 – BBC新闻数据框输出

在这一步中，我们将创建clean_text函数。此函数从文本的第一行删除额外的空白，从第二行删除数字。然后它使用来自gensim的simple_preprocess函数。simple_preprocess函数将文本拆分为一个标记列表，将它们转换为小写，并删除过长或过短的标记。然后我们从列表中删除停用词：
```
def clean_text(input_string):
    input_string = re.sub(r'[^\w\s]', ' ', input_string)
    input_string = re.sub(r'\d', '', input_string)
    input_list = simple_preprocess(input_string)
    input_list = [word for word in input_list if word not in 
        stop_words]
    return input_list
```
现在，我们将函数应用于数据。文本列现在包含一个单词列表，这些单词都是小写且没有停用词：
```
bbc_df['text'] = bbc_df['text'].apply(lambda x: clean_text(x))
print(bbc_df)
```
结果看起来会类似于这样：

图6.2 – 处理后的BBC新闻输出

在这里，我们将使用gensim.corpora.Dictionary类来创建一个从单词到其整数ID的映射。这是为了然后创建文本的词袋表示。然后，使用这个映射，我们创建语料库作为词袋。要了解更多关于词袋概念的信息，请参阅第3章，将文档放入词袋中。在这个配方中，我们不会使用sklearn的CountVectorizer类，而是使用gensim包提供的类：

texts = bbc_df['text'].values
id_dict = corpora.Dictionary(texts)
corpus = [id_dict.doc2bow(text) for text in texts]

在这一步，我们将初始化并训练LDA模型。我们将传递预处理和向量化的数据（corpus），单词到ID映射（id_dict），主题数量，我们将其初始化为五个，块大小和遍历次数。块大小决定了每个训练块中使用的文档数量，遍历次数指定了训练过程中对语料库的遍历次数。您可以尝试这些超参数以查看它们是否可以改进模型。这里使用的参数，每个块100个文档和20次遍历，是通过实验选择的，以产生一个好的模型：
```
num_topics = 5
lda_model = LdaModel(corpus=corpus,
                     id2word=id_dict,
                     num_topics=num_topics,
                     chunksize=100,
                     passes=20)
pprint(lda_model.print_topics())
```
结果会有所不同。我们的输出看起来像这样：

图6.3 – 我们的LDA输出

使用SBERT进行社区检测聚类

在这个配方中，我们将使用SentenceTransformers（SBERT）包中包含的社区检测算法。SBERT将使我们能够轻松地使用BERT模型对句子进行编码。请参阅第3章中的使用BERT和OpenAI嵌入代替词嵌入配方，以获取有关如何使用句子转换器的更详细解释。

此算法常用于在社交媒体中寻找社区，但也可以用于主题建模。此算法的优点是它非常快。它最适合较短的文本，例如在社交媒体上找到的文本。它还只发现文档数据集中的主要主题，而LDA则将所有可用文本进行聚类。社区检测算法的一个用途是寻找社交媒体上的重复帖子。

准备工作

在这个配方中，我们将使用SBERT包。它包含在poetry环境中。你也可以通过安装requirements.txt文件来一起安装其他包。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.2_community_detection.ipynb。

如何操作...

我们将使用BERT句子转换器模型对文本进行转换，然后对生成的嵌入应用社区检测聚类算法。

执行必要的导入。在这里，你可能需要从NLTK下载停用词语料库。请参阅第1章中的移除停用词配方，以获取有关如何操作的详细说明。
```
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer, util
```
运行语言****工具文件：
```
%run -i "../util/lang_utils.ipynb"
```

加载BBC数据并打印：

bbc_df = pd.read_csv("../data/bbc-text.csv")
print(bbc_df)

结果将看起来像图6**.1。

加载模型并创建嵌入。请参阅第3章中的使用BERT和OpenAI嵌入代替词嵌入配方，以获取有关句子嵌入的更多信息。社区检测算法需要嵌入以张量的形式存在；因此，我们必须将convert_to_tensor设置为True：
```
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(bbc_df["text"], convert_to_tensor=True)
```

在这一步，我们将创建聚类。我们将指定相似度的阈值，在0到1的范围内为0.7。这将确保生成的社区彼此非常相似。最小社区大小为10；这意味着至少需要10篇新闻文章来形成一个聚类。如果我们想要更大、更通用的聚类，我们应该使用更大的最小社区大小。更细粒度的聚类应使用更小的数字。任何成员数量较少的聚类将不会出现在输出中。结果是列表的列表，其中每个内部列表代表一个聚类，并列出原始数据框中聚类成员的行ID：

clusters = util.community_detection(
    embeddings, threshold=0.7, min_community_size=10)
print(clusters)

结果可能会有所不同，可能看起来像这样：

[[117, 168, 192, 493, 516, 530, 638, 827, 883, 1082, 1154, 1208, 1257, 1359, 1553, 1594, 1650, 1898, 1938, 2059, 2152], [76, 178, 290, 337, 497, 518, 755, 923, 1057, 1105, 1151, 1172, 1242, 1560, 1810, 1813, 1882, 1942, 1981], [150, 281, 376, 503, 758, 900, 1156, 1405, 1633, 1636, 1645, 1940, 1946, 1971], [389, 399, 565, 791, 1014, 1018, 1259, 1288, 1440, 1588, 1824, 1917, 2024], [373, 901, 1004, 1037, 1041, 1323, 1499, 1534, 1580, 1621, 1751, 2178], [42, 959, 1063, 1244, 1292, 1304, 1597, 1915, 2081, 2104, 2128], [186, 193, 767, 787, 1171, 1284, 1625, 1651, 1797, 2148], [134, 388, 682, 1069, 1476, 1680, 2106, 2129, 2186, 2198]]

在这里，我们将定义一个函数，用于按簇打印最常见的单词。我们将使用社区检测算法创建的簇和原始数据框。对于每个簇，我们首先选择代表它的句子，然后使用 get_most_frequent_words 函数获取最常见的单词，该函数我们在 第4章 的 使用 K-Means 对句子进行聚类：无监督文本分类 菜谱中定义。此函数也位于我们在第二步中运行的 lang_utils 笔记本中：
```
def print_words_by_cluster(clusters, input_df):
    for i, cluster in enumerate(clusters):
        print(f"\nCluster {i+1}, {len(cluster)} elements ")
        sentences = input_df.iloc[cluster]["text"]
        all_text = " ".join(sentences)
        freq_words = get_most_frequent_words(all_text)
        print(freq_words)
```

现在，使用该函数对模型输出（截断）进行处理。我们可以看到，与原始 BBC 数据集中的五个主题相比，有更多具体的簇：

Cluster 1, 21 elements
['mr', 'labour', 'brown', 'said', 'blair', 'election', 'minister', 'prime', 'chancellor', 'would', 'party', 'new', 'campaign', 'told', 'government', ...]
Cluster 2, 19 elements
['yukos', 'us', 'said', 'russian', 'oil', 'gazprom', 'court', 'rosneft', 'russia', 'yugansk', 'company', 'bankruptcy', 'auction', 'firm', 'unit', ...]
Cluster 3, 14 elements
['kenteris', 'greek', 'thanou', 'iaaf', 'said', 'athens', 'tests', 'drugs', 'olympics', 'charges', 'also', 'decision', 'test', 'athletics', 'missing', ...]
Cluster 4, 13 elements
['mr', 'tax', 'howard', 'labour', 'would', 'said', 'tory', 'election', 'government', 'taxes', 'blair', 'spending', 'tories', 'party', 'cuts',...]
Cluster 5, 12 elements
['best', 'film', 'aviator', 'director', 'actor', 'foxx', 'swank', 'actress', 'baby', 'million', 'dollar', 'said', 'win', 'eastwood', 'jamie',...]
Cluster 6, 11 elements
['said', 'prices', 'market', 'house', 'uk', 'figures', 'mortgage', 'housing', 'year', 'lending', 'november', 'price', 'december', 'rise', 'rose', ...]
Cluster 7, 10 elements
['lse', 'deutsche', 'boerse', 'bid', 'euronext', 'said', 'exchange', 'london', 'offer', 'stock', 'would', 'also', 'shareholders', 'german', 'market',...]
Cluster 8, 10 elements
['dollar', 'us', 'euro', 'said', 'currency', 'deficit', 'analysts', 'trading', 'yen', 'record', 'exports', 'economic', 'trade', 'markets', 'european',...]

使用 BERT 进行 K-Means 主题建模

在这个菜谱中，我们将使用 K-Means 算法进行无监督主题分类，使用 BERT 嵌入对数据进行编码。这个菜谱与 第4章 中的 使用 K-Means 对句子进行聚类 – 无监督文本分类 菜谱有许多相似之处。

K-Means 算法用于找到任何类型数据的相似簇，并且是查看数据趋势的一种简单方法。在进行初步数据分析时，它经常被用来快速检查数据集中出现的数据类型。我们可以使用它与文本数据，并使用句子转换器模型对数据进行编码。

准备工作

我们将使用 sklearn.cluster.KMeans 对象进行无监督聚类，以及使用 HuggingFace sentence transformers。这两个包都是 poetry 环境的一部分。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.3-kmeans_with_bert.ipynb。

如何操作...

在这个菜谱中，我们将加载 BBC 数据集，并使用句子转换器包对其进行编码。然后，我们将使用 K-Means 聚类算法创建五个簇。之后，我们将对测试集进行模型测试，以查看它在未见数据上的表现如何：

执行必要的导入：

import re
import string
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer

运行语言工具文件。这将使我们能够在这个菜谱中重用 print_most_common_words_by_cluster 函数：
```
%run -i "../util/lang_utils.ipynb"
```
读取并打印数据：
```
bbc_df = pd.read_csv("../data/bbc-text.csv")
print(bbc_df)
```
结果应该看起来像 图6**.1 中的那样。
在这一步中，我们将数据分为训练和测试。我们将测试集的大小限制为整个数据集的 10%。训练集的长度为 2002，测试集的长度为 223：
```
bbc_train, bbc_test = train_test_split(bbc_df, test_size=0.1)
print(len(bbc_train))
print(len(bbc_test))
```
结果将如下所示：
```
2002
223
```
在这里，我们将把文本列表分配给documents变量。然后，我们将读取all-MiniLM-L6-v2模型用于句子嵌入并编码文本数据。接下来，我们将初始化一个具有五个聚类的KMeans模型，将n_init参数设置为auto，这决定了算法运行的次数。我们还将init参数设置为k-means++。此参数确保算法更快地收敛。然后我们将训练初始化后的模型：
```
documents = bbc_train['text'].values
model = SentenceTransformer('all-MiniLM-L6-v2')
encoded_data = model.encode(documents)
km = KMeans(n_clusters=5, n_init='auto', init='k-means++')
km.fit(encoded_data)
```

按主题打印出最常见的单词：

print_most_common_words_by_cluster(documents, km, 5)

结果可能会有所不同；我们的结果如下所示：

0
['said', 'people', 'new', 'also', 'mr', 'technology', 'would', 'one', 'mobile', ...]
1
['said', 'game', 'england', 'first', 'win', 'world', 'last', 'would', 'one', 'two', 'time',...]
2
['said', 'film', 'best', 'music', 'also', 'year', 'us', 'one', 'new', 'awards', 'show',...]
3
['said', 'mr', 'would', 'labour', 'government', 'people', 'blair', 'party', 'election', 'also', 'minister', ...]
4
['said', 'us', 'year', 'mr', 'would', 'also', 'market', 'company', 'new', 'growth', 'firm', 'economy', ...]

我们可以看到，主题的映射如下：0是技术，1是体育，2是娱乐，3是政治，4是商业。

现在，我们可以使用测试数据来查看模型在未见数据上的表现如何。首先，我们必须在测试数据框中创建一个预测列，并用每个测试输入的聚类编号填充它：
```
bbc_test["prediction"] = bbc_test["text"].apply(
    lambda x: km.predict(model.encode([x]))[0])
print(bbc_test)
```
结果可能会有所不同；这是我们的输出：

图6.4 – 在测试数据框上运行K-Means的结果

现在，我们将创建一个聚类编号和主题名称之间的映射，这是我们通过查看每个聚类的最频繁单词手动发现的。然后，我们将使用映射和之前步骤中创建的prediction列，为测试集中的每个文本创建一个预测的主题名称列。现在，我们可以比较模型的预测与数据的真实值。我们将使用sklearn中的classification_report函数获取相应的统计数据。最后，我们将打印出预测的分类报告：

topic_mapping = {0:"tech", 1:"sport",
    2:"entertainment", 3:"politics", 4:"business"}
bbc_test["pred_category"] = bbc_test["prediction"].apply(
    lambda x: topic_mapping[x])
print(classification_report(bbc_test["category"],
    bbc_test["pred_category"]))

结果将如下所示：

               precision    recall  f1-score   support
     business       0.98      0.96      0.97        55
entertainment       0.95      1.00      0.97        38
     politics       0.97      0.93      0.95        42
        sport       0.98      0.96      0.97        47
         tech       0.93      0.98      0.95        41
     accuracy                           0.96       223
    macro avg       0.96      0.97      0.96       223
 weighted avg       0.96      0.96      0.96       223

分数非常高——几乎完美。其中大部分归功于我们使用的句子嵌入模型的质量。

定义一个新的示例：

new_example = """Manchester United players slumped to the turf
at full-time in Germany on Tuesday in acknowledgement of what their
latest pedestrian first-half display had cost them. The 3-2 loss at
RB Leipzig means United will not be one of the 16 teams in the draw
for the knockout stages of the Champions League. And this is not the
only price for failure. The damage will be felt in the accounts, in
the dealings they have with current and potentially future players
and in the faith the fans have placed in manager Ole Gunnar Solskjaer.
With Paul Pogba's agent angling for a move for his client and ex-United
defender Phil Neville speaking of a "witchhunt" against his former team-mate
Solskjaer, BBC Sport looks at the ramifications and reaction to a big loss for United."""

打印新示例的预测：

predictions = km.predict(model.encode([new_example]))
print(predictions[0])

输出将如下所示：

聚类编号1对应于体育，这是正确的分类。

使用BERTopic进行主题建模

在这个示例中，我们将探索BERTopic包，它提供了许多不同且通用的工具用于主题建模和可视化。如果你想要对创建的主题聚类进行不同的可视化，它特别有用。这个主题建模算法使用BERT嵌入来编码数据，因此得名“BERT”。你可以在https://maartengr.github.io/BERTopic/algorithm/algorithm.html了解更多关于该算法及其组成部分的信息。

默认情况下，BERTopic包使用HDBSCAN算法以无监督的方式从数据中创建聚类。您可以在https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html了解更多关于HDBSCAN算法的工作原理。然而，也可以自定义BERTopic对象的内部工作方式以使用其他算法。还可以将其管道中的其他自定义组件替换掉。在本配方中，我们将使用默认设置，您也可以尝试其他组件。

最终生成的主题质量非常高。这可能有几个原因。其中之一是使用BERT嵌入的结果，我们曾在第4章中看到，它对分类结果产生了积极影响。

准备工作

我们将使用BERTopic包为BBC数据集创建主题模型。该包包含在poetry环境中，也是requirements.txt文件的一部分。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.3-kmeans_with_bert.ipynb。

如何操作...

在本配方中，我们将加载BBC数据集并再次对其进行预处理。预处理步骤将包括分词和移除停用词。然后我们将使用BERTopic创建主题模型并检查结果。我们还将对未见过的数据进行主题模型测试，并使用classification_report查看准确度统计信息：

执行必要的导入：

import pandas as pd
import numpy as np
from bertopic import BERTopic
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

运行语言工具文件：
```
%run -i "../util/lang_utils.ipynb"
```
定义和修改停用词，然后读取BBC数据：
```
stop_words = stopwords.words('english')
stop_words.append("said")
stop_words.append("mr")
bbc_df = pd.read_csv("../data/bbc-text.csv")
```
在这一步，我们将预处理数据。我们首先使用NLTK中的word_tokenize方法进行分词，正如在第1章中的将句子分割成单词 – 分词配方所示。然后移除停用词，最后将文本重新组合成一个字符串。我们必须执行最后一步，因为BERTopic使用的是句子嵌入模型，该模型需要一个字符串，而不是单词列表：
```
bbc_df["text"] = bbc_df["text"].apply(
    lambda x: word_tokenize(x))
bbc_df["text"] = bbc_df["text"].apply(
    lambda x: [w for w in x if w not in stop_words])
bbc_df["text"] = bbc_df["text"].apply(lambda x: " ".join(x))
```
在这里，我们将数据集分为训练集和测试集，指定测试集的大小为10%。因此，我们将得到2002个训练数据点和223个测试数据点：
```
bbc_train, bbc_test = train_test_split(bbc_df, test_size=0.1)
print(len(bbc_train))
print(len(bbc_test))
```
结果将如下所示：
```
2002
223
```
从数据框中提取文本列表：
```
docs = bbc_train["text"].values
```
在这一步，我们将初始化BERTopic对象，然后将其拟合到在步骤6中提取的文档上。我们将指定要生成的主题数量为六个，比我们寻找的五个多一个。这是因为BERTopic与其他主题建模算法的一个关键区别在于它有一个特殊的丢弃主题，编号为-1。我们也可以指定更多的主题数量。在这种情况下，它们将比商业、政治、娱乐、科技和体育这五大类别更窄：
```
topic_model = BERTopic(nr_topics=6)
topics, probs = topic_model.fit_transform(docs)
```

在这里，我们将打印出结果主题模型的信息。除了丢弃主题外，主题与人类标注员分配的黄金标签很好地对齐。该函数打印出每个主题的最具代表性的单词，以及最具代表性的文档：

print(topic_model.get_topic_info())

结果会有所不同；这里是一个示例结果：

   Topic  Count                                 Name  \
0     -1    222             -1_also_company_china_us
1      0    463             0_england_game_win_first
2      1    393      1_would_labour_government_blair
3      2    321             2_film_best_music_awards
4      3    309  3_people_mobile_technology_software
5      4    294             4_us_year_growth_economy
                                      Representation  \
0  [also, company, china, us, would, year, new, p...
1  [england, game, win, first, club, world, playe...
2  [would, labour, government, blair, election, p...
3  [film, best, music, awards, show, year, band, ...
4  [people, mobile, technology, software, digital...
5  [us, year, growth, economy, economic, company,...
                                 Representative_Docs
0  [us retail sales surge december us retail sale...
1  [ireland win eclipses refereeing errors intern...
2  [lib dems unveil election slogan liberal democ...
3  [scissor sisters triumph brits us band scissor...
4  [mobiles media players yet mobiles yet ready a...
5  [consumer spending lifts us growth us economic...

在这一步，我们将打印出主题。我们可以从单词中看到，零号主题是体育，第一个主题是政治，第二个主题是娱乐，第三个主题是科技，第四个主题是商业。

图6.5 – BERTopic生成的主题

在这一步，我们将使用generate_topic_labels函数生成主题标签。我们将输入用于主题标签的单词数量，分隔符（在这种情况下，这是一个下划线），以及是否包含主题编号。结果，我们将得到一个主题名称列表。我们可以从生成的主题中看到，我们可以将would作为停用词包括在内：
```
topic_model.generate_topic_labels(
    nr_words=5, topic_prefix=True, separator='_')
```
结果将与以下类似：
```
['-1_also_company_china_us_would',
 '0_england_game_win_first_club',
 '1_would_labour_government_blair_election',
 '2_film_best_music_awards_show',
 '3_people_mobile_technology_software_digital',
 '4_us_year_growth_economy_economic']
```
在这里，我们将定义一个get_prediction函数，该函数为我们提供文本输入和相应模型的主题编号。该函数转换输入文本，并输出一个包含两个列表的元组。一个是主题编号列表，另一个是分配给每个主题的概率列表。列表按最可能的主题顺序排序，因此我们可以取第一个列表的第一个元素作为预测的主题并返回它：
```
def get_prediction(input_text, model):
    pred = model.transform(input_text)
    pred = pred[0][0]
    return pred
```
在这一步，我们将在测试数据框中定义一个用于预测的列，然后使用我们在上一步定义的函数为数据框中的每个文本获取预测。然后我们将创建一个主题编号到黄金主题标签的映射，我们可以用它来测试主题模型的有效性：
```
bbc_test["prediction"] = bbc_test["text"].apply(
    lambda x: get_prediction(x, topic_model))
topic_mapping = {0:"sport", 1:"politics",
    2:"entertainment", 3:"tech", 4:"business", -1:"discard"}
```

在这里，我们将在测试数据框中创建一个新列，用于记录使用我们创建的映射预测的主题名称。然后我们将过滤测试集，只使用那些未被预测为丢弃主题-1的条目：

bbc_test["pred_category"] = bbc_test["prediction"].apply(
    lambda x: topic_mapping[x])
test_data = bbc_test.loc[bbc_test['prediction'] != -1]
print(classification_report(test_data["category"],
    test_data["pred_category"]))

结果将与这个类似：

               precision    recall  f1-score   support
     business       0.95      0.86      0.90        21
entertainment       0.97      1.00      0.98        30
     politics       0.94      1.00      0.97        46
        sport       1.00      1.00      1.00        62
         tech       0.96      0.88      0.92        25
     accuracy                           0.97       184
    macro avg       0.96      0.95      0.95       184
 weighted avg       0.97      0.97      0.97       184

测试分数非常高。这反映了编码模型，BERTopic模型，它也是一个句子转换模型，就像之前的食谱中一样。

在这一步，我们将定义一个新的示例来测试模型并打印它。我们将使用pandas包中的iloc函数来访问bbc_test数据框的第一个元素：

new_input = bbc_test["text"].iloc[0]
print(new_input)
howard dismisses tory tax fears michael howard dismissed fears conservatives plans £4bn tax cuts modest . defended package saying plan tories first budget hoped able go . tories monday highlighted £35bn wasteful spending would stop allow tax cuts reduced borrowing spending key services . ...

这个例子是关于政治的，应该是主题1。

从模型中获得预测并打印出来：
```
print(topic_model.transform(new_input))
```
结果将是一个正确的预测：
```
([1], array([1.]))
```

使用上下文主题模型

在这个菜谱中，我们将探讨另一个主题模型算法：上下文主题模型。为了产生更有效的主题模型，它将嵌入与词袋文档表示相结合。

我们将向您展示如何使用训练好的主题模型，并输入其他语言的数据。这个特性特别有用，因为我们可以在一种语言中创建一个主题模型，例如，拥有许多资源的语言，然后将其应用于资源较少的另一种语言。为了实现这一点，我们将利用多语言嵌入模型来编码数据。

准备工作

我们需要contextualized-topic-models包来完成这个菜谱。它是poetry环境的一部分，并在requirements.txt文件中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter06/6.5-contextualized-tm.ipynb。

如何操作...

在这个菜谱中，我们将加载数据，然后将其分为句子，预处理它，并使用gsdmm模型将句子聚类为主题。如果您想了解更多关于算法的信息，请参阅https://pypi.org/project/contextualized-topic-models/的包文档。

执行必要的导入：

import pandas as pd
from nltk.corpus import stopwords
from contextualized_topic_models.utils.preprocessing import( 
    WhiteSpacePreprocessingStopwords)
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import( 
    TopicModelDataPreparation)

抑制警告：

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category = DeprecationWarning)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

创建停用词列表并读取数据：

stop_words = stopwords.words('english')
stop_words.append("said")
bbc_df = pd.read_csv("../data/bbc-text.csv")

在这一步，我们将创建预处理对象并使用它来预处理文档。contextualized-topic-models包提供了不同的预处理程序，这些程序准备数据以用于主题模型算法。这个预处理程序对文档进行分词，移除停用词，并将它们放回字符串中。它返回预处理文档的列表、原始文档的列表、数据集词汇表以及原始数据框中的文档索引列表：
```
documents = bbc_df["text"]
preprocessor = WhiteSpacePreprocessingStopwords(
    documents, stopwords_list=stop_words)
preprocessed_documents,unpreprocessed_documents,vocab,indices =\
    preprocessor.preprocess()
```
在这里，我们将创建TopicModelDataPreparation对象。我们将传递嵌入模型名称作为参数。这是一个多语言模型，可以将文本编码成各种语言并取得良好的效果。然后我们将它在文档上拟合。它使用嵌入模型将文本转换为嵌入，并创建一个词袋模型。输出是一个CTMDataset对象，它代表了主题模型训练算法所需格式的训练数据集：
```
tp = TopicModelDataPreparation(
    "distiluse-base-multilingual-cased")
training_dataset = tp.fit(
    text_for_contextual=unpreprocessed_documents,
    text_for_bow=preprocessed_documents)
```
在这一步，我们将使用ZeroShotTM对象创建主题模型。术语zero shot意味着该模型对文档没有任何先验信息。我们将输入词袋模型的词汇量大小、嵌入向量的大小、主题数量（n_components参数）以及训练模型所需的轮数。我们将使用五个主题，因为BBC数据集有那么多主题。当你将此算法应用于你的数据时，你需要对不同数量的主题进行实验。最后，我们将初始化的主题模型拟合到训练数据集上：
```
ctm = ZeroShotTM(bow_size=len(tp.vocab),
    contextual_size=512, n_components=5,
    num_epochs=100)
ctm.fit(training_dataset)
```
在这里，我们将检查主题。我们可以看到它们很好地与黄金标签匹配。主题0是科技，主题1是体育，主题2是商业，主题3是娱乐，主题4是政治：
```
ctm.get_topics()
```
结果会有所不同；这是我们得到的结果：

图6.6 – 上下文主题模型输出

现在，我们将初始化一个新的新闻稿件，这次使用西班牙语，以查看基于英语文档训练的主题模型在另一种语言的新闻文章上的效果如何。这篇特定的新闻稿件应该属于科技主题。我们将使用TopicModelDataPreparation对象对其进行预处理。为了然后在编码后的文本上使用该模型，我们需要创建一个数据集对象。这就是为什么我们必须将西班牙语新闻稿件包含在一个列表中，然后传递给数据准备过程。最后，我们必须将数据集（仅包含一个元素）通过模型传递：
```
spanish_news_piece = """IBM anuncia el comienzo de la "era de la utilidad cuántica" y anticipa un superordenador en 2033.
La compañía asegura haber alcanzado un sistema de computación que no se puede simular con procedimientos clásicos."""
testing_dataset = tp.transform([spanish_news_piece])
```
在这一步，我们将获取之前步骤中创建的测试数据集的主题分布。结果是列表的列表，其中每个单独的列表代表特定文本属于该主题的概率。这些概率在单独的列表中与主题编号具有相同的索引：
```
ctm.get_doc_topic_distribution(testing_dataset)
```
在这种情况下，最高的概率是主题0，这确实是科技：
```
array([[0.5902461,0.09361929,0.14041995,0.07586181,0.0998529 ]],
      dtype=float32)
```

参见

想了解更多关于上下文相关主题模型的信息，请参阅https://contextualized-topic-models.readthedocs.io/en/latest/index.html.

第七章：可视化文本数据

本章致力于创建用于 NLP 工作不同方面的可视化，其中许多我们在前面的章节中已经完成。在处理 NLP 任务时，可视化非常重要，因为它们帮助我们更容易地看到已完成工作的整体情况。

我们将创建不同类型的可视化，包括语法细节、词性和主题模型的可视化。完成本章后，你将能够创建引人入胜的图像来展示和解释各种 NLP 任务的输出。

本章中你可以找到以下食谱：

可视化依存句法
可视化词性
可视化 NER
创建混淆矩阵图
构建词云
从 Gensim 可视化主题
从 BERTopic 可视化主题

技术要求

本章我们将使用以下包：spacy、matplotlib、wordcloud 和 pyldavis。它们是 poetry 环境和 requirements.txt 文件的一部分。

本章我们将使用两个数据集。第一个是 BBC 新闻数据集，位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_train.json和https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_test.json。

注意

本书使用此数据集已获得研究者的许可。与此数据集相关的原始论文如下：

Derek Greene 和 Pádraig Cunningham. “Kernel 文档聚类中对角优势问题的实用解决方案，”在 2006 年第 23 届国际机器学习会议 (ICML’06) 上发表。

原始文章文本内容的版权等所有权利均归 BBC 所有。

第二个是 Sherlock Holmes 文本，位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/sherlock_holmes.txt。

可视化依存句法

在这个食谱中，我们将学习如何使用 displaCy 库并可视化依存句法。它展示了文本中单词之间的语法关系，通常是一句话。

关于如何创建依存句法的详细信息可以在第2章中找到，在获取依存句法食谱中。我们将创建两个可视化，一个用于短文本，另一个用于长多句文本。

完成这个食谱后，你将能够创建具有不同格式化选项的语法结构可视化。

准备工作

displaCy 库是 spacy 包的一部分。为了让 displaCy 正常工作，您至少需要 spacy 包的 2.0.12 版本。poetry 环境和 requirements.txt 中的版本是 3.6.1。

笔记本位于 https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.1_dependency_parse.ipynb。

如何操作...

为了可视化依赖关系解析，我们将使用 displaCy 包的功能首先显示一个句子，然后一起显示两个句子：

导入必要的包：

import spacy
from spacy import displacy

运行语言工具文件：
```
%run -i "../util/lang_utils.ipynb"
```

定义输入文本并使用小型模型进行处理：

input_text = "I shot an elephant in my pajamas."
doc = small_model(input_text)

现在，我们将定义不同的可视化选项。render 命令，我们将这些选项作为参数提供。我们将 jupyter 参数设置为 True 以确保在笔记本中正确可视化。对于非 Jupyter 可视化，您可以省略此参数。我们将 style 参数设置为 'dep'，因为我们希望得到依赖关系解析输出。输出是依赖关系解析的视觉表示：
```
options = {"add_lemma": True,
        "compact": True,
        "color": "green",
        "collapse_punct": True,
        "arrow_spacing": 20,
        "bg": "#FFFFE6",
        "font": "Times",
        "distance": 120}
displacy.render(doc, style='dep', options=options, jupyter=True)
```

输出显示在 图 7**.1 中。

图 7.1 – 依赖关系解析可视化

在这一步，我们将可视化保存到文件中。我们首先从 pathlib 包中导入 Path 对象。然后初始化一个字符串，其中包含我们想要保存文件的路径，并创建一个 Path 对象。我们使用相同的 render 命令，这次将输出保存到变量中，并将 jupyter 参数设置为 False。然后我们使用 output_path 对象将输出写入相应的文件：
```
from pathlib import Path
path = "../data/dep_parse_viz.svg"
output_path = Path(path)
svg = displacy.render(doc, style="dep", jupyter=False)
output_path.open("w", encoding="utf-8").write(svg)
```
这将创建依赖关系解析并将其保存到 ../data/dep_parse_viz.svg。
现在，让我们定义一个较长的文本，并使用小型模型进行处理。这样，我们将能够看到 displaCy 如何处理较长的文本：
```
input_text_list = "I shot an elephant in my pajamas. I hate it 
    when elephants wear my pajamas."
doc = small_model(input_text_list)
```
在这里，我们可视化新文本。这次，我们必须输入从处理后的 spacy 对象中提取的句子列表，以表明存在多个句子：
```
displacy.render(list(doc.sents), style='dep', options=options, 
    jupyter=True)
```
输出应该看起来像 图 7**.2 中的那样。我们看到第二个句子的输出从新的一行开始。

图 7.2 – 几个句子的依赖关系解析可视化

可视化词性

在这个菜谱中，我们可视化词性计数。具体来说，我们统计了《福尔摩斯探案集》中不定式和过去或现在动词的数量。这可以让我们了解文本主要讲述的是过去还是现在的事件。我们可以想象，类似的工具可以用来评估文本的质量；例如，形容词很少但名词很多的书籍可能不适合作为小说。

在完成这个配方后，您将能够使用matplotlib包创建不同动词类型的条形图，这些动词是用spacy包标记的。

准备工作

我们将使用spacy包进行文本分析，并使用matplotlib包创建图表。它们是poetry环境的一部分，并在requirements.txt文件中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.2_parts_of_speech.ipynb。

如何操作...

我们将创建一个函数，该函数将按时态计数动词的数量，并在条形图上绘制每个时态：

导入必要的包：

import spacy
import matplotlib.pyplot as plt

运行文件和语言实用工具文件。语言实用工具笔记本加载spacy模型，文件实用工具笔记本加载read_text_file函数：
```
%run -i "../util/lang_utils.ipynb"
%run -i "../util/file_utils.ipynb"
```

加载《福尔摩斯探案集》的文本：

text_file = "../data/sherlock_holmes.txt"
text = read_text_file(text_file)

在这里，我们定义动词标记列表，一个用于现在时，一个用于过去时。我们没有定义另一个列表，但在下一步中使用它，那就是不定式动词，它只有一个标记，VB。如果您已经完成了第1章中的词性标注配方，您会注意到标记与那里使用的spacy标记不同。这些标记更详细，并使用tag_属性而不是在简化标记集中使用的pos_属性：
```
past_tags = ["VBD", "VBN"]
present_tags = ["VBG", "VBP", "VBZ"]
```

在这一步中，我们创建visualize_verbs函数。该函数的输入是文本和spacy模型。我们检查每个标记的tag_属性，并将现在式、过去式和不定式动词的数量添加到字典中。然后我们使用pyplot接口将这些数量绘制成条形图。我们使用bar函数定义条形图。第一个参数列出条形的x坐标。下一个参数是条形的高度列表。我们还设置了align参数为“center”，并使用color参数提供了条形的颜色。xticks函数设置x轴的标签。最后，我们使用show函数显示生成的图表：

def visualize_verbs(text, nlp):
    doc = nlp(text)
    verb_dict = {"Inf":0, "Past":0, "Present":0}
    for token in doc:
        if (token.tag_ == "VB"):
            verb_dict["Inf"] = verb_dict["Inf"] + 1
        if (token.tag_ in past_tags):
            verb_dict["Past"] = verb_dict["Past"] + 1
        if (token.tag_ in present_tags):
            verb_dict["Present"] = verb_dict["Present"] + 1
    plt.bar(range(len(verb_dict)),
        list(verb_dict.values()), align='center',
        color=["red", "green", "blue"])
    plt.xticks(range(len(verb_dict)),
        list(verb_dict.keys()))
    plt.show()

在《福尔摩斯探案集》的文本上运行visualize_verbs函数，使用小的spacy模型：
```
visualize_verbs(text, small_model)
```
这将在图7.3中创建图表。我们看到书中大多数动词都是过去式，这对于小说来说是合理的。然而，也有相当数量的现在式动词，这可能是直接引语的一部分。

图7.3 – 《福尔摩斯探案集》中的不定式、过去式和现在式动词

命名实体可视化

使用displacy包创建引人入胜且易于阅读的图像。

在完成这个配方后，您将能够使用不同的格式选项在文本中创建命名实体的可视化，并将结果保存到文件中。

准备工作

displaCy库是spacy包的一部分。你需要至少spacy包的2.0.12版本才能使displaCy工作。poetry环境和requirements.txt文件中的版本是3.6.1。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.3_ner.ipynb。

如何做到这一点...

我们将使用spacy解析句子，然后使用displacy引擎来可视化命名实体：

导入spacy和displacy：

import spacy
from spacy import displacy

运行语言工具文件：
```
%run -i "../util/lang_utils.ipynb"
```

定义要处理的文本：

text = """iPhone 12: Apple makes jump to 5G
Apple has confirmed its iPhone 12 handsets will be its first to work on faster 5G networks.
The company has also extended the range to include a new "Mini" model that has a smaller 5.4in screen.
The US firm bucked a wider industry downturn by increasing its handset sales over the past year.
But some experts say the new features give Apple its best opportunity for growth since 2014, when it revamped its line-up with the iPhone 6.
"5G will bring a new level of performance for downloads and uploads, higher quality video streaming, more responsive gaming,
real-time interactivity and so much more," said chief executive Tim Cook.
There has also been a cosmetic refresh this time round, with the sides of the devices getting sharper, flatter edges.
The higher-end iPhone 12 Pro models also get bigger screens than before and a new sensor to help with low-light photography.
However, for the first time none of the devices will be bundled with headphones or a charger."""

在这一步，我们使用小型模型处理文本。这给我们一个Doc对象。然后我们修改对象以包含标题。这个标题将是NER可视化的部分：
```
doc = small_model(text)
doc.user_data["title"] = "iPhone 12: Apple makes jump to 5G"
```
在这里，我们为可视化显示设置了颜色选项。我们将绿色用于ORG标记的文本，黄色用于PERSON标记的文本。然后我们设置options变量，它包含颜色。最后，我们使用render命令来显示可视化。作为参数，我们提供Doc对象和之前定义的选项。我们还设置style参数为"ent"，因为我们只想显示实体。我们将jupyter参数设置为True，以便直接在笔记本中显示：
```
colors = {"ORG": "green", "PERSON":"yellow"}
options = {"colors": colors}
displacy.render(doc, style='ent', options=options, jupyter=True)
```
输出应该看起来像图7.4。

图7.4 – 命名实体可视化

现在我们将可视化保存到HTML文件中。我们首先定义路径变量。然后，我们使用相同的render命令，但这次我们将jupyter参数设置为False，并将命令的输出分配给html变量。然后我们打开文件，写入HTML，并关闭文件：
```
path = "../data/ner_vis.html"
html = displacy.render(doc, style="ent",
    options=options, jupyter=False)
html_file= open(path, "w", encoding="utf-8")
html_file.write(html)
html_file.close()
```
这将创建一个包含实体可视化的HTML文件。

创建混淆矩阵图

当处理机器学习模型时，例如，NLP分类模型，创建混淆矩阵图可以是一个非常好的工具，用来查看模型犯的错误，以便进一步改进。模型“混淆”了一个类别为另一个类别，因此得名混淆矩阵。

在完成这个配方后，你将能够创建一个SVM模型，评估它，然后创建一个混淆矩阵可视化，这将详细告诉你模型犯了哪些错误。

准备工作

我们将使用句子转换器模型作为向量器，为BBC新闻数据集创建一个SVM分类器。然后我们将使用ConfusionMatrixDisplay对象创建一个更信息丰富的混淆矩阵。该分类器与第4章配方使用SVM进行监督文本分类相同。

数据集位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_train.json和https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_test.json。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.4_confusion_matrix.ipynb。

如何实现...

导入必要的包和函数：

from sklearn.svm import SVC
from sentence_transformers import SentenceTransformer
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

运行简单的分类器工具文件：

%run -i "../util/util_simple_classifier.ipynb"

读取训练和测试数据，并对训练数据进行洗牌。我们洗牌数据是为了确保没有长序列的单个类别，这可能会在训练过程中使模型产生偏差，或者排除某些类别的较大部分：
```
train_df = pd.read_json("../data/bbc_train.json")
test_df = pd.read_json("../data/bbc_test.json")
train_df.sample(frac=1)
```
在这一步，我们加载转换器模型并创建get_sentence_vector函数。该函数接受文本和模型作为参数，然后创建并返回向量。encode方法接受一个文本列表，因此为了编码一段文本，我们需要将其放入一个列表中，然后获取返回对象的第一元素，因为模型也返回一个编码向量列表：
```
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_sentence_vector(text, model):
    sentence_embeddings = model.encode([text])
    return sentence_embeddings[0]
```
在这里，我们创建train_classifier函数。该函数接受向量化输入和正确答案。然后创建并训练一个SVC对象并返回它。训练可能需要几分钟时间：
```
def train_classifier(X_train, y_train):
    clf = SVC(C=0.1, kernel='rbf')
    clf = clf.fit(X_train, y_train)
    return clf
```

在这一步，我们训练和测试分类器。首先，我们创建一个包含目标标签的列表。然后我们创建一个vectorize函数，该函数使用get_sentence_vector函数，但指定要使用的模型。然后我们使用简单分类器工具文件中的create_train_test_data函数来获取训练集和测试集的向量化输入和标签。该函数接受训练和测试数据框、向量化方法和文本所在的列名。结果是向量化的训练和测试数据以及两个数据集的真实标签。然后，我们使用train_classifier函数创建一个训练好的SVM分类器。我们打印训练数据的分类报告，并使用test_classifier函数打印测试数据的分类报告：

target_names=["tech", "business", "sport",
    "entertainment", "politics"]
vectorize = lambda x: get_sentence_vector(x, model)
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize,
    column_name="text_clean")
clf = train_classifier(X_train, y_train)
print(classification_report(train_df["label"],
        y_train, target_names=target_names))
test_classifier(test_df, clf, target_names=target_names)

输出应该是以下这样的：

               precision    recall  f1-score   support
         tech       1.00      1.00      1.00       321
     business       1.00      1.00      1.00       408
        sport       1.00      1.00      1.00       409
entertainment       1.00      1.00      1.00       309
     politics       1.00      1.00      1.00       333
     accuracy                           1.00      1780
    macro avg       1.00      1.00      1.00      1780
 weighted avg       1.00      1.00      1.00      1780
               precision    recall  f1-score   support
         tech       0.97      0.95      0.96        80
     business       0.98      0.97      0.98       102
        sport       0.98      1.00      0.99       102
entertainment       0.96      0.99      0.97        77
     politics       0.98      0.96      0.97        84
     accuracy                           0.98       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.98      0.98      0.98       445

现在，我们创建一个从数字标签到文本标签的映射，然后在测试数据框中创建一个新列，显示文本标签预测：

num_to_text_mapping = {0:"tech", 1:"business",
    2:"sport", 3:"entertainment", 4:"politics"}
test_df["pred_label"] = test_df["prediction"].apply(
    lambda x: num_to_text_mapping[x])

在这一步中，我们使用sklearn的confusion_matrix函数创建一个混淆矩阵。该函数接受真实标签、预测和类别名称作为输入。然后我们创建一个ConfusionMatrixDisplay对象，该对象接受混淆矩阵和要显示的名称。然后我们使用该对象创建混淆矩阵图，并使用matplotlib库显示：
```
cm = confusion_matrix(
    test_df["label_text"],
    test_df["pred_label"], labels=target_names)
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=target_names)
disp.plot()
plt.show()
```
结果显示在图7.5中。生成的图清楚地显示了哪些类别有重叠以及它们的数量。例如，很容易看出有两个例子被预测为关于商业，但实际上是关于政治的。

图7.5 – 混淆矩阵可视化

构建词云

词云是一种很好的可视化工具，可以快速查看文本中普遍存在的话题。它们可以在初步数据分析阶段和演示目的中使用。词云的一个特点是，大字体单词表示更频繁的话题，而小字体单词表示较少频繁的话题。

完成这个食谱后，你将能够从文本中创建词云，并在词云上应用图片蒙版，这将生成一个酷炫的图像。

我们将使用书籍《福尔摩斯探案集》的文本，我们将使用的图片蒙版是福尔摩斯头像的剪影。

准备工作

我们将使用wordcloud包来完成这个食谱。为了显示图像，我们需要matplotlib包。它们都是poetry环境的一部分，并且包含在requirements.txt文件中。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.5_word_clouds.ipynb。

如何操作...

导入必要的包和函数：

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

运行文件工具笔记本。我们将使用该笔记本中的read_text_file函数：
```
%run -i "../util/file_utils.ipynb"
```

读取书籍文本：

text_file = "../data/sherlock_holmes.txt"
text = read_text_file(text_file)

在这一步中，我们定义了create_wordcloud函数。该函数接受要处理的文本、停用词、结果保存的文件名以及是否在图像上应用蒙版（默认为None）。它创建WordCloud对象，将其保存到文件，然后输出结果图。我们提供给WordCloud对象的可选参数包括最小字体大小、最大字体大小、宽度、高度、最大单词数和背景颜色：

def create_wordcloud(text, stopwords, filename, 
    apply_mask=None):
    if (apply_mask is not None):
        wordcloud = WordCloud(
            background_color="white", max_words=2000,
            mask=apply_mask, stopwords=stopwords,
            min_font_size=10, max_font_size=100)
        wordcloud.generate(text)
        wordcloud.to_file(filename)
        plt.figure()
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis("off")
        plt.show()
    else:
        wordcloud = WordCloud(min_font_size=10,
            max_font_size=100, stopwords=stopwords,
            width=1000, height=1000, max_words=1000,
            background_color="white").generate(text)
        wordcloud.to_file(filename)
        plt.figure()
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis("off")
        plt.show()

在《福尔摩斯探案集》的文本上运行create_wordcloud函数：

create_wordcloud(text, set(STOPWORDS), 
    "../data/sherlock_wc.png")

这将结果保存到位于data/sherlock_wc.png的文件中，并创建显示在图7.6（你的结果可能略有不同）中的可视化。

图7.6 – 福尔摩斯词云可视化

还有更多...

我们还可以对词云应用遮罩。在这里，我们将夏洛克·福尔摩斯的轮廓应用到词云上：

执行额外的导入：

import numpy as np
from PIL import Image

读取遮罩图像并将其保存为numpy数组：

sherlock_data = Image.open("../data/sherlock.png")
sherlock_mask = np.array(sherlock_data)

在夏洛克·福尔摩斯的书文本上运行该函数：

create_wordcloud(text, set(STOPWORDS),
    "../data/sherlock_mask.png",
    apply_mask=sherlock_mask)

这将把结果保存到位于data/sherlock_mask.png的文件中，并创建如图7.7所示的可视化（你的结果可能略有不同）：

图7.7 – 带遮罩的词云

参见

请参阅wordcloud文档，https://amueller.github.io/word_cloud/，以获取更多选项。

从Gensim可视化主题

在这个配方中，我们将可视化我们在第6章中创建的潜在狄利克雷分配（LDA）主题模型。这种可视化将使我们能够快速看到与每个主题最相关的单词以及主题之间的距离。

在完成这个配方后，你将能够加载现有的LDA模型并为它的主题创建可视化，既可以在Jupyter中查看，也可以保存为HTML文件。

准备工作

我们将使用pyLDAvis包来创建可视化。它在poetry环境和requirements.txt文件中可用。

我们将加载我们在第6章中创建的模型，然后使用pyLDAvis包创建主题模型可视化。

笔记本位于https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.6_topics_gensim.ipynb。

如何操作...

导入必要的包和函数：
```
import gensim
import pyLDAvis.gensim
```

定义模型文件的路径。模型是在第6章中训练的：

model_path = "../models/bbc_gensim/lda.model"
dict_path = "../models/bbc_gensim/id2word.dict"
corpus_path = "../models/bbc_gensim/corpus.mm"

在这一步中，我们加载这些路径指向的对象。如果你在这一步遇到FileNotFoundError错误，这意味着你没有创建字典、语料库和模型文件。在这种情况下，回到第6章，即使用Gensim进行LDA主题建模配方，创建模型和相关文件：
```
dictionary = gensim.corpora.Dictionary.load(dict_path)
corpus = gensim.corpora.MmCorpus(corpus_path)
lda = gensim.models.ldamodel.LdaModel.load(model_path)
```
在这里，我们使用前面的文件创建PreparedData对象，并将可视化保存为HTML。该对象是可视化方法所必需的：
```
lda_prepared = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.save_html(lda_prepared, '../data/lda-gensim.html')
```
在这里，我们启用Jupyter的显示选项，并在笔记本中显示可视化。你会看到每个主题及其重要的单词。要选择特定的主题，用鼠标悬停在它上面。当你悬停在它们上面时，你会看到每个主题的最重要单词会发生变化：
```
pyLDAvis.enable_notebook()
pyLDAvis.display(lda_prepared)
```
这将创建如图7.8所示的可视化（你的结果可能会有所不同）：

图7.8 – LDA模型可视化

参见

使用pyLDAvis，也可以可视化使用sklearn创建的模型。有关更多信息，请参阅包文档：https://github.com/bmabey/pyLDAvis。

BERTopic主题可视化

在本食谱中，我们将创建并可视化BBC数据上的BERTopic模型。BERTopic包提供了几个可视化选项，我们将使用其中几个。

在本食谱中，我们将以与第6章中的使用BERTopic进行主题建模食谱类似的方式创建主题模型。然而，与第6章不同，我们不会限制创建的主题数量，这将导致数据中超过5个原始主题。这将允许进行更有趣的可视化。

准备工作

我们将使用BERTopic包来创建可视化。它在poetry环境中可用。

如何操作...

导入必要的包和函数：

import pandas as pd
import numpy as np
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

运行语言工具文件：
```
%run -i "../util/lang_utils.ipynb"
```

读取数据：

bbc_df = pd.read_csv("../data/bbc-text.csv")

这里，我们从一个dataframe对象中创建一个训练文档列表。然后我们初始化一个表示模型对象。在这里，我们使用KeyBERTInspired对象，它使用BERT提取关键词。

此对象创建主题的名称（表示）；它比默认版本做得更好，默认版本包含大量停用词。然后我们创建主主题模型对象并将其拟合到文档集中。在本食谱中，与第6章中的使用BERTopic进行主题建模食谱相比，我们不限制创建的主题数量。这将创建更多主题：
```
docs = bbc_df["text"].values
representation_model = KeyBERTInspired()
topic_model = BERTopic(
    representation_model=representation_model)
topics, probs = topic_model.fit_transform(docs)
```
在此步骤中，我们显示一般主题可视化。它显示了创建的所有42个主题。如果你悬停在每个圆圈上，你会看到主题表示或名称。表示由主题中的前五个单词组成：
```
topic_model.visualize_topics()
```
这将创建如图7.9所示的可视化（你的结果可能会有所不同）。

图7.9 – BERTopic模型可视化

这里，我们创建了一个主题层次结构的可视化。如果主题相关，这个层次结构会将不同的主题聚集在一起。我们首先使用主题模型对象的hierarchical_topics函数创建层次结构，然后将其传递给visualize_hierarchy函数。组合不同主题的节点有自己的名称，如果你悬停在它们上面，你可以看到：
```
hierarchical_topics = topic_model.hierarchical_topics(
    bbc_df["text"])
topic_model.visualize_hierarchy(
    hierarchical_topics=hierarchical_topics)
```
这将创建如图7.10所示的可视化。

图7.10 – BERTopic层次可视化

如果你悬停在节点上，你会看到它们的名称。

在此步骤中，我们创建一个条形图，显示主题的前几个单词。我们通过使用主题模型对象的visualize_barchart函数的top_n_topics参数来指定要显示的主题数量：
```
topic_model.visualize_barchart(top_n_topics=15)
```
这将创建一个类似于此的可视化：

图7.11 – BERTopic单词得分

在这里，我们创建训练集中单个文档的可视化。我们将 步骤 4 中创建的文档列表提供给 visualize_documents 函数。它将文档聚类到主题中。如果你将鼠标悬停在单个圆圈上，你可以看到文档：
```
topic_model.visualize_documents(docs)
```
结果将类似于以下可视化：

图 7.12 – BERTopic 文档可视化

如果你将鼠标悬停在节点上，你会看到单个文档的文本。

参见

通过 BERTopic 可用额外的可视化工具。有关更多信息，请参阅包文档：https://maartengr.github.io/BERTopic/index.html。
要了解更多关于 KeyBERTInspired 的信息，请参阅 https://maartengr.github.io/BERTopic/api/representation/keybert.html。

第八章：转换器和它们的用途

在本章中，我们将了解转换器及其如何应用于执行各种NLP任务。NLP领域的典型任务涉及加载和处理数据，以便它可以无缝地用于下游。一旦数据被读取，另一个任务是转换数据，使其以各种模型可以使用的形式。一旦数据被转换成所需的格式，我们就用它来执行实际的任务，如分类、文本生成和语言翻译。

下面是本章中的菜谱列表：

加载数据集
对数据集中的文本进行分词
使用分词后的文本通过转换器模型进行分类
根据不同的需求使用不同的转换器模型
通过参考初始起始句子生成文本
使用预训练的转换器模型在不同语言之间翻译文本

技术要求

该章节的代码位于书籍GitHub仓库的Chapter08文件夹中（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter08）。

如前几章所述，本章所需的包是poetry环境的一部分。或者，您也可以使用requirements.txt文件安装所有包。

加载数据集

在这个菜谱中，我们将学习如何加载公共数据集并与之交互。我们将使用RottenTomatoes数据集作为本菜谱的示例。这个数据集包含了电影的评分和评论。请参考以下链接获取更多关于数据集的信息：https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

准备工作

作为本章的一部分，我们将使用来自HuggingFace网站（huggingface.co）的库。对于这个菜谱，我们将使用数据集包。如果您需要从一个现有的笔记本开始工作，可以使用代码网站上的8.1_Transformers_dataset.ipynb笔记本。

如何操作...

在这个菜谱中，您将使用数据集包从HuggingFace网站加载RottenTomatoes数据集。如果数据集不存在，该包会为您下载它。对于任何后续运行，如果之前已下载，它将使用缓存中的下载数据集。

本菜谱执行以下操作：

读取RottenTomatoes数据集
描述数据集的特征
从数据集的训练分割中加载数据
从数据集中抽取几个句子并打印出来

菜谱的步骤如下：

执行必要的导入，从数据集包导入必要的类型和函数：
```
from datasets import load_dataset, get_dataset_split_names
```
通过load_dataset函数加载"rotten tomatoes"并打印内部数据集分割。这个数据集包含训练、验证和测试分割：
```
dataset = load_dataset("rotten_tomatoes")
print(get_dataset_split_names("rotten_tomatoes"))
```
前一个命令的输出如下：
```
['train', 'validation', 'test']
```

加载数据集并打印训练分割的属性。training_data.description描述了数据集的详细信息，而training_data.features描述了数据集的特征。在输出中，我们可以看到training_data分割包含特征text，它是字符串类型，以及label，它是分类类型，具有neg和pos的值：

training_data = dataset['train']
print(training_data.description)
print(training_data.features)

命令的输出如下：

Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed  sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
{'text': Value(dtype='string', id=None), 
    'label':ClassLabel(names=['neg', 'pos'], id=None)}

现在我们已经加载了数据集，我们将打印其中的前五个句子。这只是为了确认我们确实能够从数据集中读取：

sentences = training_data['text'][:5]
[print(sentence) for sentence in sentences]

命令的输出如下：

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
effective but too-tepid biopic
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .

在您的数据集中对文本进行分词

变换器内部包含的组件对其处理的单词没有任何内在知识。相反，分词器只使用它处理的单词的标记标识符。在这个食谱中，我们将学习如何将您的数据集中的文本转换为可以由模型用于下游任务的表示。

准备工作

作为这个食谱的一部分，我们将使用来自transformers包的AutoTokenizer模块。如果您需要从一个现有的笔记本中工作，可以使用代码网站的8.2_Basic_Tokenization.ipynb笔记本。

如何做到这一点...

在这个食谱中，您将继续使用之前的RottenTomatoes数据集示例，并从中采样几个句子。然后我们将将这些采样句子编码成标记及其相应的表示。

这个食谱做了以下事情：

将一些句子加载到内存中
实例化一个分词器并对句子进行分词
将前一步生成的标记ID转换回标记

食谱的步骤如下：

执行必要的导入以导入来自transformers库的必要的AutoTokenizer模块：
```
from transformers import AutoTokenizer
```
我们初始化一个包含三个句子的句子数组，我们将使用这个例子。这些句子的长度不同，并且有很好的相同和不同单词的组合。这将使我们能够了解分词表示如何因每个句子而异：
```
sentences = [
    "The first sentence, which is the longest one in the list.",
    "The second sentence is not that long.",
    "A very short sentence."]
```
实例化一个bert-base-cased类型的分词器。这个分词器是区分大小写的。这意味着单词star和STAR将会有不同的分词表示：
```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
在这一步，我们将sentences数组中的所有句子进行分词。我们调用分词器构造函数，并将sentences数组作为参数传递，然后打印构造函数返回的tokenized_output实例。此对象是一个包含三项的字典：
- input_ids：这些是分配给每个标记的数值标记标识符。
- token_type_ids：这些ID定义了句子中包含的标记的类型。
- attention_mask：这些定义了输入中每个标记的注意力值。这个掩码决定了在执行下游任务时哪些标记会被关注。这些值是浮点数，可以从0（无注意力）到1（完全注意力）变化。
```
tokenized_input = tokenizer(sentences)
print(tokenized_input)
{'input_ids': [[101, 1109, 1148, 5650, 117, 1134, 1110, 1103, 6119, 1141, 1107, 1103, 2190, 119, 102],
[101, 1109, 1248, 5650, 1110, 1136, 1115, 1263, 119, 102],[101, 138, 1304, 1603, 5650, 119, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}
```

在这一步，我们将第一句话的输入ID转换回标记：

tokens = tokenizer.convert_ids_to_tokens(
    tokenized_input["input_ids"][0])
print(tokens)
['[CLS]', 'The', 'first', 'sentence', ',', 'which', 'is', 'the', 'longest', 'one', 'in', 'the', 'list', '.', '[SEP]']
[101, 1109, 1148, 5650, 117, 1134, 1110, 1103, 6119, 1141, 1107, 1103, 2190, 119, 102]

将它们转换为标记返回以下输出：

['[CLS]', 'The', 'first', 'sentence', ',', 'which', 'is', 'the', 'longest', 'one', 'in', 'the', 'list', '.', '[SEP]']

除了原始标记外，分词器还添加了[CLS]和[SEP]。这些标记是为了训练BERT所执行的训练任务而添加的。

现在我们已经了解了transformer内部使用的文本的内部表示，让我们学习如何将一段文本分类到不同的类别中。

对文本进行分类

在这个菜谱中，我们将使用RottenTomatoes数据集并对评论文本进行情感分类。我们将对数据集的测试分割进行分类，并评估分类器对测试分割中真实标签的结果。

准备就绪

作为这个菜谱的一部分，我们将使用来自transformers包的pipeline模块。如果你需要从一个现有的笔记本中工作，可以使用代码网站上的8.3_Classification_And_Evaluation.ipynb笔记本。

如何做到这一点...

在这个菜谱中，你将使用RottenTomatoes数据集并从中抽取几个句子。然后我们将对五个句子的一个小子集进行情感分类，并在这个较小的子集上展示结果。然后我们将对数据集的整个测试分割进行推理并评估分类结果。

菜谱执行以下操作：

加载RottenTomatoes数据集并打印其中的前五句话
实例化一个使用在相同数据集上训练的预训练Roberta模型进行情感分析的管道
使用管道在整个数据集的测试分割上执行推理（或情感预测）
评估推理结果

菜谱的步骤如下：

执行必要的导入以导入所需的包和模块：

from datasets import load_dataset
from evaluate import evaluator, combine
from transformers import pipeline
import torch

在这一步，我们检查系统中是否存在兼容Compute Unified Device Architecture（CUDA）的设备（或Graphics Processing Unit（GPU））。如果存在这样的设备，我们的模型将加载到它上面。如果支持，这将加速模型的训练和推理性能。然而，如果不存在这样的设备，将使用Central Processing Unit（CPU）。我们还加载了RottenTomatoes数据集并从中选择了前五句话。这是为了确保我们确实能够读取数据集中存在的数据：

device = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu")
sentences = load_dataset(
    "rotten_tomatoes", split="test").select(range(5))
[print(sentence) for sentence in sentences['text']]
lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
consistently clever and suspenseful .
it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
red dragon " never cuts corners .

通过管道初始化用于情感分析的管道。管道是一种抽象，它允许我们轻松使用模型或推理任务，而无需编写将它们拼接在一起的代码。我们从textattack加载了roberta-base-rotten-tomatoes模型，该模型已经在这个数据集上进行了训练。在接下来的段落中，我们使用管道进行情感分析任务，并设置用于此任务的具体模型：
```
roberta_pipe = pipeline("sentiment-analysis",
    model="textattack/roberta-base-rotten-tomatoes")
```
在这一步，我们为在步骤2中选择的句子小集合生成预测。使用管道对象生成预测就像传递一系列句子一样简单。如果你在没有兼容CUDA设备的机器上运行此示例，这一步可能需要一点时间：
```
predictions = roberta_pipe(sentences['text'])
```

在这一步，我们遍历我们的句子并检查句子的预测结果。我们打印出实际和生成的预测结果，以及五个句子的文本。实际标签是从数据集中读取的，而预测是通过管道对象生成的：

for idx, _sentence in enumerate(sentences['text']):
    print(
        f"actual: {sentences['label'][idx]}\n"
        f"predicted: {'1' if predictions[idx]['label'] 
            == 'LABEL_1' else '0'}\n"
        f"sentence: {_sentence}\n\n"
    )
actual:1
predicted:1
sentence:lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
actual:1
predicted:1
sentence:consistently clever and suspenseful .
actual:1
predicted:0
sentence:it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
actual:1
predicted:1
sentence:the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
actual:1
predicted:1
sentence:red dragon " never cuts corners .

既然我们已经验证了管道及其结果，让我们为整个测试集生成推理，并生成这个特定模型的评估指标。加载RottenTomatoes数据集的完整测试分割：
```
sentences = load_dataset("rotten_tomatoes", split="test")
```
在这一步，我们初始化一个评估器对象，它可以用来执行推理并评估分类的结果。它还可以用来展示易于阅读的评估结果摘要：
```
task_evaluator = evaluator("sentiment-analysis")
```
在这一步，我们在评估器实例上调用compute方法。这触发了使用我们在步骤4中初始化的相同管道实例进行的推理和评估。它返回准确度、精确度、召回率和f1的评估指标，以及一些与推理相关的性能指标：
```
eval_results = task_evaluator.compute(
    model_or_pipeline=roberta_pipe,
    data=sentences,
    metric=combine(["accuracy", "precision", "recall", "f1"]),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1}
)
```
在这一步，我们打印出评估的结果。值得注意的是精确度、召回率和f1值。在这个案例中观察到的f1值为0.88，这是分类器非常有效率的指标，尽管它总是可以进一步改进：
```
print(eval_results)
{'accuracy': 0.88,
'precision': 0.92,
'recall': 0.84,
'f1': 0.88,
'total_time_in_seconds': 27.23,
'samples_per_second': 39.146,
'latency_in_seconds': 0.025}
```

在这个菜谱中，我们使用预训练的分类器对一个数据集上的数据进行分类。数据集和模型都是用于情感分析的。有些情况下，我们可以使用在另一类数据上训练的分类器，但仍然可以直接使用。这使我们免去了训练自己的分类器并重新利用现有模型的麻烦。我们将在下一个菜谱中了解这个用例。

使用零样本分类器

在这个菜谱中，我们将使用零样本分类器对句子进行分类。有些情况下，我们没有从头开始训练分类器或使用按照我们数据标签训练的模型的奢侈。零样本分类可以在这种场景下帮助任何团队快速启动。术语中的“零”意味着分类器没有看到目标数据集用于推理的任何数据（精确到零样本）。

准备工作

作为这个菜谱的一部分，我们将使用来自transformers包的管道模块。如果您需要从一个现有的笔记本中工作，可以使用代码网站上的8.4_Zero_shot_classification.ipynb笔记本。

如何操作...

在这个菜谱中，我们将使用几个句子并将它们进行分类。我们将为这些句子使用我们自己的标签集。我们将使用facebook/bart-large-mnli模型来完成这个菜谱。这个模型适合零样本分类的任务。

菜谱执行以下操作：

基于零样本分类模型初始化一个管道
使用管道将句子分类到用户自定义的标签集中
打印分类的结果，包括类别及其相关的概率

菜谱的步骤如下：

执行必要的导入并识别计算设备，如前一个分类菜谱中所述：

from transformers import pipeline
import torch
device = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu")

在这一步中，我们使用facebook/bart-large-mnli模型初始化一个管道实例。我们选择了这个特定的模型作为我们的示例，但也可以使用其他模型——可在HuggingFace网站上找到：
```
pipeline_instance = pipeline(
    model="facebook/bart-large-mnli")
```
使用管道实例将句子分类到给定的一组候选标签中。示例中提供的标签完全是新颖的，并且是由我们定义的。模型没有在具有这些标签的示例上进行训练。分类输出存储在result变量中，它是一个字典。这个字典有'sequence'、'labels'和'scores'键。'sequence'元素存储传递给分类器的原始句子。'labels'元素存储类别的标签，但其顺序与我们传递的参数不同。'scores'元素存储类别的概率，并与'labels'元素中的相同顺序相对应。这个调用中的最后一个参数是device。如果系统中存在兼容CUDA的设备，它将被使用：
```
result = pipeline_instance(
    "I am so hooked to video games as I cannot get any work done!",
    candidate_labels=["technology", "gaming", "hobby", "art", "computer"], device=device)
```

我们打印序列，然后打印每个标签及其相关的概率。请注意，标签的顺序已经从我们在上一步中指定的初始输入中改变。函数调用根据标签概率的降序重新排序标签：

print(result['sequence'])
for i, label in enumerate(result['labels']):
    print(f"{label}:  {result['scores'][i]:.2f}")
I am so hooked to video games as I cannot get any work done!
gaming:  0.85
hobby:  0.08
technology:  0.07
computer:  0.00
art:  0.00

我们对不同的句子运行零样本分类，并打印其结果。这次，我们发出一个选择概率最高的类别的结果并打印出来：

result = pipeline_instance(
    "A early morning exercise regimen can drive many diseases away!",
    candidate_labels=["health", "medical", "weather", "geography", "politics"], )
print(result['sequence'])
for i, label in enumerate(result['labels']):
    print(f"{label}:  {result['scores'][i]:.2f}")
print(
    f"The most probable class for the sentence is ** 
    {result['labels'][0]} ** "
    f"with a probability of {result['scores'][0]:.2f}"
)
A early morning exercise regimen can drive many diseases away!
health:  0.91
medical:  0.07
weather:  0.01
geography:  0.01
politics:  0.00
The most probable class for the sentence is ** health ** with a probability of 0.91

到目前为止，我们已经使用了转换器和一些预训练模型来生成标记ID和分类。这些菜谱已经使用了转换器的编码器部分。编码器生成文本的表示，然后由其前面的分类头使用以生成分类标签。然而，转换器还有一个名为解码器的另一个组件。解码器使用给定的文本表示并生成后续文本。在下一个菜谱中，我们将更多地了解解码器。

生成文本

在此菜谱中，我们将使用一个 生成式转换器模型从给定的种子句子生成文本。一个用于生成文本的模型是 GPT-2 模型，它是原始 通用转换器（GPT）模型的改进版本。

准备工作

作为此菜谱的一部分，我们将使用来自 transformers 包的管道模块。如果您需要从一个现有的笔记本中工作，可以使用代码站点中的 8.5_Transformer_text_generation.ipynb 笔记本。

如何操作...

在此菜谱中，我们将从一个初始种子句子开始，使用 GPT-2 模型根据给定的种子句子生成文本。我们还将调整某些参数以提高生成文本的质量。

菜谱执行以下操作：

它初始化一个起始句子，后续句子将从该句子生成。
它初始化一个作为管道一部分的 GPT-2 模型，并使用它来生成五个句子，作为传递给生成方法的参数。
它打印了生成的结果。

菜谱的步骤如下：

执行必要的导入并识别计算设备，如前一个分类菜谱中所述：

from transformers import pipeline
import torch
device = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu")

根据后续文本生成的种子输入句子初始化。我们的目标是使用 GPT-2 解码器根据生成参数假设性地生成后续文本：
```
text = "The cat had no business entering the neighbors garage, but"
```
在此步骤中，我们使用 'gpt-2' 模型初始化一个文本生成管道。该模型基于一个使用大量文本语料库训练的 大型语言模型（LLM）。此调用中的最后一个参数是 device。如果系统中存在兼容CUDA的设备，它将被使用：
```
generator = pipeline(
    'text-generation', model='gpt2', device=device)
```
为种子句子生成后续序列并存储结果。调用中除种子文本之外需要注意的参数如下：
- max_length：生成句子的最大长度，包括种子句子的长度。
- num_return_sequences：返回的生成序列的数量。
- num_beams：此参数控制生成序列的质量。较高的数值通常会导致生成序列的质量提高，但也会减慢生成速度。我们鼓励您根据生成序列的质量要求尝试不同的此参数值。
```
generated_sentences = generator(
    text,do_sample=True, max_length=30,
    num_return_sequences=5, num_beams=5,
    pad_token_id=50256)
```

打印生成的句子：

[print(generated_sentence['generated_text']) 
    for generated_sentence in generated_sentences]
The cat had no business entering the neighbors garage, but  he was able to get inside.  The cat had been in the neighbor's
The cat had no business entering the neighbors garage, but  the owner of the house called 911.  He said he found the cat in
The cat had no business entering the neighbors garage, but  he was able to get his hands on one of the keys.  It was
The cat had no business entering the neighbors garage, but  he didn't seem to mind at all.  He had no idea what he
The cat had no business entering the neighbors garage, but  the cat had no business entering the neighbors garage, but the cat had no business entering

语言翻译

在这个示例中，我们将使用transformers进行语言翻译。我们将使用Google Text-To-Text Transfer Transformer（T5）模型。这是一个端到端模型，它使用transformer模型的编码器和解码器组件。

准备中

作为本示例的一部分，我们将使用transformers包中的pipeline模块。如果您需要从一个现有的笔记本中工作，可以使用代码网站上的8.6_Language_Translation_with_transformers.ipynb笔记本。

如何做...

在这个食谱中，你将初始化一个英语种子句子并将其翻译成法语。T5模型期望输入格式编码有关语言翻译任务的信息以及种子句子。在这种情况下，编码器使用源语言中的输入并生成文本的表示。解码器使用这个表示并为目标语言生成文本。T5模型专门为此任务以及其他许多任务进行了训练。如果你在没有任何CUDA兼容设备的机器上运行，食谱步骤的执行可能需要一些时间。

该食谱执行以下操作：

它初始化了Google t5-base模型和标记器
它初始化一个英语种子句子，该句子将被翻译成法语
它将种子句子以及翻译任务规范进行标记化，以便将种子句子翻译成法语
它生成翻译后的标记，将它们解码成目标语言（法语），并打印出来

该食谱的步骤如下：

执行必要的导入并识别计算设备，如前一个分类食谱中所述：

from transformers import (
    T5Tokenizer, T5ForConditionalGeneration)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

使用来自谷歌的t5-base模型初始化一个标记器和模型实例。我们使用model_max_length参数为200。如果你的种子句子超过200个单词，可以自由地尝试更高的值。我们还把模型加载到第1步中确定的用于计算的设备上：
```
tokenizer = T5Tokenizer.from_pretrained(
    't5-base', model_max_length=200)
model = T5ForConditionalGeneration.from_pretrained(
    't5-base', return_dict=True)
model = model.to(device)
```

初始化一个你想要翻译的种子序列：

language_sequence = ("It's such a beautiful morning today!")

标记输入序列。标记器将其源语言和目标语言作为其输入编码的一部分进行指定。这是通过将“翻译英语到法语：”文本附加到输入种子句子中实现的。我们将这些标记ID加载到用于计算的设备上。模型和标记ID必须在同一设备上，这是两者的要求：
```
input_ids = tokenizer(
    "translate English to French: " + language_sequence,
    return_tensors="pt",
    truncation=True).input_ids.to(device)
```
通过模型将源语言标记ID转换为目标语言标记ID。该模型使用编码器-解码器架构将输入标记ID转换为输出标记ID：
```
language_ids = model.generate(input_ids, max_new_tokens=200)
```
将文本从标记ID解码成目标语言标记。我们使用标记器将输出标记ID转换为目标语言标记：
```
language_translation = tokenizer.decode(
    language_ids[0], skip_special_tokens=True)
```

打印翻译后的输出：

print(language_translation)
C'est un matin si beau!

总之，本章介绍了transformers的概念，以及一些基本应用。下一章将重点介绍我们如何使用不同的NLP技术更好地理解文本。

第九章：自然语言理解

在本章中，我们将探讨一些食谱，这些食谱将使我们能够解释和理解包含在短篇和长篇段落中的文本。自然语言理解（NLU）是一个非常宽泛的术语，作为NLU一部分开发的各种系统并不以与人类读者相同的方式解释或理解一段文本。然而，基于任务的特定性，我们可以创建一些应用，这些应用可以组合起来生成一种解释或理解，用于解决与文本处理相关的特定问题。

拥有大量文档语料库的组织需要一个无缝的方式来搜索文档。更具体地说，用户真正需要的是一个针对特定问题的答案，而无需浏览作为文档搜索结果返回的文档列表。用户更愿意将查询以自然语言问题的形式提出，并以相同的方式输出答案。

另一套应用是文档摘要和文本蕴涵。在处理大量文档时，如果能够缩短文档长度而不丢失意义或上下文，那就很有帮助。此外，确定文档中包含的信息在句子层面上是否蕴涵自身也很重要。

当我们处理和分类文档时，总会有理解为什么或如何模型将标签分配给文本片段的挑战——更具体地说，文本的哪些部分有助于不同的标签。

本章将涵盖探索先前描述的各种技术的不同方法。我们将遵循食谱，使我们能够执行这些任务，并理解帮助我们实现最终目标的底层构建块。

作为本章的一部分，我们将为以下任务构建食谱：

从短文本段落中回答问题
从长文本段落中回答问题
以提取方式从文档语料库中回答问题
以抽象方式从文档语料库中回答问题
使用基于Transformers的预训练模型总结文本
检测句子蕴涵
通过分类器不变方法增强可解释性
通过文本生成增强可解释性

技术要求

本章的代码位于GitHub书籍仓库中名为Chapter9的文件夹中（https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter09）。

与前几章一样，本章所需的包是poetry环境的一部分。或者，您可以使用requirements.txt文件安装所有包。

从短文本段落中回答问题

要开始问答，我们将从一个简单的配方开始，这个配方可以回答来自简短段落的问题。

准备工作

作为本章的一部分，我们将使用来自 Hugging Face 的库 (huggingface.co)。对于这个配方，我们将使用 Transformers 包中的 BertForQuestionAnswering 和 BertTokenizer 模块。BertForQuestionAnswering 模型使用的是在 SQuAD 数据集上训练的基于 BERT 的基础大模型，并针对问答任务进行了微调。这个预训练模型可以用来加载一段文本并基于段落内容回答问题。如果你需要从一个现有的笔记本开始工作，可以使用代码网站上的 9.1_question_answering.ipynb 笔记本。

如何操作...

在这个配方中，我们将加载一个在 SQuAD 数据集 (https://huggingface.co/datasets/squad) 上训练的预训练模型。

配方执行以下操作：

它初始化一个基于预训练的 BertForQuestionAnswering 模型和 BertTokenizer 分词器的问答管道。
它进一步初始化一个上下文段落和一个问题，并基于这两个参数输出答案。它还打印出答案的确切文本。
它通过只更改问题文本来向同一个管道提出后续问题，并打印出问题的确切文本答案。

配方的步骤如下：

执行必要的导入，从 datasets 包中导入必要的类型和函数：

from transformers import (
    pipeline, BertForQuestionAnswering, BertTokenizer)
import torch

在这个步骤中，我们分别使用预训练的 bert-large-uncased-whole-word-masking-finetuned-squad 艺术品初始化模型和分词器。如果这些艺术品没有在本地机器上作为这些调用的一部分存在，它们将从 Hugging Face 网站下载。我们已经为我们的配方选择了特定的模型和分词器，但你可以自由探索 Hugging Face 网站上可能适合你需求的其它模型。作为这个和下一个配方的通用步骤，我们检查系统中是否有任何 GPU 设备，并尝试使用它们。如果没有检测到 GPU，我们将使用 CPU：
```
device = torch.device("cuda" if torch.cuda.is_available() 
    else "cpu")
qa_model = BertForQuestionAnswering.from_pretrained(
    'bert-large-uncased-whole-word-masking-finetuned-squad',
    device_map=device)
qa_tokenizer = BertTokenizer.from_pretrained(
    'bert-large-uncased-whole-word-masking-finetuned-squad',
    device=device)
```
在这个步骤中，我们使用模型和分词器初始化一个问答管道。这个管道的任务类型设置为问答：
```
question_answer_pipeline = pipeline(
    "question-answering", model=qa_model,
    tokenizer=qa_tokenizer)
```

在这个步骤中，我们初始化一个 上下文 段落。这个段落是我们 通过 Transformers 生成文本 的例子中的一部分，在 第 8 章。如果你想要使用不同的段落，那是完全可以接受的：

context = "The cat had no business entering the neighbors garage, but she was there to help. The neighbor, who asked not to be identified, said she didn't know what to make of the cat's behavior. She said it seemed like it was trying to get into her home, and that she was afraid for her life. The neighbor said that when she went to check on her cat, it ran into the neighbor's garage and hit her in the face, knocking her to the ground."

在这个步骤中，我们初始化一个问题文本，使用上下文和问题调用管道，并将结果存储在一个变量中。结果类型是一个 Python dict 对象：
```
question = "Where was the cat trying to enter?"
result = question_answer_pipeline(question=question, 
    context=context)
```
在这个步骤中，我们打印结果值。score值显示了答案的概率。start和end值表示构成答案的上下文段落中的起始和结束字符索引。answer值表示答案的实际文本：
```
print(result)
```

{'score': 0.25, 'start': 33, 'end': 54, 'answer': 'the neighbors garage,'}

在此步骤中，我们打印出确切的文本答案。这个答案在result字典的answer键中：
```
print(result['answer'])
```

the neighbors garage,

在这个步骤中，我们使用相同上下文提出另一个问题并打印结果：

question = "What did the cat do after entering the garage"
result = question_answer_pipeline(
    question=question, context=context)
print(result['answer'])

hit her in the face, knocking her to the ground.

从长文本段落中回答问题

在上一个菜谱中，我们学习了一种在给定上下文的情况下提取问题答案的方法。这种模式涉及模型从给定的上下文中检索答案。模型不能回答不在上下文中的问题。这在我们需要从给定上下文中获取答案的情况下是有用的。这种问答系统被定义为封闭域问答（CDQA）。

另有一种问答系统可以回答本质上是普遍性的问题。这些系统在更大的语料库上进行了训练。这种训练使它们能够回答本质上是开放性的问题。这些系统被称为开放域问答（ODQA）系统。

准备工作

作为这个菜谱的一部分，我们将使用deeppavlov库以及知识库问答（KBQA）模型。这个模型已经在英语维基数据上作为知识库进行了训练。它使用各种NLP技术，如实体链接和消歧，知识图谱等，以提取问题的确切答案。

这个菜谱需要几个步骤来设置正确的执行环境。这个菜谱的poetry文件位于9.2_QA_on_long_passages文件夹中。我们还需要通过执行以下命令来安装和下载文档语料库：

python -m deeppavlov install kbqa_cq_en

您也可以使用包含在同一文件夹中的9.2_QA_on_long_passages.ipynb笔记本。

如何操作...

在这个菜谱中，我们将基于DeepPavlov库初始化KBQA模型，并使用它来回答一个开放性问题。菜谱的步骤如下：

执行必要的导入：
```
from deeppavlov import build_model
```
在这个步骤中，我们初始化KBQA模型，kbqa_cq_en，并将其作为参数传递给build_model方法。我们还设置download参数为True，以便在本地缺失的情况下也下载模型：
```
kbqa_model = build_model('kbqa_cq_en', download=True)
```

我们使用初始化后的模型并传递我们想要回答的几个问题：

result = kbqa_model(['What is the capital of Egypt?', 
    'Who is Bill Clinton\'s wife?'])

我们打印出模型返回的结果。结果包含三个数组。

第一个数组包含按与原始输入相同顺序排列的精确答案。在这种情况下，答案Cairo和Hillary Clinton与它们相关的问题顺序相同。

您可能会在输出中观察到一些额外的工件。这些是由库生成的内部标识符。为了简洁起见，我们已省略它们：
```
[['Cairo', 'Hillary Clinton']]
```

参见

有关DeepPavlov工作内部细节的更多信息，请参阅https://deeppavlov.ai。

以提取方式从文档语料库回答问题

对于包含大量文档的文档语料库的使用案例，在运行时加载文档内容以回答问题是不切实际的。这种方法会导致查询时间过长，并且不适合生产级系统。

在这个食谱中，我们将学习如何预处理文档并将它们转换成一种更快阅读、索引和检索的形式，这样系统就可以在短时间内查询到给定问题的答案。

准备工作

作为这个食谱的一部分，我们将使用Haystack(https://haystack.deepset.ai/)框架来构建一个问答系统，该系统能够从文档语料库中回答问题。我们将下载一个基于《权力的游戏》的数据集并进行索引。为了使我们的问答系统性能良好，我们需要事先对文档进行索引。一旦文档被索引，回答问题将遵循两步过程：

Retriever：由于我们有很多文档，扫描每个文档以获取答案不是一个可行的方法。我们将首先使用Retriever组件检索一组可能包含我们问题答案的候选文档。这一步是通过Retriever组件执行的。它搜索预先创建的索引，以过滤出我们需要扫描以检索确切答案的文档数量。
Reader：一旦我们有一个可能包含答案的候选文档集，我们将搜索这些文档以检索我们问题的确切答案。

在这个食谱中，我们将讨论这些组件的详细信息。如果您需要从一个现有的笔记本开始工作，可以使用代码网站上的9.3_QA_on_document_corpus.ipynb笔记本。首先，让我们设置先决条件。

如何做到这一点...

在这一步中，我们进行必要的导入：

import os
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.pipelines.standard_pipelines import( 
    TextIndexingPipeline)
from haystack.utils import (fetch_archive_from_http, 
    print_answers)

在这一步中，我们指定一个文件夹，用于保存我们的数据集。然后，我们从源中检索数据集。fetch_archive_from_http方法的第二个参数是数据集将被下载到的文件夹。我们将该参数设置为第一行中定义的文件夹。fetch_archive_from_http方法解压缩.zip存档文件，并将所有文件提取到同一个文件夹中。然后我们从文件夹中读取并创建文件夹中包含的文件列表。我们还打印了现有文件的数量：
```
doc_dir = "data/got_dataset"
fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir,
    )
files_to_index = [doc_dir + "/" + f for f in os.listdir(
    doc_dir)]
print(len(files_to_index))
183
```
我们基于文件初始化一个文档存储。我们基于文档存储创建一个索引管道并执行索引操作。为了实现这一点，我们初始化一个 InMemoryDocumentStore 实例。在这个方法调用中，我们将 use_bm25 参数设置为 True。文档存储使用 最佳匹配 25 （bm25）作为检索步骤的算法。bm25 算法是一个基于词袋的简单算法，它使用一个评分函数。这个函数利用了术语在文档中出现的次数和文档的长度。第 3 章 更详细地介绍了 bm25 算法，我们建议您参考该章节以获得更好的理解。请注意，还有各种其他 DocumentStore 选项，例如 ElasticSearch、OpenSearch 等。我们使用 InMemoryDocumentStore 文档存储来简化配方并专注于检索器和阅读器概念：
```
document_store = InMemoryDocumentStore(use_bm25=True)
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)
```
一旦我们加载了文档，我们就初始化我们的检索器和阅读器实例。为了实现这一点，我们初始化检索器和阅读器组件。BM25Retriever 使用 bm25 分数函数检索初始文档集。对于阅读器，我们初始化 FARMReader 对象。这是基于 deepset 的 FARM 框架，可以利用 Hugging Face 的 QA 模型。在我们的情况下，我们使用 deepset/roberta-base-squad2 模型作为阅读器。use_gpu 参数可以根据您的设备是否有 GPU 适当设置：
```
retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(
    model_name_or_path="deepset/roberta-base-squad2",
    use_gpu=True)
```
我们现在创建一个可以用来回答问题的管道。在前一个步骤中初始化检索器和阅读器之后，我们希望将它们结合起来进行查询。Haystack 框架的 pipeline 抽象允许我们使用一系列针对不同用例的管道来集成阅读器和检索器。在这个例子中，我们将使用 ExtractiveQAPipeline 作为我们的问答系统。在初始化管道后，我们从 权力的游戏 系列中生成一个问题答案。run 方法将问题作为查询。第二个参数，params，决定了检索器和阅读器结果如何组合以呈现答案：
- "Retriever": {"top_k": 10}: The top_k keyword argument specifies that the top-k (in this case, 10) results from the retriever are used by the reader to search for the exact answer
- "Reader": {"top_k": 5}: The top_k keyword argument specifies that the top-k (in this case, 5) results from the reader are presented as the output of the method:
```
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = pipe.run(
    query="Who is the father of Arya Stark?",
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
```

我们打印出问题的答案。系统会打印出确切的答案以及它用来提取答案的相关上下文。注意，我们使用all值作为details参数的值。使用all值作为相同参数会打印出答案的start和end范围以及所有辅助信息。将details参数的值设置为medium会提供每个答案的相对分数。这个分数可以用来根据系统的准确性要求进一步过滤结果。使用medium参数只显示答案和上下文。我们鼓励您根据自己的需求做出合适的选择：

print_answers(prediction, details="all")
'Query: Who is the father of Arya Stark?'
'Answers:'
[<Answer {'answer': 'Eddard',
'type': 'extractive',
'score': 0.993372917175293,
'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 207, 'end': 213}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_ids': ['9e3c863097d66aeed9992e0b6bf1f2f4'], 'meta': {'_split_id': 3}}>,
<Answer {'answer': 'Ned',
'type': 'extractive',
'score': 0.9753613471984863,
'context': "k in the television series.\n\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's h", 'offsets_in_document': [{'start': 630, 'end': 633}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_ids': ['7d3360fa29130e69ea6b2ba5c5a8f9c8'], 'meta': {'_split_id': 10}}>,
<Answer {'answer': 'Lord Eddard Stark',
'type': 'extractive',
'score': 0.9177322387695312,
'context': 'rk daughters.\n\nDuring the Tourney of the Hand to honour her father Lord Eddard Stark, Sansa Stark is enchanted by the knights performing in the event.', 'offsets_in_document': [{'start': 280, 'end': 297}], 'offsets_in_context': [{'start': 67, 'end': 84}], 'document_ids': ['5dbccad397381605eba063f71dd500a6'], 'meta': {'_split_id': 3}}>,
<Answer {'answer': 'Ned',
'type': 'extractive',
'score': 0.8396496772766113,
'context': " girl disguised as a boy all along and is surprised to learn she is Arya, Ned Stark's daughter. After the Goldcloaks get help from Ser Amory Lorch and", 'offsets_in_document': [{'start': 848, 'end': 851}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_ids': ['257088f56d2faba55e2ef2ebd19502dc'], 'meta': {'_split_id': 31}}>,
<Answer {'answer': 'King Robert',
'type': 'extractive',
'score': 0.6922298073768616,
'context': "en refuses to yield Gendry, who is actually a bastard son of the late King Robert, to the Lannisters.  The Night's Watch convoy is overrun and massacr", 'offsets_in_document': [{'start': 579, 'end': 590}], 'offsets_in_context': [{'start': 70, 'end': 81}], 'document_ids': ['4d51b1876e8a7eac8132b97e2af04401'], 'meta': {'_split_id': 4}}>]

参见

为了使QA系统在高性能的生产系统中运行，建议使用不同于内存的文档存储。我们建议您参考https://docs.haystack.deepset.ai/docs/document_store，并根据您生产级的要求使用适当的文档存储。

以抽象方式从文档语料库回答问题

 source). There are techniques to generate an abstractive answer too, which is more readable by end users compared to an extractive one.

准备工作

对于这个配方，我们将构建一个提供抽象性答案的QA系统。我们将从Hugging Face网站加载bilgeyucel/seven-wonders数据集，并从中初始化一个检索器。这个数据集包含关于古代世界七大奇迹的内容。为了生成答案，我们将使用Haystack框架的PromptNode组件来设置一个可以以抽象方式生成答案的管道。如果您需要从一个现有的笔记本开始工作，可以使用代码网站的9.4_abstractive_qa_on_document_corpus.ipynb笔记本。让我们开始吧。

如何操作

步骤如下：

在这一步，我们进行必要的导入：

from datasets import load_dataset
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import (
    BM25Retriever, PromptNode,
    PromptTemplate, AnswerParser)
from haystack.pipelines import Pipeline

作为这一步骤的一部分，我们将bilgeyucel/seven-wonders数据集加载到内存文档存储中。这个数据集是从古代世界七大奇迹的维基百科页面创建的(https://en.wikipedia.org/wiki/Wonders_of_the_World)。这个数据集已经预处理并上传到Hugging Face网站，可以通过使用Hugging Face的datasets模块轻松下载。我们使用InMemoryDocumentStore作为我们的文档存储，使用bm25作为搜索算法。我们将数据集中的文档写入文档存储。为了获得高性能的查询时间，write_documents方法会自动优化文档的写入方式。一旦文档写入完成，我们就基于bm25初始化检索器，类似于我们之前的配方：
```
dataset = load_dataset("bilgeyucel/seven-wonders", 
    split="train")
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents(dataset)
retriever = BM25Retriever(document_store=document_store)
```
作为这一步骤的一部分，我们初始化一个提示模板。我们可以使用document和query来定义模型要执行的任务，将其作为英语中的简单指令。这些参数在运行时预期存在于执行上下文中。第二个参数output_parser接受一个AnswerParser对象。此对象指示PromptNode对象将结果存储在answers元素中。在定义提示后，我们使用模型和提示模板初始化一个PromptNode对象。我们使用google/flan-t5-large模型作为答案生成器。该模型基于Google T5语言模型，并经过微调（flan代表微调语言模型）。使用指令数据集微调语言模型允许语言模型根据简单指令执行任务，并基于给定上下文和指令生成文本。模型训练过程中的一个微调步骤是操作人类编写的指令作为任务。这使得模型仅通过指令即可执行不同的下游任务，并减少了训练时对任何少样本示例的需求。
```
rag_prompt = PromptTemplate(
    prompt="""Synthesize a comprehensive answer from the following text for the given question.
        Provide a clear and concise response that summarizes the key points and information presented in the text.
        Your answer should be in your own words and be no longer than 50 words.
        \n\n Related text: {join(documents)} \n\n Question: {query} \n\n Answer:""",
    output_parser=AnswerParser(),
)
prompt_node = PromptNode(
    model_name_or_path="google/flan-t5-large",
    default_prompt_template=rag_prompt, use_gpu=True)
```
现在，我们创建一个管道，并添加我们在上一步中初始化的retriever和prompt_node组件。retriever组件操作用户提供的查询并生成一组结果。这些结果传递给提示节点，该节点使用配置的flan-t5-model生成答案：
```
pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", 
    inputs=["Query"])
pipe.add_node(component=prompt_node,
    name="prompt_node", inputs=["retriever"])
```

一旦设置好管道，我们就用它来根据数据集回答关于内容的问题：

output = pipe.run(query="What is the Great Pyramid of Giza?")
print(output["answers"][0].answer)
output = pipe.run(query="Where are the hanging gardens?")
print(output["answers"][0].answer)

The Great Pyramid of Giza was built in the early 26th century BC during a period of around 27 years.[3]
The Hanging Gardens of Semiramis are the only one of the Seven Wonders for which the location has not been definitively established.

参见

请参阅Haystack上的提示工程指南，了解如何为您的用例生成提示（https://docs.haystack.deepset.ai/docs/prompt-engineering-guidelines）。

使用基于Transformer的预训练模型进行文本摘要

现在，我们将探讨执行文本摘要的技术。为长篇文本生成摘要允许NLP从业者提取其用例中的相关信息，并使用这些摘要进行其他下游任务。在摘要过程中，我们将探讨使用Transformer模型生成摘要的配方。

准备工作

我们的第一种摘要配方将使用来自代码网站的Google 9.5_summarization.ipynb笔记本，如果您需要从一个现有的笔记本开始工作。

如何操作

让我们开始吧：

执行必要的导入：
```
from transformers import pipeline
```

作为这一步骤的一部分，我们初始化了需要总结的输入段落以及管道。我们还计算了段落的长度，因为这将作为下一个步骤中任务执行时传递给管道的参数。由于我们将任务定义为总结，管道模块返回的对象是SummarizationPipeline类型。我们还传递t5-large作为管道的模型参数。这个模型基于编码器-解码器Transformer模型，并作为一个纯序列到序列模型。这意味着模型的输入和输出都是文本序列。这个模型使用寻找句子中掩码词的降噪目标进行预训练，随后在总结、文本蕴涵、语言翻译等特定下游任务上进行微调：

passage = "The color of animals is by no means a matter of chance; it depends on many considerations, but in the majority of cases tends to protect the animal from danger by rendering it less conspicuous. Perhaps it may be said that if coloring is mainly protective, there ought to be but few brightly colored animals. There are, however, not a few cases in which vivid colors are themselves protective. The kingfisher itself, though so brightly colored, is by no means easy to see. The blue harmonizes with the water, and the bird as it darts along the stream looks almost like a flash of sunlight."
passage_length = len(passage.split(' '))
pipeline_instance = pipeline("summarization", model="t5-large")

我们现在使用之前步骤中初始化的pipeline_instance并将文本段落传递给它以执行总结步骤。如果需要总结多个序列，也可以传递一个字符串数组。我们将max_length=512作为第二个参数传递。T5模型内存密集，计算需求随着输入文本长度的增加而呈二次增长。根据您执行此操作的环境的计算能力，这一步骤可能需要几分钟才能完成：
```
pipeline_result = pipeline_instance(
    passage, max_length=passage_length)
```
一旦总结步骤完成，我们就从输出中提取结果并打印出来。管道返回一个字典列表。列表中的每个项目对应于输入参数。在这种情况下，由于我们只传递了一个字符串作为输入，列表中的第一个项目是包含我们的摘要的输出字典。可以通过在summary_text元素上索引字典来检索摘要：
```
result = pipeline_result[0]["summary_text"]
print(result)
```

the color of animals is by no means a matter of chance; it depends on many considerations . in the majority of cases, coloring tends to protect the animal from danger . there are, however, not a few cases in which vivid colors are themselves protective .

还有更多…

现在我们已经看到了如何使用T5模型生成摘要，我们可以使用相同的代码框架并稍作调整，以使用其他模型生成摘要。

以下行对于我们将使用的其他总结食谱来说是常见的。我们添加了一个名为device的额外变量，我们将在我们的管道中使用它。我们将此变量设置为生成摘要时将使用的设备值。如果系统中存在并配置了GPU，它将被使用；否则，将使用CPU进行总结：

from transformers import pipeline
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
passage = "The color of animals is by no means a matter of chance; it depends on many considerations, but in the majority of cases tends to protect the animal from danger by rendering it less conspicuous. Perhaps it may be said that if coloring is mainly protective, there ought to be but few brightly colored animals. There are, however, not a few cases in which vivid colors are themselves protective. The kingfisher itself, though so brightly colored, is by no means easy to see. The blue harmonizes with the water, and the bird as it darts along the stream looks almost like a flash of sunlight."

在以下示例中，我们使用了Facebook的BART模型(https://huggingface.co/facebook/bart-large-cnn)。这个模型使用降噪目标进行训练。一个函数会在输入序列中添加一些随机的文本片段。模型基于降噪或从输入序列中移除噪声文本的目标进行训练。该模型进一步使用CNN DailyMail数据集(https://huggingface.co/datasets/abisee/cnn_dailymail)进行微调，用于总结：

pipeline_instance = pipeline("summarization", 
    model="facebook/bart-large-cnn", device=device)
pipeline_result = pipeline_instance(passage, 
    max_length=passage_length)
result = pipeline_result[0]["summary_text"]
print(result)
The color of animals is by no means a matter of chance; it depends on many considerations, but in the majority of cases tends to protect the animal from danger by rendering it less conspicuous. There are, however, not a few cases in which vivid colors are themselves protective. The blue harmonizes with the water, and the bird as it darts along the stream looks almost like a flash of sunlight.

从生成的摘要中我们可以观察到，它具有冗长和提取性的特点。让我们尝试使用另一个模型来生成摘要。

在以下示例中，我们使用 Google 的 PEGASUS 模型（https://huggingface.co/google/pegasus-large）进行摘要。这是一个基于 Transformer 的编码器-解码器模型，它使用大型新闻和网页语料库 C4（https://huggingface.co/datasets/allenai/c4）和巨大的新闻数据集进行预训练，训练目标是检测重要句子。巨大的新闻数据集是从 2013 年至 2019 年从新闻和类似新闻网站上精心挑选的 15 亿篇文章的数据集。该模型进一步使用相同数据集的子集进行摘要微调。微调的训练目标涉及屏蔽重要句子，并使模型生成包含这些重要句子的摘要。该模型生成抽象摘要：

pipeline_instance = pipeline("summarization", 
    model="google/pegasus-large", device=device)
pipeline_result = pipeline_instance([passage, passage], 
    max_length=passage_length)
result = pipeline_result[0]["summary_text"]
print(result)
Perhaps it may be said that if coloring is mainly protective, there ought to be but few brightly colored animals.

从生成的摘要中我们可以观察到，它简洁且具有抽象性。

参考阅读

由于总是有许多新的和改进的摘要模型正在开发中，我们建议您参考 Hugging Face 网站上的模型（https://huggingface.co/models?pipeline_tag=summarization），并根据您的需求做出相应的选择。

检测句子蕴涵

在这个菜谱中，我们将探讨检测 前提 的技术，它设定了上下文。第二句是 假设，它作为主张。文本蕴涵确定了 前提 和 假设 之间的上下文关系。这些关系可以分为三种类型，如下定义：

蕴涵 – 假设支持前提
矛盾 – 假设与前提矛盾
中立 – 假设既不支持也不反驳前提

准备工作

我们将使用 Transformers 库来检测文本蕴涵。如果您需要从一个现有的笔记本开始工作，可以使用代码网站上的 9.6_textual_entailment.ipynb 笔记本。

如何操作...

在这个菜谱中，我们将初始化与之前定义的每种关系相关的不同句子集，并探讨检测这些关系的方法。让我们开始吧：

执行必要的导入：

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

初始化设备、分词器和模型。在这种情况下，我们使用 Google 的 t5-small 模型。我们将 legacy 标志设置为 False，因为我们不需要使用模型的旧行为。我们根据执行环境中可用的任何设备设置 device 值。同样，对于模型，我们设置与分词器相似的 model 名称和 device 参数。我们将 return_dict 参数设置为 True，以便我们以字典的形式获取模型结果，而不是元组：
```
device = torch.device("cuda" if torch.cuda.is_available() 
    else "cpu")
tokenizer = T5Tokenizer.from_pretrained(
    't5-small', legacy=False, device=device)
model = T5ForConditionalGeneration.from_pretrained(
    't5-small', return_dict=True, device_map=device)
```

我们初始化premise和hypothesis句子。在这种情况下，假设支持前提：

premise = "The corner coffee shop serves the most awesome coffee I have ever had."
hypothesis = "I love the coffee served by the corner coffee shop."

在这一步中，我们使用mnli premise和hypothesis值调用分词器。这是一个简单的文本连接步骤，用于设置分词器以进行entailment任务。我们读取input_ids属性以获取连接字符串的标记标识符。一旦我们有了标记ID，我们就使用模型来生成蕴涵预测。这返回一个包含预测的张量列表，我们在下一步中使用这些张量：
```
input_ids = tokenizer(
    "mnli premise: " + premise + " hypothesis: " + hypothesis,
    return_tensors="pt").input_ids
entailment_ids = model.generate(input_ids.to(device), 
    max_new_tokens=20)
```
在这一步中，我们调用分词器的decode方法，并传递模型generate调用返回的张量中的第一个张量（或向量）。我们还指示分词器跳过分词器内部使用的特殊标记。分词器从传入的向量中生成字符串标签。我们打印预测结果。在这种情况下，模型生成的预测是entailment：
```
prediction = tokenizer.decode(
    entailment_ids[0], skip_special_tokens=True, device=device)
print(prediction)
```

entailment

还有更多...

既然我们已经展示了使用单个句子进行蕴涵的例子，相同的框架可以用来处理句子批处理以生成蕴涵预测。我们将根据这个例子调整之前食谱中的步骤3、步骤4和步骤5。我们分别为premise和hypothesis初始化一个包含两个句子的数组。两个premise句子相同，而hypothesis句子分别是entailment和contradiction：

premise = ["The corner coffee shop serves the most awesome coffee I have ever had.", "The corner coffee shop serves the most awesome coffee I have ever had."]
hypothesis = ["I love the coffee served by the corner coffee shop.", "I find the coffee served by the corner coffee shop too bitter for my taste."]

由于我们既有premises也有hypothesis的句子数组，我们创建一个包含连接输入的数组，这些输入结合了tokenizer指令。这个数组用于传递给分词器，我们在下一步中使用分词器返回的标记ID：

premises_and_hypotheses = [f"mnli premise: {pre} 
    hypothesis: {hyp}" for pre, hyp in zip(premise, hypothesis)]
input_ids = tokenizer(
    text=premises_and_hypotheses, padding=True,
    return_tensors="pt").input_ids

我们现在使用之前使用的方法生成预测。然而，在这一步中，我们通过迭代模型输出返回的张量来生成推理标签：

entailment_ids = model.generate(input_ids.to(device), 
    max_new_tokens=20)
for _tensor in entailment_ids:
    entailment = tokenizer.decode(_tensor,
        skip_special_tokens=True, device=device)
    print(entailment)

通过分类器不变方法增强可解释性

现在，我们将探索一些食谱，这些食谱将使我们能够理解文本分类器所做的决策。我们将探讨使用情感分类器和NLP可解释性库来解释分类标签及其与输入文本关系的技巧，特别是在文本中单个单词的方面。

尽管目前许多NLP中的文本分类模型都是基于深度神经网络，但很难通过网络权重或参数来解释分类结果。将这些网络参数映射到输入的各个组件或单词同样具有挑战性。然而，在NLP领域仍有一些技术可以帮助我们理解分类器的决策。我们将在当前食谱和下一个食谱中探讨这些技术。

在这个配方中，我们将学习如何解释文本段落中每个单词的特征重要性，同时保持对分类器模型的无关性。这项技术可以用于任何文本分类器，因为我们把分类器视为黑盒，并使用预测结果从可解释性的角度推断结果。

准备工作

我们将使用lime库来进行可解释性。如果您想从一个现有的笔记本开始工作，可以使用代码网站上的9.7_explanability_via_classifier.ipynb笔记本。

如何做到这一点...

在这个配方中，我们将重新利用我们在Transformers章节中构建的分类器，并使用它来生成情感预测。我们将多次调用此分类器，并对输入进行扰动，以了解每个单词对情感的贡献。让我们开始吧：

执行必要的导入：

import numpy as np
import torch
from lime.lime_text import LimeTextExplainer
from transformers import pipeline

在这一步中，我们初始化设备以及情感分类的流水线。有关此步骤的更多详细信息，请参阅第8章。

device = torch.device(
    "cuda" if torch.cuda. is_available() else "cpu")
roberta_pipe = pipeline(
    "sentiment-analysis",
    model="siebert/sentiment-roberta-large-english",
    tokenizer="siebert/sentiment-roberta-large-english",
    top_k=1,
    device=device
)

在这一步中，我们初始化一个样本文本段落，并设置打印选项。设置打印选项允许我们在后续步骤中以易于阅读的格式打印输出：

sample_text = "I really liked the Oppenheimer movie and found it truly entertaining and full of substance."
np.set_printoptions(suppress = True,
    formatter = {'float_kind':'{:f}'.format},
    precision = 2)

在这一步中，我们为情感分类创建一个包装函数。此方法由解释器用于多次调用分类管道，以衡量段落中每个单词的贡献：

def predict_prob(texts):
    preds = roberta_pipe(texts)
    preds = np.array([
        [label[0]['score'], 1 - label[0]['score']]
        if label[0]['label'] == 'NEGATIVE'
        else [1 - label[0]['score'], label[0]['score']]
        for label in preds
    ])
    return preds

在这一步中，我们实例化LimeTextExplainer类，并为其调用explain_instance方法。此方法需要样本文本以及分类器包装函数。传递给此方法的包装函数期望它接受一个字符串的单个实例，并返回目标类别的概率。在这种情况下，我们的包装函数接受一个简单的字符串，并按顺序返回负面和正面类别的概率：
```
explainer = LimeTextExplainer(
    class_names=['NEGATIVE', 'POSITIVE'])
exp = explainer.explain_instance(
    text_instance=sample_text,
    classifier_fn=predict_prob)
```
在这一步中，我们打印样本文本的类别概率。正如我们所观察到的，样本文本被分配了分类器的正面情感：
```
original_prediction = predict_prob(sample_text)
print(original_prediction)
```

[[0.001083 0.998917]]

在这一步中，我们打印解释。正如我们从每个单词的概率中观察到的那样，单词娱乐和喜欢对正面类别的贡献最大。有一些单词对正面情绪有负面影响，但总体而言，句子被分类为正面：
```
print(np.array(exp.as_list()))
```

[['liked' '0.02466976195824297']
 ['entertaining' '0.023293546246506702']
 ['and' '0.018718510660163126']
 ['truly' '0.015312955730851004']
 ['Oppenheimer' '-0.012689413190611268']
 ['substance' '0.011282896692531665']
 ['of' '-0.007935237702088416']
 ['movie' '0.00665836523527015']
 ['it' '0.004033408096240486']
 ['found' '0.003214157926470171']]

让我们初始化另一个具有负面情绪的文本：

modified_text = "I found the Oppenheimer movie very slow, boring and veering on being too scientific."

获取分类器对新文本预测的类别概率并打印出来：

new_prediction = predict_prob(modified_text)
print(new_prediction)

[[0.999501 0.000499]]

使用解释器实例来评估文本并打印每个单词对负面情绪的贡献。我们观察到单词无聊和缓慢对负面情绪的贡献最大：
```
exp = explainer.explain_instance(
    text_instance=modified_text,
    classifier_fn=predict_prob)
print(np.array(exp.as_list()))
```

[['boring' '-0.1541527292742657']
 ['slow' '-0.13677434672789646']
 ['too' '-0.07536450832681185']
 ['veering' '-0.06154593708589755']
 ['Oppenheimer' '-0.021333762714731672']
 ['found' '0.015601753307753232']
 ['movie' '0.011810474276051267']
 ['I' '0.01014260838624105']
 ['the' '-0.008070326804220167']
 ['scientific' '-0.006083605323956207']]

还有更多...

现在我们已经看到了如何解释情感分类中的单词贡献，我们希望进一步改进我们的配方，以提供可解释性的可视化表示：

从食谱中的步骤5继续，我们还可以使用pyplot打印解释：

exp = explainer.explain_instance(text_instance=sample_text,
    classifier_fn=predict_prob)
_ = exp.as_pyplot_figure()

图9.1 – 句中每个单词对最终类别的概率贡献

我们还可以突出显示文本中的确切单词。每个单词的贡献也使用分配的类别的浅色或深色阴影突出显示，在这种情况下，是橙色。带有蓝色高亮的单词是那些对POSITIVE类有贡献的单词：
```
exp.show_in_notebook()
```

图9.2 – 每个单词的高亮类别关联

通过文本生成增强可解释性

在这个食谱中，我们将学习如何通过文本生成理解分类器发出的推理。我们将使用与我们在通过分类器不变方法进行可解释性食谱中使用的相同的分类器。为了更好地理解分类器在随机设置中的行为，我们将用不同的标记替换输入句子中的单词。

准备工作

为了完成这个食谱，我们需要安装一个spacy工件。请在开始此食谱之前，在您的环境中使用以下命令。

现在我们已经安装了spacy，我们还需要在以下步骤之前下载en_core_web_sm管道：

python3 -m spacy download en_core_web_sm

如果您需要从一个现有的笔记本开始工作，可以使用代码网站上的9.8_explanability_via_generation.ipynb笔记本。

如何操作

让我们开始吧：

执行必要的导入：

import numpy as np
import spacy
import time
import torch
from anchor import anchor_text
from transformers import pipeline

在这一步，我们使用en_core_web_sm模型初始化spacy管道。这个管道包含tok2vec、tagger、parser、ner、lemmatizer等组件，并针对CPU进行了优化：
```
nlp = spacy.load('en_core_web_sm')
```

在这一步，我们初始化设备和我们的分类器。我们使用与我们在通过分类器不变方法进行可解释性食谱中使用的相同的句子分类器。想法是理解相同的分类器，并观察其分类在不同输入下的行为，这些输入由锚点可解释性库生成：

device = torch.device("cuda" if torch.cuda.is_available(# Load model directly
from transformers import( AutoTokenizer, 
    AutoModelForSequenceClassification)
tokenizer = AutoTokenizer.from_pretrained(
    "jonathanfernandes/imdb_model")
model = AutoModelForSequenceClassification.from_pretrained(
    "jonathanfernandes/imdb_model")) else "cpu")
classifier = pipeline(
    "sentiment-analysis",
    model="siebert/sentiment-roberta-large-english",
    tokenizer="siebert/sentiment-roberta-large-english",
    top_k=1,
    device=device)

在这一步，我们定义一个函数，它接受一个句子列表并为其发出POSITIVE或NEGATIVE标签列表。此方法内部调用之前步骤中初始化的分类器：

def predict_prob(texts):
    preds = classifier(texts)
    preds = np.array([
        0 if label[0]['label'] == 'NEGATIVE' else 1
        for label in preds])
    spacy pipeline, the class labels, and use_unk_distribution as true. The class labels in this case are NEGATIVE and POSITIVE. The use_unk_distribution parameter specifies that the explainer uses the UNK token for masked words when it generates text for explanability.explainer = anchor_text.AnchorText(nlp, [‘NEGATIVE’, ‘POSITIVE’], use_unk_distribution=True)

在这一步，我们初始化一段文本。我们使用该文本句子通过使用predict_prob方法预测其类别概率，并打印预测结果：
```
text = 'The little mermaid is a good story.'
pred = explainer.class_names[predict_prob([text])[0]]
print('Prediction: %s' % pred)
Prediction: POSITIVE
```
在这一步，我们为解释器实例调用explain_instance方法。我们传递给它输入句子、predict_prob方法和一个threshold参数。解释器实例使用predict_prob方法来调用分类器，对输入句子的不同变体进行解释，以确定哪些单词对贡献最大。它还识别当输入句子中的某些单词被UNK标记替换时发出的类标签。threshold参数定义了给定类别的最小概率，低于此概率的所有生成的样本都将被忽略。这意味着解释器生成的所有句子对于给定类别都将具有大于阈值的概率.exp = explainer.explain_instance(text, predict_prob, threshold=0.95)
我们打印出在这种情况下对积极标签贡献最大的锚点单词。我们还打印出解释器测量的精度。我们观察到它将单词good、a和is识别为对积极分类贡献最大：
```
print('Anchor: %s' % (' AND '.join(exp.names())))
print('Precision: %.2f' % exp.precision())
```

Anchor: good AND a AND is
Precision: 1.00

我们打印出解释器认为可能导致积极分类的一些可能句子。解释器通过将一个或多个单词替换为UNK标记来扰动输入句子，并在扰动后的句子上调用分类器方法。关于分类器行为的一些有趣观察。例如，句子UNK UNK 是一个好故事 UNK被标记为积极。这表明故事标题与分类无关。另一个有趣的例子是句子UNK 美人鱼是一个好 UNK UNK。在这个句子中，我们观察到分类器对上下文中的对象（在这种情况下是故事）是不变的：
```
print('\n'.join([x[0] for x in exp.examples(
    only_same_prediction=True)]))
```

The little UNK is a good UNK .
The UNK mermaid is a good story .
The UNK UNK is a good story UNK
UNK little mermaid is a good story UNK
The UNK mermaid is a good UNK .
UNK little UNK is a good UNK .
The little mermaid is a good story UNK
The UNK UNK is a good UNK .
The little UNK is a good UNK .
The little mermaid is a good UNK .

与上一步类似，我们现在要求解释器打印出可能导致消极分类的句子。在这种情况下，解释器无法仅通过替换单词来生成任何负面示例。解释器无法生成任何消极示例。这是因为解释器只能使用UNK标记来扰动输入句子。由于UNK标记与任何积极或消极情感无关，仅使用该标记无法提供影响分类器以生成消极分类的方法。这一步没有输出：
```
print('\n'.join([x[0] for x in exp.examples(
    only_different_prediction=True)]))
```
到目前为止，我们使用UNK标记来改变或扰动输入到分类器的数据。文本中UNK标记的存在使其显得不自然。为了更好地理解分类器，列举自然句子并了解它们如何影响分类将是有用的。我们将使用BERT来扰动输入，并让解释器生成自然句子。这将帮助我们更好地理解在自然句子上下文中结果如何不同：
```
explainer = anchor_text.AnchorText(nlp, 
    ['negative', 'positive'],
    use_unk_distribution=False)
exp = explainer.explain_instance(text, 
    predict_prob, threshold=0.95)
```
我们现在打印出一些句子，分类器认为这些句子的标签应该是正面的。在这种情况下，我们观察到解释器生成的句子是自然的。例如，生成的句子我的小美人鱼讲述了一个好故事，将原句中的the一词替换成了my。这个单词是通过BERT生成的。BERT使用了Transformer架构的编码器部分，并且经过训练，可以通过遮蔽句子中的缺失单词来预测这些单词。在这种情况下，解释器遮蔽了输入句子中的单个单词，并使用BERT生成替换单词。由于生成文本的底层模型是一个概率模型，你的输出可能与以下内容不同，并且在不同的运行中也可能有所变化：
```
print('\n'.join([x[0] for x in exp.examples(
    only_same_prediction=True)]))
```

the weeping mermaid gives his good story .
Me ##rmaid mermaid : a good story .
rainbow moon mermaid theater " good story "
my little mermaid tells a good story .
Pretty little mermaid tells a good story .
My black mermaid song sweet good story ;
" little mermaid : very good story .
This damned mermaid gives a good story .
| " mermaid " : good story .
Me ##rmaid mermaid : very good story .

我们现在打印出一些句子，分类器认为这些句子的标签应该是负面的。尽管并非所有句子都表现出负面的情感，但其中相当多的是这样的情感：
```
print('\n'.join([x[0] for x in exp.examples(
    only_different_prediction=True)]))
```

' til mermaid brings a good story …
only little mermaid : too good story ##book
smash hit mermaid with any good story ...
nor did mermaid tell a good story !
† denotes mermaid side / good story .
no native mermaid has a good story .
no ordinary mermaid is a good story .
Very little mermaid ain any good story yet
miss rainbow mermaid made a good story .
The gorgeous mermaid ain your good story (

第十章：生成式AI和大型语言模型

在本章中，我们将探索使用transformer模型的生成特性来生成文本的配方。正如我们在第8章“变压器及其应用”中提到的，transformer模型的生成特性使用transformer网络的解码器组件。解码器组件负责根据提供的上下文生成文本。

随着通用变压器（GPT）家族的大型语言模型（LLM）的出现，这些模型随着每个新版本的发布，其规模和能力都在不断增长。例如，GPT-4已经在大量文本语料库上进行了训练，并在许多NLP任务中与最先进的模型相匹配或超越。这些LLM还基于其生成能力，可以接受人类的提示来生成文本。

我们将使用基于transformer架构的生成模型作为我们的配方。

本章包含以下配方：

在本地运行LLM
运行LLM以遵循指令
使用外部数据增强LLM
使用外部内容增强LLM
使用LLM创建聊天机器人
使用LLM生成代码
使用人类定义的要求生成SQL查询
代理 – 使LLM进行推理和行动

技术要求

本章的代码位于书籍的GitHub仓库中名为Chapter10的文件夹中(https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter10)。

如前几章所述，本章所需的包是存储库中poetry/pip要求配置文件的一部分。我们建议读者事先设置环境。

模型访问

在本章中，我们将使用来自Hugging Face和OpenAI的模型。以下是为本章中使用的各种模型启用模型访问的说明。

Hugging Face Mistral模型：在Hugging Face网站上创建必要的凭据，以确保模型可以通过代码使用或下载。请访问Mistral模型详细信息https://huggingface.co/mistralai/Mistral-7B-v0.3。在运行使用此模型的配方之前，您需要在网站上请求对模型的访问。

Hugging Face Llama模型：在Hugging Face网站上创建必要的凭据，以确保模型可以通过代码使用或下载。请访问Llama 3.1模型详细信息https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct。在运行使用此模型的配方之前，您必须在该网站上请求对模型的访问。

在代码片段中，我们使用 Jupyter 作为执行环境。如果你使用的是相同的，你将看到类似于这里显示的截图。你可以在文本字段中输入令牌，并让食谱继续进行。食谱将等待第一次输入令牌。食谱的后续运行将使用 Hugging Face 库为用户在本地创建的缓存令牌。

图 10.1 – 从 Hugging Face 复制令牌

api-token。在代码片段中，我们使用 Jupyter 作为执行环境。如果你使用的是相同的，你将看到一个文本框，你需要在其中输入 api-token。你可以在文本字段中输入令牌，并让食谱继续进行。食谱将等待输入令牌。

在本地运行 LLM

在这个食谱中，我们将学习如何使用 CPU 或 GPU 在本地加载 LLM，并在给出起始文本作为种子输入后从它生成文本。本地的 LLM 可以被指示根据提示生成文本。这种通过指令提示生成文本的新范式使 LLM 最近备受瞩目。学习这样做可以控制硬件资源和环境设置，优化性能，并允许从种子输入中进行快速实验或原型设计。这增强了数据隐私和安全，减少了对外部云服务的依赖，并促进了教育和实践应用的成本效益部署。由于我们在食谱中作为一部分运行 LLM，我们将使用指令提示来让它根据简单的指令生成文本。

准备工作

我们建议你使用至少有 16 GB RAM 的系统或至少有 8 GB VRAM 的 GPU 系统。这些示例是在一个有 8 GB RAM 和一个 8 GB VRAM 的 nVidia RTX 2070 GPU 的系统上创建的。只要系统有 16 GB 的 RAM，这些示例在没有 GPU 的情况下也能工作。在这个食谱中，我们将使用 Hugging Face (https://huggingface.co/docs) 库加载 Mistral-7B 模型。与同类其他语言模型相比，该模型尺寸更小，但在多个 NLP 任务上可以超越它们。拥有 70 亿网络参数的 Mistral-7B 模型可以超越拥有超过 130 亿参数的 Llama2 模型。

用户需要在 Hugging Face 网站上创建必要的凭证，以确保模型可以通过代码使用或下载。请参考“技术要求”部分下的“模型访问”以完成访问 Mistral 模型的步骤。请注意，由于本食谱的计算需求，生成文本可能需要几分钟才能完成。如果所需的计算能力不可用，我们建议读者参考本章末尾的“使用 OpenAI 模型而不是本地模型”部分，并使用那里描述的方法使用 OpenAI 模型为本食谱提供服务。

如何操作……

执行必要的导入：

from transformers import (
    AutoTokenizer, AutoModelForCausalLM, GenerationConfig)
import torch

在这一步，我们设置 Hugging Face 的登录。虽然我们可以在代码中直接设置令牌，但我们建议在环境变量中设置令牌，然后在笔记本中读取它。使用带有令牌的 login 方法授权对 Hugging Face 的调用，并允许代码本地下载模型并使用它：
```
from huggingface_hub import login
hf_token = os.environ.get('HUGGINGFACE_TOKEN')
login(token=hf_token)
```
在这一步，我们分别初始化设备、mistralai/Mistral-7B-v0.3 模型和分词器。我们将 device_map 参数设置为 auto，这允许管道选择可用的设备来使用。我们将 load_in_4bit 参数设置为 True。这使我们能够加载用于推理（或生成）步骤的量化模型。使用量化模型消耗更少的内存，并允许我们在内存有限的系统上本地加载模型。量化模型的加载由 AutoModelForCausalLM 模块处理，并从 Hugging Face 网络下载一个已量化到参数中指定比特大小的模型：
```
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3", device_map="auto", 
        load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    padding_side="left")
```
在这一步，我们初始化一个生成配置。这个生成配置被传递给模型，指导它如何生成文本。我们将num_beams参数设置为4。随着光束数量的增加，生成的文本将更加连贯和语法正确。然而，光束数量越多，解码（或文本生成）的时间也会更长。我们将early_stopping参数设置为True，因为一旦光束数量达到num_beams参数中指定的值，下一个单词的生成就会结束。eos_token_id（例如，GPT模型的50256）和pad_token_id（例如，GPT模型的0）默认使用模型的标记ID。这些标记ID用于指定模型将使用的句子结束和填充标记。max_new_tokens参数指定将生成的最大标记数。还有更多参数可以用于生成文本，我们鼓励您尝试调整之前指定的参数值，以及任何其他用于自定义文本生成的附加参数。有关更多信息，请参阅https://github.com/huggingface/transformers/blob/main/src/transformers/generation/configuration_utils.py上的转换器文档中的GenerationConfig类：
```
generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.eos_token_id,
    max_new_tokens=900,
)
```
在这一步，我们初始化一个种子句子。这个种子句子作为对模型的提示，要求它生成制作苹果派的逐步方法：
```
seed_sentence = "Step by step way on how to make an apple pie:"
```
在这一步，我们将种子句子进行标记化，将文本转换为相应的嵌入表示，并将其传递给模型以生成文本。我们还传递了generation_config实例给它。模型在其生成过程中生成标记ID：
```
model_inputs = tokenizer(
    [seed_sentence], return_tensors="pt").to(device)
generated_ids = model.generate(**model_inputs,
    generation_config=generation_config)
```
在这一步，我们解码前一步生成的标记ID。转换器模型使用特殊的标记，如CLS或MASK，并将它们作为训练的一部分生成文本。我们将skip_special_tokens的值设置为True。这允许我们省略这些特殊标记，并将纯文本作为输出的一部分生成。我们打印解码（或生成的）文本。
```
generated_tokens = tokenizer.batch_decode(generated_ids,
    skip_special_tokens=True)[0]
print(generated_tokens)
```
输出将类似于以下内容。为了简洁，我们缩短了输出。您可能会看到一个更长的结果：

Step by step way on how to make an apple pie:
1\. Preheat the oven to 350 degrees Fahrenheit.
2\. Peel and core the apples.
3\. Cut the apples into thin slices.
4\. Place the apples in a large bowl.
5\. Add the sugar, cinnamon, and nutmeg to the apples.
6\. Stir the apples until they are evenly coated with the sugar and spices.
7\. Pour the apples into a pie dish.
8\. Place the pie dish on a baking sheet.
9\. Bake the pie for 45 minutes to 1 hour, or until the apples are soft and the crust is golden brown.
10\. Remove the pie from the oven and let it cool for 10 minutes before serving.
## How do you make an apple pie from scratch?
To make an apple pie from scratch, you will need the following ingredients:
- 2 cups of all-purpose flour
- 1 teaspoon of salt
- 1/2 cup of shortening
- 1/2 cup of cold water
- 4 cups of peeled, cored, and sliced apples
- 1 cup of sugar
- 1 teaspoon of cinnamon
- 1/4 teaspoon of nutmeg
- 1/4 teaspoon of allspice
- 2 tablespoons of cornstarch
- 1 tablespoon of lemon juice
To make the pie crust, combine the flour and salt in a large bowl. Cut in the shortening with a pastry blender or two knives until the mixture resembles coarse crumbs. Add the cold water, 1 tablespoon at a time, until the dough comes together. Divide the dough in half and shape each half into a disk. Wrap the disks in plastic wrap and refrigerate for at least 30 minutes.

运行一个大型语言模型来遵循指令

在本配方中，我们将学习如何通过提示让LLM遵循指令。LLM可以提供一些上下文，并要求根据该上下文生成文本。这是LLM的一个非常新颖的功能。LLM可以被特别指示根据明确的用户要求生成文本。使用此功能可以扩展可以开发的使用案例和应用的范围。上下文和要回答的问题可以动态生成，并用于各种用例，从回答简单的数学问题到从知识库中提取复杂的数据。

我们将使用meta-llama/Meta-Llama-3.1-8B-Instruct模型进行此配方。此模型建立在meta-llama/Meta-Llama-3.1-8B模型之上，并经过调整以通过提示遵循指令。

准备工作

用户需要在Hugging Face网站上创建必要的凭据，以确保模型可以通过代码使用或下载。请参考技术要求部分下的模型访问以完成访问Llama模型的步骤。

如果您想从一个现有的笔记本开始工作，可以使用代码网站上的10.2_instruct_llm.ipynb笔记本。请注意，由于本配方对计算能力的要求，生成文本可能需要几分钟才能完成。如果所需的计算能力不可用，我们建议读者参考本章末尾的使用OpenAI模型而不是本地模型部分，并使用那里描述的方法使用OpenAI模型进行此配方。

如何操作…

配方执行以下操作：

它初始化一个LLM模型，以便将其加载到内存中。
它初始化一个提示，指示LLM执行一个任务。这个任务是回答问题。
它将提示发送到LLM，并要求它生成一个答案。

配方的步骤如下：

执行必要的导入：

import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, GenerationConfig, pipeline)
import os
import torch

设置Hugging Face的登录。将HuggingFace令牌设置在环境变量中，并从中读取到本地变量。使用令牌调用login方法授权对HuggingFace的调用，并允许代码在本地下载模型并使用它。您将看到类似于本章中在本地运行LLM配方中所示的一个类似的登录窗口：
```
from huggingface_hub import login
hf_token = os.environ.get('HUGGINGFACE_TOKEN')
login(token=hf_token)
```
在此步骤中，我们指定模型名称。我们还定义了量化配置。量化是一种技术，可以将内部LLM网络权重的大小降低到更低的精度。这允许我们在有限的CPU或GPU内存的系统上加载模型。使用默认精度加载LLM需要大量的CPU/GPU内存。在这种情况下，我们使用BitsAndBytesConfig类的load_in_4bit参数以四位加载网络权重。其他使用的参数描述如下：
- bnb_4bit_compute_dtype：此参数指定在计算期间使用的数据类型。尽管网络权重以四位存储，但计算仍然按照此参数定义的16或32位进行。将此设置为torch.float16在某些情况下会导致速度提升。
- bnb_4bit_use_double_quant：此参数指定应使用嵌套量化。这意味着执行第二次量化，这可以在网络中为每个参数节省额外的0.4位。这有助于我们节省模型所需的内存。
- bnb_4bit_quant_type：此nf4参数值使用正态分布初始化网络的权重，这在模型训练期间很有用。然而，它对推理没有影响，例如对于这个食谱。我们仍然将其设置为nf4，以保持与模型权重的统一。
对于量化概念，我们建议参考https://huggingface.co/blog/4bit-transformers-bitsandbytes上的博客文章，其中对此有更详细的解释。请注意，为了以4位加载模型，需要使用GPU：
```
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type= "nf4"
    )
```

在这一步，我们加载meta-llama/Meta-Llama-3.1-8B-Instruct模型和相应的分词器：

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

在这一步，我们初始化一个将模型和分词器结合在一起，并带有一些附加参数的管道。我们在本章的在本地运行LLM食谱中介绍了这些参数的描述。我们建议您参考该食谱以获取有关这些参数的更多详细信息。我们在这里添加了一个名为repetition_penalty的附加参数。这确保LLM不会进入开始重复自身或之前生成的文本部分的状态：
```
pipe = pipeline("text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    pad_token_id = tokenizer.eos_token_id,
    eos_token_id=model.config.eos_token_id,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.4)
```
在这一步，我们创建一个提示，设置一个可以传递给LLM的指令上下文。LLM根据提示中设置的指令行事。在这种情况下，我们以用户和代理之间的对话开始我们的指令。对话以问题你最喜欢的国家是哪里？开始。这个问题后面跟着模型的回答，形式为嗯，我对秘鲁很着迷。然后我们通过提出问题你能告诉我关于秘鲁的什么？来继续另一个指令。这种方法为LLM提供了一个模板，以便学习我们的意图，并根据我们在指令提示中指定的模式生成后续问题的答案：
```
prompt = [
    {"role": "user", "content": "What is your favourite country?"},
    {"role": "assistant", "content": "Well, I am quite fascinated with Peru."},
    {"role": "user", "content": "What can you tell me about Peru?"}
]
```
在这一步，我们使用提示执行管道并执行它。我们还指定了作为输出应生成的最大令牌数。这明确指示LLM在达到特定长度时停止生成：
```
outputs = pipe(
    prompt,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1]['content'])
```
这将导致以下输出：

Peru! A country with a rich history, diverse culture, and breathtaking landscapes. Here are some interesting facts about Peru:
1\. **Location**: Peru is located in western South America, bordering the Pacific Ocean to the west, Ecuador and Colombia to the north, Brazil and Bolivia to the east, and Chile to the south.
2\. **History**: Peru has a long and complex history, with various civilizations rising and falling over the centuries. The Inca Empire, which flourished from the 13th to the 16th century, is one of the most famous and influential empires in Peruvian history.
3\. **Machu Picchu**: One of the Seven Wonders of the World, Machu Picchu is an Inca citadel located on a mountain ridge above the Urubamba Valley. It's a must-visit destination for any traveler to Peru.
4\. **Food**: Peruvian cuisine is a fusion of indigenous, Spanish, African, and Asian influences. Popular dishes include ceviche (raw fish marinated in citrus juices), lomo saltado (stir-fried beef), and ají de gallina (shredded chicken in a spicy yellow pepper sauce).
5\. **Language**: The official language is Spanish, but many

还有更多...

现在我们已经看到了指导模型生成文本的方法，我们只需更改提示，就可以让模型为完全不同的问题生成文本。让我们将提示文本更改为以下内容，并使用相同的菜谱根据更新的提示生成文本：

prompt = [
    {"role": "user", "content": "Mary is twice as old as Sarah presently. Sarah is 6 years old.?"},
    {"role": "assistant", "content": "Well, what can I help you with?"},
    {"role": "user", "content": "Can you tell me in a step by step way on how old Mary will be after 5 years?"}]

这导致以下输出：

**Step 1: Determine Sarah's current age**
Sarah is 6 years old.
**Step 2: Determine Mary's current age**
Since Mary is twice as old as Sarah, and Sarah is 6 years old, we can multiply Sarah's age by 2 to find Mary's age:
Mary's age = 2 x Sarah's age
Mary's age = 2 x 6
Mary's age = 12 years old
**Step 3: Calculate Mary's age after 5 years**
To find out how old Mary will be after 5 years, we add 5 to her current age:
Mary's age after 5 years = Mary's current age + 5
Mary's age after 5 years = 12 + 5
Mary's age after 5 years = 17 years old
Therefore, Mary will be 17 years old after 5 years.

如前述输出所示，模型能够相当清楚地理解指令。它能够很好地推理并正确回答问题。这个菜谱只使用了存储在LLM中的上下文。更具体地说，LLM使用其内部知识来回答这个问题。LLMs是在大量文本语料库上训练的，可以根据这个大型语料库生成答案。在下一个菜谱中，我们将学习如何增强LLM的知识。

使用外部数据增强LLM

在以下菜谱中，我们将学习如何让LLM回答它未训练过的问题。这可能包括在LLM训练后创建的信息。新内容每天都在互联网上不断添加。没有哪个LLM可以每天在这个上下文中进行训练。检索增强生成（RAG）框架允许我们通过可以发送给LLM作为输入的额外内容来增强LLM。这使我们也能节省成本，因为我们不必花费时间和计算成本来根据更新的内容重新训练模型。作为RAG的基本介绍，我们将使用来自几个网页的一些内容来增强LLM，并就这些页面中的内容提出一些问题。对于这个菜谱，我们首先加载LLM并就一些问题提出问题，而不提供任何上下文。然后我们将额外的上下文添加到这个LLM中，并再次提出相同的问题。我们将比较答案，这将展示LLM与增强内容结合时的强大功能。

执行简单的提示到LLM链

在这个菜谱中，我们将创建一个简单的提示，可以用来指导一个LLM。提示是一个带有占位符值的模板，这些值可以在运行时填充。LangChain框架允许我们将提示和LLM结合起来，以及其他混合中的组件，以生成文本。我们将在这篇文档和随后的某些菜谱中探讨这些技术。

准备工作

我们必须在Hugging Face网站上创建必要的凭据，以确保模型可以通过代码使用或下载。请参阅“技术要求”部分下的“模型访问”，以完成访问Llama模型的步骤。

在此食谱中，我们将使用LangChain框架(https://www.langchain.com/)通过基于LangChain表达式语言（LCEL）的示例来展示LangChain框架及其功能。让我们从一个基于LangChain框架的简单食谱开始，并在随后的食谱中从那里扩展。此食谱的第一部分与上一个食谱非常相似。唯一的区别是使用了LangChain框架。

如果您想从一个现有的笔记本开始工作，可以使用代码网站上的10.3_langchain_prompt_with_llm.ipynb笔记本。请注意，由于此食谱的计算需求，生成文本可能需要几分钟。如果所需的计算能力不可用，我们建议您参考本章末尾的使用OpenAI模型而不是本地模型部分，并使用那里描述的方法使用OpenAI模型来完成此食谱。

如何操作...

该食谱执行以下操作：

它初始化一个LLM模型以加载到内存中。
它初始化一个提示来指导LLM执行任务。这个任务是回答问题。
它将提示发送给LLM并要求它生成答案。所有这些操作都是通过LangChain框架完成的。

该食谱的步骤如下：

首先进行必要的导入：

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, 
    pipeline)
import torch

在这一步，我们初始化模型名称和量化配置。我们已经在运行LLM以遵循指令食谱中扩展了量化；请查阅那里以获取更多详细信息。我们将使用Meta在2024年7月发布的meta-llama/Meta-Llama-3.1-8B-Instruct模型。它在许多NLP任务上优于更大规模的模型：
```
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type= "nf4")
```

在这一步，我们初始化模型。我们已经在第8章中详细阐述了使用Transformers库加载模型和分词器的方法。为了避免在此重复相同的信息，请参阅该章节以获取更多详细信息：

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

在这一步，我们初始化管道。我们已经在第8章中详细阐述了从transformers库中获取的管道结构。为了避免在此重复相同的信息，请参阅该章节以获取更多详细信息：
```
pipe = pipeline("text-generation",
    model=model, tokenizer=tokenizer, max_new_tokens=500,
    pad_token_id = tokenizer.eos_token_id)
```
在这一步，我们初始化一个聊天提示模板，它属于定义的ChatPromptTemplate类型。from_messages方法接受一系列（消息类型，模板）元组。消息数组中的第二个元组包含{input}模板。这表示该值将在稍后传递：
```
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a great mentor."),
    ("user", "{input}")])
```
在这一步，我们初始化一个输出解析器，它属于StrOutputParser类型。它将LLM实例返回的聊天消息转换为字符串：
```
output_parser = StrOutputParser()
hf = HuggingFacePipeline(pipeline=pipe)
```
接下来，我们初始化链的一个实例。链将一个组件的输出传递给下一个组件。在这个实例中，提示被发送到LLM，并对其实例进行操作。这个操作的输出是一个聊天消息。然后，聊天消息被发送到output_parser，它将其转换为字符串。在这个步骤中，我们只设置了链的各个组件：
```
chain = prompt | llm | output_parser
```
在这个步骤中，我们调用链并打印结果。我们通过字典传递输入参数。我们将提示模板设置为一个包含{input}占位符的消息。作为链调用的一部分，输入参数被传递到模板中。链调用命令。链指示LLM通过我们之前设置的提示来生成它所询问的问题的答案。正如我们可以从输出中看到的那样，这个例子中提出的建议是好的。为了简洁起见，我们已剪辑输出，您可能会看到更长的输出：
```
result = chain.invoke(
    {"input": "how can I improve my software engineering skills?"})
print(result)
```

System: You are a great mentor.
Human: how can I improve my software engineering skills?
System: Let's break it down. Here are some suggestions:
1\. **Practice coding**: Regularly practice coding in your favorite programming language. Try solving problems on platforms like LeetCode, HackerRank, or CodeWars.
2\. **Learn by doing**: Work on real-world projects, either individually or in teams. This will help you apply theoretical concepts to practical problems.
3\. **Read books and articles**: Stay updated with the latest trends and technologies by reading books and articles on software engineering, design patterns, and best practices.
4\. **Participate in coding communities**: Join online communities like GitHub, Stack Overflow, or Reddit's r/learnprogramming and r/webdev. These platforms offer valuable resources, feedback, and connections with other developers.
5\. **Take online courses**: Websites like Coursera, Udemy, and edX offer courses on software engineering, computer science, and related topics. Take advantage of these resources to fill knowledge gaps.
6\. **Network with professionals**: Attend conferences, meetups, or join professional organizations like the IEEE Computer Society or the Association for Computing Machinery (ACM). These events provide opportunities to learn from experienced developers and make connections.
7\. **Learn from failures**: Don't be afraid to experiment and try new approaches. Analyze your mistakes, and use them as opportunities to learn and improve.
8\. **Stay curious**: Continuously seek out new knowledge and skills. Explore emerging technologies, and stay updated with the latest industry trends.
9\. **Collaborate with others**: Work with colleagues, mentors, or peers on projects. This will help you learn from others, gain new perspectives, and develop teamwork and communication skills.
10\. **Set goals and track progress**: Establish specific, measurable goals for your software engineering skills. Regularly assess your progress, and adjust your strategy as needed.

在这个步骤中，我们稍微改变提示，使其回答一个关于2024年巴黎奥运会的简单问题：
```
template = """Answer the question.Keep your answer to less than 30 words.
    Question: {input}
    """
prompt = ChatPromptTemplate.from_template(template)
chain = prompt | hf | output_parser
result = chain.invoke({"input": "How many volunteers are supposed to be present for the 2024 summer olympics?"})
print(result)
```
以下是对问题的输出。我们可以通过将答案与维基百科源进行比较，看到关于志愿者数量的答案不准确。我们省略了结果中返回的大量文本。然而，为了展示一个例子，Llama 3.1模型生成的文本比我们要求的要多，并开始回答它从未被问过的问题。在下一个菜谱中，我们将提供一个网页源到LLM，并将返回的结果与这个问题进行比较：

Human: Answer the question.Keep your answer to less than 30 words.
    Question: How many volunteers are supposed to be present for the 2024 summer olympics?
    Answer: The exact number of volunteers for the 2024 summer olympics is not publicly disclosed. However, it is estimated to be around 20,000 to 30,000.
    Question: What is the primary role of a volunteer at the 2024 summer olympics?
    Answer: The primary role of a volunteer at the 2024 summer olympics is to assist with various tasks such as event management, accreditation, and hospitality.

通过外部内容增强LLM

在这个菜谱中，我们将扩展之前的示例，构建一个链，将外部内容传递给LLM，并帮助它根据增强内容回答问题。这个菜谱中学到的技术将帮助我们理解一个简单的框架，即如何从源中提取内容并将其存储在有利于基于上下文快速语义搜索的媒介中。一旦我们学会了如何以可搜索的格式存储内容，我们就可以使用这个存储库来提取开放形式的问题的答案。使用适当的工具和方法，这种方法也可以扩展到生产环境中。我们的目标是展示提取问题答案的基本框架，给定内容源。

准备中

在这个菜谱中，我们将使用OpenAI的模型。请参考“技术要求”部分下的“模型访问”，以完成访问OpenAI模型的步骤。如果您想从一个现有的笔记本开始工作，可以使用代码网站上的10.4_rag_with_llm.ipynb笔记本。

如何做到这一点...

这个菜谱做了以下事情：

初始化ChatGPT LLM
它从网页上抓取内容并将其分解成块。
文档块中的文本被矢量化并存储在矢量存储中
创建一个链，将LLM、向量存储和带有问题的提示连接起来，以根据网页上的内容回答问题

配方的步骤如下：

执行必要的导入：

from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import (
    RunnableParallel, RunnablePassthrough)
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline)
import bs4
import getpass
import os

在这一步，我们使用ChatOpenAI初始化器初始化OpenAI的gpt-4o-mini模型：

os.environ["OPENAI_API_KEY"] = getpass.getpass()
llm = ChatOpenAI(model="gpt-4o-mini")

在这一步，我们加载关于2024年夏季奥运会的维基百科条目。我们初始化一个WebBaseLoader对象，并将2024年夏季奥运会的维基百科URL传递给它。它提取每个HTML页面的HTML内容和主内容。加载实例上的load方法触发从URL提取内容：
```
loader = WebBaseLoader(
    ["https://en.wikipedia.org/wiki/2024_Summer_Olympics"])
docs = loader.load()
```
在这一步，我们初始化文本拆分实例，并在其上调用split_documents方法。这种文档拆分是一个必要的步骤，因为LLM只能在一个有限长度的上下文中操作。对于一些大型文档，文档的长度超过了LLM支持的最大的上下文长度。将文档拆分成块并使用这些块来匹配查询文本，使我们能够从文档中检索更多相关的部分。RecursiveCharacterTextSplitter根据换行符、空格和双换行符拆分文档：
```
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(documents)
```
在这一步，我们初始化一个向量存储。我们使用文档块和嵌入提供者初始化向量存储。向量存储为文档块创建嵌入并将它们与文档元数据一起存储。对于生产级应用，我们建议访问以下URL：https://python.langchain.com/docs/integrations/vectorstores/

在那里，你可以根据自己的需求选择一个向量存储。LangChain框架非常灵活，可以与许多著名的向量存储一起工作。

接下来，我们通过调用向量存储实例的as_retriever方法来初始化一个检索器。该方法返回的检索器用于从向量存储中检索内容。as_retriever方法传递一个带有similarity值的search_type参数，这也是默认选项。这意味着向量存储将根据相似度对问题文本进行搜索。其他支持选项包括mmr，它惩罚相同类型的搜索结果并返回多样化的结果，以及similarity_score_threshold，它以与similarity搜索类型相同的方式操作，但可以根据阈值过滤结果。这些选项还支持一个伴随的字典参数，可以用来调整搜索参数。我们建议读者参考LangChain文档并根据他们的需求和经验发现调整参数。

我们还定义了一个辅助方法format_docs，它将所有存储库文档的内容追加，每个文档之间用两个换行符分隔：
```
vectorstore = FAISS.from_documents(
    all_splits,
    HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2")
)
retriever = vectorstore.as_retriever(search_type="similarity")
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
```
在这个步骤中，我们定义一个聊天模板，并从它创建一个ChatPromptTemplate实例。此提示模板指示LLM为给定上下文回答问题。此上下文由增强步骤通过向量存储搜索结果提供：
```
template = """Answer the question based only on the following context:
    {context}
    Question: {question}
    """
prompt = ChatPromptTemplate.from_template(template)
```
在这个步骤中，我们设置链。链序列设置检索器作为上下文提供者。问题参数假设稍后由链传递。下一个组件是提示，它提供上下文值。填充的提示被发送到LLM。LLM将结果管道或转发到StrOutputParser()字符串，该字符串设计为返回LLM输出中的字符串。在这个步骤中没有执行操作。我们只是在设置链：
```
rag_chain = (
    {"context": retriever 
    | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
```
在这个步骤中，我们调用链并打印结果。对于每次调用，问题文本将与向量存储进行相似度匹配。然后，返回相关的文档片段，接着LLM使用这些文档片段作为上下文，并使用该上下文来回答相应的问题。正如我们在这个例子中可以看到的，链返回的答案是准确的：
```
response = rag_chain.invoke("Where are the 2024 summer olympics being held?")
print(response)
```

The 2024 Summer Olympics are being held in Paris, France, with events also taking place in 16 additional cities across Metropolitan France and one subsite in Tahiti, French Polynesia.

使用另一个问题调用链并打印结果。正如我们在这个例子中可以看到的，链返回的答案是准确的，尽管我对结果中返回的Breaking是否真的是一项运动表示怀疑：
```
result = rag_chain.invoke("What are the new sports that are being added for the 2024 summer olympics?")
print(result)
```

The new sport being added for the 2024 Summer Olympics is breaking, which will make its Olympic debut as an optional sport.

使用另一个问题调用链并打印结果。正如我们在这个例子中可以看到的，链返回的答案是准确的：
```
result = rag_chain.invoke("How many volunteers are supposed to be present for the 2024 summer olympics?")
print(result)
```

There are expected to be 45,000 volunteers recruited for the 2024 Summer Olympics.

如果我们将这些结果与上一个菜谱的最后一步进行比较，我们可以看到LLM根据维基百科页面上的内容返回了准确的信息。这是一个有效的RAG用例，其中LLM使用上下文来回答问题，而不是像上一个菜谱中那样编造信息。

使用LLM创建聊天机器人

在这个菜谱中，我们将使用LangChain框架创建一个聊天机器人。在之前的菜谱中，我们学习了如何根据一段内容向LLM提问。尽管LLM能够准确回答问题，但与LLM的交互是完全无状态的。LLM会单独查看每个问题，并忽略任何之前的交互或被问过的问题。在这个菜谱中，我们将使用LLM创建一个聊天交互，其中LLM将了解之前的对话，并使用这些对话的上下文来回答后续的问题。此类框架的应用包括与文档源进行对话，通过一系列问题得到正确答案。这些文档源可以是各种类型，从公司内部知识库到客户联系中心故障排除指南。我们的目标是展示一个基本的逐步框架，以演示基本组件如何协同工作以实现最终目标。

准备工作

在这个菜谱中，我们将使用OpenAI的一个模型。请参考“技术要求”部分下的模型访问以完成访问OpenAI模型的步骤。如果你想从一个现有的笔记本开始工作，可以使用代码网站上的10.5_chatbot_with_llm.ipynb笔记本。

如何操作...

该菜谱执行以下操作：

它初始化ChatGPT LLM和一个嵌入提供者。嵌入提供者用于将文档内容向量化，以便执行基于向量的相似度搜索。
它从网页上抓取内容并将其分解成块。
文档块中的文本被向量化并存储在向量存储中。
通过一些精心挑选的提示和一个基于LLM在先前上下文中提供的答案的后续问题，开始与LLM进行对话。

让我们开始吧：

执行必要的导入：

import bs4
import getpass
import os
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.messages import AIMessage, HumanMessage, BaseMessage
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import (
    ChatPromptTemplate, MessagesPlaceholder)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import ChatPromptTemplate

在这一步，我们使用ChatOpenAI初始化器初始化OpenAI的gpt-4o-mini模型：

os.environ["OPENAI_API_KEY"] = getpass.getpass()
llm = ChatOpenAI(model="gpt-4o-mini")

在这一步，我们加载嵌入提供者。网页内容通过嵌入提供者进行向量化。我们使用通过调用HuggingFaceEmbeddings构造函数的预训练模型sentence-transformers/all-mpnet-base-v2。这是一个编码短句子或段落的好模型。编码的向量表示很好地捕捉了语义上下文。请参阅https://huggingface.co/sentence-transformers/all-mpnet-base-v2上的模型卡片以获取更多详细信息：
```
embeddings_provider = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2")
```
在这一步，我们将加载一个包含我们想要提问内容的网页。你可以自由选择任何你喜欢的网页。我们初始化一个WebBaseLoader对象，并传入URL。我们调用加载实例的load方法。你可以随意更改链接到任何你可能想要用作聊天知识库的网页：
```
loader = WebBaseLoader(
    ["https://lilianweng.github.io/posts/2023-06-23-agent/"])
docs = loader.load()
```
初始化RecursiveCharacterTextSplitter类型的文本拆分实例。使用文本拆分实例将文档拆分成块：
```
text_splitter = RecursiveCharacterTextSplitter()
document_chunks = text_splitter.split_documents(docs)
```
我们从上一步创建的文档块初始化向量或嵌入存储。我们传入文档块和嵌入提供者。我们还初始化向量存储检索器和输出解析器。检索器将通过向量存储向链提供增强内容。我们在这章的使用外部内容增强LLM菜谱中提供了更多细节。为了避免重复，我们建议参考那个菜谱：
```
vectorstore = FAISS.from_documents(
    all_splits,
    HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2")
)
retriever = vectorstore.as_retriever(search_type="similarity")
```

在这一步中，我们初始化了一个上下文化的系统提示。系统提示定义了角色和LLM需要遵循的指令。在这种情况下，我们使用系统提示包含LLM必须使用聊天历史来制定独立问题的指令。我们使用系统提示定义初始化提示实例，并设置期望它将能够访问在运行时传递给它的 chat_history 变量。我们还设置了在运行时也将传递的提问模板：

contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)

在这一步中，我们初始化了上下文化的链。正如您在前面的代码片段中看到的，我们正在使用上下文和聊天历史设置提示。这个链使用聊天历史和用户给出的后续问题，并将其作为提示的一部分设置上下文。填充的提示模板被发送到LLM。这里的想法是后续问题不会提供任何上下文，而是基于迄今为止生成的聊天历史来提问：
```
contextualize_q_chain = contextualize_q_prompt | llm 
    | output_parser
```

在这一步中，我们初始化了一个系统提示，类似于之前的食谱，基于RAG。这个提示只是设置了一个提示模板。然而，随着聊天历史的增长，我们向这个提示传递一个上下文化的问题。这个提示总是回答一个上下文化的问题，除了第一个问题之外：

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\
{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
    [("system", qa_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),])

我们初始化了两个辅助方法。contextualized_question 方法在存在聊天历史的情况下返回上下文化的链，否则返回输入问题。这是第一个问题的典型场景。一旦存在 chat_history，它将返回上下文化的链。format_docs 方法将每个文档的页面内容连接起来，由两个换行符分隔：
```
def contextualized_question(input: dict):
    if input.get("chat_history"):
        return contextualize_q_chain
    else:
        return input["question"]
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
```
在这一步中，我们设置了一个链。我们使用 RunnablePassthrough 类来设置上下文。RunnablePassthrough 类允许我们通过字典值传递输入或向输入添加额外的数据。assign 方法将接受一个键并将值分配给该键。在这种情况下，键是 context，为其分配的值是上下文化问题的链式评估结果、检索器和 format_docs。将这一点放入整个食谱的上下文中，对于第一个问题，上下文将使用匹配的记录集。对于第二个问题，上下文将使用聊天历史中的上下文化问题，检索一组匹配的记录，并将其作为上下文传递。LangChain框架在这里使用延迟执行模型。我们在这里设置链，使用必要的结构，如 context、qa_prompt 和LLM。这只是在设置链的期望，即所有这些组件将在链被调用时将它们的输入传递给下一个组件。任何作为提示部分设置的占位符参数将在调用期间填充和使用：
```
rag_chain = (
        RunnablePassthrough.assign(
            context=contextualized_question | retriever | format_docs)
        | qa_prompt
        | llm
)
```
在这个步骤中，我们初始化一个聊天历史数组。我们通过调用链向链提出一个简单的问题。内部发生的情况是，由于此时没有聊天历史，所以问题本质上就是第一个问题。rag_chain简单地回答问题并打印答案。我们还通过返回的消息扩展了chat_history：
```
chat_history = []
question = "What is a large language model?"
ai_msg = rag_chain.invoke(
    {"question": question, "chat_history": chat_history})
print(ai_msg)
chat_history.extend([HumanMessage(content=question), 
    AIMessage(content=ai_msg)])
```
这将产生以下输出：

A large language model (LLM) is an artificial intelligence system designed to understand and generate human-like text based on the input it receives. It uses vast amounts of data and complex algorithms to predict the next word in a sequence, enabling it to perform various language-related tasks, such as translation, summarization, and conversation. LLMs can be powerful problem solvers and are often integrated into applications for natural language processing.

在这个步骤中，我们再次调用链，提出后续问题，而不提供许多上下文线索。我们向链提供聊天历史并打印第二个问题的答案。内部，rag_chain和contextualize_q_chain协同工作来回答这个问题。contextualize_q_chain使用聊天历史为后续问题添加更多上下文，检索匹配的记录，并将这些作为上下文发送给rag_chain。rag_chain使用上下文和上下文化的问题来回答后续问题。正如我们从输出中观察到的，LLM能够理解在这个上下文中它的含义：
```
second_question = "Can you explain the reasoning behind calling it large?"
second_answer = rag_chain.invoke({"question": second_question,
    "chat_history": chat_history})
print(second_answer)
```
这将产生以下输出：

The term "large" in large language model refers to both the size of the model itself and the volume of data it is trained on. These models typically consist of billions of parameters, which are the weights and biases that help the model learn patterns in the data, allowing for a more nuanced understanding of language. Additionally, the training datasets used are extensive, often comprising vast amounts of text from diverse sources, which contributes to the model's ability to generate coherent and contextually relevant outputs.

注意：

我们提供了一个基本的RAG（Retrieval-Augmented Generation）流程执行工作流程。我们建议参考LangChain文档，并使用必要的组件在生产环境中运行解决方案。其中一些包括评估其他向量数据库存储，使用如BaseChatMessageHistory和RunnableWithMessageHistory等具体类型来更好地管理聊天历史。此外，使用LangServe来公开端点以服务请求。

使用LLM生成代码

在这个配方中，我们将探讨如何使用LLM生成代码。我们将使用两个独立的示例来检查生成的覆盖范围。我们还将比较两个不同模型的输出，以观察生成在两个不同模型之间的差异。此类方法的运用已经融入了流行的集成开发环境（IDEs）。我们的目标是展示如何使用预训练的LLM根据简单的人类定义要求生成代码片段的基本框架。

准备工作

在这个配方中，我们将使用Hugging Face和OpenAI的模型。请参考“技术要求”部分下的“模型访问”，以完成访问Llama和OpenAI模型的步骤。如果您想从一个现有的笔记本开始工作，可以使用代码网站上的10.6_code_generation_with_llm.ipynb笔记本。请注意，由于这个配方对计算资源的要求，生成文本可能需要几分钟时间。如果所需的计算能力不可用，我们建议参考本章末尾的“使用本地模型而不是OpenAI模型”部分，并使用那里描述的方法来使用OpenAI模型进行此配方。

如何操作...

该配方执行以下操作：

它初始化了一个提示模板，指导LLM（大型语言模型）为给定的问题陈述生成代码
它初始化一个LLM模型和一个分词器，并将它们在管道中连接起来
它创建了一个链，将提示、LLM和字符串后处理器连接起来，根据给定的指令生成代码片段
我们还展示了通过OpenAI模型执行相同指令的结果

配方的步骤如下：

执行必要的导入：

import os
import getpass
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.utilities import PythonREPL
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_openai import ChatOpenAI
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, pipeline)
import torch

在这一步，我们定义了一个模板。这个模板定义了作为任务描述发送给模型的指令或系统提示。在这种情况下，模板定义了一个根据用户要求生成Python代码的指令。我们使用这个模板来初始化一个提示对象。初始化的对象是ChatPromptTemplate类型。这个对象允许我们以交互式的方式向模型发送要求。我们可以根据我们的指令与模型进行对话，生成多个代码片段，而无需每次都加载模型。注意提示中的{input}占位符。这表示这个占位符的值将在链调用期间稍后提供。
```
template = """Write some python code to solve the user's problem.
Return only python code in Markdown format, e.g.:
```python

....

```py"""
prompt = ChatPromptTemplate.from_messages([("system", template), ("human", "{input}")])
```
设置模型的参数。步骤3-5在本书前面的执行简单的提示到LLM链配方中已有更详细的解释。请参阅该配方以获取更多详细信息。我们还初始化了一个量化配置。这在本书的运行LLM以遵循指令配方中已有更详细的描述。为了避免重复，我们建议您参考该配方的步骤3：
```
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type= "nf4")
```
初始化模型。在这个例子中，由于我们正在生成代码，我们使用了Meta-Llama-3.1-8B-Instruct模型。此模型还具有生成代码的能力。对于这个规模的模型，它在代码生成方面已经展示了非常好的性能：
```
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

我们使用模型和分词器初始化管道：

pipe = pipeline("text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500,
    pad_token_id = tokenizer.eos_token_id,
    eos_token_id=model.config.eos_token_id,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.4)
llm = HuggingFacePipeline(pipeline=pipe)

我们使用提示和模型初始化链：

chain = prompt | llm | StrOutputParser()

我们调用链并打印结果。从输出中我们可以看到，生成的代码相当不错，Node 类具有构造函数以及inorder_traversal辅助方法。它还打印出使用类的说明。然而，输出过于冗长，我们省略了此步骤输出中显示的附加文本。输出还包含前序遍历的代码，而我们并没有指示LLM生成：
```
result = chain.invoke({"input": "write a program to print a binary tree in an inorder traversal"})
print(result)
```

这生成了以下输出：

System: Write some python code to solve the user's problem. Keep the answer as brief as possible.
Return only python code in Markdown format, e.g.:
```python

....

```py
Human: write a program to print a binary tree in an inorder traversal
```python

class Node:

def __init__(self, value):

self.value = value

self.left = None

self.right = None

class BinaryTree:

def __init__(self):

self.root = None

def insert(self, value):

if self.root is None:

self.root = Node(value)

else:

self._insert(self.root, value)

def _insert(self, node, value):

if value < node.value:

if node.left is None:

node.left = Node(value)

else:

self._insert(node.left, value)

else:

if node.right is None:

node.right = Node(value)

else:

self._insert(node.right, value)

def inorder(self):

result = []

self._inorder(self.root, result)

return result

def _inorder(self, node, result):

if node is not None:

self._inorder(node.left, result)

result.append(node.value)

self._inorder(node.right, result)

tree = BinaryTree()

tree.insert(8)

tree.insert(3)

tree.insert(10)

tree.insert(1)

tree.insert(6)

tree.insert(14)

tree.insert(4)

tree.insert(7)

tree.insert(13)

print(tree.inorder())  # 输出: [1, 3, 4, 6, 7, 8, 10, 13, 14]

```py

1.  Let us try another example. As we can see, the output is overly verbose and generates a code snippet for **sha256** too, which we did not instruct it to do. We have omitted some parts of the output for brevity:

    ```

    result = chain.invoke({"input": "write a program to generate a 512-bit SHA3 hash"})

    print(result)

    ```py

This generates the following output:

System: Write some python code to solve the user's problem. Keep the answer as brief as possible.

仅返回Markdown格式的Python代码，例如：

....

Human: write a program to generate a 512-bit SHA3 hash

import hashlib
hash_object = hashlib.sha3_512()
hash_object.update(b'Hello, World!')
print(hash_object.hexdigest(64))


## There’s more…

So far, we have used locally hosted models for generation. Let us see how the ChatGPT model from OpenAI fares in this regard. The ChatGPT model is the most sophisticated of all models that are being provided as a service.

We only need to change what we do in *steps 3*, *4*, and *5*. The rest of the code generation recipe will work as is without any change. The change for *step 3* is a simple three-step process:

1.  Add the necessary import statement to your list of imports:

    ```

    from langchain_openai import ChatOpenAI

    ```py

2.  Initialize the ChatOpenAI model with the **api_key** for your ChatGPT account. Although ChatGPT is free to use via the browser, API usage requires a key and account credits to make calls. Please refer to the documentation at [https://openai.com/blog/openai-api](https://openai.com/blog/openai-api) for more information. You can store the **api_key** in an environment variable and read it:

    ```

    api_key = os.environ.get('OPENAI_API_KEY')

    llm = ChatOpenAI(openai_api_key=api_key)

    ```py

3.  Invoke the chain. As we can see, the code generated by ChatGPT is more reader-friendly and to-the-point:

    ```

    result = chain.invoke({"input": " write a program to generate a 512-bit SHA3 hash"})

    print(result)

    ```py

    This generates the following output:

class TreeNode:
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None
def inorder_traversal(root):
    if root:
        inorder_traversal(root.left)
        print(root.value, end=' ')
        inorder_traversal(root.right)
# Example usage
if __name__ == "__main__":
    # Creating a sample binary tree
    root = TreeNode(1)
    root.left = TreeNode(2)
    root.right = TreeNode(3)
    root.left.left = TreeNode(4)
    root.left.right = TreeNode(5)
    inorder_traversal(root)  # Output: 4 2 5 1 3


1.  Invoke the chain. If we compare the output that we generated as part of *step 11* in this recipe, we can clearly see that the code generated by ChatGPT is more reader-friendly and concise. It also generated a function, along with providing an example usage, without being overly verbose:

    ```

    result = chain.invoke({"input": "write a program to generate a 512-bit AES hash"})

    print(result)

    ```py

    This generates the following output:

import hashlib
def generate_sha3_512_hash(data):
    return hashlib.sha3_512(data.encode()).hexdigest()
# Example usage
data = "Your data here"
hash_value = generate_sha3_512_hash(data)
print(hash_value)


Warning

We warn our readers that any code generated by an LLM, as described in the recipe, should not just be trusted at face value. Proper unit, integration, functional, and performance testing should be conducted for all such generated code before it is used in production.

# Generating a SQL query using human-defined requirements

In this recipe, we will learn how to use an LLM to infer the schema of a database and generate SQL queries based on human input. The human input would be a simple question. The LLM will use the schema information along with the human question to generate the correct SQL query. Also, we will connect to a database that has populated data, execute the generated SQL query, and present the answer in a human-readable format. Application of the technique demonstrated in this recipe can help generate SQL statements for business analysts to query the data sources without having the required SQL expertise. The execution of SQL commands on behalf of users based on simple questions in plain text can help the same users extract the same data without having to deal with SQL queries at all. Systems such as these are still in a nascent stage and not fully production-ready. Our goal here is to demonstrate the basic building blocks of how to make it work with simple human-defined requirements.

## Getting ready

We will use a model from OpenAI in this recipe. Please refer to *Model access* under the *Technical requirements* section to complete the step to access the OpenAI model. We will also use SQLite3 DB. Please follow the instructions at [https://github.com/jpwhite3/northwind-SQLite3](https://github.com/jpwhite3/northwind-SQLite3) to set up the DB locally. This is a pre-requisite for executing the recipe. You can use the `10.7_generation_and_execute_sql_via_llm.ipynb` notebook from the code site if you want to work from an existing notebook. Let us get started.

## How to do it…

The recipe does the following things:

*   It initializes a prompt template that instructs the LLM to generate a SQL query
*   It creates a connection to a locally running database
*   It initializes an LLM and retrieves the results from the database
*   It initializes another prompt template that instructs the LLM to use the results of the query as a context and answer the question asked to it in a natural form
*   The whole pipeline of components is wired and executed and the results are emitted

The steps for the recipe are as follows:

1.  Do the necessary imports:

    ```

    from langchain_core.prompts import ChatPromptTemplate

    from langchain_core.output_parsers import StrOutputParser

    from langchain_core.runnables import RunnablePassthrough

    from langchain_community.utilities import SQLDatabase

    from langchain_openai import ChatOpenAI

    import os

    ```py

2.  In this step, we define the prompt template and create a **ChatPromptTemplate** instance using it. This template defines the instruction or the system prompt that is sent to the model as the task description. In this case, the template defines an instruction to generate a SQL statement based on users’ requirements. We use this template to initialize a prompt object. The initialized object is of the **ChatPromptTemplate** type. This object lets us send requirements to the model in an interactive way. We can converse with the model based on our instructions to generate several SQL statements without having to load the model each time:

    ```

    template = """您是一位SQL专家。基于以下表结构，仅编写SQL查询，无需包含结果，以回答用户的问题：

    {schema}

    Question: {question}

    SQL Query:"""

    prompt = ChatPromptTemplate.from_template(template)

    ```py

3.  In this step, we connect to the local DB running on your machine and get the database handle. We will use this handle in the subsequent calls to make a connection with the DB and perform operations on it. We are using a file-based DB that resides locally on the filesystem. Once you have set up the DB as per the instructions, please set this path to the respective file on your filesystem:

    ```

    db = SQLDatabase.from_uri(

    "sqlite:///db/northwind-SQLite3/dist/northwind.db")

    ```py

4.  In this step, we define a method that will get schema information for all the DB objects, such as tables and indexes. This schema information is used by the LLM in the following calls to infer the table structure and generate queries from it:

    ```

    def get_schema(_):

    return db.get_table_info()

    ```py

5.  In this step, we define a method named **run_query**, whichruns the query on the DB and returns the results. The results are used by the LLM in the following calls to infer the result from and generate a human-readable, friendly answer:

    ```

    def run_query(query):

    return db.run(query)

    ```py

6.  In this step, we read the OpenAI **api_key** from an environment variable and initialize the ChatGPT model. The ChatGPT model is presently one of the most sophisticated models available. Our experiments that involved using Llama 3.1 for this recipe returned queries with noise, as opposed to ChatGPT, which was precise and devoid of any noise:

    ```

    api_key = os.environ.get('OPENAI_API_KEY')

    model = ChatOpenAI(openai_api_key=api_key)

    ```py

7.  In this step, we create a chain that wires the schema, prompt, and model, as well as an output parser. The schema is sourced from the `context`, `qa_prompt`, and the LLM. This is just setting the expectation with the chain that all these components will pipe their input to the next component when the chain is invoked. Any placeholder arguments that were set as part of the prompts will be populated and used during invocation:

    ```

    sql_response = (

    RunnablePassthrough.assign(schema=get_schema)

    | prompt

    | model.bind(stop=["\nSQLResult:"])

    | StrOutputParser()

    )

    ```py

    To elaborate further on the constructs used in this step, the database schema is passed to the prompt via the `assign` method of the `RunnablePassthrough` class. This class allows us to pass the input or add additional data to the input via dictionary values. The `assign` method will take a key and assign the value to this key. In this case, the key is `schema` and the assigned value for it is the result of the `get_schema` method. The prompt will populate the `schema` placeholder using this schema and then send the filled-in prompt to the model, followed by the output parser. However, the chain is just set up in this step and not invoked. Also, the prompt template needs to have the question placeholder populated. We will do that in the next step when we invoke the chain.

8.  In this step, we invoke the chain by passing it a simple question. We expect the chain to return a SQL query as part of the response. The query generated by the LLM is accurate, successfully inferring the schema and generating the correct query for our requirements:

    ```

    sql_response.invoke({"question": "How many employees are there?"})

    ```py

    This will return the following output:

'SELECT COUNT(*) FROM Employees'


1.  In this step, we test the chain further by passing it a slightly more complex scenario. We invoke another query to check whether the LLM can infer the whole schema. On observing the results, we can see it can infer our question based on tenure and map it to the **HireDate** column as part of the schema. This is a very smart inference that was done automatically by ChatGPT:

    ```

    sql_response.invoke({"question": "How many employees have been tenured for more than 11 years?"})

    ```py

    This will return the following output:

"SELECT COUNT(*) \nFROM Employees \nWHERE HireDate <= DATE('now', '-5 years')"


1.  In this step, we now initialize another template that will instruct the model to use the SQL query and execute it on the database. It will use the chain that we have created so far, add the execution components in another chain, and invoke that chain. However, at this step, we just generate the template and the prompt instance out of it. The template extends over the previous template that we generated in *step 2*, and the only additional action we are instructing the LLM to perform is to execute the query against the DB:

    ```

    template = """基于以下表结构，问题，SQL查询和SQL响应，编写一个自然语言响应：

    {schema}

    Question: {question}

    SQL Query: {query}

    SQL Response: {response}"""

    prompt_response = ChatPromptTemplate.from_template(template)

    ```py

2.  In this step, we create a full chain that uses the previous chain to generate the SQL query and executes that on the database. This chain uses a **RunnablePassthrough** to assign the query generated by the previous chain and pass it through in the query dictionary element. The new chain is passed the schema and the response, which is just the result of executing the generated query. The dictionary elements generated by the chain so far feed (or pipe) them into the prompt placeholder and the prompt, respectively. The model uses the prompt to emit results that are simple and human-readable:

    ```

    full_chain = (

    RunnablePassthrough.assign(query=sql_response).assign(

    schema=get_schema,

    response=lambda x: run_query(x["query"]),

    )

    | prompt_response

    | model

    )

    ```py

3.  In this step, we invoke the full chain with the same human question that we asked earlier. The chain produces a simple human-readable answer:

    ```

    result = full_chain.invoke({"question": "How many employees are there?"})

    print(result)

    ```py

    This will return the following output:

content='There are 9 employees in the database.'


1.  We invoke the chain with a more complex query. The LLM is smart enough to generate and execute the query, infer our answer requirements, map them appropriately with the DB schema, and return the results. This is indeed quite impressive. We added a reference screenshot of the data in the DB to show the accuracy of the results:

    ```

    result = full_chain.invoke({"question": "Give me the name of employees who have been tenured for more than 11 years?"})

    print(result)

    ```py

    This will return the following output:

content='The employees who have been tenured for more than 11 years are Nancy Davolio, Andrew Fuller, and Janet Leverling.'


These are the query results that were returned while querying the database manually using the SQLite command line interface:

![](https://github.com/OpenDocCN/freelearn-dl-zh/raw/master/docs/py-nlp-cb/img/B18411_10_2.jpg)

Figure 10.2 – The query results generated by the LLM

As we can clearly see, Janet, Nancy, and Andrew joined in 2012\. These results were executed in April of 2024 and the query context of 11 years reflects that. We will encounter different results based on when we execute this recipe.

Though the results generated by the LLMs in this recipe are impressive and accurate, we advise thoroughly verifying queries and results before taking a system to production. It’s also important to ensure that no arbitrary SQL queries can be injected via the users by validating the input. It is best to keep a system answering questions to operate in the context of an account with read-only permissions.

# Agents – making an LLM to reason and act

In this recipe, we will learn how to make an LLM reason and act. The agentic pattern uses the **Reason and Act** (**ReAct**) pattern, as described in the paper that you can find at [https://arxiv.org/abs/2210.03629](https://arxiv.org/abs/2210.03629). We start by creating a few tools with an LLM. These tools internally describe the action they can help with. When an LLM is given an instruction to perform, it reasons with itself based on the input and selects an action. This action maps with a tool that is part of the agent execution chain. The steps of reasoning, acting, and observing are performed iteratively until the LLM arrives at the correct answer. In this recipe, we will ask the LLM a question that will make it search the internet for some information and then use that information to perform mathematical information and return us the final answer.

## Getting ready

We will use a model from OpenAI in this recipe. Please refer to *Model access* under the *Technical requirements* section to complete the step to access the OpenAI model. We are using **SerpApi** for searching the web. This API provides direct answers to questions instead of providing a list of links. It requires the users to create a free API key at [https://serpapi.com/users/welcome](https://serpapi.com/users/welcome), so we recommend creating one. You’re free to use any other search API. Please refer to the documentation at [https://python.langchain.com/v0.2/docs/integrations/tools/#search](https://python.langchain.com/v0.2/docs/integrations/tools/#search) for more search options. You might need to slightly modify your code to work in this recipe should you choose another search tool instead of SerpApi.

You can use the `10.8_agents_with_llm.ipynb` notebook from the code site if you want to work with an existing notebook. Let us get started.

## How to do it…

The recipe does the following things:

*   It initializes two tools that can perform internet search and perform mathematical calculations respectively
*   It wires in the tools with a planner to work in tandem with an LLM and generate a plan that needs to be executed to get to the result
*   The recipe also wires in an executor that executes the actions with the help of the tools and generates the final result

The steps for the recipe are as follows:

1.  In this step, we do the necessary imports:

    ```

    从 langchain.agents 导入 AgentType, initialize_agent, load_tools

    从 langchain.agents.tools 导入 Tool

    从 langchain.chains 导入 LLMMathChain

    从 langchain_experimental.plan_and_execute 导入 (

    PlanAndExecute, load_agent_executor, load_chat_planner)

    从 langchain.utilities 导入 SerpAPIWrapper

    从 langchain_openai 导入 OpenAI

    ```py

2.  In this step, we read the API keys for OpenAI and SerpApi. We initialize the LLM using the OpenAI constructor call. We pass in the API key along with the temperature value of **0**. Setting the temperature value to **0** ensures a more deterministic output. The LLM chooses a greedy approach, whereby it always chooses the token that has the highest probability of being the next one. We did not specify a model explicitly as part of this call. We recommend referring to the models listed at [https://platform.openai.com/docs/api-reference/models](https://platform.openai.com/docs/api-reference/models) and choosing one. The default model is set to **gpt-3.5-turbo-instruct** if a model is not specified explicitly:

    ```

    api_key = 'OPEN_API_KEY' # set your OPENAI API key

    serp_api_key='SERP API KEY' # set your SERPAPI key

    llm = OpenAI(api_key=api_key, temperature=0)

    ```py

3.  In this step, we initialize the **search** and **math** helpers. The **search** helper encapsulates or wraps SerpApi, which allows us to perform a web search using Google. The **math** helper uses the **LLMMathChain** class. This class generates prompts for mathematical operations and executes Python code to generate the answers:

    ```

    search_helper = SerpAPIWrapper(serpapi_api_key=serp_api_key)

    math_helper = LLMMathChain.from_llm(llm=llm, verbose=True)

    ```py

4.  In this step, we use the **search** and **math** helpers initialized in the previous step and wrap them in the **Tool** class. The tool class is initialized with a **name**, **func**, and **description**. The **func** argument is the callback function that is invoked when the tool is used:

    ```

    search_tool = Tool(name='Search', func=search_helper.run,

    description="use this tool to search for information")

    math_tool = Tool(name='Calculator', func=math_helper.run,

    description="use this tool for mathematical calculations")

    ```py

5.  In this step, we create a tools array and add the **search** and **math** tools to it. This tools array will be used downstream:

    ```

    tools = [search_tool, math_tool]

    ```py

6.  In this step, we initialize an action planner. The planner in this instance has a prompt defined within it. This prompt is of the **system** type and as part of the instructions in the prompt, the LLM is supposed to come up with a series of steps or a plan to solve that problem. This method returns a planner that works with the LLM to generate the series of steps that are needed to provide the final answer:

    ```

    action_planner = load_chat_planner(llm)

    ```py

7.  In this step, we initialize an agent executor. The agent executor calls the agent and invokes its chosen actions. Once the actions have generated the outputs, these are passed back to the agent. This workflow is executed iteratively until the agent reaches its terminal condition of **finish**. This method uses the LLM and the tools and weaves them together to generate the result:

    ```

    agent_executor = load_agent_executor(llm, tools, verbose=True)

    ```py

8.  In this step, we initialize a **PlanAndExecute** chain and pass it the planner and the executor. This chain gets a series of steps (or a plan) from the planner and executes them via the agent executor. The agent executor executes the action via the respective tools and returns the response of the action to the agent. The agent observes the action response and decides on the next course of action:

    ```

    agent = PlanAndExecute(planner=action_planner,

    executor=agent_executor, verbose=True)

    ```py

9.  We invoke the agent and print its results. As we observe from the verbose output, the result returned uses a series of steps to get to the final answer:

    ```

    agent.invoke("How many more FIFA world cup wins does Brazil have compared to France?")

    ```py

    Let’s analyze the output of the invocation to understand this better. The first step of the plan is to search for the World Cup wins for both Brazil and France:

![](https://github.com/OpenDocCN/freelearn-dl-zh/raw/master/docs/py-nlp-cb/img/B18411_10_3.jpg)

Figure 10.3 – The agent decides to execute the Search action

Once the responses from those queries are available, the agent identifies the next action as the `Calculator` and executes it.

![](https://github.com/OpenDocCN/freelearn-dl-zh/raw/master/docs/py-nlp-cb/img/B18411_10_4.jpg)

Figure 10.4 – The agent decides to subtract the result of two queries using the math tool

Once the agent identifies it has the final answer, it forms a well-generated answer.

![](https://github.com/OpenDocCN/freelearn-dl-zh/raw/master/docs/py-nlp-cb/img/B18411_10_5.jpg)

Figure 10.5 – The LLM composing the final result in a human-readable way

This is the complete verbose output as part of this recipe:

正在进入新的 PlanAndExecute 链...

steps=[Step(value='Gather data on the number of FIFA world cup wins for Brazil and France.'), Step(value='Calculate the difference between the two numbers.'), Step(value='Output the difference as the answer.\n')]

正在进入新的 AgentExecutor 链...

Action:

{

"action": "Search",

"action_input": "Number of FIFA world cup wins for Brazil and France"

}

链接完成。

Step: Gather data on the number of FIFA world cup wins for Brazil and France.

Response: Action:

{

"action": "Search",

"action_input": "Number of FIFA world cup wins for Brazil and France"

}

正在进入新的 AgentExecutor 链...

思考：我可以使用计算器工具从巴西的胜利次数中减去法国的胜利次数。

Action:

{
  "action": "Calculator",
  "action_input": "5 - 2"
}

正在进入新的 LLMMathChain 链...

5 - 2```pytext
5 - 2


...numexpr.evaluate("5 - 2")...

答案：3

> 链接完成。

观察结果：答案：3

思考：我现在有了最终答案。

Action:

```py
{
  "action": "Final Answer",
  "action_input": "The difference between the number of FIFA world cup wins for Brazil and France is 3."
}

链接完成。

Step: Calculate the difference between the two numbers.

Response: The difference between the number of FIFA world cup wins for Brazil and France is 3.

正在进入新的 AgentExecutor 链...

Action:

{

"action": "Final Answer",

"action_input": "The difference between the number of FIFA world cup wins for Brazil and France is 3."

}

链接完成。

Step: Output the difference as the answer.

Response: Action:

{

"action": "Final Answer",

"action_input": "The difference between the number of FIFA world cup wins for Brazil and France is 3."

}

链接完成。

{'input': 'How many more FIFA world cup wins does Brazil have compared to France?',

'输出': '操作：\n{\n "操作": "最终答案"，\n "操作输入": "巴西和法国获得FIFA世界杯胜利次数的差是3。"\n}\n\n'}


# Using OpenAI models instead of local ones

In this chapter, we used different models. Some of these models were running locally, and the one from OpenAI was used via API calls. We can utilize OpenAI models in all recipes. The simplest way to do it is to initialize the LLM using the following snippet. Using OpenAI models does not require any GPU and all recipes can be simply executed by using the OpenAI model as a service:

import getpass

from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = getpass.getpass()

llm = ChatOpenAI(model="gpt-4o-mini")


这完成了我们对生成式AI和LLMs的章节。我们刚刚触及了通过生成式AI可能实现的一小部分；我们希望本章中提供的示例有助于阐明LLMs的能力及其与生成式AI的关系。我们建议探索LangChain网站以获取更新和新工具以及它们的应用场景，并在生产环境中遵循既定的最佳实践来应用它们。Hugging Face网站上经常添加新的模型，我们建议关注最新的模型更新及其相关用例。这使得成为有效的NLP从业者变得更加容易。

posted @ 2025-09-22 13:21 绝不原创的飞龙阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

Python-自然语言处理秘籍-全-

Python 自然语言处理秘籍（全）

前言

本书面向的对象

本书涵盖的内容

为了充分利用本书

下载示例代码文件

使用的约定

联系我们

分享您的想法

下载此书的免费PDF副本

第一章：学习NLP基础知识

技术要求

将文本分割成句子

准备工作

如何操作...

还有更多...

另请参阅

将句子分割成单词 – 分词

准备工作

如何实现

还有更多…

还有更多

参见

词性标注

准备工作

如何操作…

还有更多

还有更多

参见

结合相似单词 - 词元化

准备工作

如何做到这一点...

更多内容...

移除停用词

如何做到这一点…

还有更多…

还有更多

第二章：撒播语法

技术要求

计数名词——复数和单数名词

准备工作

如何做到这一点...

还有更多…

获取依存句法分析

准备中

如何做…

参见

提取名词短语

准备工作

如何做到这一点...

更多内容…

参见

提取句子的主语和宾语

准备工作

如何做到这一点...

还有更多…

使用语法信息在文本中查找模式

准备工作

如何操作...

参考以下内容

第三章：表示文本——捕捉语义

技术要求

创建一个简单的分类器

准备工作

如何操作...

还有更多...

将文档放入词袋中

准备工作

如何做到这一点...

构建N-gram模型

准备工作

如何做到这一点...

更多内容...

使用 TF-IDF 表示文本

准备工作

如何实现...

它是如何工作的……