文本数据增强以及python实现

1. 背景介绍

1.1 数据稀缺：AI时代的"自然资源短缺"

在人工智能和机器学习迅猛发展的今天，我们常听到"数据是新的石油"这一比喻。然而，对于文本分析任务而言，高质量的标注数据更像是一种稀缺的"稀土资源"——获取成本高昂，却对最终产品质量至关重要。

想象一下，您正在为一家电商平台构建一个客户评论情感分析系统。理想情况下，您需要成千上万条标注为"正面"、"负面"和"中性"的评论数据。但现实往往是：

标注数据数量不足，特别是特定领域的专业数据

数据类别分布不均衡（例如，90%是正面评论，10%是负面评论）

获取新标注数据的成本极高（每条专业标注可能花费很贵）

数据质量参差不齐，标注一致性难以保证

这些问题并非特例，而是文本分析领域的普遍挑战。根据Gartner的研究，数据科学家80%的时间都花费在数据准备上，而其中大部分时间都用于应对数据稀缺和质量问题。

1.2 数据增强：文本分析的"点石成金"之术

数据增强（Data Augmentation）技术为解决这些挑战提供了一条创新途径。它通过对现有文本数据进行智能变换和扩充，在不增加标注成本的前提下，有效增加数据量、改善数据分布，并提高模型的泛化能力。

如果将原始文本数据比作烹饪的基本食材，那么数据增强就像是一位经验丰富的厨师，通过不同的烹饪方法（煎、炒、烹、炸、烤）将同样的食材变成各种美味佳肴。这样，即使只有有限的食材（数据），也能创造出丰富多样的菜品（增强数据集）。

1.3 本文目标与读者

本文旨在为大家提供一份全面的数据增强实战指南，涵盖从基础概念到高级技术，再到行业应用的完整知识体系。无论您是：

NLP工程师：希望提升模型性能和鲁棒性

数据科学家：面临数据稀缺或类别不平衡问题

机器学习研究者：探索数据增强的前沿技术

业务分析师：想了解如何通过AI技术创造更多价值

1.4 核心挑战与本文结构

在深入探讨之前，让我们先明确文本数据增强面临的核心挑战：

Ø 语义保留：如何确保变换后的文本仍保留原始语义

Ø 多样性：如何生成真正增加信息量的多样化样本

Ø 领域适应性：如何针对特定领域（如医疗、法律）设计有效的增强方法

Ø 评估标准：如何客观评估数据增强的效果

Ø 效率与成本：如何在计算资源有限的情况下实现高效增强

本文将围绕这些挑战展开，采用"概念-技术-实践-展望"的逻辑结构：

2. 核心概念解析

2.1 数据增强的本质："相似但不同"的艺术

数据增强的本质是生成"相似但不同"的样本——这些样本在语义上与原始数据相似，但在表面形式上有所不同。这种"相似性"确保了增强数据与原始任务的相关性，而"差异性"则增加了训练数据的多样性，帮助模型学习更鲁棒的特征。

想象一下，数据增强就像是语言老师教学生学习词汇的方法：

为了解释"喜悦"这个词，老师可能会提供"高兴"、“开心”、"愉快"等同义词（对应同义词替换增强）

老师可能会用不同的句子结构来表达相同的意思：“我很高兴"和"多么令人高兴啊！”（对应句式变换增强）

老师可能会提供包含这个词的不同语境：“收到礼物我很高兴"和"看到朋友我很高兴”（对应上下文扩展增强）

通过这些方法，学生（模型）能够从多个角度理解"喜悦"这个概念，而不仅仅是记住单一的表达方式。

2.2 数据增强与相关技术的区别

在深入探讨文本数据增强之前，我们需要明确它与相关技术的区别：

数据增强 vs 数据合成

数据增强：从现有真实数据出发，通过变换生成新样本

数据合成：从零开始生成全新的数据（如使用GAN或大型语言模型生成文本）

数据增强 vs 迁移学习

数据增强：通过增加数据多样性来提高模型性能

迁移学习：将从一个任务学到的知识迁移到另一个相关任务

数据增强 vs 主动学习

数据增强：利用现有数据生成更多数据，无需人工标注

主动学习：选择最有价值的未标注数据进行人工标注

这三种技术并非相互排斥，而是可以相互补充。例如，我们可以使用迁移学习初始化模型，应用数据增强扩充训练集，同时结合主动学习策略选择关键样本进行标注。

2.3 文本数据增强的独特挑战

与图像数据增强相比，文本数据增强面临着独特的挑战：

Ø 离散性挑战：图像是连续的像素矩阵，可以进行平滑的旋转、缩放等变换；而文本由离散的词语组成，微小的变化可能导致语义的巨大改变（例如，“好"变成"不好”）。

Ø 语义保留挑战：图像增强（如旋转）不会改变图像内容的语义；而文本中的词语替换或重组可能会完全改变句子的意思。

Ø 上下文依赖性挑战：文本中的词语含义高度依赖上下文（例如，"苹果"可能指水果或公司）；而图像中的像素含义相对固定。

Ø 领域特异性挑战：不同领域的文本（如法律、医疗、科技）有其独特的术语和表达方式，通用的增强方法可能效果不佳。

这些挑战使得文本数据增强需要更加精细和智能的方法，而不仅仅是简单的随机变换。

2.4 文本数据增强的核心原则

基于以上挑战，有效的文本数据增强应遵循以下核心原则：

Ø 语义一致性：增强后的文本应保留原始文本的核心语义和标签

Ø 多样性：增强样本应在表达方式上具有足够的多样性

Ø 真实性：生成的文本应符合自然语言的语法和表达习惯

Ø 适度性：增强程度应适中，过度增强可能导致语义漂移

Ø 针对性：应根据具体任务和数据特点选择合适的增强方法

2.5 文本数据增强技术分类

根据增强方法的复杂度和技术原理，我们可以将文本数据增强技术分为三大类：

1.基础增强技术：基于简单的词语替换、插入、删除等操作，实现简单高效的数据扩充

(1) 同义词替换

(2) 随机插入

(3) 随机删除

(4) 随机交换

(5) 回译

2.中级增强技术：结合语言模型或知识库，实现更智能的语义级增

(1) 上下文感知替换

(2) 句法结构变换

(3) 知识图谱增强

(4) 篇章重组

3.高级增强技术：利用深度学习模型生成高质量、多样化的增强文本

(1) 生成式对抗网络(GAN)增强

(2) 预训练语言模型增强

(3) 风格迁移增强

(4) 领域自适应增强

这种分类不是绝对的，许多高级增强技术也会结合基础操作。在接下来的技术原理部分，我们将详细介绍每种技术的原理、实现方法和适用场景。

3. 技术原理与实现

3.1 基础增强技术

3.1.1 同义词替换 (Synonym Replacement)

同义词替换原理：替换句子中的非核心词语为其同义词，保持句子语义不变的同时增加表达多样性。

实现步骤：

Ø 分词并过滤停用词

Ø 识别可替换的内容词（名词、动词、形容词、副词）

Ø 为每个可替换词选择合适的同义词

Ø 替换选中的词语

代码实现：

import random

import nltk

from nltk.corpus import wordnet

from nltk.tokenize import word_tokenize

from nltk.tag import pos_tag

# 下载必要的资源

nltk.download('wordnet')

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

def get_synonyms(word, pos_tag):

"""获取词语的同义词"""

synonyms = set()

# 将NLTK词性标签转换为WordNet词性标签

tag_map = {

'NN': wordnet.NOUN,

'VB': wordnet.VERB,

'JJ': wordnet.ADJ,

'RB': wordnet.ADV

}

wordnet_tag = tag_map.get(pos_tag[:2], None)

if not wordnet_tag:

return []

# 获取同义词

for syn in wordnet.synsets(word, pos=wordnet_tag):

for lemma in syn.lemmas():

synonym = lemma.name().replace('_', ' ')

if synonym != word: # 确保不是原词

synonyms.add(synonym)

return list(synonyms)

def synonym_replacement(text, replace_rate=0.2):

"""同义词替换数据增强"""

words = word_tokenize(text)

tagged_words = pos_tag(words)

# 筛选可替换的内容词

content_words = [(word, tag) for word, tag in tagged_words

if tag.startswith(('NN', 'VB', 'JJ', 'RB'))]

# 计算需要替换的词语数量

num_replace = max(1, int(len(content_words) * replace_rate))

# 随机选择词语进行替换

replaced_indices = set()

augmented_words = words.copy()

for i in range(num_replace):

if not content_words:

break

# 随机选择一个内容词

word_idx = random.randint(0, len(content_words)-1)

original_word, pos_tag = content_words[word_idx]

word_position = words.index(original_word)

# 确保不会重复替换同一个位置

if word_position in replaced_indices:

continue

# 获取同义词

synonyms = get_synonyms(original_word, pos_tag)

if synonyms:

# 随机选择一个同义词

replacement = random.choice(synonyms)

augmented_words[word_position] = replacement

replaced_indices.add(word_position)

# 从候选列表中移除已处理的词

del content_words[word_idx]

return ' '.join(augmented_words)

# 示例

original_text = "The quick brown fox jumps over the lazy dog."

augmented_text = synonym_replacement(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “The quick brown fox jumps over the lazy dog.”

增强文本: “The fast brown fox leaps over the lazy dog.”

优缺点分析：

优点：实现简单，计算成本低，保留核心语义

缺点：可能生成不自然的表达，难以捕捉上下文依赖关系，对专业领域文本效果有限

适用场景：文本分类、情感分析等对语义保留要求不极高的基础任务

3.1.2 随机插入 (Random Insertion)

原理：随机选择句子中的某个词语，找到其同义词并插入到句子的随机位置，增加句子长度和多样性。

代码实现：

def random_insertion(text, insert_rate=0.2):

"""随机插入同义词数据增强"""

words = word_tokenize(text)

tagged_words = pos_tag(words)

# 筛选可替换的内容词

content_words = [(word, tag) for word, tag in tagged_words

if tag.startswith(('NN', 'VB', 'JJ', 'RB'))]

if not content_words:

return text

# 计算需要插入的词语数量

num_inserts = max(1, int(len(words) * insert_rate))

augmented_words = words.copy()

for _ in range(num_inserts):

# 随机选择一个内容词

word, pos_tag = random.choice(content_words)

# 获取同义词

synonyms = get_synonyms(word, pos_tag)

if synonyms:

# 随机选择一个同义词

insertion = random.choice(synonyms)

# 随机选择插入位置

insert_pos = random.randint(0, len(augmented_words))

augmented_words.insert(insert_pos, insertion)

return ' '.join(augmented_words)

# 示例

original_text = "I love programming in Python because it is simple and powerful."

augmented_text = random_insertion(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “I love programming in Python because it is simple and powerful.”

增强文本: “I love programming in Python because it is simple and powerful. easy”

3.1.3 随机删除 (Random Deletion)

原理：以一定概率随机删除句子中的非核心词语，提高模型对噪声和不完整信息的鲁棒性。

代码实现：

def random_deletion(text, delete_prob=0.1):

"""随机删除词语数据增强"""

words = word_tokenize(text)

# 如果句子太短，不进行删除

if len(words) <= 3:

return text

# 随机删除词语

augmented_words = []

for word in words:

# 对每个词，以delete_prob的概率删除

if random.random() > delete_prob:

augmented_words.append(word)

# 确保至少保留一个词

if not augmented_words:

return random.choice(words)

return ' '.join(augmented_words)

# 示例

original_text = "Natural language processing is a subfield of artificial intelligence."

augmented_text = random_deletion(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “Natural language processing is a subfield of artificial intelligence.”

增强文本: “Natural processing is a subfield of artificial intelligence.”

3.1.4 随机交换 (Random Swap)

原理：随机交换句子中两个词语的位置，增加句子结构的多样性。

代码实现：

def random_swap(text, swap_rate=0.2):

"""随机交换词语数据增强"""

words = word_tokenize(text)

# 如果句子太短，不进行交换

if len(words) <= 3:

return text

# 计算需要交换的次数

num_swaps = max(1, int(len(words) * swap_rate))

augmented_words = words.copy()

for _ in range(num_swaps):

# 随机选择两个不同的位置

idx1, idx2 = random.sample(range(len(augmented_words)), 2)

# 交换词语

augmented_words[idx1], augmented_words[idx2] = augmented_words[idx2], augmented_words[idx1]

return ' '.join(augmented_words)

# 示例

original_text = "Data augmentation techniques improve the performance of machine learning models."

augmented_text = random_swap(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “Data augmentation techniques improve the performance of machine learning models.”

增强文本: “Data improve techniques augmentation the performance of machine learning models.”

3.1.5 EDA (Easy Data Augmentation)

原理：结合以上四种基础增强方法（同义词替换、随机插入、随机删除、随机交换），形成一个综合的数据增强策略。

代码实现：

def eda_augmentation(text, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4):

"""

EDA数据增强

参数:

text: 原始文本

alpha_sr: 同义词替换率

alpha_ri: 随机插入率

alpha_rs: 随机交换率

p_rd: 随机删除概率

num_aug: 生成的增强样本数量

"""

augmented_texts = []

for _ in range(num_aug):

aug_text = text

# 随机选择一种增强方法

aug_method = random.choice(['sr', 'ri', 'rs', 'rd'])

if aug_method == 'sr':

aug_text = synonym_replacement(aug_text, alpha_sr)

elif aug_method == 'ri':

aug_text = random_insertion(aug_text, alpha_ri)

elif aug_method == 'rs':

aug_text = random_swap(aug_text, alpha_rs)

else: # rd

aug_text = random_deletion(aug_text, p_rd)

augmented_texts.append(aug_text)

# 去重

augmented_texts = list(set(augmented_texts))

# 确保至少返回一个增强样本

if not augmented_texts:

augmented_texts.append(text)

return augmented_texts

# 示例

original_text = "The cat sat on the mat and purred contently."

augmented_texts = eda_augmentation(original_text)

print(f"原始文本: {original_text}")

print("增强文本:")

for i, text in enumerate(augmented_texts):

print(f" {i+1}. {text}")

效果示例：

原始文本: “The cat sat on the mat and purred contently.”

增强文本:

“The cat sat on mat and purred contently.”

“The cat sat on the mat and purred happily.”

“The cat sat the on mat and purred contently.”

“The cat sat on the mat and purred contently. content”

研究表明：EDA方法在多个文本分类任务上能够提高模型性能，特别是在标注数据有限的情况下，有时甚至能达到与使用10倍标注数据相当的效果。

3.1.6 回译 (Back Translation)

原理：通过将文本翻译成另一种语言，再翻译回原语言，生成与原始文本语义相似但表达方式不同的增强文本。

代码实现：

from deep_translator import GoogleTranslator

def back_translation(text, src_lang='en', intermediate_lang='fr'):

"""

回译数据增强

参数:

text: 原始文本

src_lang: 原始文本语言

intermediate_lang: 中间翻译语言

"""

try:

# 翻译到中间语言

translator = GoogleTranslator(source=src_lang, target=intermediate_lang)

intermediate_text = translator.translate(text)

# 翻译回原语言

translator = GoogleTranslator(source=intermediate_lang, target=src_lang)

back_translated_text = translator.translate(intermediate_text)

return back_translated_text

except Exception as e:

print(f"翻译出错: {e}")

return text

def multi_back_translation(text, src_lang='en', intermediate_langs=['fr', 'de', 'es', 'zh-CN']):

"""使用多种中间语言进行回译，生成多个增强样本"""

augmented_texts = []

for lang in intermediate_langs:

try:

aug_text = back_translation(text, src_lang, lang)

if aug_text != text: # 只添加有变化的文本

augmented_texts.append(aug_text)

except:

continue

# 去重

augmented_texts = list(set(augmented_texts))

return augmented_texts

# 示例

original_text = "Artificial intelligence is transforming the way we live and work."

augmented_texts = multi_back_translation(original_text)

print(f"原始文本: {original_text}")

print("增强文本:")

for i, text in enumerate(augmented_texts):

print(f" {i+1}. {text}")

效果示例：

原始文本: “Artificial intelligence is transforming the way we live and work.”

增强文本:

“Artificial intelligence is changing the way we live and work.”

“Artificial intelligence is revolutionizing our lifestyles and work.”

“Artificial intelligence is transforming how we live and work.”

优缺点分析：

优点：能生成高质量、语法正确的多样性文本，语义保留好。

缺点：依赖翻译服务质量，计算成本较高，可能引入翻译错误。

3.2 中级增强技术

3.2.1 上下文感知替换 (Contextual Word Replacement)

原理：与简单的同义词替换不同，上下文感知替换考虑词语在特定上下文中的含义，选择最合适的同义词进行替换。这对于多义词特别重要。

实现思路：使用预训练语言模型（如BERT）获取词语的上下文嵌入，然后从嵌入空间中寻找最相似但又不完全相同的词语作为替换。

代码实现：

import torch

from transformers import BertTokenizer, BertModel

import numpy as np

from scipy.spatial.distance import cosine

# 加载预训练模型和分词器

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertModel.from_pretrained('bert-base-uncased')

model.eval()

def get_contextual_embedding(text):

"""获取文本中每个词的上下文嵌入"""

inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():

outputs = model(**inputs)

# 获取最后一层的隐藏状态

last_hidden_state = outputs.last_hidden_state

# 将token转换回词语，同时记录每个词对应的token索引

tokens = tokenizer.tokenize(text)

word_embeddings = []

current_word = []

current_embedding = []

for i, token in enumerate(tokens):

if token.startswith('##'):

# 子词，合并到前一个词

current_word.append(token[2:])

current_embedding.append(last_hidden_state[0, i+1].numpy()) # +1是因为有[CLS] token

else:

if current_word:

# 对前一个词的多个子词嵌入取平均

avg_embedding = np.mean(current_embedding, axis=0)

word_embeddings.append(( ''.join(current_word), avg_embedding ))

current_word = [token]

current_embedding = [last_hidden_state[0, i+1].numpy()]

# 添加最后一个词

if current_word:

avg_embedding = np.mean(current_embedding, axis=0)

word_embeddings.append(( ''.join(current_word), avg_embedding ))

return word_embeddings

def contextual_replacement(text, top_k=5, replace_rate=0.2):

"""上下文感知替换数据增强"""

# 获取上下文嵌入

word_embeddings = get_contextual_embedding(text)

if not word_embeddings:

return text

words = [word for word, _ in word_embeddings]

embeddings = np.array([emb for _, emb in word_embeddings])

# 计算需要替换的词语数量

num_replace = max(1, int(len(words) * replace_rate))

augmented_words = words.copy()

# 随机选择要替换的词语

replace_indices = random.sample(range(len(words)), min(num_replace, len(words)))

for idx in replace_indices:

original_word = words[idx]

original_emb = embeddings[idx]

# 计算与其他词的相似度（排除自身）

similarities = []

for i, (word, emb) in enumerate(word_embeddings):

if i != idx: # 不与自身比较

# 计算余弦相似度（值越大越相似）

similarity = 1 - cosine(original_emb, emb)

similarities.append((word, similarity))

# 按相似度排序，取前top_k个候选词

similarities.sort(key=lambda x: x[1], reverse=True)

candidates = [word for word, sim in similarities[:top_k] if word != original_word]

if candidates:

# 随机选择一个候选词进行替换

replacement = random.choice(candidates)

augmented_words[idx] = replacement

return ' '.join(augmented_words)

# 示例

original_text = "I went to the bank to deposit my money and then sat on the river bank."

augmented_text = contextual_replacement(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “I went to the bank to deposit my money and then sat on the river bank.”

增强文本: “I went to the bank to deposit my money and then sat on the river shore.”

在这个例子中，上下文感知替换能够区分"bank"的两个不同含义（银行和河岸），并为第二个"bank"选择了更合适的同义词"shore"，而简单的同义词替换可能会错误地替换第一个"bank"。

3.2.2 句法结构变换 (Syntactic Transformation)

原理：通过改变句子的句法结构（如主动句变被动句、肯定句变否定句、调整修饰语位置等）来生成新的句子，同时保持语义不变。

代码实现（使用spacy进行句法分析）：

import spacy

from spacy.matcher import Matcher

# 加载spaCy的英文模型

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)

def active_to_passive(sentence):

"""将主动句转换为被动句"""

doc = nlp(sentence)

# 查找主动语态的动词短语

passive_sentence = None

for token in doc:

# 检查是否为及物动词

if token.dep_ == "ROOT" and token.pos_ == "VERB" and "nsubj" in [child.dep_ for child in token.children]:

# 找到主语和宾语

subj = next((child for child in token.children if child.dep_ == "nsubj"), None)

obj = next((child for child in token.children if child.dep_ == "dobj"), None)

if subj and obj:

# 构建被动句

passive_verb = token._.lemma + "ed" # 简化处理，实际应考虑动词变化

passive_sentence = f"{obj.text} is {passive_verb} by {subj.text}"

# 添加其他成分

for child in token.children:

if child.dep_ not in ["nsubj", "dobj"]:

passive_sentence += f" {child.text}"

break

return passive_sentence if passive_sentence else sentence

def add_negation(sentence):

"""在句子中添加否定词，改变句子的肯定/否定性质"""

doc = nlp(sentence)

# 查找可以添加否定的动词

neg_sentence = None

for token in doc:

if token.dep_ == "ROOT" and token.pos_ == "VERB":

# 在动词前添加否定词

neg_token = "not"

if token.text.lower().startswith(('a', 'e', 'i', 'o', 'u')):

neg_token = "n't" # 对于be动词等使用缩写形式

# 构建否定句

neg_sentence = []

for t in doc:

if t == token:

if neg_token == "n't":

neg_sentence.append(t.text + neg_token)

else:

neg_sentence.append(neg_token)

neg_sentence.append(t.text)

else:

neg_sentence.append(t.text)

neg_sentence = ' '.join(neg_sentence)

break

return neg_sentence if neg_sentence else sentence

def syntactic_transformation(text):

"""句法结构变换数据增强"""

# 尝试多种句法变换

transformations = []

# 主动句变被动句

passive = active_to_passive(text)

if passive and passive != text:

transformations.append(passive)

# 添加否定

negation = add_negation(text)

if negation and negation != text:

transformations.append(negation)

# 如果没有生成新的变换，返回原文本

if not transformations:

return [text]

return transformations

# 示例

original_text = "The company released a new product last month."

transformations = syntactic_transformation(original_text)

print(f"原始文本: {original_text}")

print("变换文本:")

for i, text in enumerate(transformations):

print(f" {i+1}. {text}")

效果示例：

原始文本: “The company released a new product last month.”

变换文本:

“product is released by company a new last month”

“The company not released a new product last month”

注意：这个实现是简化版，实际应用中需要更复杂的语法分析和转换规则，或使用专门的句法转换模型。

3.2.3 知识图谱增强 (Knowledge Graph Augmentation)

原理：利用外部知识图谱（如WordNet、ConceptNet、DBpedia等）为文本中的实体添加相关信息，丰富文本内容。

实现思路：识别文本中的实体，从知识图谱中获取相关概念、属性或关系，然后将这些信息以自然语言的形式融入原始文本。

代码实现（使用ConceptNet）：

import requests

import json

from urllib.parse import quote

def get_related_concepts(word, lang='en', limit=3):

"""从ConceptNet获取相关概念"""

url = f"http://api.conceptnet.io/query?node=/c/{lang}/{quote(word)}&rel=/r/RelatedTo&limit={limit}"

response = requests.get(url)

if response.status_code == 200:

data = json.loads(response.text)

related_concepts = []

for edge in data.get('edges', []):

# 获取相关概念

end_node = edge.get('end', {})

concept = end_node.get('label', '')

# 过滤掉非英文概念和自身

if concept and end_node.get('language') == lang and concept.lower() != word.lower():

related_concepts.append(concept)

return list(set(related_concepts)) # 去重

return []

def knowledge_graph_augmentation(text, add_prob=0.3):

"""知识图谱增强数据增强"""

doc = nlp(text)

augmented_tokens = []

for token in doc:

# 对名词尝试添加相关概念

if token.pos_ == "NOUN" and random.random() < add_prob:

related_concepts = get_related_concepts(token.text)

if related_concepts:

# 选择一个相关概念

concept = random.choice(related_concepts)

# 添加解释性短语

augmented_tokens.append(f"{token.text} (such as {concept})")

continue

augmented_tokens.append(token.text)

return ' '.join(augmented_tokens)

# 示例

original_text = "I need to buy a laptop for programming and data analysis."

augmented_text = knowledge_graph_augmentation(original_text)

print(f"原始文本: {original_text}")

print(f"增强文本: {augmented_text}")

效果示例：

原始文本: “I need to buy a laptop for programming and data analysis.”

增强文本: “I need to buy a laptop (such as notebook) for programming (such as software_development) and data analysis.”

3.3 高级增强技术

3.3.1 基于Transformer的文本生成增强

原理：利用预训练的语言模型（如GPT、BART、T5等）生成与原始文本语义相似但表达方式不同的新文本。

代码实现（使用Hugging Face的transformers库）：

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

def paraphrase_generation(text, model_name="t5-small", num_return_sequences=3):

"""使用T5模型生成释义文本"""

# 加载模型和分词器

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# T5模型需要特定的任务前缀

input_text = f"paraphrase: {text}"

# 使用pipeline进行文本生成

paraphraser = pipeline(

"text2text-generation",

model=model,

tokenizer=tokenizer,

max_length=100,

num_return_sequences=num_return_sequences,

temperature=0.7, # 控制生成的随机性，值越大越随机

top_k=50,

top_p=0.95

)

# 生成释义

paraphrases = paraphraser(input_text)

# 提取生成的文本

return [para['generated_text'] for para in paraphrases]

# 示例

original_text = "Climate change is one of the most pressing issues facing our planet today."

paraphrases = paraphrase_generation(original_text)

print(f"原始文本: {original_text}")

print("生成的释义文本:")

for i, para in enumerate(paraphrases):

print(f" {i+1}. {para}")

效果示例：

原始文本: “Climate change is one of the most pressing issues facing our planet today.”

生成的释义文本:

“Today, climate change is one of the most critical problems confronting our planet.”

“One of the most urgent challenges facing our planet today is climate change.”

“Our planet is currently facing one of the most pressing problems: climate change.”

进阶应用：对于特定领域，可以使用领域内数据微调预训练模型，以获得更符合领域特点的生成结果。

3.3.2 条件生成式数据增强

原理：根据特定条件（如情感、主题、领域等）生成具有目标属性的文本，用于平衡训练数据分布或适应特定场景。

代码实现（生成特定情感的文本）：

def conditional_text_generation(text, target_sentiment="positive", model_name="mrm8488/t5-base-finetuned-emotion"):

"""根据目标情感生成文本"""

# 加载情感条件生成模型

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 准备输入：[情感标签]: [原始文本]

input_text = f"{target_sentiment}: {text}"

# 生成文本

generator = pipeline(

"text2text-generation",

model=model,

tokenizer=tokenizer,

max_length=100,

temperature=0.8

)

result = generator(input_text)

return result[0]['generated_text']

# 示例

original_text = "The new phone has a great camera but the battery life is disappointing."

positive_version = conditional_text_generation(original_text, "positive")

negative_version = conditional_text_generation(original_text, "negative")

print(f"原始文本: {original_text}")

print(f"生成积极版本: {positive_version}")

print(f"生成消极版本: {negative_version}")

效果示例：

原始文本: “The new phone has a great camera but the battery life is disappointing.”

生成积极版本: “The new phone has an amazing camera and the battery life exceeds all expectations!”

生成消极版本: “The new phone has a terrible camera and the battery life is extremely disappointing.”

这种方法特别适用于解决类别不平衡问题，例如当负面样本远少于正面样本时，可以生成更多高质量的负面样本。

3.3.3 风格迁移增强 (Style Transfer Augmentation)

原理：保持文本语义内容不变，将其转换为不同的风格（如正式/非正式、积极/消极、专业/通俗等）。

实现思路：使用专门的风格迁移模型或提示工程方法，指导大型语言模型生成特定风格的文本。

代码实现（使用提示工程方法）：

def style_transfer(text, target_style="formal", model_name="gpt2"):

"""使用提示工程进行文本风格迁移"""

# 创建风格迁移提示

style_prompts = {

"formal": f"Rewrite the following text in a formal academic style: {text}\nFormal version:",

"informal": f"Rewrite the following text in a casual, conversational style: {text}\nCasual version:",

"technical": f"Explain the following concept using technical terminology: {text}\nTechnical explanation:",

"simple": f"Explain the following concept using simple language for a child to understand: {text}\nSimple explanation:"

}

if target_style not in style_prompts:

raise ValueError(f"不支持的目标风格: {target_style}")

prompt = style_prompts[target_style]

# 加载模型和生成文本

generator = pipeline(

"text-generation",

model=model_name,

tokenizer=AutoTokenizer.from_pretrained(model_name),

max_length=len(text) + 100,

temperature=0.7

)

# 对于GPT2等模型，需要添加padding和eos_token

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

result = generator(prompt, pad_token_id=tokenizer.eos_token_id)

generated_text = result[0]['generated_text']

# 提取风格迁移后的部分

if "version:" in generated_text:

return generated_text.split("version:")[-1].strip()

elif "explanation:" in generated_text:

return generated_text.split("explanation:")[-1].strip()

return generated_text

# 示例

original_text = "AI is changing how we live and work in many ways."

formal_version = style_transfer(original_text, "formal")

informal_version = style_transfer(original_text, "informal")

print(f"原始文本: {original_text}")

print(f"正式风格: {formal_version}")

print(f"非正式风格: {informal_version}")

效果示例：

原始文本: “AI is changing how we live and work in many ways.”

正式风格: “Artificial intelligence is transforming numerous aspects of human existence and professional activities.”

非正式风格: “AI’s totally changing how we live and work, like in so many ways, it’s crazy!”

3.4 数据增强策略与组合方法

单一的数据增强方法往往效果有限，而组合多种方法可以获得更好的增强效果。以下是几种常用的组合策略：

3.4.1 顺序组合策略

按一定顺序应用多种增强方法，首先使用回译获取语义相似的文本，然后对回译结果应用同义词替换，最后应用随机删除。

def sequential_augmentation(text):

"""顺序组合多种增强方法"""

# 1. 首先应用回译

aug_text = back_translation(text)

# 2. 然后应用同义词替换

aug_text = synonym_replacement(aug_text)

# 3. 最后应用随机删除

aug_text = random_deletion(aug_text)

return aug_text

3.4.2 随机组合策略

每次增强随机选择一种或多种方法组合应用：

def random_combination_augmentation(text, num_aug=5):

"""随机组合多种增强方法"""

augmentation_methods = [

synonym_replacement,

random_insertion,

random_deletion,

random_swap,

back_translation,

contextual_replacement

]

augmented_texts = []

for _ in range(num_aug):

# 随机选择1-3种方法组合

num_methods = random.randint(1, 3)

selected_methods = random.sample(augmentation_methods, num_methods)

aug_text = text

for method in selected_methods:

try:

aug_text = method(aug_text)

except:

continue

if aug_text != text:

augmented_texts.append(aug_text)

return list(set(augmented_texts)) # 去重

参考书籍：

posted on 2025-07-16 21:25 limingqi 阅读(170) 评论(0) 收藏举报

刷新页面返回顶部

文本数据增强以及python实现

导航

公告