竞赛学习活动 —— 文本匹配

文本匹配


重点内容

  • 计算文本间的统计距离
  • 训练词向量&无监督句子编码
  • BERT模型搭建和训练

文本匹配应用场景

文本匹配是NLP中一个重要的基础问题,应用场景包括:

  • 信息检索:根据query检索相似文本;
  • 新闻推荐:相似新闻推荐;
  • 智能客服:根据用户提出的问题检索相似的问题和答案。

数据集

LCQMC数据集侧重于意图匹配,比释义语料库更通用。LCQMC包含260068个带有人工标注的问题对。数据集划分如下:

  • train_set: 238766
  • dev_set: 8802
  • test_set: 12500

评估方式

  • Accuracy:\(acc = \frac{预测正确的样本数}{总样本数}\)
  • 文本相似度与标签的皮尔逊系数(TODO)(皮尔逊系数是衡量两组数据相关性的工具,计算文本预测结果序列与文本标签序列的皮尔逊相关系数?)

学习打卡

环境依赖

导入依赖包

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba

数据集读取

# 加载数据集

def load_dataset():
    """
    加载LCQMC文本匹配数据集
    """
    
    train_set = pd.read_csv('https://mirror.coggle.club/dataset/LCQMC.train.data.zip', 
            sep='\t', names=['query1', 'query2', 'label'])

    valid_set = pd.read_csv('https://mirror.coggle.club/dataset/LCQMC.valid.data.zip', 
            sep='\t', names=['query1', 'query2', 'label'])

    test_set = pd.read_csv('https://mirror.coggle.club/dataset/LCQMC.test.data.zip', 
            sep='\t', names=['query1', 'query2', 'label'])

    return train_set, valid_set, test_set

train_set, valid_set, test_set = load_dataset()
train_set.head()
query1 query2 label
0 喜欢打篮球的男生喜欢什么样的女生 爱打篮球的男生喜欢什么样的女生 1
1 我手机丢了,我想换个手机 我想买个新手机,求推荐 1
2 大家觉得她好看吗 大家觉得跑男好看吗? 0
3 求秋色之空漫画全集 求秋色之空全集漫画 1
4 晚上睡觉带着耳机听音乐有什么害处吗? 孕妇可以戴耳机听音乐吗? 0
print(f'train_size:{train_set.shape[0]}, valid_size:{valid_set.shape[0]}, test_size:{test_set.shape[0]}')
train_size:238766, valid_size:8802, test_size:12500

文本数据分析

  • 比较相似文本对平均文本长度与不相似文本对的平均文本长度
  • 统计所有文本中字符和单词(jieba分词)的个数
dataset = pd.concat([train_set, valid_set, test_set])
# 文本数据分析

def data_analysis(dataset):
    """
    统计dataset中所有样本的字符和词,比较不同标签样本的平均文本长度
    返回query1和query2分词后的列表,供后续环节使用
    """
    
    word_vocab, char_vocab = set(), set()
    len_pos, len_neg = [], []
    query1_lst, query2_lst = [], []
    
    for index, query1, query2, label in dataset.itertuples():
        len_query = len(query1) + len(query2)
        if label:
            len_pos.append(len_query)
        else:
            len_neg.append(len_query)
        
        char_vocab |= (set(query1 + query2))
        
        words1 = jieba.lcut(query1, cut_all = False)
        words2 = jieba.lcut(query2, cut_all = False)
        word_vocab |= (set(words1) | set(words2))
        
        query1_lst.append(words1)
        query2_lst.append(words2)
        
    print(f'相似文本对平均长度:{0.25 * sum(len_pos) / len(len_pos)}')
    print(f'不相似文本对平均长度:{0.25 * sum(len_neg) / len(len_neg)}')
    print(f'词数:{len(word_vocab)}字数:{len(char_vocab)}')    
    
    dataset['query1_seg'] = pd.Series(query1_lst)
    dataset['query2_seg'] = pd.Series(query2_lst)
    
    return

data_analysis(dataset)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.869 seconds.
Prefix dict has been built successfully.


相似文本对平均长度:5.108003967137094
不相似文本对平均长度:5.9488821926706485
词数:40441字数:5088

结果显示:相似文本对与不相似文本对的平均长度差别不大

文本相似度(统计特征)

对query1和query2计算以下统计特征,并计算与标签的相关度(皮尔逊相关系数):

  • 文本长度:abs(len(query1) - len(query2))/(len(query1) + len(query2))
  • 单词个数:abs(num(query1) - num(query2))/(num(query1) + num(query2))
  • 单词差异:Jaccard距离
  • LCS:len(LCS)/(len(query1) + len(query2))
  • TFIDF编码相似度:余弦相似度
dataset.head()
query1 query2 label
0 喜欢打篮球的男生喜欢什么样的女生 爱打篮球的男生喜欢什么样的女生 1
1 我手机丢了,我想换个手机 我想买个新手机,求推荐 1
2 大家觉得她好看吗 大家觉得跑男好看吗? 0
3 求秋色之空漫画全集 求秋色之空全集漫画 1
4 晚上睡觉带着耳机听音乐有什么害处吗? 孕妇可以戴耳机听音乐吗? 0
# 获取tfidf权重

def get_tv(dataset):
    """
    基于语料训练TfidfVectorizer
    """
    
    seg1 = dataset.query1_seg.map(lambda x: ' '.join(x)).to_list()
    seg2 = dataset.query2_seg.map(lambda x: ' '.join(x)).to_list()
    
    corpus = seg1 + seg2
    
    tv = TfidfVectorizer(use_idf=True, smooth_idf=True, norm=None)
    tv_fit = tv.fit(corpus)
    
    return tv_fit


tv_fit = get_tv(dataset)
# 统计特征分析

def get_cos_similarity(v1: list, v2: list):
    
    num = float(np.dot(v1, v2))  # 向量点乘
    denom = np.linalg.norm(v1) * np.linalg.norm(v2)  # 求模长的乘积
    
    return 0.5 + 0.5 * (num / denom) if denom != 0 else 0
def cal_LCS(s1, s2):
    """
    计算字符串s1与s2的最长公共子序列的长度
    """
    
    res = 0
    
    l1, l2 = len(s1), len(s2)
    dp = [[0 for _ in range(l2 + 1)] for _ in range(l1 + 1)]
    
    for i in range(l1):
        for j in range(l2):
            
            if s1[i] == s2[j]:
                dp[i + 1][j + 1] = dp[i][j] + 1
                res += 1
                
            else:
                dp[i + 1][j + 1] = max(dp[i + 1][j], dp[i][j + 1])
    
    res = res / (len(s1) + len(s2))            
    
    return res


def cal_statistic(dataset):
    """
    计算各类统计特征
    """
        
    res = []
        
    for index, query1, query2, label, query1_seg, query2_seg in dataset.itertuples():
        
        # 文本长度
        length = abs(len(query1) - len(query2))/(len(query1) + len(query2))
        
        # 单词个数
        num = abs(len(query1_seg) - len(query2_seg)) / (len(query1_seg) + len(query2_seg))
        
        # 单词差异:jaccard距离
        jaccard = len(set(query1_seg) & set(query2_seg)) / len(set(query1_seg) | set(query2_seg))
        
        # LCS
        lcs = cal_LCS(query1, query2)
        
        # tfidf_sim
        query1_tfidf, query2_tfidf = tv_fit.transform([' '.join(query1_seg), ' '.join(query2_seg)]).toarray()

        tfidf_sim = get_cos_similarity(query1_tfidf, query2_tfidf)
        
        
        res.append([length, num, jaccard, lcs, tfidf_sim, label])
        
        if index and not index % 50000:
            print(f'{index} finished!')
            
    print('all finished!')    
    
    stat_df = pd.DataFrame(np.array(res), columns=['文本长度', '单词个数', '单词差异(jaccard)', 'LCS', 'tfidf_sim', 'label'], dtype=float)

    return stat_df


stat_df = cal_statistic(dataset)

stat_df.corr()
50000 finished!
100000 finished!
150000 finished!
200000 finished!
all finished!
文本长度 单词个数 单词差异(jaccard) LCS tfidf_sim label
文本长度 1.000000 0.767464 -0.343058 -0.312184 -0.247515 -0.201464
单词个数 0.767464 1.000000 -0.286983 -0.198526 -0.093974 -0.078558
单词差异(jaccard) -0.343058 -0.286983 1.000000 0.712443 0.659670 0.534212
LCS -0.312184 -0.198526 0.712443 1.000000 0.608842 0.492255
tfidf_sim -0.247515 -0.093974 0.659670 0.608842 1.000000 0.533298
label -0.201464 -0.078558 0.534212 0.492255 0.533298 1.000000

结果显示:以jaccard相似度衡量的单词差异与标签的相关性最高,即单词差异最具有区分性。

文本相似度(词向量与句子编码)

  1. 使用jieba分词,然后使用word2vec训练词向量
  2. 计算单词的TFIDF和BM25权重
  3. 尝试如下无监督句子编码过程
  • Mean-Pooling:句子中所有词的embedding按列取平均
  • Max-Pooling:句子中所有词的embedding按列取最大值(即取embedding每个维度的最大值)
  • TFIDF-Pooling:句子中所有词的embedding按TFIDF加权平均,对Mean-Pooling的优化
  • BM25-Pooling:对TFIDF-Pooling中的TF进行优化,主要是考虑到句子长度对TF的影响
  • SIF-Pooling:保持加权平均的本质思想,加权平均后得到句子向量列表,对所有的句子向量求第一主成分,然后用原来的句子向量减去其第一主成分。
# 训练word2vec model

from gensim.models.word2vec import Word2Vec
import os

def get_word2vec_model(dataset):
    
    path_model = "../resource/word2vec.model"
    
    if os.path.exists(path_model):
        model = Word2Vec.load(path_model)
    
    else:
        corpus = dataset.query1_seg.to_list() + dataset.query2_seg.to_list()
    
        model = Word2Vec(sentences=corpus, vector_size=100, min_count=1)
        model.save(path_model)
    
    return model

w2v_model = get_word2vec_model(dataset)
# w2v embedding
corpus = dataset.query1_seg.to_list() + dataset.query2_seg.to_list()
    
w2v_embedding = list(map(lambda x: w2v_model.wv[x], corpus))
# 计算bm25和tfidf权重

from gensim.corpora import Dictionary
from gensim.models.bm25model import OkapiBM25Model, BM25ABC
import math

class BM25Model(BM25ABC):
    def __init__(self, corpus=None, dictionary=None, k1=1.5, b=0.75, epsilon=0.25):
        self.k1, self.b, self.epsilon = k1, b, epsilon
        super().__init__(corpus, dictionary)

    def precompute_idfs(self, dfs, num_docs):
        idf_sum = 0
        idfs = dict()
        negative_idfs = []
        for term_id, freq in dfs.items():
            idf = math.log(num_docs - freq + 0.5) - math.log(freq + 0.5)
            idfs[term_id] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(term_id)
        average_idf = idf_sum / len(idfs)

        eps = self.epsilon * average_idf
        for term_id in negative_idfs:
            idfs[term_id] = eps

        return idfs

    def get_term_weights(self, num_tokens, term_frequencies, idfs):
        term_weights = idfs * (term_frequencies * (self.k1 + 1)
                              / (term_frequencies + self.k1 * (1 - self.b + self.b
                                                              * num_tokens / self.avgdl)))
        return term_weights
    
    def __getitem__(self, bow):
        
        num_tokens = sum(freq for term_id, freq in bow)

        term_ids, term_frequencies, idfs = [], [], []
        for term_id, term_frequency in bow:
            term_ids.append(term_id)
            term_frequencies.append(term_frequency)
            idfs.append(self.idfs.get(term_id) or 0.0)
        term_frequencies, idfs = np.array(term_frequencies), np.array(idfs)

        bm25_weights = self.get_term_weights(num_tokens, term_frequencies, idfs)
        tfidf_weights = term_frequencies * idfs
        
        vector = [
            (term_id, float(bm25_weight), float(tfidf_weight))
            for term_id, bm25_weight, tfidf_weight
            in zip(term_ids, bm25_weights, tfidf_weights)
        ]
        return vector

    
def get_weight(corpus):
    
    dictionary = Dictionary(corpus)
    
    bm25_model = BM25Model(dictionary=dictionary)
    idf_map = bm25_model.idfs
    
    weight_corpus = list(map(lambda x: bm25_model[dictionary.doc2bow(x)], corpus))
    
    bm25_weight, tfidf_weight = [], []
    
    for weight, doc in zip(weight_corpus, corpus):
        
        id_to_bm25, id_to_tfidf = {}, {}
        for id, bm25, tfidf in weight:
            id_to_bm25[id] = bm25
            id_to_tfidf[id] = tfidf
        
        bm25_vec = list(map(lambda x: id_to_bm25[dictionary.token2id[x]], doc))
        tfidf_vec = list(map(lambda x: id_to_tfidf[dictionary.token2id[x]], doc))
        
        bm25_weight.append(bm25_vec)
        tfidf_weight.append(tfidf_vec)
    
    return bm25_weight, tfidf_weight

bm25_weight, tfidf_weight = get_weight(corpus)

print(len(bm25_weight), len(tfidf_weight))
520136 520136
# pooling

from sklearn.decomposition import TruncatedSVD


def compute_pc(X,npc=1):
    """
    Compute the principal components. DO NOT MAKE THE DATA ZERO MEAN!
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: component_[i,:] is the i-th pc
    """
    svd = TruncatedSVD(n_components=npc, n_iter=7, random_state=0)
    svd.fit(X)
    return svd.components_

def remove_pc(X, npc=1):
    """
    Remove the projection on the principal components
    :param X: X[i,:] is a data point
    :param npc: number of principal components to remove
    :return: XX[i, :] is the data point after removing its projection
    """
    pc = compute_pc(X, npc)
    if npc==1:
        XX = X - X.dot(pc.transpose()) * pc
    else:
        XX = X - X.dot(pc.transpose()).dot(pc)
    return XX


mean_pooling = list(map(lambda x: np.array(x).mean(axis=0), w2v_embedding))
max_pooling = list(map(lambda x: np.array(x).max(axis=0), w2v_embedding))

bm25_pooling, tfidf_pooling = [], []

for w2v, bm25, tfidf in zip(w2v_embedding, bm25_weight, tfidf_weight):
    bm25_pooling.append((np.array(bm25).reshape(-1,1) * w2v).mean(axis=0))
    tfidf_pooling.append((np.array(tfidf).reshape(-1,1) * w2v).mean(axis=0))

sif_bm25_pooling = remove_pc(np.array(bm25_pooling))
sif_tfidf_pooling = remove_pc(np.array(tfidf_pooling))
sif_mean_pooling = remove_pc(np.array(mean_pooling))
# 相似度计算

def cal_similarity(pooling):
    half_len = int(len(pooling) / 2)
    res = list(map(lambda x: get_cos_similarity(x[0], x[1]), zip(pooling[:half_len], pooling[half_len:])))
    return res

mean_sim = cal_similarity(mean_pooling)
max_sim = cal_similarity(max_pooling)
bm25_sim = cal_similarity(bm25_pooling)
tfidf_sim = cal_similarity(tfidf_pooling)
sif_bm25_sim = cal_similarity(sif_bm25_pooling)
sif_tfidf_sim = cal_similarity(sif_tfidf_pooling)
sif_mean_sim = cal_similarity(sif_mean_pooling)

sim_df = pd.DataFrame(np.array([mean_sim, max_sim, bm25_sim, tfidf_sim, sif_bm25_sim, sif_tfidf_sim, sif_mean_sim]).T, columns=['mean_sim', 'max_sim', 'bm25_sim', 'tfidf_sim', 'sif_bm25_sim', 'sif_tfidf_sim', 'sif_mean_sim'], dtype=float)
sim_df['label'] = pd.Series(dataset['label'].tolist())
sim_df.corr()
mean_sim max_sim bm25_sim tfidf_sim sif_bm25_sim sif_tfidf_sim sif_mean_sim label
mean_sim 1.000000 0.430473 0.820236 0.820078 0.817288 0.819759 0.975252 0.504288
max_sim 0.430473 1.000000 0.235300 0.233249 0.233063 0.232914 0.396588 0.153789
bm25_sim 0.820236 0.235300 1.000000 0.998204 0.999388 0.997849 0.823554 0.623700
tfidf_sim 0.820078 0.233249 0.998204 1.000000 0.997549 0.999593 0.823530 0.618218
sif_bm25_sim 0.817288 0.233063 0.999388 0.997549 1.000000 0.997193 0.820727 0.623515
sif_tfidf_sim 0.819759 0.232914 0.997849 0.999593 0.997193 1.000000 0.823204 0.618578
sif_mean_sim 0.975252 0.396588 0.823554 0.823530 0.820727 0.823204 1.000000 0.506946
label 0.504288 0.153789 0.623700 0.618218 0.623515 0.618578 0.506946 1.000000

结果显示:bm25加权平均的效果是最优的,在mean_pooling, bm25_pooling, tfidf_pooling的基础上做SIF的去除第一主成分的优化,效果不是很明显。

文本匹配模型(LSTM孪生网络)

  1. 定义孪生网络(嵌入层、LSTM层、全连接层)
  2. 使用文本匹配数据训练孪生网络
  3. 对测试数据进行预测
# 确定合适的seq_len

from matplotlib import pyplot as plt
from collections import Counter
seqs_len = list(sorted([len(x) for x in (dataset.query1_seg.to_list() + dataset.query2_seg.to_list())]))

len_counter = Counter(seqs_len)
plt.bar(len_counter.keys(), len_counter.values())
<BarContainer object of 33 artists>

seq_len = 15
# 数据集预处理

def load_pretrained_embedding_matrix(w2v_model):
    '''
    返回:
    1. 加入OOV vector和PAD vector的预训练的embedding矩阵
    2. 字典,word2index,用于处理分词后的文本序列
    3. padding_idx,PAD vector在embedding矩阵中的index
    4. embedding_dim词嵌入维度
    '''
    
    embedding_dim = w2v_model.wv.vectors.shape[1]
    vec_oov = np.random.rand(embedding_dim)
    vec_pad = np.zeros(embedding_dim)
    embedding_matrix = np.vstack((w2v_model.wv.vectors, vec_oov, vec_pad))

    vocab = w2v_model.wv.key_to_index
    
    oov_idx = len(vocab)
    padding_idx = oov_idx + 1
    
    return embedding_matrix, vocab, oov_idx, padding_idx, embedding_dim


def DataProcess(sent):
    
    if len(sent) <= seq_len:
        sent = list(map(lambda x: vocab.get(x, oov_idx), sent)) + [padding_idx] * (seq_len - len(sent))
    else:
        sent = list(map(lambda x: vocab.get(x, oov_idx), sent[:seq_len]))
        
    return np.array(sent)


embedding_matrix, vocab, oov_idx, padding_idx, embedding_dim = load_pretrained_embedding_matrix(w2v_model)

q1 = dataset.query1_seg.map(DataProcess)
q2 = dataset.query2_seg.map(DataProcess)
label = dataset.label

data = pd.concat(([q1, q2, label]), axis=1)
data = data.sample(frac=1, ignore_index=True)
data.head()
query1_seg query2_seg label
0 [22, 1634, 72, 0, 10067, 39658, 39658, 39658, ... [20, 1634, 10067, 2, 39658, 39658, 39658, 3965... 1
1 [19, 201, 63, 1, 32, 0, 8, 39658, 39658, 39658... [1, 32, 0, 19, 21352, 7614, 39658, 39658, 3965... 0
2 [44, 6741, 9219, 7, 39658, 39658, 39658, 39658... [44, 6741, 9219, 103, 39658, 39658, 39658, 396... 1
3 [1175, 35, 4455, 20288, 15181, 54, 39658, 3965... [35, 5106, 54, 3332, 39658, 39658, 39658, 3965... 0
4 [764, 1139, 3634, 243, 6749, 0, 253, 54, 39658... [16, 3634, 2368, 54, 4462, 2341, 0, 13, 184, 1... 0
from torch.utils.data import Dataset, DataLoader
import torch

class MyDataset(Dataset):
    
    def __init__(self, data, mode):
        
        if mode == 'train':
            self.data = data.iloc[:238766,:].values
        if mode == 'valid':
            self.data = data.iloc[238766:238766+8802,:].values
        if mode == 'test':
            self.data = data.iloc[238766+8802:,:].values
    
    def __len__(self):
        
        return self.data.shape[0]
    
    def getdata(self, index):
        
        q1, q2, label = self.data[index]
        
            
        return torch.tensor(q1), torch.tensor(q2), torch.tensor(label)
            
    def __getitem__(self, index):
        
        if isinstance(index, slice):
            start, end, stride = index.indices(len(self))
            q1s, q2s, labels = [], [], []
            for i in range(start, end):
                q1, q2, label = self.getdata(i)
                q1s.append(q1)
                q2s.append(q2)
                labels.append(label)
            return q1s, q2s, labels
        else:
            return self.getdata(index)

train_dataset = MyDataset(data, mode = 'train')
valid_dataset = MyDataset(data, mode = 'valid')
test_dataset = MyDataset(data, mode = 'test')
BATCH_SIZE = 16

train_data_loader = DataLoader(train_dataset, batch_size=4*BATCH_SIZE, shuffle=True, drop_last=True)
valid_data_loader = DataLoader(valid_dataset, batch_size=4*BATCH_SIZE, shuffle=True, drop_last=True)
test_data_loader = DataLoader(test_dataset, batch_size=4*BATCH_SIZE, shuffle=True, drop_last=True)

train_sample = tuple(next(iter(train_data_loader)))
valid_sample = tuple(next(iter(valid_data_loader)))
test_sample = tuple(next(iter(test_data_loader)))

# print(train_sample, valid_sample, test_sample)
# 定义模型

import torch
import torch.nn as nn

class SiameseLSTM(nn.Module):
    def __init__(self, pretrained_weight, embedding_dim, hidden_dim, out_dim, padding_idx, batch_size):
        super(SiameseLSTM, self).__init__()
        
        
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.out_dim = out_dim
        
        
#         self.h0 = torch.randn(1, self.batch_size, self.hidden_dim).cuda()
#         self.c0 = torch.randn(1, self.batch_size, self.hidden_dim).cuda()
        
        
        self.embed = nn.Embedding.from_pretrained(pretrained_weight, padding_idx=padding_idx, freeze=False)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True)
        self.hidden2tag = nn.Linear(4 * self.hidden_dim, self.out_dim)

        
    def forward(self, q1, q2):
        """
        Input:
            q1: [batch_size, seq_len]
            q2: [batch_size, seq_len]
        """
        
        # [batch_size, seq_len, embedding_dim]
        embed1, embed2 = self.embed(q1), self.embed(q2)              
        
        
        # [batch_size, seq_len, hidden_dim]
#         output1, (h1, c1) = self.lstm(embed1, (self.h0, self.c0))
#         output2, (h2, c2) = self.lstm(embed2, (self.h0, self.c0))
        
        output1, (h1, c1) = self.lstm(embed1)
        output2, (h2, c2) = self.lstm(embed2)
        
        
        # [batch_size, hidden_dim] 沿seq_len方向做max_pooling
        output1 = output1.permute(1, 0, 2)[-1]
        output2 = output2.permute(1, 0, 2)[-1]

        
        # [batch_size, 4*hidden_dim]
        sim = torch.cat([output1, output1 * output2, torch.abs(output1 - output2), output2], dim=-1)

        
        # [batch_size, 2]
        output = self.hidden2tag(sim)

        return output
torch.cuda.is_available()
True
torch.cuda.device_count()
4
!nvidia-smi
Tue Jan 24 11:01:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:03:00.0 Off |                    0 |
| N/A   34C    P0    51W / 250W |    865MiB / 22919MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:04:00.0 Off |                    0 |
| N/A   35C    P0    51W / 250W |    611MiB / 22919MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   33C    P0    50W / 250W |    611MiB / 22919MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P0    50W / 250W |    611MiB / 22919MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10876      C   ...3/envs/jupyter/bin/python      863MiB |
|    1   N/A  N/A     10876      C   ...3/envs/jupyter/bin/python      609MiB |
|    2   N/A  N/A     10876      C   ...3/envs/jupyter/bin/python      609MiB |
|    3   N/A  N/A     10876      C   ...3/envs/jupyter/bin/python      609MiB |
+-----------------------------------------------------------------------------+
model = SiameseLSTM(
    pretrained_weight = torch.FloatTensor(embedding_matrix), 
    embedding_dim = embedding_dim, 
    hidden_dim = 16, 
    out_dim = 2, 
    padding_idx=padding_idx, 
    batch_size = BATCH_SIZE
)
model = nn.DataParallel(model).cuda()
print(model)
DataParallel(
  (module): SiameseLSTM(
    (embed): Embedding(39659, 100, padding_idx=39658)
    (lstm): LSTM(100, 16, batch_first=True)
    (hidden2tag): Linear(in_features=64, out_features=2, bias=True)
  )
)
# 准确率计算

def accuracy(pred, label):
    return (pred.argmax(dim=1) == label).float().mean().item()
# 训练模型

import torch.optim as optim

EPOCH = 10
LEARNING_RATE = 0.001


def process(data_loader, mode, optimizer=None):
    
    step, total_loss, total_acc = 0, 0, 0
    
    for q1, q2, tags in data_loader:
        pred = model(q1.cuda(), q2.cuda())
        loss = nn.CrossEntropyLoss()(pred, tags.cuda())
        
        total_loss += loss.item()
        total_acc += accuracy(pred, tags.cuda())
        
        if mode == 'train':
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        step += 1
    
    print(f'{mode}_loss:{total_loss / step}, {mode}_acc:{total_acc / step}')
            
        
seed = 1
torch.manual_seed(seed)

optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

for i in range(EPOCH):
    
    print(f'-------------------------第{i+1}轮训练开始---------------------')
    
    # 训练
    
    process(train_data_loader, mode='train', optimizer=optimizer)
    
    # 计算验证集损失
    with torch.no_grad():
        process(valid_data_loader, mode='valid')
-------------------------第1轮训练开始---------------------
train_loss:0.535547758698783, train_acc:0.7363815348525469
valid_loss:0.4924165106167758, valid_acc:0.7692746350364964
-------------------------第2轮训练开始---------------------
train_loss:0.41942384708423075, train_acc:0.8191353887399464
valid_loss:0.4552816731216264, valid_acc:0.7974452554744526
-------------------------第3轮训练开始---------------------
train_loss:0.36078374552023634, train_acc:0.8509509048257373
valid_loss:0.4463498212777785, valid_acc:0.8078239051094891
-------------------------第4轮训练开始---------------------
train_loss:0.3236872991791679, train_acc:0.8693951072386059
valid_loss:0.45117809269985143, valid_acc:0.8153512773722628
-------------------------第5轮训练开始---------------------
train_loss:0.29681315174171496, train_acc:0.8821003686327078
valid_loss:0.4543193623314809, valid_acc:0.8125
-------------------------第6轮训练开始---------------------
train_loss:0.27694827924425097, train_acc:0.8920911528150134
valid_loss:0.4587254343676741, valid_acc:0.8171760948905109
-------------------------第7轮训练开始---------------------
train_loss:0.26128301543580623, train_acc:0.8985673592493297
valid_loss:0.46427330396471234, valid_acc:0.8163777372262774
-------------------------第8轮训练开始---------------------
train_loss:0.24749427433387844, train_acc:0.9042937332439678
valid_loss:0.47178024216725006, valid_acc:0.8185447080291971
-------------------------第9轮训练开始---------------------
train_loss:0.2365480364946993, train_acc:0.9088011058981234
valid_loss:0.48314058378230046, valid_acc:0.8126140510948905
-------------------------第10轮训练开始---------------------
train_loss:0.22686020918689848, train_acc:0.9128141756032172
valid_loss:0.5018375272298381, valid_acc:0.8170620437956204
# 模型测试

process(test_data_loader, mode='test')
test_loss:0.4900959795866257, test_acc:0.8205929487179487
PATH = 'LSTM_model.pt'
torch.save(model.state_dict(), PATH)

选用的模型很简单,第四个epoch基本拟合完毕,测试集准确率0.82

优化方向:提升模型复杂度(BiLSTM),学习率衰减,dropout,梯度剪裁

文本匹配模型(BERT模型)

  • 使用BERT对文本进行编码,计算句子对相似度
  • 定义BERT网络,使用数据完成BERT-NSP训练,对测试数据进行预测
  • 使用BERT模型完成Sentence-BERT,训练并进行预测。
# 使用BERT编码句子

from transformers import BertConfig, BertModel, BertTokenizer
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

model = BertModel.from_pretrained("bert-base-chinese")
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
dataset.head()
query1 query2 label
0 喜欢打篮球的男生喜欢什么样的女生 爱打篮球的男生喜欢什么样的女生 1
1 我手机丢了,我想换个手机 我想买个新手机,求推荐 1
2 大家觉得她好看吗 大家觉得跑男好看吗? 0
3 求秋色之空漫画全集 求秋色之空全集漫画 1
4 晚上睡觉带着耳机听音乐有什么害处吗? 孕妇可以戴耳机听音乐吗? 0
# Tokenize sentences

def bert_encode(sentences):
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=32)
    
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])    
#     print(f'embedding:{pair_embeddings}')
    
    # Normalize embeddings
    encode = F.normalize(embeddings, p=2, dim=1)
    
    return encode


query1 = dataset.query1.to_list()
query2 = dataset.query2.to_list()

num_query = len(query1)
index, batch_size = 0, 128

sim_bert = []
while index + batch_size < num_query:
    encode1 = bert_encode(query1[index:index + batch_size])
    encode2 = bert_encode(query2[index:index + batch_size])

    sim_bert += list(map(lambda x: torch.dot(x[0], x[1]), zip(encode1, encode2)))
    index += batch_size

encode1 = bert_encode(query1[index:])
encode2 = bert_encode(query2[index:])

sim_bert += list(map(lambda x: torch.dot(x[0], x[1]), zip(encode1, encode2)))
    
print(len(sim_bert))
260068
df = pd.DataFrame(np.array([sim_bert, dataset.label.values]).T, columns=['bert_sims', 'label'])
df.corr()
bert_sims label
bert_sims 1.000000 0.557954
label 0.557954 1.000000
# bert-nsp 训练
import transformers
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, TensorDataset

transformers.logging.set_verbosity_error()

# 划分为训练集和验证集
# stratify 按照标签进行采样,训练集和验证部分同分布
q1_train, q1_val, q2_train, q2_val, train_label, test_label =  train_test_split(
    dataset['query1'], 
    dataset['query2'],
    dataset['label'],
    test_size=0.2, 
    stratify=dataset['label'])


from transformers import BertTokenizer
# 分词器,词典

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
train_encoding = tokenizer(list(q1_train), list(q2_train), 
                           truncation=True, padding=True, max_length=32)
test_encoding = tokenizer(list(q1_val), list(q2_val), 
                          truncation=True, padding=True, max_length=32)

# 数据集读取
class MyDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    # 读取单个样本
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(train_encoding, list(train_label))
test_dataset = MyDataset(test_encoding, list(test_label))

for k, v in train_encoding.items():
    for ids in v:
        print(tokenizer.decode(ids))
        break
    break
[CLS] 辽 宁 卫 生 职 业 技 术 学 院 的 住 宿 条 件 [SEP] 辽 宁 卫 生 职 业 技 术 学 院 医 疗 美 容 [SEP]
from transformers import BertForNextSentencePrediction, get_linear_schedule_with_warmup
import torch

model = BertForNextSentencePrediction.from_pretrained('bert-base-chinese')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 单个读取到批量读取
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

# 优化方法
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
total_steps = len(train_loader) * 1
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

# 模型训练

def process(data_loader, mode, optimizer=None, scheduler=None):
    
    step, total_loss, total_acc = 0, 0, 0
    
    for batch in train_loader:
        # 正向传播
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        loss = outputs[0]
        logits = outputs[1]
        
        total_loss += loss.item()
        
        pred = logits.detach().cpu()
        label_ids = labels.to('cpu')
        
        total_acc += accuracy(pred, label_ids)
        
        if mode == 'train':
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            scheduler.step()
        
        step += 1
        if mode == 'train' and step % 2000 == 0:
            print(f'iter_num: {step}, {mode}_loss: {total_loss / step}, progress:{step/total_steps*100}%')
            
    print(f'{mode}_total_loss:{total_loss / step}, {mode}_acc:{total_acc / step}')
            
        
for i in range(5):
    
    print(f'-------------------------第{i+1}轮训练开始---------------------')
    
    # 训练
    model.train()
    process(train_loader, mode='train', optimizer=optimizer, scheduler=scheduler)
    
    # 计算验证集损失
    model.eval()
    with torch.no_grad():
        process(test_loader, mode='valid')
        

-------------------------第1轮训练开始---------------------
iter_num: 2000, train_loss: 0.2818656249437481, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.25996779286302624, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.246443612488918, progress:92.27929867733006%
train_total_loss:0.24448637775899182, train_acc:0.8984403400394008
valid_total_loss:0.17493474610913023, valid_acc:0.9312881875481835
-------------------------第2轮训练开始---------------------
iter_num: 2000, train_loss: 0.18335341547057032, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.1840216007931158, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.18221153588220476, progress:92.27929867733006%
train_total_loss:0.18244350250490682, train_acc:0.9271094034045192
valid_total_loss:0.17494391908760862, valid_acc:0.9312838182673752
-------------------------第3轮训练开始---------------------
iter_num: 2000, train_loss: 0.18436180991400034, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.1840214733267203, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.18291711654793472, progress:92.27929867733006%
train_total_loss:0.18276111815716028, train_acc:0.9271478531123014
valid_total_loss:0.1749338287400653, valid_acc:0.9312903721931713
-------------------------第4轮训练开始---------------------
iter_num: 2000, train_loss: 0.18274084678804503, progress:30.759766225776687%



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

Cell In[9], line 46
     44 # 训练
     45 model.train()
---> 46 process(train_loader, mode='train', optimizer=optimizer, scheduler=scheduler)
     48 # 计算验证集损失
     49 model.eval()


Cell In[9], line 12, in process(data_loader, mode, optimizer, scheduler)
     10 attention_mask = batch['attention_mask'].to(device)
     11 labels = batch['labels'].to(device)
---> 12 outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
     14 loss = outputs[0]
     15 logits = outputs[1]


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1480, in BertForNextSentencePrediction.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, **kwargs)
   1476     labels = kwargs.pop("next_sentence_label")
   1478 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1480 outputs = self.bert(
   1481     input_ids,
   1482     attention_mask=attention_mask,
   1483     token_type_ids=token_type_ids,
   1484     position_ids=position_ids,
   1485     head_mask=head_mask,
   1486     inputs_embeds=inputs_embeds,
   1487     output_attentions=output_attentions,
   1488     output_hidden_states=output_hidden_states,
   1489     return_dict=return_dict,
   1490 )
   1492 pooled_output = outputs[1]
   1494 seq_relationship_scores = self.cls(pooled_output)


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1021, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1012 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1014 embedding_output = self.embeddings(
   1015     input_ids=input_ids,
   1016     position_ids=position_ids,
   (...)
   1019     past_key_values_length=past_key_values_length,
   1020 )
-> 1021 encoder_outputs = self.encoder(
   1022     embedding_output,
   1023     attention_mask=extended_attention_mask,
   1024     head_mask=head_mask,
   1025     encoder_hidden_states=encoder_hidden_states,
   1026     encoder_attention_mask=encoder_extended_attention_mask,
   1027     past_key_values=past_key_values,
   1028     use_cache=use_cache,
   1029     output_attentions=output_attentions,
   1030     output_hidden_states=output_hidden_states,
   1031     return_dict=return_dict,
   1032 )
   1033 sequence_output = encoder_outputs[0]
   1034 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:610, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    601     layer_outputs = torch.utils.checkpoint.checkpoint(
    602         create_custom_forward(layer_module),
    603         hidden_states,
   (...)
    607         encoder_attention_mask,
    608     )
    609 else:
--> 610     layer_outputs = layer_module(
    611         hidden_states,
    612         attention_mask,
    613         layer_head_mask,
    614         encoder_hidden_states,
    615         encoder_attention_mask,
    616         past_key_value,
    617         output_attentions,
    618     )
    620 hidden_states = layer_outputs[0]
    621 if use_cache:


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:496, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    484 def forward(
    485     self,
    486     hidden_states: torch.Tensor,
   (...)
    493 ) -> Tuple[torch.Tensor]:
    494     # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    495     self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 496     self_attention_outputs = self.attention(
    497         hidden_states,
    498         attention_mask,
    499         head_mask,
    500         output_attentions=output_attentions,
    501         past_key_value=self_attn_past_key_value,
    502     )
    503     attention_output = self_attention_outputs[0]
    505     # if decoder, the last output is tuple of self-attn cache


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:435, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    416 def forward(
    417     self,
    418     hidden_states: torch.Tensor,
   (...)
    424     output_attentions: Optional[bool] = False,
    425 ) -> Tuple[torch.Tensor]:
    426     self_outputs = self.self(
    427         hidden_states,
    428         attention_mask,
   (...)
    433         output_attentions,
    434     )
--> 435     attention_output = self.output(self_outputs[0], hidden_states)
    436     outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
    437     return outputs


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:386, in BertSelfOutput.forward(self, hidden_states, input_tensor)
    384 def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
    385     hidden_states = self.dense(hidden_states)
--> 386     hidden_states = self.dropout(hidden_states)
    387     hidden_states = self.LayerNorm(hidden_states + input_tensor)
    388     return hidden_states


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/dropout.py:58, in Dropout.forward(self, input)
     57 def forward(self, input: Tensor) -> Tensor:
---> 58     return F.dropout(input, self.p, self.training, self.inplace)


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/functional.py:1252, in dropout(input, p, training, inplace)
   1250 if p < 0.0 or p > 1.0:
   1251     raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
-> 1252 return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)


KeyboardInterrupt: 
# 加入对抗训练

class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {}

    def attack(self, epsilon=0.001):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and 'embeddings.word_embeddings' in name:
                
                # 保存原始参数
                self.backup[name] = param.data.clone()
                
                norm = torch.norm(param.grad)
                if norm != 0:
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)

    def restore(self):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and 'embeddings.word_embeddings' in name: 
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}
        
def process(data_loader, mode, optimizer=None, scheduler=None):
    
    step, total_loss, total_acc = 0, 0, 0
    
    for batch in train_loader:
        # 正向传播
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        loss = outputs[0]
        logits = outputs[1]
        
        total_loss += loss.item()
        
        pred = logits.detach().cpu()
        label_ids = labels.to('cpu')
        
        total_acc += accuracy(pred, label_ids)
        
        if mode == 'train':
            
            fgm = FGM(model)
            
            optimizer.zero_grad()
            loss.backward()
            
            fgm.attack() # 在embedding上添加对抗扰动
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            outputs[0].backward() # 反向传播,并在正常的grad基础上,累加对抗训练的梯度
            fgm.restore() # 恢复embedding参数
            
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            scheduler.step()
        
        step += 1
        if mode == 'train' and step % 2000 == 0:
            print(f'iter_num: {step}, {mode}_loss: {loss.item()}, progress:{step/total_steps*100}%')
            
    print(f'{mode}_average_loss:{total_loss / step}, {mode}_acc:{total_acc / step}')
            
        
for i in range(3):
    
    print(f'-------------------------第{i+1}轮训练开始---------------------')
    
    # 训练
    model.train()
    process(train_loader, mode='train', optimizer=optimizer, scheduler=scheduler)
    
    # 计算验证集损失
    model.eval()
    with torch.no_grad():
        process(test_loader, mode='valid')
        

-------------------------第1轮训练开始---------------------
iter_num: 2000, train_loss: 0.10326074808835983, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.21262915432453156, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.21447157859802246, progress:92.27929867733006%
train_average_loss:0.24003259284792497, train_acc:0.9002618949455835
valid_average_loss:0.16738152858047475, valid_acc:0.9339346634593607
-------------------------第2轮训练开始---------------------
iter_num: 2000, train_loss: 0.19769251346588135, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.4742729067802429, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.18049080669879913, progress:92.27929867733006%
train_average_loss:0.17529396013528914, train_acc:0.9301901862405697
valid_average_loss:0.1673766056904171, valid_acc:0.9339390327493362
-------------------------第3轮训练开始---------------------
iter_num: 2000, train_loss: 0.23520785570144653, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.11421637237071991, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.1428006887435913, progress:92.27929867733006%
train_average_loss:0.1755074827219364, train_acc:0.9306874108640864
valid_average_loss:0.1673783917777568, valid_acc:0.9339412173851568
# 构造SBERT模型
# SBERT把句子通过预训练BERT进行编码获得并按照一定的策略(如平均池化,或者取CLS)获得句向量
#然后送入后续网络进行分类。其实就是用BERT替换LSTM及其之前的部分作为encoder


import torch.nn as nn
from transformers import BertModel, BertTokenizer
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class SentenceBert(nn.Module):
    def __init__(self, model_name = 'bert-base-chinese', out_dim=2, vocab_size=768):
        super(SentenceBert, self).__init__()
        
        self.out_dim = out_dim
        self.vocab_size = vocab_size
        
        self.encoder = BertModel.from_pretrained(model_name)
        
        self.hidden2tag = nn.Linear(4 * self.vocab_size, self.out_dim)

    
    def forward(self, encoded_input1, encoded_input2):

        # Compute token embeddings
        with torch.no_grad():
            model_output1 = self.encoder(**encoded_input1)
            model_output2 = self.encoder(**encoded_input2)

        # Perform pooling
        pool_embed1 = mean_pooling(model_output1, encoded_input1['attention_mask'])
        pool_embed2 = mean_pooling(model_output2, encoded_input2['attention_mask'])

        # Normalize embeddings
        embed1 = F.normalize(pool_embed1, p=2, dim=1)
        embed2 = F.normalize(pool_embed2, p=2, dim=1)
        
        
        sim = torch.cat([embed1, embed1 * embed2, torch.abs(embed1 - embed2), embed2], dim=-1)

        
        # [batch_size, 2]
        output = self.hidden2tag(sim)

        return output
    
import transformers
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, TensorDataset

transformers.logging.set_verbosity_error()

# 划分为训练集和验证集
# stratify 按照标签进行采样,训练集和验证部分同分布
q1_train, q1_val, q2_train, q2_val, train_label, test_label =  train_test_split(
    dataset['query1'], 
    dataset['query2'],
    dataset['label'],
    test_size=0.2, 
    stratify=dataset['label'])


tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

q1_train_encoding = tokenizer(list(q1_train), truncation=True, padding=True, max_length=32)
q2_train_encoding = tokenizer(list(q2_train), truncation=True, padding=True, max_length=32)
q1_test_encoding = tokenizer(list(q1_val), truncation=True, padding=True, max_length=32)
q2_test_encoding = tokenizer(list(q2_val), truncation=True, padding=True, max_length=32)
# 数据集读取
class MyDataset(Dataset):
    def __init__(self, q1_encodings, q2_encodings, labels):
        self.q1_encodings = q1_encodings
        self.q2_encodings = q2_encodings
        self.labels = labels
    
    # 读取单个样本
    def __getitem__(self, idx):
        item1 = {key: torch.tensor(val[idx]) for key, val in self.q1_encodings.items()}
        item2 = {key: torch.tensor(val[idx]) for key, val in self.q2_encodings.items()}
        label = torch.tensor(self.labels[idx])
        return item1, item2, label
    
    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(q1_train_encoding, q2_train_encoding, list(train_label))
test_dataset = MyDataset(q1_test_encoding, q2_test_encoding, list(test_label))

train_dataset[:2]
({'input_ids': tensor([[ 101,  784,  720, 4277, 2094, 4638, 1922, 7345, 7262, 1962, 8043,  102,
              0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0],
          [ 101, 4263,  677, 1166,  782, 4638, 5439, 2038, 2582,  720, 1215,  102,
              0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0]]),
  'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0],
          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0]]),
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0]])},
 {'input_ids': tensor([[ 101,  784,  720, 4277, 2094, 4638, 7676, 3717, 8024, 1914, 2208, 7178,
           8043,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0],
          [ 101, 4263,  677, 1166,  782, 4638, 5439, 2038, 2582,  720, 1215, 8043,
            102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
              0,    0,    0,    0,    0,    0,    0,    0]]),
  'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0],
          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0]]),
  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0]])},
 tensor([0, 1]))
model = SentenceBert()
model = nn.DataParallel(model).cuda()
train_loader = DataLoader(train_dataset, batch_size=4 * 8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=4 * 8, shuffle=True)
# 训练模型
from transformers import get_linear_schedule_with_warmup

import torch.optim as optim


def process(data_loader, mode, optimizer=None, scheduler=None):
    
    step, total_loss, total_acc = 0, 0, 0
    
    for q1, q2, tags in data_loader:
        for k, v in q1.items():
            q1[k] = v.cuda()
        for k, v in q2.items():
            q2[k] = v.cuda()
        pred = model(q1, q2)
        loss = nn.CrossEntropyLoss()(pred, tags.cuda())
        
        total_loss += loss.item()
        total_acc += accuracy(pred, tags.cuda())
        
        if mode == 'train':
            optimizer.zero_grad()
            loss.backward()
            
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
        
        step += 1
        
        if mode == 'train' and step % 2000 == 0:
                print(f'iter_num: {step}, {mode}_loss: {loss.item()}, progress:{step/total_steps*100}%')
            
    print(f'{mode}_average_loss:{total_loss / step}, {mode}_acc:{total_acc / step}')
            
EPOCH = 5
LEARNING_RATE = 2e-4

total_steps = len(train_loader) * 1

optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)


for i in range(EPOCH):
    
    print(f'-------------------------第{i+1}轮训练开始---------------------')
    
    # 训练
    model.train()
    process(train_loader, mode='train', optimizer=optimizer, scheduler=scheduler)
    
    # 计算验证集损失
    model.eval()
    with torch.no_grad():
        process(test_loader, mode='valid')
-------------------------第1轮训练开始---------------------
iter_num: 2000, train_loss: 0.6034929752349854, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.460986852645874, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.4471384882926941, progress:92.27929867733006%
train_average_loss:0.5418581028088758, train_acc:0.7537261262302287
valid_average_loss:0.48876352958280367, valid_acc:0.7563944166436847
-------------------------第2轮训练开始---------------------
iter_num: 2000, train_loss: 0.557519793510437, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.5344920754432678, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.5915530920028687, progress:92.27929867733006%
train_average_loss:0.5003710666305283, train_acc:0.7816580912216888
valid_average_loss:0.4888062159476978, valid_acc:0.7563202864276234
-------------------------第3轮训练开始---------------------
iter_num: 2000, train_loss: 0.5355170369148254, progress:30.759766225776687%
iter_num: 4000, train_loss: 0.5583447217941284, progress:61.519532451553374%
iter_num: 6000, train_loss: 0.5360739231109619, progress:92.27929867733006%
train_average_loss:0.5007460270660249, train_acc:0.7811530018743827
valid_average_loss:0.48879986114314267, valid_acc:0.7563697065961024
-------------------------第4轮训练开始---------------------
iter_num: 2000, train_loss: 0.5362592935562134, progress:30.759766225776687%



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

Cell In[42], line 54
     52 # 训练
     53 model.train()
---> 54 process(train_loader, mode='train', optimizer=optimizer, scheduler=scheduler)
     56 # 计算验证集损失
     57 model.eval()


Cell In[42], line 16, in process(data_loader, mode, optimizer, scheduler)
     14 for k, v in q2.items():
     15     q2[k] = v.cuda()
---> 16 pred = model(q1, q2)
     17 loss = nn.CrossEntropyLoss()(pred, tags.cuda())
     19 total_loss += loss.item()


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:167, in DataParallel.forward(self, *inputs, **kwargs)
    165 if len(self.device_ids) == 1:
    166     return self.module(*inputs[0], **kwargs[0])
--> 167 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
    168 outputs = self.parallel_apply(replicas, inputs, kwargs)
    169 return self.gather(outputs, self.output_device)


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:172, in DataParallel.replicate(self, module, device_ids)
    171 def replicate(self, module, device_ids):
--> 172     return replicate(module, device_ids, not torch.is_grad_enabled())


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:115, in replicate(network, devices, detach)
    113 module_indices[module] = i
    114 for j in range(num_replicas):
--> 115     replica = module._replicate_for_data_parallel()
    116     # This is a temporary fix for DDP. DDP needs to access the
    117     # replicated model parameters. It used to do so through
    118     # `mode.parameters()`. The fix added in #33907 for DP stops the
    119     # `parameters()` API from exposing the replicated parameters.
    120     # Hence, we add a `_former_parameters` dict here to support DDP.
    121     replica._former_parameters = OrderedDict()


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1971, in Module._replicate_for_data_parallel(self)
   1968 # replicas do not have parameters themselves, the replicas reference the original
   1969 # module.
   1970 replica._parameters = OrderedDict()
-> 1971 replica._buffers = replica._buffers.copy()
   1972 replica._modules = replica._modules.copy()
   1973 replica._is_replica = True  # type: ignore[assignment]


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/modules/module.py:1220, in Module.__setattr__(self, name, value)
   1217                 d.discard(name)
   1219 params = self.__dict__.get('_parameters')
-> 1220 if isinstance(value, Parameter):
   1221     if params is None:
   1222         raise AttributeError(
   1223             "cannot assign parameters before Module.__init__() call")


File /opt/anaconda3/envs/jupyter/lib/python3.10/site-packages/torch/nn/parameter.py:10, in _ParameterMeta.__instancecheck__(self, instance)
      9 def __instancecheck__(self, instance):
---> 10     return super().__instancecheck__(instance) or (
     11         isinstance(instance, torch.Tensor) and getattr(instance, '_is_param', False))


KeyboardInterrupt: 

结果显示SBERT效果还不如LSTM,远不如BERT,暂时不知道哪里出现的问题,要去看Sentence-Bert源码

# SimCSE 数据集构造
import transformers
from transformers import BertModel, BertTokenizer, BertConfig
from torch.utils.data import Dataset, DataLoader

transformers.logging.set_verbosity_error()


class TrainDataset(Dataset):
    """
    训练数据集, 重写__getitem__和__len__方法
    """
    
    def __init__(self, data):
        self.data = data
      
    def __len__(self):
        return len(self.data)
    
    def text_2_id(self, text: str):
        # 添加自身两次, 经过bert编码之后, 互为正样本
        return tokenizer([text, text], max_length=32, truncation=True, padding='max_length', return_tensors='pt')
    
    def __getitem__(self, index: int):
        return self.text_2_id(self.data[index])
    

class TestDataset(Dataset):
    """
    测试数据集, 重写__getitem__和__len__方法
    """
    
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def text_2_id(self, text: str):
        return tokenizer(text, max_length=32, truncation=True, padding='max_length', return_tensors='pt')
    
    def __getitem__(self, index: int):
        da = self.data[index]        
        return self.text_2_id([da[0]]), self.text_2_id([da[1]]), int(da[2])

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

train_data = TrainDataset(train_set.query1.to_list()[:100000] + train_set.query2.to_list()[:100000])
val_data = TestDataset(list(zip(valid_set.query1.to_list(), valid_set.query2.to_list(), valid_set.label.to_list()))[:5000])
test_data = TestDataset(list(zip(test_set.query1.to_list(), test_set.query2.to_list(), test_set.label.to_list()))[:5000])
# 构造SimCSE unsper模型
import torch
import torch.nn as nn
from loguru import logger

class SimcseModel(nn.Module):
    """Simcse无监督模型定义"""
    def __init__(self, pretrained_model='bert-base-chinese', pooling='last-avg', DROPOUT=0.3):
        super(SimcseModel, self).__init__()
        config = BertConfig.from_pretrained(pretrained_model)       
        config.attention_probs_dropout_prob = DROPOUT   # 修改config的dropout系数
        config.hidden_dropout_prob = DROPOUT           
        self.bert = BertModel.from_pretrained(pretrained_model, config=config)
        self.pooling = pooling
        
    def forward(self, input_ids, attention_mask, token_type_ids):

        out = self.bert(input_ids, attention_mask, token_type_ids, output_hidden_states=True)

        if self.pooling == 'cls':
            return out.last_hidden_state[:, 0]  # [batch, 768]
        
        if self.pooling == 'pooler':
            return out.pooler_output            # [batch, 768]
        
        if self.pooling == 'last-avg':
            last = out.last_hidden_state.transpose(1, 2)    # [batch, 768, seqlen]
            return torch.avg_pool1d(last, kernel_size=last.shape[-1]).squeeze(-1)       # [batch, 768]
        
        if self.pooling == 'first-last-avg':
            first = out.hidden_states[1].transpose(1, 2)    # [batch, 768, seqlen]
            last = out.hidden_states[-1].transpose(1, 2)    # [batch, 768, seqlen]                   
            first_avg = torch.avg_pool1d(first, kernel_size=last.shape[-1]).squeeze(-1) # [batch, 768]
            last_avg = torch.avg_pool1d(last, kernel_size=last.shape[-1]).squeeze(-1)   # [batch, 768]
            avg = torch.cat((first_avg.unsqueeze(1), last_avg.unsqueeze(1)), dim=1)     # [batch, 2, 768]
            return torch.avg_pool1d(avg.transpose(1, 2), kernel_size=2).squeeze(-1)     # [batch, 768]
    
    
def simcse_unsup_loss(y_pred: 'tensor') -> 'tensor':
    """无监督的损失函数
    y_pred (tensor): bert的输出, [batch_size * 2, 768]
    
    """
    
    # 得到y_pred对应的label, [1, 0, 3, 2, ..., batch_size-1, batch_size-2]
    y_true = torch.arange(y_pred.shape[0]).cuda()
    y_true = (y_true - y_true % 2 * 2) + 1
    
    # batch内两两计算相似度, 得到相似度矩阵(对角矩阵)
    sim = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=-1)
    
    # 将相似度矩阵对角线置为很小的值, 消除自身的影响
    sim = sim - torch.eye(y_pred.shape[0]).cuda() * 1e12
    
    # 相似度矩阵除以温度系数
    sim = sim / 0.05
    
    # 计算相似度矩阵与y_true的交叉熵损失
    loss = F.cross_entropy(sim, y_true)
    
    return loss


def eval(model, dataloader) -> float:
    """模型评估函数 
    批量预测, batch结果拼接, 一次性求spearman相关度
    """
    
    model.eval()
    sim_tensor = torch.tensor([], device=DEVICE)
    label_array = np.array([])
    
    with torch.no_grad():
        for source, target, label in dataloader:
            
            # source        [batch, 1, seq_len] -> [batch, seq_len]
            source_input_ids = source.get('input_ids').squeeze(1).to(DEVICE)
            source_attention_mask = source.get('attention_mask').squeeze(1).to(DEVICE)
            source_token_type_ids = source.get('token_type_ids').squeeze(1).to(DEVICE)
            source_pred = model(source_input_ids, source_attention_mask, source_token_type_ids)
            
            # target        [batch, 1, seq_len] -> [batch, seq_len]
            target_input_ids = target.get('input_ids').squeeze(1).to(DEVICE)
            target_attention_mask = target.get('attention_mask').squeeze(1).to(DEVICE)
            target_token_type_ids = target.get('token_type_ids').squeeze(1).to(DEVICE)
            target_pred = model(target_input_ids, target_attention_mask, target_token_type_ids)
            
            # concat
            sim = F.cosine_similarity(source_pred, target_pred, dim=-1)
            sim_tensor = torch.cat((sim_tensor, sim), dim=0)            
            label_array = np.append(label_array, np.array(label))
    
    # corrcoef 
    return spearmanr(label_array, sim_tensor.cpu().numpy()).correlation

            
def train(model, train_dl, dev_dl, optimizer) -> None:
    """模型训练函数"""
    
    model.train()
    global best
    
    for batch_idx, source in enumerate(train_dl):
        
        # 维度转换 [batch, 2, seq_len] -> [batch * 2, sql_len]
        real_batch_num = source.get('input_ids').shape[0]
        input_ids = source.get('input_ids').view(real_batch_num * 2, -1).to(DEVICE)
        attention_mask = source.get('attention_mask').view(real_batch_num * 2, -1).to(DEVICE)
        token_type_ids = source.get('token_type_ids').view(real_batch_num * 2, -1).to(DEVICE)
        
        out = model(input_ids, attention_mask, token_type_ids)        
        loss = simcse_unsup_loss(out)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch_idx and batch_idx % 1000 == 0:     
            print(f'idx:{batch_idx}, loss: {loss.item():.4f}, progress:{100*batch_idx/len(train_dl):.4f}%')
            corrcoef = eval(model, dev_dl)
            model.train()
            if best < corrcoef:
                best = corrcoef
                torch.save(model.state_dict(), SAVE_PATH)
                print(f"higher corrcoef: {best:.4f} in batch: {batch_idx}, save model")
       
import torch.nn.functional as F

DEVICE = 'cuda'
POOLING = 'cls'
model_name = 'bert-base-chinese'
BATCH_SIZE = 64
LR = 2e-6
   
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE)
dev_dataloader = DataLoader(val_data, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE)
from scipy.stats import spearmanr

# load model
model = SimcseModel(pretrained_model=model_name, pooling=POOLING)
model = nn.DataParallel(model).cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

# train
EPOCHS = 5
SAVE_PATH = 'best_model.pt'
best = 0
for epoch in range(EPOCHS):
    print(f'------------------------ epoch: {epoch + 1}------------------------')
    train(model, train_dataloader, dev_dataloader, optimizer)
    print(f'train is finished, best model is saved at {SAVE_PATH}')

# eval

model.load_state_dict(torch.load(SAVE_PATH))
dev_corrcoef = eval(model, dev_dataloader)
test_corrcoef = eval(model, test_dataloader)
print(f'dev_corrcoef: {dev_corrcoef:.4f}')
print(f'test_corrcoef: {test_corrcoef:.4f}')
------------------------ epoch: 1------------------------
idx:1000, loss: 0.1693, progress:32.0000%
higher corrcoef: 0.6005 in batch: 1000, save model
idx:2000, loss: 0.0759, progress:64.0000%
idx:3000, loss: 0.0367, progress:96.0000%
train is finished, best model is saved at best_model.pt
------------------------ epoch: 2------------------------
idx:1000, loss: 0.0159, progress:32.0000%
idx:2000, loss: 0.0193, progress:64.0000%
idx:3000, loss: 0.0090, progress:96.0000%
train is finished, best model is saved at best_model.pt
------------------------ epoch: 3------------------------
idx:1000, loss: 0.0097, progress:32.0000%
idx:2000, loss: 0.0059, progress:64.0000%
idx:3000, loss: 0.0052, progress:96.0000%
train is finished, best model is saved at best_model.pt
------------------------ epoch: 4------------------------
idx:1000, loss: 0.0040, progress:32.0000%
idx:2000, loss: 0.0058, progress:64.0000%
idx:3000, loss: 0.0034, progress:96.0000%
train is finished, best model is saved at best_model.pt
------------------------ epoch: 5------------------------
idx:1000, loss: 0.0030, progress:32.0000%
idx:2000, loss: 0.0028, progress:64.0000%
idx:3000, loss: 0.0028, progress:96.0000%
train is finished, best model is saved at best_model.pt
dev_corrcoef: 0.6005
test_corrcoef: 0.6579

import random
import time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from scipy.stats import spearmanr
from torch.utils.data import DataLoader, Dataset
from transformers import BertConfig, BertModel, BertTokenizer

# 基本参数
EPOCHS = 2
BATCH_SIZE = 32
LR = 1e-6
MAXLEN = 32
POOLING = 'cls'   # choose in ['cls', 'pooler', 'last-avg', 'first-last-avg']
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 

# 预训练模型
model_name = 'bert-base-chinese'

# 微调后参数存放位置
SAVE_PATH = 'simcse_sup.pt'

def load_train_data(data) -> list:
    """
    """
    
    q1 = data.query1.to_list()
    q2 = data.query2.to_list()
    
    train_data = []
    for i in range(len(q1) - 1):
        train_data.append([q1[i], q2[i], q1[i+1]])
        
    return train_data 
    

class TrainDataset(Dataset):
    """训练数据集, 重写__getitem__和__len__方法
    """
    def __init__(self, data: list):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def text_2_id(self, text: str):
        return tokenizer([text[0], text[1], text[2]], max_length=MAXLEN, 
                         truncation=True, padding='max_length', return_tensors='pt')
    
    def __getitem__(self, index: int):
        return self.text_2_id(self.data[index])
    
    
class TestDataset(Dataset):
    """测试数据集, 重写__getitem__和__len__方法
    """
    def __init__(self, data: list):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def text_2_id(self, text: str):
        return tokenizer(text, max_length=MAXLEN, truncation=True, 
                         padding='max_length', return_tensors='pt')
    
    def __getitem__(self, index):
        line = self.data[index]
        return self.text_2_id([line[0]]), self.text_2_id([line[1]]), int(line[2])

train_data = TrainDataset(load_train_data(train_set[train_set['label'] == 1]))
val_data = TestDataset(list(zip(valid_set.query1.to_list(), valid_set.query2.to_list(), valid_set.label.to_list()))[:5000])
test_data = TestDataset(list(zip(test_set.query1.to_list(), test_set.query2.to_list(), test_set.label.to_list()))[:5000])
print(train_data[0])
{'input_ids': tensor([[ 101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614,  784,
          720, 3416, 4638, 1957, 4495,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614,  784,  720,
         3416, 4638, 1957, 4495,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 2769, 2797, 3322,  696,  749, 8024, 2769, 2682, 2940,  702, 2797,
         3322,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}
# -*- encoding: utf-8 -*-
    
    
class SimcseModel(nn.Module):
    """Simcse有监督模型定义"""
    def __init__(self, pretrained_model: str, pooling: str):
        super(SimcseModel, self).__init__()
        # config = BertConfig.from_pretrained(pretrained_model)   # 有监督不需要修改dropout
        self.bert = BertModel.from_pretrained(pretrained_model)
        self.pooling = pooling
        
    def forward(self, input_ids, attention_mask, token_type_ids):
        
        # out = self.bert(input_ids, attention_mask, token_type_ids)
        out = self.bert(input_ids, attention_mask, token_type_ids, output_hidden_states=True)

        if self.pooling == 'cls':
            return out.last_hidden_state[:, 0]  # [batch, 768]
        
        if self.pooling == 'pooler':
            return out.pooler_output            # [batch, 768]
        
        if self.pooling == 'last-avg':
            last = out.last_hidden_state.transpose(1, 2)    # [batch, 768, seqlen]
            return torch.avg_pool1d(last, kernel_size=last.shape[-1]).squeeze(-1)       # [batch, 768]
        
        if self.pooling == 'first-last-avg':
            first = out.hidden_states[1].transpose(1, 2)    # [batch, 768, seqlen]
            last = out.hidden_states[-1].transpose(1, 2)    # [batch, 768, seqlen]                   
            first_avg = torch.avg_pool1d(first, kernel_size=last.shape[-1]).squeeze(-1) # [batch, 768]
            last_avg = torch.avg_pool1d(last, kernel_size=last.shape[-1]).squeeze(-1)   # [batch, 768]
            avg = torch.cat((first_avg.unsqueeze(1), last_avg.unsqueeze(1)), dim=1)     # [batch, 2, 768]
            return torch.avg_pool1d(avg.transpose(1, 2), kernel_size=2).squeeze(-1)     # [batch, 768]
                  
            
def simcse_sup_loss(y_pred: 'tensor') -> 'tensor':
    """有监督的损失函数
    y_pred (tensor): bert的输出, [batch_size * 3, 768]
    
    """
    # 得到y_pred对应的label, 每第三句没有label, 跳过, label= [1, 0, 4, 3, ...]
    y_true = torch.arange(y_pred.shape[0], device=DEVICE)
    use_row = torch.where((y_true + 1) % 3 != 0)[0]
    y_true = (use_row - use_row % 3 * 2) + 1
    # batch内两两计算相似度, 得到相似度矩阵(对角矩阵)
    sim = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=-1)
    # 将相似度矩阵对角线置为很小的值, 消除自身的影响
    sim = sim - torch.eye(y_pred.shape[0], device=DEVICE) * 1e12
    # 选取有效的行
    sim = torch.index_select(sim, 0, use_row)
    # 相似度矩阵除以温度系数
    sim = sim / 0.05
    # 计算相似度矩阵与y_true的交叉熵损失
    loss = F.cross_entropy(sim, y_true)
    return loss
    

def eval(model, dataloader) -> float:
    """模型评估函数 
    批量预测, 计算cos_sim, 转成numpy数组拼接起来, 一次性求spearman相关度
    """
    model.eval()
    sim_tensor = torch.tensor([], device=DEVICE)
    label_array = np.array([])
    with torch.no_grad():
        for source, target, label in dataloader:
            # source        [batch, 1, seq_len] -> [batch, seq_len]
            source_input_ids = source['input_ids'].squeeze(1).to(DEVICE)
            source_attention_mask = source['attention_mask'].squeeze(1).to(DEVICE)
            source_token_type_ids = source['token_type_ids'].squeeze(1).to(DEVICE)
            source_pred = model(source_input_ids, source_attention_mask, source_token_type_ids)
            # target        [batch, 1, seq_len] -> [batch, seq_len]
            target_input_ids = target['input_ids'].squeeze(1).to(DEVICE)
            target_attention_mask = target['attention_mask'].squeeze(1).to(DEVICE)
            target_token_type_ids = target['token_type_ids'].squeeze(1).to(DEVICE)
            target_pred = model(target_input_ids, target_attention_mask, target_token_type_ids)
            # concat
            sim = F.cosine_similarity(source_pred, target_pred, dim=-1)
            sim_tensor = torch.cat((sim_tensor, sim), dim=0)
            label_array = np.append(label_array, np.array(label))  
    # corrcoef       
    return spearmanr(label_array, sim_tensor.cpu().numpy()).correlation
        

def train(model, train_dl, dev_dl, optimizer) -> None:
    """模型训练函数 
    """
    model.train()
    global best
    early_stop_batch = 0
    for batch_idx, source in enumerate(train_dl, start=1):
        
        # 维度转换 [batch, 3, seq_len] -> [batch * 3, sql_len]
        real_batch_num = source.get('input_ids').shape[0]
        input_ids = source.get('input_ids').view(real_batch_num * 3, -1).to(DEVICE)
        attention_mask = source.get('attention_mask').view(real_batch_num * 3, -1).to(DEVICE)
        token_type_ids = source.get('token_type_ids').view(real_batch_num * 3, -1).to(DEVICE)
        
        # 训练
        out = model(input_ids, attention_mask, token_type_ids)
        loss = simcse_sup_loss(out)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # 评估
        if batch_idx % 100 == 0:
            print(f'batch_idx:{batch_idx}, loss: {loss.item():.4f}, progress:{100*batch_idx/len(train_dl):.4f}%')
            corrcoef = eval(model, dev_dl)
            model.train()
            if best < corrcoef:
                early_stop_batch = 0
                best = corrcoef
                torch.save(model.state_dict(), SAVE_PATH)
                print(f"higher corrcoef: {best:.4f} in batch: {batch_idx}, save model")
                continue
            early_stop_batch += 1
            if early_stop_batch == 5:
                print(f"corrcoef doesn't improve for {early_stop_batch} batch, early stop!")
                print(f"train use sample number: {(batch_idx - 10) * BATCH_SIZE}")
                return 
tokenizer = BertTokenizer.from_pretrained(model_name)

# load data
    
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE)
dev_dataloader = DataLoader(val_data, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE)
    
# load model    
model = SimcseModel(pretrained_model=model_name, pooling=POOLING)
model.to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

# train
best = 0
for epoch in range(EPOCHS):
    print(f'----------------------epoch: {epoch + 1}-----------------------')
    train(model, train_dataloader, dev_dataloader, optimizer)
print(f'train is finished, best model is saved at {SAVE_PATH}')

# eval
model.load_state_dict(torch.load(SAVE_PATH))
dev_corrcoef = eval(model, dev_dataloader)
test_corrcoef = eval(model, test_dataloader)
print(f'dev_corrcoef: {dev_corrcoef:.4f}')
print(f'test_corrcoef: {test_corrcoef:.4f}')
    
----------------------epoch: 1-----------------------
batch_idx:100, loss: 1.1291, progress:2.3089%
higher corrcoef: 0.4878 in batch: 100, save model
batch_idx:200, loss: 0.9710, progress:4.6179%
higher corrcoef: 0.5499 in batch: 200, save model
batch_idx:300, loss: 0.9440, progress:6.9268%
higher corrcoef: 0.5832 in batch: 300, save model
batch_idx:400, loss: 0.9957, progress:9.2357%
higher corrcoef: 0.6009 in batch: 400, save model
batch_idx:500, loss: 0.8824, progress:11.5447%
higher corrcoef: 0.6119 in batch: 500, save model
batch_idx:600, loss: 0.9363, progress:13.8536%
higher corrcoef: 0.6195 in batch: 600, save model
batch_idx:700, loss: 0.8632, progress:16.1625%
higher corrcoef: 0.6234 in batch: 700, save model
batch_idx:800, loss: 0.8339, progress:18.4715%
higher corrcoef: 0.6258 in batch: 800, save model
batch_idx:900, loss: 0.8889, progress:20.7804%
higher corrcoef: 0.6271 in batch: 900, save model
batch_idx:1000, loss: 0.8728, progress:23.0894%
higher corrcoef: 0.6275 in batch: 1000, save model
batch_idx:1100, loss: 0.8358, progress:25.3983%
batch_idx:1200, loss: 0.8673, progress:27.7072%
batch_idx:1300, loss: 0.8888, progress:30.0162%
batch_idx:1400, loss: 0.8947, progress:32.3251%
batch_idx:1500, loss: 0.7396, progress:34.6340%
corrcoef doesn't improve for 5 batch, early stop!
train use sample number: 47680
----------------------epoch: 2-----------------------
batch_idx:100, loss: 0.7435, progress:2.3089%
batch_idx:200, loss: 0.7990, progress:4.6179%
batch_idx:300, loss: 0.8162, progress:6.9268%
batch_idx:400, loss: 0.7929, progress:9.2357%
batch_idx:500, loss: 0.7649, progress:11.5447%
corrcoef doesn't improve for 5 batch, early stop!
train use sample number: 15680
train is finished, best model is saved at simcse_sup.pt
dev_corrcoef: 0.6275
test_corrcoef: 0.6614

posted @ 2023-01-31 08:47  monkey_trainer  阅读(162)  评论(0)    收藏  举报