跟着AI学AI - 诊断结论信息抽取 - 数据增强

前面已经标好数据，并对数据进行了BERT训练格式转换数据格式转换BERT训练格式，但直接使用少量标注数据进行NER训练确实容易过拟合，必须数据增强（解决数据稀缺）

数据增强和数据标注的区别：
有100份数据，标了10份，剩下的90份

📊 两者的本质区别

1. 数据增强 (Data Augmentation)

目的：从已有的标注数据中创建新的、多样化的训练样本
方法：对已标注的文本进行变换（如同义词替换、词序调整、添加噪声等）
结果：标签数量不变（还是那10份），但训练样本变多了
例子：把"平均心率为76次/分" → "平均心律是76次/分"（同义词替换）

2. 自动标注/半自动标注 (Auto-Labeling)

目的：给未标注的数据（你的90份未标数据）自动生成标签
方法：使用模型预测、规则匹配、主动学习等
结果：标注数据从10份增加到100份
例子：用你训练好的NER模型预测剩下90份数据的实体

🔄 你真正需要的是什么？

根据你的描述，你有100份数据，只标注了10份，那么你真正需要的是：

方案A：先增强，后标注（推荐）

# 步骤1：用10份标注数据做增强，得到200份训练数据
# 步骤2：用这200份数据训练一个初步模型
# 步骤3：用这个模型预测剩下的90份数据（自动标注）
# 步骤4：人工检查修正自动标注结果
# 步骤5：用100份标注数据重新训练更好的模型

方案B：主动学习 + 增强

# 步骤1：用10份数据做增强，训练初步模型
# 步骤2：让模型预测90份未标数据，找出"最不确定"的样本
# 步骤3：人工标注这些"最有价值"的样本（比如20份）
# 步骤4：现在有30份标注数据，再做增强训练
# 步骤5：重复直到标注完100份

步骤1：用10份标注数据做增强，得到200份训练数据

augmentor.py

import json
import random
import copy
import re
from transformers import AutoTokenizer
from collections import defaultdict
from typing import List, Dict, Tuple
import numpy as np


class ECGDataAugmentor:
    def __init__(self, tokenizer, label2id):
        self.tokenizer = tokenizer
        self.label2id = label2id
        self.id2label = {v: k for k, v in label2id.items()}

        # 心电图专用同义词词典
        self.ecg_synonyms = {
            # 心率相关
            "心率": ["心律", "心跳", "心搏"],
            "次/分": ["bpm", "次每分", "每分钟"],
            "平均心率": ["平均心律", "平均心跳", "平均心搏"],
            "最快心率": ["最高心率", "最大心率", "最快心律"],
            "最慢心率": ["最低心率", "最小心率", "最慢心律"],

            # 事件类型
            "心动过速": ["窦性心动过速", "快速心律失常", "心率过快"],
            "心动过缓": ["窦性心动过缓", "缓慢心律失常", "心率过慢"],
            "室性早搏": ["室早", "室性期前收缩", "PVC"],
            "单发室早": ["单发性室早", "孤立性室早", "单源性室早"],
            "三联律": ["三联律室早", "室性三联律"],

            # 诊断结论
            "窦性心律": ["窦性节律", "正常窦性心律"],
            "频发室性早搏": ["室性早搏频发", "多发室早", "室早频发"],
            "心率变异性分析": ["心率变异性", "HRV分析", "心率变异性检测"],

            # 指标名称
            "SDNN": ["标准差NN", "NN间期标准差"],
            "SDANN": ["标准差ANN", "平均NN间期标准差"],

            # 单位
            "ms": ["毫秒", "毫秒单位"],

            # 其他
            "诊断": ["结论", "诊断结果", "检查结论"],
            "正常参考值范围": ["正常范围", "参考范围", "正常值范围"],
        }

        # 医学数值范围
        self.medical_ranges = {
            "心率": {"min": 40, "max": 180, "step": 1},
            "室早次数": {"min": 0, "max": 10000, "step": 1},
            "百分比": {"min": 0.1, "max": 50.0, "step": 0.1},
            "SDNN": {"min": 50, "max": 300, "step": 0.01},
            "SDANN": {"min": 30, "max": 200, "step": 0.01},
            "时间": {"formats": ["%m-%d %H:%M:%S", "%H:%M:%S", "%Y-%m-%d %H:%M"]}
        }

        # 报告模板
        self.report_templates = [
            "平均心率为{avg_hr}次/分,最快心率是{max_hr}次/分,发生于{max_time},最慢心率是{min_hr}次/分,发生于{min_time},其中心动过速事件(心率>100次/分),持续时间占总时间的{tachy_percent}%,心动过缓事件(心率<60次/分),持续时间占总时间的{brady_percent}%。 室性早搏共发生{pvc_count}次,占总心搏数的{pvc_percent}%,包括{pvc_single}次单发室早.{pvc_triplet}次三联律。 诊断： 1、窦性心律（心率波动于{min_hr_range}次/分--{max_hr_range}次/分之间） 2、频发室性早搏（{pvc_single_diag}次单发室早.插入性室早.{pvc_triplet_diag}次三联律) 3、心率变异性分析：SDNN {sdnn}（正常参考值范围：102-180ms），SDANN {sdann}（正常参考值范围：92-162ms）",

            "心率监测结果：平均{avg_hr}次/分，最高{max_hr}次/分（{max_time}），最低{min_hr}次/分（{min_time}）。心动过速占比{tachy_percent}%，心动过缓占比{brady_percent}%。室性早搏{pvc_count}次（占比{pvc_percent}%），其中单发{pvc_single}次，三联律{pvc_triplet}次。诊断：1.窦性心律（{min_hr_range}-{max_hr_range}次/分）2.频发室性早搏 3.心率变异性：SDNN {sdnn}ms，SDANN {sdann}ms",

            "监测期间心率{avg_hr}次/分，最快{max_hr}次/分于{max_time}，最慢{min_hr}次/分于{min_time}。心动过速{tachy_percent}%，心动过缓{brady_percent}%。室早{pvc_count}次（{pvc_percent}%），单发{pvc_single}次，三联律{pvc_triplet}次。结论：1、窦性心律（{min_hr_range}-{max_hr_range}次/分）2、频发室早 3、HRV：SDNN {sdnn}，SDANN {sdann}",
        ]

        # 实体模式库（从标注数据中提取）
        self.entity_patterns = defaultdict(list)

    def extract_entity_patterns(self, samples: List[Dict]):
        """从样本中提取实体模式"""
        for sample in samples:
            tokens = sample["tokens"]
            labels = [self.id2label[l] for l in sample["labels"]]

            current_entity = None
            entity_text = ""

            for i, label in enumerate(labels):
                if label.startswith("B-"):
                    if current_entity:
                        entity_type = current_entity["type"]
                        self.entity_patterns[entity_type].append(current_entity["text"])

                    entity_type = label[2:]
                    current_entity = {
                        "type": entity_type,
                        "text": tokens[i].replace("##", ""),
                        "start": i
                    }

                elif label.startswith("I-") and current_entity and label[2:] == current_entity["type"]:
                    current_entity["text"] += tokens[i].replace("##", "")

                elif label == "O":
                    if current_entity:
                        entity_type = current_entity["type"]
                        self.entity_patterns[entity_type].append(current_entity["text"])
                        current_entity = None

            if current_entity:
                entity_type = current_entity["type"]
                self.entity_patterns[entity_type].append(current_entity["text"])

    def generate_medical_values(self, entity_type: str, original_value: str = None):
        """生成合理的医学数值"""
        if entity_type == "数值":
            if original_value:
                try:
                    # 基于原值进行小范围随机变化
                    if "." in original_value:
                        val = float(original_value)
                        variation = val * 0.1  # 10%的波动
                        new_val = val + random.uniform(-variation, variation)
                        return f"{new_val:.2f}" if "." in original_value else f"{int(new_val)}"
                    else:
                        val = int(original_value)
                        variation = max(1, int(val * 0.1))  # 10%的波动，至少1
                        new_val = val + random.randint(-variation, variation)
                        return str(max(1, new_val))  # 确保为正数
                except:
                    pass

            # 随机生成数值
            num_type = random.choice(["int", "float"])
            if num_type == "int":
                return str(random.randint(1, 10000))
            else:
                return f"{random.uniform(0.1, 100.0):.2f}"

        elif entity_type == "日期时间":
            # 生成随机时间
            month = random.randint(1, 12)
            day = random.randint(1, 28)
            hour = random.randint(0, 23)
            minute = random.randint(0, 59)
            second = random.randint(0, 59)
            return f"{month:02d}-{day:02d} {hour:02d}:{minute:02d}:{second:02d}"

        elif entity_type == "百分比":
            return f"{random.uniform(0.1, 50.0):.1f}%"

        else:
            return original_value if original_value else "未知"

    def synonym_replacement_augment(self, samples: List[Dict], num_to_generate: int) -> List[Dict]:
        """同义词替换增强（针对医学文本优化）"""
        augmented = []

        for _ in range(num_to_generate):
            sample = copy.deepcopy(random.choice(samples))
            text = self.tokenizer.decode(sample["input_ids"], skip_special_tokens=True)

            # 进行同义词替换
            new_text = text
            replacements_made = 0

            for word, synonyms in self.ecg_synonyms.items():
                if word in new_text and random.random() < 0.4:
                    replacement = random.choice(synonyms)
                    new_text = new_text.replace(word, replacement, 1)
                    replacements_made += 1

            # 如果进行了替换，重新编码
            if replacements_made > 0:
                encoding = self.tokenizer(
                    new_text,
                    max_length=len(sample["input_ids"]),
                    padding="max_length",
                    truncation=True,
                    return_tensors=None
                )

                sample["input_ids"] = encoding["input_ids"]
                sample["attention_mask"] = encoding["attention_mask"]
                sample["tokens"] = self.tokenizer.convert_ids_to_tokens(encoding["input_ids"])

            augmented.append(sample)

        return augmented

    def value_perturbation_augment(self, samples: List[Dict], num_to_generate: int) -> List[Dict]:
        """数值扰动增强（医学数值的合理变化）"""
        augmented = []

        for _ in range(num_to_generate):
            sample = copy.deepcopy(random.choice(samples))
            tokens = sample["tokens"]
            labels = [self.id2label[l] for l in sample["labels"]]

            # 找出数值实体
            new_tokens = tokens.copy()

            for i, (token, label) in enumerate(zip(tokens, labels)):
                if label == "B-数值" or label == "I-数值":
                    clean_token = token.replace("##", "")

                    # 检查是否是数字
                    if clean_token.replace('.', '').replace('-', '').isdigit():
                        # 生成新的医学合理数值
                        new_value = self.generate_medical_values("数值", clean_token)

                        # 将新数值token化
                        new_value_tokens = self.tokenizer.tokenize(new_value)

                        if len(new_value_tokens) == 1:
                            new_tokens[i] = new_value_tokens[0]
                        else:
                            # 多token数值处理（简化）
                            new_tokens[i] = new_value_tokens[0]
                            # 注意：这里简化处理，实际应该处理多token情况

            # 重新构建文本
            new_text = self.tokenizer.convert_tokens_to_string(new_tokens)

            # 重新编码
            encoding = self.tokenizer(
                new_text,
                max_length=len(sample["input_ids"]),
                padding="max_length",
                truncation=True,
                return_tensors=None
            )

            sample["input_ids"] = encoding["input_ids"]
            sample["attention_mask"] = encoding["attention_mask"]
            sample["tokens"] = self.tokenizer.convert_ids_to_tokens(encoding["input_ids"])

            augmented.append(sample)

        return augmented

    def template_based_augment(self, samples: List[Dict], num_to_generate: int) -> List[Dict]:
        """基于模板的增强（生成全新的心电图报告）"""
        augmented = []

        # 从样本中提取典型数值范围
        typical_values = self._extract_typical_values(samples)

        for _ in range(num_to_generate):
            # 选择模板
            template = random.choice(self.report_templates)

            # 生成合理的医学数值
            params = {
                'avg_hr': random.randint(50, 100),
                'max_hr': random.randint(100, 180),
                'min_hr': random.randint(40, 70),
                'max_time': self.generate_medical_values("日期时间"),
                'min_time': self.generate_medical_values("日期时间"),
                'tachy_percent': random.uniform(0.1, 10.0),
                'brady_percent': random.uniform(0.1, 20.0),
                'pvc_count': random.randint(100, 5000),
                'pvc_percent': random.uniform(0.1, 30.0),
                'pvc_single': random.randint(100, 5000),
                'pvc_triplet': random.randint(0, 100),
                'min_hr_range': random.randint(40, 70),
                'max_hr_range': random.randint(100, 180),
                'pvc_single_diag': random.randint(100, 5000),
                'pvc_triplet_diag': random.randint(0, 100),
                'sdnn': random.uniform(50.0, 300.0),
                'sdann': random.uniform(30.0, 200.0),
            }

            # 应用典型值
            for key, value_range in typical_values.items():
                if key in params:
                    if isinstance(value_range, tuple):
                        params[key] = random.uniform(value_range[0], value_range[1])
                    else:
                        params[key] = value_range

            # 确保数值合理性
            params['max_hr'] = max(params['max_hr'], params['avg_hr'] + 20)
            params['min_hr'] = min(params['min_hr'], params['avg_hr'] - 20)
            params['max_hr_range'] = params['max_hr']
            params['min_hr_range'] = params['min_hr']
            params['pvc_single'] = min(params['pvc_single'], params['pvc_count'])

            # 格式化数值
            for key in ['tachy_percent', 'brady_percent', 'pvc_percent', 'sdnn', 'sdann']:
                params[key] = f"{params[key]:.1f}" if key in ['sdnn', 'sdann'] else f"{params[key]:.1f}"

            # 生成文本
            try:
                new_text = template.format(**params)

                # 使用第一个样本作为参考长度
                ref_sample = samples[0]

                # 编码
                encoding = self.tokenizer(
                    new_text,
                    max_length=len(ref_sample["input_ids"]),
                    padding="max_length",
                    truncation=True,
                    return_tensors=None
                )

                # 需要为这个新文本生成标签（这里简化处理，实际应该使用NER模型或规则）
                # 暂时使用原样本标签（后续需要改进）
                labels = ref_sample["labels"].copy()
                if len(labels) > len(encoding["input_ids"]):
                    labels = labels[:len(encoding["input_ids"])]
                else:
                    labels.extend([self.label2id["O"]] * (len(encoding["input_ids"]) - len(labels)))

                augmented.append({
                    "input_ids": encoding["input_ids"],
                    "attention_mask": encoding["attention_mask"],
                    "labels": labels,
                    "tokens": self.tokenizer.convert_ids_to_tokens(encoding["input_ids"])
                })

            except Exception as e:
                print(f"模板生成失败: {e}")
                continue

        return augmented

    def entity_swap_augment(self, samples: List[Dict], num_to_generate: int) -> List[Dict]:
        """实体交换增强（交换同类实体）"""
        augmented = []

        # 提取所有样本的实体
        all_entities = self._extract_all_entities_by_type(samples)

        for _ in range(num_to_generate):
            sample = copy.deepcopy(random.choice(samples))
            tokens = sample["tokens"]
            labels = [self.id2label[l] for l in sample["labels"]]

            # 找出所有实体
            entities = self._extract_entities_from_tokens(tokens, labels)

            if not entities:
                continue

            # 随机选择一个实体类型进行交换
            entity_types = list(set([e["type"] for e in entities]))
            if not entity_types:
                continue

            entity_type_to_swap = random.choice(entity_types)

            # 获取同类型的其他实体
            candidate_entities = all_entities.get(entity_type_to_swap, [])
            if len(candidate_entities) < 2:
                continue

            # 选择要交换的实体和目标实体
            entities_of_type = [e for e in entities if e["type"] == entity_type_to_swap]
            if not entities_of_type:
                continue
            entity_to_replace = random.choice(entities_of_type)
            
            # 确保有不同文本的候选实体
            different_entities = [e for e in candidate_entities if e["text"] != entity_to_replace["text"]]
            if not different_entities:
                continue
            target_entity = random.choice(different_entities)

            # 执行交换
            new_tokens = tokens.copy()
            start, end = entity_to_replace["start"], entity_to_replace["end"]
            replacement_tokens = self.tokenizer.tokenize(target_entity["text"])

            # 替换token
            if len(replacement_tokens) == (end - start):
                # token数量相同，直接替换
                for i in range(start, end):
                    new_tokens[i] = replacement_tokens[i - start]
            else:
                # token数量不同，简化处理：用原实体
                pass

            # 重新构建文本
            new_text = self.tokenizer.convert_tokens_to_string(new_tokens)

            # 重新编码
            encoding = self.tokenizer(
                new_text,
                max_length=len(sample["input_ids"]),
                padding="max_length",
                truncation=True,
                return_tensors=None
            )

            sample["input_ids"] = encoding["input_ids"]
            sample["attention_mask"] = encoding["attention_mask"]
            sample["tokens"] = self.tokenizer.convert_ids_to_tokens(encoding["input_ids"])

            augmented.append(sample)

        return augmented

    def random_deletion_augment(self, samples: List[Dict], num_to_generate: int) -> List[Dict]:
        """随机删除增强（删除非关键信息）"""
        augmented = []

        for _ in range(num_to_generate):
            sample = copy.deepcopy(random.choice(samples))
            text = self.tokenizer.decode(sample["input_ids"], skip_special_tokens=True)

            # 使用标点分割句子
            sentences = re.split(r'[，。；;]', text)
            sentences = [s.strip() for s in sentences if s.strip()]

            if len(sentences) > 3:
                # 随机删除一个句子（非诊断部分）
                non_diagnostic_indices = [i for i, s in enumerate(sentences)
                                          if not any(word in s for word in ["诊断", "结论", "1、", "2、", "3、"])]

                if non_diagnostic_indices:
                    idx_to_remove = random.choice(non_diagnostic_indices)
                    del sentences[idx_to_remove]

                    new_text = "，".join(sentences) + "。"

                    # 重新编码
                    encoding = self.tokenizer(
                        new_text,
                        max_length=len(sample["input_ids"]),
                        padding="max_length",
                        truncation=True,
                        return_tensors=None
                    )

                    sample["input_ids"] = encoding["input_ids"]
                    sample["attention_mask"] = encoding["attention_mask"]
                    sample["tokens"] = self.tokenizer.convert_ids_to_tokens(encoding["input_ids"])

            augmented.append(sample)

        return augmented

    def combine_augmentations(self, samples: List[Dict], target_multiple: int = 10) -> List[Dict]:
        """组合多种增强方法"""
        print(f"原始数据: {len(samples)} 条")
        print(f"目标数据: {len(samples) * target_multiple} 条")

        # 提取实体模式
        self.extract_entity_patterns(samples)

        augmented = copy.deepcopy(samples)

        # 各种增强方法及其权重
        augmentation_methods = [
            (self.synonym_replacement_augment, 25, "同义词替换"),
            (self.value_perturbation_augment, 25, "数值扰动"),
            (self.template_based_augment, 20, "模板生成"),
            (self.entity_swap_augment, 15, "实体交换"),
            (self.random_deletion_augment, 15, "随机删除"),
        ]

        # 计算需要生成的总数
        target_count = len(samples) * target_multiple
        needed = target_count - len(samples)
        total_weight = sum(w for _, w, _ in augmentation_methods)

        for method, weight, name in augmentation_methods:
            to_generate = int(needed * (weight / total_weight))
            if to_generate == 0:
                continue

            print(f"\n使用 {name} 生成 {to_generate} 条数据...")
            try:
                generated = method(samples, to_generate)
                augmented.extend(generated)
                print(f"  成功生成 {len(generated)} 条数据")
            except Exception as e:
                print(f"  增强失败: {e}")

        # 如果还不够，复制一些原数据
        if len(augmented) < target_count:
            needed = target_count - len(augmented)
            extra = random.choices(samples, k=needed)

            # 对额外样本添加轻微变化
            for sample in extra:
                sample_copy = copy.deepcopy(sample)

                # 轻微的同义词替换
                text = self.tokenizer.decode(sample_copy["input_ids"], skip_special_tokens=True)
                for word, synonyms in self.ecg_synonyms.items():
                    if word in text and random.random() < 0.2:
                        replacement = random.choice(synonyms)
                        text = text.replace(word, replacement, 1)

                if text != self.tokenizer.decode(sample_copy["input_ids"], skip_special_tokens=True):
                    encoding = self.tokenizer(
                        text,
                        max_length=len(sample_copy["input_ids"]),
                        padding="max_length",
                        truncation=True,
                        return_tensors=None
                    )
                    sample_copy["input_ids"] = encoding["input_ids"]
                    sample_copy["attention_mask"] = encoding["attention_mask"]

                augmented.append(sample_copy)

            print(f"\n补充 {needed} 条轻微修改的原数据")

        print(f"\n增强完成! 最终数据: {len(augmented)} 条")
        return augmented[:target_count]

    def _extract_typical_values(self, samples: List[Dict]) -> Dict:
        """从样本中提取典型数值范围"""
        typical = defaultdict(list)

        for sample in samples:
            tokens = sample["tokens"]
            labels = [self.id2label[l] for l in sample["labels"]]

            entities = self._extract_entities_from_tokens(tokens, labels)

            for entity in entities:
                if entity["type"] == "数值":
                    try:
                        clean_text = entity["text"].replace(',', '').replace('，', '')
                        if '.' in clean_text:
                            typical["float"].append(float(clean_text))
                        else:
                            typical["int"].append(int(clean_text))
                    except:
                        pass

        # 计算典型范围
        result = {}
        if typical.get("int"):
            result["heart_rate"] = (min(typical["int"]), max(typical["int"]))
        if typical.get("float"):
            result["percentage"] = (min(typical["float"]), max(typical["float"]))

        return result

    def _extract_all_entities_by_type(self, samples: List[Dict]) -> Dict[str, List[Dict]]:
        """从所有样本中按类型提取实体"""
        entities_by_type = defaultdict(list)

        for sample in samples:
            tokens = sample["tokens"]
            labels = [self.id2label[l] for l in sample["labels"]]

            entities = self._extract_entities_from_tokens(tokens, labels)

            for entity in entities:
                entities_by_type[entity["type"]].append(entity)

        return entities_by_type

    def _extract_entities_from_tokens(self, tokens: List[str], labels: List[str]) -> List[Dict]:
        """从token和label序列中提取实体"""
        entities = []
        current_entity = None

        for i, label in enumerate(labels):
            if label.startswith("B-"):
                if current_entity is not None:
                    entities.append(current_entity)

                entity_type = label[2:]
                current_entity = {
                    'type': entity_type,
                    'text': tokens[i].replace("##", ""),
                    'start': i,
                    'end': i + 1
                }

            elif label.startswith("I-"):
                if current_entity is not None and label[2:] == current_entity['type']:
                    current_entity['text'] += tokens[i].replace("##", "")
                    current_entity['end'] = i + 1

            elif label == "O":
                if current_entity is not None:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity is not None:
            entities.append(current_entity)

        return entities


def create_synthetic_labels_for_augmented_text(text: str, tokenizer, label2id, original_sample: Dict) -> List[int]:
    """
    为增强后的文本创建标签（简化版本，实际应该使用规则或模型）
    这里使用基于规则的简单方法
    """
    # 解码原标签
    id2label = {v: k for k, v in label2id.items()}

    # 简单的规则：基于关键词匹配
    rules = {
        "指标名称": ["平均心率", "最快心率", "最慢心率", "SDNN", "SDANN", "心率变异性分析"],
        "数值": r"\d+\.?\d*",
        "单位": ["次/分", "次", "ms", "%"],
        "日期时间": r"\d{2}-\d{2} \d{2}:\d{2}:\d{2}",
        "事件类型": ["心动过速事件", "心动过缓事件", "室性早搏"],
        "条件定义": ["心率>100次/分", "心率<60次/分", "正常参考值范围"],
        "时间占比": ["持续时间占总时间的", "占总心搏数的"],
        "事件子类": ["单发室早", "三联律"],
        "诊断类别": ["诊断"],
        "诊断结论": ["窦性心律", "频发室性早搏", "心率变异性分析"],
        "数值范围": ["心率波动于", "次/分之间"],
    }

    # 这里简化处理，实际应该进行完整的NER
    # 暂时返回原样本的标签（截断或填充）
    labels = original_sample["labels"].copy()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer(text)["input_ids"])

    if len(labels) > len(tokens):
        labels = labels[:len(tokens)]
    else:
        labels.extend([label2id["O"]] * (len(tokens) - len(labels)))

    return labels


def main_ecg_augmentation():
    """心电图数据增强主函数"""
    print("=" * 60)
    print("心电图报告数据增强工具")
    print("=" * 60)

    # 1. 加载转换后的数据
    output_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\out\bert_training_data.json"
    labels_info_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\out\labels_info.json"

    print("加载数据...")
    with open(output_file, "r", encoding="utf-8") as f:
        bert_data = json.load(f)

    with open(labels_info_file, "r", encoding="utf-8") as f:
        labels_info = json.load(f)

    label2id = labels_info["label2id"]

    # 2. 加载tokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

    # 3. 准备完整样本
    complete_samples = []
    for item in bert_data:
        tokens = tokenizer.convert_ids_to_tokens(item["input_ids"])
        complete_samples.append({
            "input_ids": item["input_ids"],
            "attention_mask": item["attention_mask"],
            "labels": item["labels"],
            "tokens": tokens
        })

    print(f"加载了 {len(complete_samples)} 个样本")

    # 4. 创建增强器并执行增强
    augmentor = ECGDataAugmentor(tokenizer, label2id)

    print("\n开始数据增强...")
    augmented_samples = augmentor.combine_augmentations(complete_samples, target_multiple=20)

    # 5. 验证增强结果
    print("\n" + "=" * 60)
    print("增强结果验证")
    print("=" * 60)

    # 显示几个样本对比
    for i in range(min(3, len(complete_samples), len(augmented_samples))):
        print(f"\n--- 样本 {i + 1} ---")

        # 原样本
        orig_text = tokenizer.decode(complete_samples[i]["input_ids"], skip_special_tokens=True)
        print(f"原文本: {orig_text[:80]}...")

        # 增强样本
        aug_text = tokenizer.decode(augmented_samples[i]["input_ids"], skip_special_tokens=True)
        print(f"增强后: {aug_text[:80]}...")

        # 检查变化
        if orig_text != aug_text:
            print("✓ 文本已修改")
        else:
            print("✗ 文本未修改")

    # 6. 保存增强数据
    print(f"\n保存增强数据...")

    # 准备保存格式
    data_to_save = []
    for sample in augmented_samples:
        data_to_save.append({
            "input_ids": sample["input_ids"],
            "attention_mask": sample["attention_mask"],
            "labels": sample["labels"]
        })

    # 保存到新文件
    augmented_file = output_file.replace(".json", "_ecg_augmented.json")
    with open(augmented_file, "w", encoding="utf-8") as f:
        json.dump(data_to_save, f, indent=2, ensure_ascii=False)

    # 更新标签信息
    labels_info["original_samples"] = len(complete_samples)
    labels_info["augmented_samples"] = len(augmented_samples)
    labels_info["augmentation_ratio"] = len(augmented_samples) / len(complete_samples)

    labels_info_file_aug = labels_info_file.replace(".json", "_augmented.json")
    with open(labels_info_file_aug, "w", encoding="utf-8") as f:
        json.dump(labels_info, f, indent=2, ensure_ascii=False)

    print(f"\n" + "=" * 60)
    print("增强完成!")
    print(f"原始数据: {len(complete_samples)} 条")
    print(f"增强后数据: {len(augmented_samples)} 条")
    print(f"增强倍数: {len(augmented_samples) / len(complete_samples):.1f} 倍")
    print(f"增强数据保存到: {augmented_file}")
    print("=" * 60)

    return augmented_samples


if __name__ == "__main__":
    main_ecg_augmentation()

(vippython) PS D:\OpenSource\Python\VipPython> $env:HF_ENDPOINT = "https://hf-mirror.com"
(vippython) PS D:\OpenSource\Python\VipPython> uv run .\information_extraction\augmentor.py

步骤2：用这200份数据训练一个初步模型

训练脚本

使用增强后的200份数据训练一个初步的NER模型 train_ecg_ner.py
主要功能：

加载增强后的训练数据和标签信息
分割训练集和验证集（8:2比例）
使用BERT-base-chinese预训练模型
配置AdamW优化器和线性学习率调度器
训练10个 epoch，保存最佳模型
记录训练和验证损失
输出：
训练过程中的损失信息
最佳模型保存在 models/ecg_ner/ 目录
配置信息保存在 models/ecg_ner/config.json

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from transformers import AutoModelForTokenClassification, AutoTokenizer, AdamW, get_linear_schedule_with_warmup
import json
import os
from tqdm import tqdm
from ner_dataset import MedicalNERDatasetWithLabels


def train_ecg_ner():
    """
    训练心电图报告NER模型
    """
    print("=" * 60)
    print("心电图报告NER模型训练")
    print("=" * 60)

    # 1. 配置参数
    config = {
        "data_path": "data/out/bert_training_data_ecg_augmented.json",
        "labels_info_path": "data/out/labels_info_augmented.json",
        "model_name": "bert-base-chinese",
        "max_length": 256,
        "batch_size": 16,
        "epochs": 10,
        "learning_rate": 2e-5,
        "warmup_steps": 500,
        "weight_decay": 0.01,
        "output_dir": "models/ecg_ner",
        "device": "cuda" if torch.cuda.is_available() else "cpu"
    }

    print(f"使用设备: {config['device']}")
    print(f"训练数据: {config['data_path']}")
    print(f"标签信息: {config['labels_info_path']}")

    # 2. 加载数据集
    print("\n加载数据集...")
    dataset = MedicalNERDatasetWithLabels(
        config["data_path"],
        config["labels_info_path"],
        config["max_length"]
    )

    print(f"数据集大小: {len(dataset)}")
    print(f"标签数量: {dataset.num_labels}")
    print(f"标签映射: {dataset.label2id}")

    # 3. 分割训练集和验证集
    train_size = int(0.8 * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

    print(f"\n训练集大小: {len(train_dataset)}")
    print(f"验证集大小: {len(val_dataset)}")

    # 4. 创建数据加载器
    train_loader = DataLoader(
        train_dataset,
        batch_size=config["batch_size"],
        shuffle=True,
        num_workers=0
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=config["batch_size"],
        shuffle=False,
        num_workers=0
    )

    # 5. 加载预训练模型
    print("\n加载预训练模型...")
    model = AutoModelForTokenClassification.from_pretrained(
        config["model_name"],
        num_labels=dataset.num_labels
    )
    model.to(config["device"])

    # 6. 设置优化器和学习率调度器
    optimizer = AdamW(
        model.parameters(),
        lr=config["learning_rate"],
        weight_decay=config["weight_decay"]
    )

    total_steps = len(train_loader) * config["epochs"]
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config["warmup_steps"],
        num_training_steps=total_steps
    )

    # 7. 创建输出目录
    os.makedirs(config["output_dir"], exist_ok=True)

    # 8. 训练模型
    print("\n开始训练...")
    best_val_loss = float("inf")

    for epoch in range(config["epochs"]):
        print(f"\n--- 第 {epoch + 1} 轮训练 ---")

        # 训练模式
        model.train()
        train_loss = 0.0

        for batch in tqdm(train_loader, desc="训练中"):
            input_ids = batch["input_ids"].to(config["device"])
            attention_mask = batch["attention_mask"].to(config["device"])
            labels = batch["labels"].to(config["device"])

            # 前向传播
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            train_loss += loss.item()

            # 反向传播
            loss.backward()
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        avg_train_loss = train_loss / len(train_loader)
        print(f"训练损失: {avg_train_loss:.4f}")

        # 验证模式
        model.eval()
        val_loss = 0.0

        with torch.no_grad():
            for batch in tqdm(val_loader, desc="验证中"):
                input_ids = batch["input_ids"].to(config["device"])
                attention_mask = batch["attention_mask"].to(config["device"])
                labels = batch["labels"].to(config["device"])

                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )

                loss = outputs.loss
                val_loss += loss.item()

        avg_val_loss = val_loss / len(val_loader)
        print(f"验证损失: {avg_val_loss:.4f}")

        # 保存最佳模型
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            model.save_pretrained(config["output_dir"])
            print(f"保存最佳模型 (验证损失: {best_val_loss:.4f})")

    # 9. 保存配置信息
    config["best_val_loss"] = best_val_loss
    with open(os.path.join(config["output_dir"], "config.json"), "w", encoding="utf-8") as f:
        json.dump(config, f, indent=2, ensure_ascii=False)

    print("\n" + "=" * 60)
    print("训练完成!")
    print(f"最佳验证损失: {best_val_loss:.4f}")
    print(f"模型保存路径: {config['output_dir']}")
    print("=" * 60)


if __name__ == "__main__":
    train_ecg_ner()

评估脚本 `evaluate_ecg_ner.py`

评估训练好的模型性能
主要功能：

加载训练好的模型
使用增强后的数据进行评估
计算精确率、召回率、F1分数等指标
生成详细的分类报告
保存评估结果

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForTokenClassification, AutoTokenizer
import json
from tqdm import tqdm
from ner_dataset import MedicalNERDatasetWithLabels
from sklearn.metrics import classification_report


def evaluate_ecg_ner():
    """
    评估心电图报告NER模型
    """
    print("=" * 60)
    print("心电图报告NER模型评估")
    print("=" * 60)

    # 1. 配置参数
    config = {
        "data_path": "data/out/bert_training_data_ecg_augmented.json",
        "labels_info_path": "data/out/labels_info_augmented.json",
        "model_dir": "models/ecg_ner",
        "max_length": 256,
        "batch_size": 16,
        "device": "cuda" if torch.cuda.is_available() else "cpu"
    }

    print(f"使用设备: {config['device']}")
    print(f"评估数据: {config['data_path']}")
    print(f"模型路径: {config['model_dir']}")

    # 2. 加载数据集
    print("\n加载数据集...")
    dataset = MedicalNERDatasetWithLabels(
        config["data_path"],
        config["labels_info_path"],
        config["max_length"]
    )

    # 3. 创建数据加载器
    data_loader = DataLoader(
        dataset,
        batch_size=config["batch_size"],
        shuffle=False,
        num_workers=0
    )

    # 4. 加载模型
    print("\n加载模型...")
    model = AutoModelForTokenClassification.from_pretrained(config["model_dir"])
    model.to(config["device"])
    model.eval()

    # 5. 加载tokenizer
    tokenizer = AutoTokenizer.from_pretrained(config["model_dir"])

    # 6. 评估模型
    print("\n开始评估...")
    all_true_labels = []
    all_pred_labels = []

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="评估中"):
            input_ids = batch["input_ids"].to(config["device"])
            attention_mask = batch["attention_mask"].to(config["device"])
            labels = batch["labels"].to(config["device"])

            # 前向传播
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            # 获取预测标签
            predictions = torch.argmax(outputs.logits, dim=2)

            # 收集标签（排除-100的填充值）
            for i in range(input_ids.shape[0]):
                for j in range(input_ids.shape[1]):
                    if labels[i, j] != -100:
                        all_true_labels.append(labels[i, j].item())
                        all_pred_labels.append(predictions[i, j].item())

    # 7. 计算评估指标
    print("\n" + "=" * 60)
    print("评估结果")
    print("=" * 60)

    # 生成标签映射
    id2label = dataset.id2label
    target_names = [id2label[i] for i in sorted(id2label.keys())]

    # 打印分类报告
    report = classification_report(
        all_true_labels,
        all_pred_labels,
        target_names=target_names,
        zero_division=0
    )
    print(report)

    # 8. 保存评估结果
    evaluation_result = {
        "true_labels": all_true_labels,
        "pred_labels": all_pred_labels,
        "report": report
    }

    with open("models/ecg_ner/evaluation_result.json", "w", encoding="utf-8") as f:
        json.dump(evaluation_result, f, indent=2, ensure_ascii=False)

    print("\n评估完成!")
    print(f"评估结果保存路径: models/ecg_ner/evaluation_result.json")


if __name__ == "__main__":
    evaluate_ecg_ner()

训练配置说明

模型：使用 bert-base-chinese 预训练模型
** batch size**：16
学习率：2e-5
训练轮数：10
最大序列长度：256
设备：自动使用GPU（如果可用）

步骤3：用这个模型预测剩下的90份数据（自动标注）

步骤4：人工检查修正自动标注结果

步骤5：用100份标注数据重新训练更好的模型

🚀 针对你的情况：自动标注90份数据的代码

既然你有90份未标注数据，我给你一个自动标注的方案：

import json
import os
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
from tqdm import tqdm
import numpy as np

class AutoLabeler:
    def __init__(self, model_path=None):
        """初始化自动标注器"""
        # 加载你之前训练的模型（如果有）
        # 或者使用预训练模型微调
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
        
        if model_path and os.path.exists(model_path):
            print(f"加载已有模型: {model_path}")
            self.model = AutoModelForTokenClassification.from_pretrained(model_path)
        else:
            print("使用基础BERT模型")
            # 这里需要你有label2id映射
            self.model = None
        
    def train_initial_model(self, train_data, label2id, epochs=3):
        """用10份数据训练初步模型"""
        # 这里简化为训练流程示意
        print("训练初步模型...")
        # 实际需要实现训练代码
        pass
    
    def predict_entities(self, texts, batch_size=8):
        """预测文本中的实体"""
        if self.model is None:
            print("请先训练或加载模型")
            return []
        
        self.model.eval()
        all_predictions = []
        
        for i in tqdm(range(0, len(texts), batch_size)):
            batch_texts = texts[i:i+batch_size]
            
            # 编码
            encodings = self.tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                max_length=256,
                return_tensors="pt"
            )
            
            # 预测
            with torch.no_grad():
                outputs = self.model(**encodings)
                predictions = torch.argmax(outputs.logits, dim=2)
            
            # 转换回标签
            for j in range(len(batch_texts)):
                tokens = self.tokenizer.convert_ids_to_tokens(encodings["input_ids"][j])
                pred_labels = [self.id2label[p.item()] for p in predictions[j]]
                
                # 处理特殊token
                entities = self._extract_entities(tokens, pred_labels)
                all_predictions.append(entities)
        
        return all_predictions
    
    def auto_label_unlabeled_data(self, labeled_file, unlabeled_file, output_file):
        """自动标注未标注数据"""
        print("="*60)
        print("自动标注未标注数据")
        print("="*60)
        
        # 1. 加载已标注数据
        print("1. 加载已标注数据...")
        with open(labeled_file, "r", encoding="utf-8") as f:
            labeled_data = json.load(f)
        
        # 2. 加载未标注数据
        print("2. 加载未标注数据...")
        unlabeled_texts = []
        if unlabeled_file.endswith('.json'):
            with open(unlabeled_file, "r", encoding="utf-8") as f:
                unlabeled_data = json.load(f)
                unlabeled_texts = [item["text"] for item in unlabeled_data]
        elif unlabeled_file.endswith('.txt'):
            with open(unlabeled_file, "r", encoding="utf-8") as f:
                unlabeled_texts = [line.strip() for line in f if line.strip()]
        
        print(f"未标注数据: {len(unlabeled_texts)} 条")
        
        # 3. 方法1：基于规则的自动标注（如果没有模型）
        print("3. 开始基于规则的自动标注...")
        auto_labeled_data = self.rule_based_labeling(unlabeled_texts)
        
        # 4. 保存结果
        print(f"4. 保存自动标注结果...")
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(auto_labeled_data, f, indent=2, ensure_ascii=False)
        
        print(f"自动标注完成！保存到: {output_file}")
        return auto_labeled_data
    
    def rule_based_labeling(self, texts):
        """基于规则的自动标注（针对心电图报告）"""
        auto_labeled = []
        
        # 心电图报告规则
        rules = [
            # 指标名称
            (r"(平均心率|最快心率|最慢心率|SDNN|SDANN|心率变异性分析)", "指标名称"),
            
            # 数值
            (r"\b\d+(\.\d+)?\b", "数值"),
            
            # 单位
            (r"(次/分|次|ms|%|bpm)", "单位"),
            
            # 日期时间
            (r"\d{2}-\d{2} \d{2}:\d{2}:\d{2}", "日期时间"),
            
            # 事件类型
            (r"(心动过速事件|心动过缓事件|室性早搏)", "事件类型"),
            
            # 条件定义
            (r"(心率>100次/分|心率<60次/分|正常参考值范围)", "条件定义"),
            
            # 时间占比
            (r"(持续时间占总时间的|占总心搏数的)", "时间占比"),
            
            # 事件子类
            (r"(单发室早|三联律)", "事件子类"),
            
            # 诊断类别
            (r"诊断", "诊断类别"),
            
            # 诊断结论
            (r"(窦性心律|频发室性早搏)", "诊断结论"),
            
            # 数值范围
            (r"心率波动于.*次/分之间", "数值范围"),
        ]
        
        for text_idx, text in enumerate(tqdm(texts, desc="规则标注")):
            annotations = []
            
            # 应用规则
            for pattern, label in rules:
                for match in re.finditer(pattern, text):
                    annotations.append({
                        "value": {
                            "start": match.start(),
                            "end": match.end(),
                            "text": match.group(),
                            "labels": [label]
                        },
                        "id": f"auto_{text_idx}_{len(annotations)}",
                        "from_name": "entity",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "automatic"
                    })
            
            # 去重（避免重叠标注）
            annotations = self._remove_overlapping_annotations(annotations)
            
            # 创建Label Studio格式
            item = {
                "id": text_idx + 1,
                "annotations": [{
                    "id": 1,
                    "result": annotations,
                    "was_cancelled": False,
                    "ground_truth": False,
                    "lead_time": 0
                }],
                "data": {
                    "text": text
                },
                "meta": {},
                "created_at": "2024-01-01T00:00:00.000000Z",
                "updated_at": "2024-01-01T00:00:00.000000Z"
            }
            
            auto_labeled.append(item)
        
        return auto_labeled
    
    def _remove_overlapping_annotations(self, annotations):
        """去除重叠的标注"""
        if not annotations:
            return annotations
        
        # 按起始位置排序
        sorted_anns = sorted(annotations, key=lambda x: x["value"]["start"])
        
        filtered = []
        for ann in sorted_anns:
            if not filtered:
                filtered.append(ann)
            else:
                last = filtered[-1]
                # 检查是否重叠
                if ann["value"]["start"] >= last["value"]["end"]:
                    filtered.append(ann)
                # 如果重叠，保留更长的标注
                elif ann["value"]["end"] - ann["value"]["start"] > last["value"]["end"] - last["value"]["start"]:
                    filtered[-1] = ann
        
        return filtered
    
    def create_training_data_from_mix(self, labeled_data, auto_labeled_data, output_file):
        """混合人工标注和自动标注数据创建训练集"""
        print("="*60)
        print("创建混合训练数据集")
        print("="*60)
        
        # 人工标注数据（高质量）
        mixed_data = copy.deepcopy(labeled_data)
        
        # 自动标注数据（较低质量）
        for item in auto_labeled_data:
            # 标记为自动标注
            item["annotations"][0]["ground_truth"] = False
            item["annotations"][0]["origin"] = "automatic"
            mixed_data.append(item)
        
        print(f"人工标注数据: {len(labeled_data)} 条")
        print(f"自动标注数据: {len(auto_labeled_data)} 条")
        print(f"混合数据总计: {len(mixed_data)} 条")
        
        # 保存
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(mixed_data, f, indent=2, ensure_ascii=False)
        
        print(f"混合数据集保存到: {output_file}")
        return mixed_data


def main_auto_labeling():
    """主函数：自动标注90份未标注数据"""
    
    # 文件路径
    labeled_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\label_studio_export.json"  # 10份已标注
    
    # 假设你的90份未标注数据在这个文件里
    unlabeled_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\unlabeled_reports.json"  # 90份未标注
    
    # 输出文件
    auto_labeled_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\auto_labeled_reports.json"
    mixed_training_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\mixed_training_data.json"
    
    # 创建自动标注器
    labeler = AutoLabeler()
    
    # 1. 自动标注未标注数据
    auto_labeled = labeler.auto_label_unlabeled_data(
        labeled_file, 
        unlabeled_file, 
        auto_labeled_file
    )
    
    # 2. 创建混合训练数据
    with open(labeled_file, "r", encoding="utf-8") as f:
        labeled_data = json.load(f)
    
    mixed_data = labeler.create_training_data_from_mix(
        labeled_data,
        auto_labeled,
        mixed_training_file
    )
    
    print("\n" + "="*60)
    print("下一步建议：")
    print("1. 用混合数据训练模型")
    print("2. 人工检查自动标注结果")
    print("3. 修正错误标注")
    print("4. 用修正后的数据重新训练")
    print("="*60)


# 如果还没有未标注数据，先创建示例
def create_sample_unlabeled_data():
    """创建示例未标注数据"""
    sample_texts = [
        "平均心率为82次/分,最快心率是158次/分,发生于01-20 14:23:45,最慢心率是52次/分,发生01-21 08:12:30,其中心动过速事件(心率>100次/分),持续时间占总时间的2.1%,心动过缓事件(心率<60次/分),持续时间占总时间的3.5%。室性早搏共发生1850次,占总心搏数的6.5%,包括1850次单发室早.28次三联律。诊断：1、窦性心律（心率波动于52次/分--158次/分之间）2、频发室性早搏（1850次单发室早.插入性室早.28次三联律) 3、心率变异性分析：SDNN 198.75（正常参考值范围：102-180ms），SDANN 125.63（正常参考值范围：92-162ms）",
        "心率监测：平均68次/分，最高132次/分（01-19 16:45:22），最低48次/分（01-20 06:30:15）。室早总数3210次，占比8.2%。SDNN 167.42ms，SDANN 118.76ms。",
        "24小时动态心电图：平均心率75bpm，最快146bpm，最慢44bpm。室性早搏4200次，三联律65次。HRV分析正常。"
    ]
    
    output_file = r"D:\OpenSource\Python\VipPython\information_extraction\data\unlabeled_reports.json"
    
    data = []
    for i, text in enumerate(sample_texts):
        data.append({
            "id": i + 1,
            "text": text,
            "source": "sample"
        })
    
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    
    print(f"创建了 {len(data)} 条示例未标注数据到: {output_file}")
    return data


if __name__ == "__main__":
    # 先创建示例未标注数据（如果你还没有）
    create_sample_unlabeled_data()
    
    # 运行自动标注
    main_auto_labeling()

📋 具体实施步骤：

第一步：整理你的数据

data/
├── labeled/           # 10份已标注数据
│   └── label_studio_export.json
├── unlabeled/         # 90份未标注数据
│   └── unlabeled_reports.json
└── output/            # 输出目录

第二步：运行自动标注

# 运行上面的 main_auto_labeling() 函数
# 这会生成：
# 1. auto_labeled_reports.json - 自动标注的90份数据
# 2. mixed_training_data.json - 10份人工 + 90份自动标注的混合数据

第三步：人工修正（关键步骤）

# 人工检查自动标注结果，修正错误
# 可以使用Label Studio快速修正

第四步：训练最终模型

# 用修正后的100份数据训练最终模型

🎯 总结

方法	输入	输出	目的
数据增强	10份标注数据	200份训练数据	防止过拟合，提高模型泛化
自动标注	10份标注 + 90份未标注	100份标注数据	快速获得更多标注数据
结合使用	10份标注 + 90份未标注	100份高质量标注	最佳方案

建议你：

先做自动标注，把90份未标注数据变成标注数据
再用数据增强，把100份标注数据增强到1000+份训练数据
训练一个强大的模型

这样你既解决了数据量少的问题，又解决了过拟合的问题！

posted @ 2026-03-27 19:31 VipSoft 阅读(1) 评论(0) 收藏举报

刷新页面返回顶部