如何在自己的项目中导入huggingface的模型--python版

使用huggingface的方式

建议使用colab在线测试代码和训练模型（需FQ）https://colab.research.google.com/

6.1使用pipeline

translator = pipeline(
    "translation",
    model="facebook/nllb-200-distilled-600M",
    temperature=0.7,
    top_p=0.9,
    length_penalty=1.2
)

6.2使用分词器和预加载模型--单次处理

步骤：使用分词器切分input -> 切分后tokens传入预加载的模型

# 表示导入的大模型名称
checkpoint = "ERCDiDip/langdetect"
model_langReco = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 加载与模型对应的分词器
tokenizer_langReco = AutoTokenizer.from_pretrained(checkpoint)
def langReco (sentence):
  # return_tensora = "pt" 表示返回的编码结果是pytorch张量格式
  inputs = tokenizer_langReco(sentence,return_tensors="pt")
  
  outputs = model_langReco(**inputs)
  
  # logits 是模型最后一层的输出的原始分数 表示模型对每个可能标签的预测置信度
  # 还需经过softmax函数转化为概率分布
  logits = outputs.logits
  
  # torch.argmax找到logits中最大值的位置对应的索引，这个索引即为模型预测的类别ID
  # .item()将该张量转换为python的标量值
  predicted_class_id = torch.argmax(logits).item()

  # 将类别Id映射到人类可读的语言名称
  return model_langReco.config.id2label[predicted_class_id]

#测试
sentence = "i am leanrning machine"
print(langReco(sentence))

6.3使用分词器和预加载模型--批量处理

注意要点：传入的批量数据需要长度相同，短的使用【PAD】补充

例：

第一步：传入数组：

inputs = ["i love NML","I am learning maching learning"]

第二步：分词器的使用

# 分词器定义
tokenizer_langReco = AutoTokenizer.from_pretrained(checkpoint)

# 分词器使用
# padding=True : 会将短的序列用指定的填充token(通常是[PAD]) 填充到与批次中最长序列相同的长度
# truncation=True : 对过长的序列进行截断，使其长度不超过模型的最大支持长度
inputs = tokenizer_langReco(sentences, return_tensors="pt", padding=True, truncation=True)

第三步：获取预测结果

outputs = model_langReco(**inputs)
  
# logits 是模型最后一层的输出的原始分数 表示模型对每个可能标签的预测置信度
# 还需经过softmax函数转化为概率分布
logits = outputs.logits

# torch.argmax(logits, dim=1) finds the index of the max value along dimension 1 (the class dimension)
# This returns a tensor of predicted class IDs, one for each sentence in the batch
predicted_class_ids = torch.argmax(logits, dim=1)

# Convert the tensor of IDs to a list of Python integers
predicted_class_ids_list = predicted_class_ids.tolist()

第四步：映射结果

predicted_languages = []

  # Iterate through the predicted class IDs for each sentence in the batch
 for predicted_id in predicted_class_ids_list:
      # Check if the predicted class ID exists in the id2label dictionary
      if predicted_id in model_langReco.config.id2label:
          # 将类别Id映射到人类可读的语言名称
          predicted_languages.append(model_langReco.config.id2label[predicted_id])
      
      else:
          # If the ID is not found, add an informative message for that sentence
          print(f"Warning: Predicted class ID {predicted_id} not found in model's id2label mapping for one of the sentences.")
          predicted_languages.append(f"Unknown language (ID: {predicted_id})")

posted on 2025-05-15 09:26 alwaysnepenthe 阅读(95) 评论(0) 收藏举报