HF微调(一)
YelpReviewFull 数据集¶
数据集下载: YelpReviewFull
数据集摘要¶
Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。
支持的任务和排行榜¶
文本分类、情感分类:该数据集主要用于文本分类:给定文本,预测情感。
语言¶
这些评论主要以英语编写。
数据集结构¶
数据实例¶
一个典型的数据点包括文本和相应的标签。
来自YelpReviewFull测试集的示例如下:
{
'label':0,
'text':'Igot\'new\'tiresfromthemandwithintwoweeksgotaflat.Itookmycartoalocalmechanictoseeificouldgettheholepatched,buttheysaidthereasonIhadaflatwasbecausethepreviouspatchhadblown-WAIT,WHAT?Ijustgotthetireandneverneededtohaveitpatched?Thiswassupposedtobeanewtire.\\nItookthetireovertoFlynn\'sandtheytoldmethatsomeonepuncturedmytire,thentriedtopatchit.Sothereareresentfultireslashers?Ifindthatveryunlikely.Afterarguingwiththeguyandtellinghimthathislogicwasfarfetchedhesaidhe\'dgivemeanewtire\\"this time\\".\\nIwillnevergobacktoFlynn\'sb/cofthewaythisguytreatedmeandthesimplefactthattheygavemeausedtire!'
}
数据字段¶
- 'text': 评论文本使用双引号(")转义,任何内部双引号都通过2个双引号("")转义。换行符使用反斜杠后跟一个 "n" 字符转义,即 "\n"。
- 'label': 对应于评论的分数(介于1和5之间)。
数据拆分¶
Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。
加载数据集¶
from datasets import load_dataset
dataset = load_dataset("/dataset/yelp_review_full")
#数据集已经划分出训练和测试数据
dataset
DatasetDict({ train: Dataset({ features: ['label', 'text'], num_rows: 650000 }) test: Dataset({ features: ['label', 'text'], num_rows: 50000 }) })
#数据格式
dataset["train"][222]
{'label': 2, 'text': "Its not bad. I'm not into chain kind of places, but if you're with a group that doesn't do anything more adventurous, you'll do fine here. I vacillated between 2 and 3 stars but the house beer I had will make sure I come back from time to time."}
import random
import pandas as pd
import datasets
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, datasets.ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))
#展示10个数据样例
show_random_elements(dataset["train"])
label | text | |
---|---|---|
0 | 3 stars | If you go to Cave Creek you must go to the Buffalo Chip, Harold's , hideaway etc. just great local restaurants in a great town north of Phoenix/Scottsdale. I like the Chip more for drinking dancing and rodeo than the food. But it is okay for a little grub But it is fun. So try it |
1 | 3 stars | Chuy's was pretty good. Seems like all of their locations are pretty similar. I think they'd be best known for their cheap margaritas... $2 pints, $4 small pitchers. They aren't too strong and come from a mediocre mix, but if you're looking for something sweet and cheap its a good bet.\n\nFree serve-yourself chips and salsa are pretty good. \n\nAs for food... pretty decent prices and okay to good food. I wouldn't say its authentic Mexican, actually I think the menu is a bit confused. But the food is good overall. It's a place we'll go to every month or so. |
2 | 3 stars | I ended up eating at Taggia while staying at Firesky so it was a choice of convenience. I've had the food from here several times using room service and it's never anything to complain about. It was the same story the day I had lunch here. I had an organic greens salad and shared the margherita and goat cheese pizzas with my fellow lunchers. All of the food was good - the goat cheese pizza in particular with its thin, crispy crust.\n\nUnfortunately the day we ate here our service was MIA. We were told we could seat ourselves so we did. After about 10 minutes someone came by to take our drink order and maybe 10 minutes later our waters arrived. Well 2 out of 3 of them did anyway. Then we ordered two salads and two pizzas to share. One pizza came first. WTH? Where were the salads? Or the other pizza? The salads showed up a few minutes later and then our server realized that she had forgotten our second pizza. No biggie since we had salads and one pizza to eat. But the service was lackluster with a L. Like Andrea R says, I wouldn't go out of my way to eat here, but when in the area it's a good option to have. |
3 | 2 star | I recently had a work luncheon at Ricardo's, I had been before years ago and it was extremely unmemorable. This visit would be more memorable but for the wrong reasons. \n\nWhen given the choice, I prefer to order off the menu than choose a buffet. But the whole group went to the buffet and I didn't want to be the oddball. I had two carne asada tacos, cheese enchilada and chips & salsa. The enchilada was bland the only hint of flavor was the acidity from the tomatoes. The salsa, too, was bland and watery. The chips were pretty generic. The first taco was ok, a bit bland, but tender. The second was filled with grizzly meat. It really turned my stomach. Fortunately, the service was friendly and they were able to accomodate our large group. |
4 | 4 stars | We had a great time at this resort over the long weekend. The staff was super friendly, especially Adam, David and Cassie. Great job!!! And our suite was perfect to accommodate three women with lots of bags, make-up and shoes. The Hole in the Wall restaurant had a really good breakfast, friendly staff and an outdoors patio. Not so for the Rico Restaurant. They were a bit rude, overwhelmed and obviously didn't want our business. We also floated down the Lazy River, it was definitely Lazy...pretty slow but perfect temp. All in all, I'll be back. |
5 | 1 star | Im an owner with no kids, this place is not for my husband and I.. The element here is all about families and cooking in and playing in the pool from the moment it opens.\n\nThe restaurant bar is a bit of a joke, and the pressure to buy more points makes a relaxing vacation more stressful. We were an original owner and saw most of it built.\n\nWe noticed that they no longer offer a shuttle which is a mistake for those that want to go to the strip and not have to worry about driving. But after this weekend I see that they don't need to offer the shuttle because more than half the people there don't plan on leaving the facility at all.\n\nThe guests that we ran into all seemed to be there on a free vacation offer. They were tatted up and pushing baby strollers... and screaming to each other WAIT TIL YOU SEE THE POOL....\n\nMy hubby and myself both looked at each other and said OUT LOUD, we don't think we will be coming back here again ever to this location.\n\nWe came home and looked into selling it all together, but then thought maybe we would try another location that Diamond Resorts has to offer before we do so..\n\nSo bottom line, if you have kids and love the pool and slides and pool some more.. this is for you.. If your looking for a weekend with the hubby or friends in Vegas to relax and to enjoy what Vegas is all about... this resort is not for you.. |
6 | 3 stars | Booked a room here through Priceline for the Tuesday before Thanksgiving. Actually booked it on the drive in from Las Vegas through my cell phone, which was pretty sweet. Paid $25 + tax, so you can't beat the price. We had a hard time finding it as Google Maps was wrong about it's location, but you can't blame the hotel for that.\n\nWhat I can blame the hotel for is not giving me a king size bed. Priceline had booked me with 2 doubles, and in my experience I am always able to switch unless they're sold out. The front desk clerk told me they were indeed not sold out, but it was their policy not to let Priceline users switch rooms. So much for considering staying there the rest of Thanksgiving week.\n\nI was going to let it go, but then at 9:30am the maid knocked on our door and woke me up 2 hours earlier than I had planned (we got to sleep at 5am, give me a break). I ended up coming out of my room and saw that my do not disturb sign was on, so she must have chosen to ruin my day for fun. Tried to fall back asleep but was then kept up by the sounds of what looked to be a loud garbage truck parked right outside of our room.\n\nI give up on sleeping. At least they have solid free Internet so I can Yelp this hotel. Courtyard, you're lucky you're getting 3 stars from me. |
7 | 2 star | Been to 4 Cirques and this is the least favorite. Sets and costumes are absolutely amazing but the acts were very unimpressive compared to the older ones we've seen. \nThe only exception was the opening act with the two twin men on ropes that swung out into the audience. They weren't in the book at the shop so I think they added them in later to spice up the program. Breathtaking!\n\nPros- Art direction, sets, lights\n\nCons- Acts seen before and ANNOYING clowns |
8 | 2 star | KOOLAID KID reminded me of home...nice touch..i know I know..this is suppose to be about the chicken & waffles, but I must say quenching my thirst is very important to me..so back to the food..it was just that chicken(no flavor) & waffles(nothing special)..mac & cheese was very nice...and the new building was very very nice..okay that's all |
9 | 1 star | Just called this location and I live 1.8 miles away. I asked them to deliver and they informed me that they would not deliver to my house because it was a couple hundred yards out of the map plan. They asked me to call the power and southern store. This store advised me that they could not deliver because jimmy johns has a two mile radius they can deliver to. Called this store back and they once again decided to tell me even though I was in the two mile radius they did not want to deliver to me and my only option was for pickup. I will never eat at this location. I know the owners at Firehouse Subs and they go out of the way and this location is just lazy. Not getting my money jimmy johns no matter how fast you are. Laziness is worse |
预处理数据¶
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/models/bert-base-cased")
"""
文本处理
1 对于长度不等的输入数据,可以使用填充(padding)和截断(truncation)策略来处理
2 Datasets.map + lambda 处理整个数据集
"""
tokenized_datasets = dataset.map(lambda examples: tokenizer(examples["text"], padding="max_length", truncation=True), batched=True)
#查看处理后的样式
show_random_elements(tokenized_datasets["train"], num_examples=2)
label | text | input_ids | token_type_ids | attention_mask | |
---|---|---|---|---|---|
0 | 5 stars | Officially my favorite steakhouse hands down! As if Vegas doesn't have enough to write about, this little gem is in one of the remaining standouts of what some call \"Old Vegas\". The meal got off to a great start at the bar as we waited for our table. The service was fast, the drinks were well made and naturally they were top shelf. I was particularly pleased with the professional look and feel of the place. Its very old school, mafioso, dark, white linen, with the sounds of clinking wine glasses and lively conversation in the background. \n\nThe food: Excellent! Your steak is served as intended, well aged beef cooked to order on a hot plate...nothing else! Fantastic. The sides are of sharable portions so agree on one for every 2 or 3 people unless you plan on toting a box of leftovers around Vegas. The mac and cheese is perfect as is the smooth and creamy lobster bisque. The other members in my party had several cuts of meat and one had a chicken dish...my sincerest apologies for not having info on those. Fact is, I wanted every square inch of space in my belly for the rib eye on my plate. Like sooo many other reviews about this place, it was melt in your mouth good. We also ordered a chocolate gnash desert that could only be made better by eating it off of your favorite super models body, otherwise it was absolute perfection. \n\nService was superb. Our waiter was attentive, informative and practiced. He was professional and friendly. Over all a fantastic experience. I will definitely be returning. Highly recommend this location to anyone. You will not be disappointed. | [101, 9018, 1193, 1139, 5095, 26704, 3255, 1493, 1205, 106, 1249, 1191, 6554, 2144, 112, 189, 1138, 1536, 1106, 3593, 1164, 117, 1142, 1376, 176, 5521, 1110, 1107, 1141, 1104, 1103, 2735, 2484, 10469, 1104, 1184, 1199, 1840, 165, 107, 2476, 6554, 165, 107, 119, 1109, 7696, 1400, 1228, 1106, 170, 1632, 1838, 1120, 1103, 2927, 1112, 1195, 3932, 1111, 1412, 1952, 119, 1109, 1555, 1108, 2698, 117, 1103, 8898, 1127, 1218, 1189, 1105, 8534, 1152, 1127, 1499, 12202, 119, 146, 1108, 2521, 7229, 1114, 1103, 1848, 1440, 1105, 1631, 1104, 1103, 1282, 119, 2098, 1304, 1385, 1278, 117, 12477, ...] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...] |
1 | 1 star | Horrible college! If you can go somewhere else they are so unorganized and try to find ways to make you attend their school. | [101, 9800, 27788, 2134, 106, 1409, 1128, 1169, 1301, 4476, 1950, 1152, 1132, 1177, 8362, 1766, 24087, 1105, 2222, 1106, 1525, 3242, 1106, 1294, 1128, 4739, 1147, 1278, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...] |
数据抽样¶
使用 30000 个数据样本,在 BERT 上演示小规模训练(基于 Pytorch Trainer)
shuffle()
函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制,可以在此函数中指定generator参数来使用不同的numpy.random.Generator。
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(30000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(5000))
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("/models/bert-base-cased", num_labels=5)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
训练超参数(TrainingArguments)¶
源代码定义:https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161
最重要配置:模型权重保存路径(output_dir)
from transformers import TrainingArguments
model_dir = "./models/bert-base-cased-finetune-yelp"
# logging_steps 默认值为500,根据我们的训练数据和步长,将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
per_device_train_batch_size=128,
num_train_epochs=1,
logging_steps=100)
# 完整的超参数配置
print(training_args)
TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, batch_eval_metrics=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=no, evaluation_strategy=None, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=models/bert-base-cased-finetune-yelp/runs/Dec26_06-22-56_com1, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=100, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=models/bert-base-cased-finetune-yelp, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=128, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=models/bert-base-cased-finetune-yelp, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=500, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, )
训练过程中的指标评估(Evaluate)¶
Hugging Face Evaluate 库 支持使用一行代码,获得数十种不同领域(自然语言处理、计算机视觉、强化学习等)的评估方法。 当前支持 完整评估指标:https://huggingface.co/evaluate-metric
训练器(Trainer)在训练过程中不会自动评估模型性能。因此,我们需要向训练器传递一个函数来计算和报告指标。
Evaluate库提供了一个简单的准确率函数,您可以使用evaluate.load
函数加载
import numpy as np
import evaluate
#export HF_ENDPOINT=https://hf-mirror.com 国内代理,默认下载accuracy指标
metric = evaluate.load("accuracy")
#使用metric.compute计算预测准确率
def compute_metrics(eval_pred):
logits, labels = eval_pred
#将 logits 转换为预测值(所有Transformers 模型都返回 logits)
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
训练过程指标监控¶
通常,为了监控训练过程中的评估指标变化,我们可以在TrainingArguments
指定evaluation_strategy
参数,以便在 epoch 结束时报告评估指标。
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir=model_dir,
evaluation_strategy="epoch",
per_device_train_batch_size=128,
num_train_epochs=2,
logging_steps=30)
/root/anaconda3/envs/jupylab/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn(
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.889300 | 0.879301 | 0.608000 |
2 | 0.723600 | 0.867261 | 0.627800 |
TrainOutput(global_step=470, training_loss=0.8827837457048132, metrics={'train_runtime': 1196.3641, 'train_samples_per_second': 50.152, 'train_steps_per_second': 0.393, 'total_flos': 1.578708854784e+16, 'train_loss': 0.8827837457048132, 'epoch': 2.0})
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(1000))
trainer.evaluate(small_test_dataset)
{'eval_loss': 0.8601201176643372, 'eval_accuracy': 0.621, 'eval_runtime': 6.9634, 'eval_samples_per_second': 143.607, 'eval_steps_per_second': 17.951, 'epoch': 2.0}
保存模型和训练状态¶
#使用 trainer.save_model 方法保存模型,后续可以通过 from_pretrained() 方法重新加载
trainer.save_model("./models/bert-base-cased-finetune-yelp")
#使用 trainer.save_state 方法保存训练状态
trainer.save_state()