ChatGPT said:
Yes — I know the datasets Python package.
📦 What is datasets?
datasets is a popular open-source library from Hugging Face for easily accessing, sharing, processing, and streaming datasets—especially for machine learning and NLP.
✅ Key Features
-
Huge collection of ready-to-use datasets (text, audio, image, tabular, multimodal).
-
One-line loading of many public datasets (e.g., IMDB, MNIST, SQuAD).
-
Fast and memory-efficient thanks to Apache Arrow.
-
Built-in streaming, so you can use massive datasets without downloading them.
-
Easy dataset transformation (map, filter, shard, train-test split).
-
Seamless integration with PyTorch, TensorFlow, JAX, and tokenizers.
🧰 Example: Loading a dataset
from datasets import load_dataset dataset = load_dataset("imdb") print(dataset)
🔄 Example: Applying a transformation
def add_length(example): example["length"] = len(example["text"]) return example dataset = dataset.map(add_length)
📥 Installation
pip install datasets
You said:
ChatGPT said:
Yes — I know the evaluate Python package.
📦 What is evaluate?
evaluate is a library from Hugging Face used for loading, computing, and comparing evaluation metrics for machine-learning models.
It complements the datasets library and provides a unified interface to hundreds of metrics.
✅ Key Features
-
One-line loading of many standard metrics
(e.g., accuracy, BLEU, ROUGE, F1, precision, recall, perplexity) -
Consistent API across all metrics
-
Built-in dataset integration
-
Support for local/custom metrics
-
Fast and efficient, powered by Apache Arrow (like
datasets)
🧰 Basic Usage
Load and compute a metric:
import evaluate accuracy = evaluate.load("accuracy") result = accuracy.compute( predictions=[0, 1, 1, 0], references=[0, 1, 0, 0] ) print(result)
Example output:
{'accuracy': 0.75} 🧩 Using a metric inside a training loop
Often used with Hugging Face Transformers:
metric = evaluate.load("f1") def compute_metrics(eval_pred): logits, labels = eval_pred preds = logits.argmax(axis=-1) return metric.compute(predictions=preds, references=labels)
📥 Installation
pip install evaluateYou said:
ChatGPT said:
Yes — I know the accelerate Python package.
🚀 What is accelerate?
accelerate is a Hugging Face library that helps you run PyTorch (and sometimes TensorFlow) training on multiple GPUs, TPUs, or mixed precision — without having to write complex distributed-training code.
It abstracts away:
-
DistributedDataParallel (DDP)
-
Multi-GPU setups
-
Mixed-precision (fp16/bf16) training
-
Device placement
-
Gradient accumulation
-
DeepSpeed + Fully Sharded Data Parallel (FSDP)
✅ Why it's useful
-
Write your training loop once → it runs on CPU, 1 GPU, many GPUs, or TPU automatically.
-
Minimal code changes to scale from local to large clusters.
-
Clean and non-intrusive — your training loop stays standard PyTorch.
🧰 Minimal Example
from accelerate import Accelerator import torch from torch.utils.data import DataLoader accelerator = Accelerator() model = MyModel() optimizer = torch.optim.Adam(model.parameters()) train_dataloader = DataLoader(my_dataset, batch_size=16) # Prepare everything model, optimizer, train_dataloader = accelerator.prepare( model, optimizer, train_dataloader ) # Training loop for batch in train_dataloader: optimizer.zero_grad() outputs = model(**batch) accelerator.backward(outputs.loss) optimizer.step()
You don’t need to worry about model.to(device), DDP setup, or fp16 — Accelerator() handles it.
⚙️ CLI
accelerate also includes a very useful setup wizard:
accelerate config
Then launch training with:
accelerate launch train.py
📥 Installation
pip install accelerate

浙公网安备 33010602011771号