ZhangZhihui's Blog  
Do you know the Python package 'datasets'?
 
ChatGPT said:

Yes — I know the datasets Python package.

📦 What is datasets?

datasets is a popular open-source library from Hugging Face for easily accessing, sharing, processing, and streaming datasets—especially for machine learning and NLP.

✅ Key Features

  • Huge collection of ready-to-use datasets (text, audio, image, tabular, multimodal).

  • One-line loading of many public datasets (e.g., IMDB, MNIST, SQuAD).

  • Fast and memory-efficient thanks to Apache Arrow.

  • Built-in streaming, so you can use massive datasets without downloading them.

  • Easy dataset transformation (map, filter, shard, train-test split).

  • Seamless integration with PyTorch, TensorFlow, JAX, and tokenizers.

🧰 Example: Loading a dataset

from datasets import load_dataset

dataset = load_dataset("imdb")
print(dataset)

🔄 Example: Applying a transformation

def add_length(example):
    example["length"] = len(example["text"])
    return example

dataset = dataset.map(add_length)

📥 Installation

pip install datasets

 

You said:
Do you know the Python package 'evaluate'?
 
ChatGPT said:

Yes — I know the evaluate Python package.

📦 What is evaluate?

evaluate is a library from Hugging Face used for loading, computing, and comparing evaluation metrics for machine-learning models.

It complements the datasets library and provides a unified interface to hundreds of metrics.

✅ Key Features

  • One-line loading of many standard metrics
    (e.g., accuracy, BLEU, ROUGE, F1, precision, recall, perplexity)

  • Consistent API across all metrics

  • Built-in dataset integration

  • Support for local/custom metrics

  • Fast and efficient, powered by Apache Arrow (like datasets)

🧰 Basic Usage

Load and compute a metric:

import evaluate

accuracy = evaluate.load("accuracy")

result = accuracy.compute(
    predictions=[0, 1, 1, 0],
    references=[0, 1, 0, 0]
)

print(result)

Example output:

{'accuracy': 0.75}

🧩 Using a metric inside a training loop

Often used with Hugging Face Transformers:

metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return metric.compute(predictions=preds, references=labels)

📥 Installation

pip install evaluate
 
You said:
Do you know the Python package 'accelerate'?
 
ChatGPT said:

Yes — I know the accelerate Python package.

🚀 What is accelerate?

accelerate is a Hugging Face library that helps you run PyTorch (and sometimes TensorFlow) training on multiple GPUs, TPUs, or mixed precisionwithout having to write complex distributed-training code.

It abstracts away:

  • DistributedDataParallel (DDP)

  • Multi-GPU setups

  • Mixed-precision (fp16/bf16) training

  • Device placement

  • Gradient accumulation

  • DeepSpeed + Fully Sharded Data Parallel (FSDP)

✅ Why it's useful

  • Write your training loop once → it runs on CPU, 1 GPU, many GPUs, or TPU automatically.

  • Minimal code changes to scale from local to large clusters.

  • Clean and non-intrusive — your training loop stays standard PyTorch.

🧰 Minimal Example

from accelerate import Accelerator
import torch
from torch.utils.data import DataLoader

accelerator = Accelerator()

model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
train_dataloader = DataLoader(my_dataset, batch_size=16)

# Prepare everything
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

# Training loop
for batch in train_dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    accelerator.backward(outputs.loss)
    optimizer.step()

You don’t need to worry about model.to(device), DDP setup, or fp16 — Accelerator() handles it.

⚙️ CLI

accelerate also includes a very useful setup wizard:

accelerate config

Then launch training with:

accelerate launch train.py

📥 Installation

pip install accelerate

 

 

posted on 2025-11-28 10:16  ZhangZhihuiAAA  阅读(6)  评论(0)    收藏  举报