Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio)
Abstract:
Below is Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio), written in a complete, structured, and student-friendly manner suitable for PyTorch book.
Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio)
PyTorch provides a powerful core framework for tensor operations, automatic differentiation, and building deep learning models. However, most real-world machine learning tasks involve working with specialized data types such as images, text, and audio. To simplify this, PyTorch includes three companion libraries:
-
torchvision – for image data, image models, and transformations
-
torchtext – for text preprocessing, datasets, and embeddings
-
torchaudio – for audio loading, preprocessing, and speech applications
These libraries provide optimized data utilities, pretrained models, and industry-ready pipelines that make it easier to build end-to-end ML workflows.
C.1 torchvision – Computer Vision with PyTorch
torchvision is the most commonly used companion library of PyTorch for computer vision tasks. It includes:
-
Popular image datasets
-
Image transformations
-
Pretrained models for classification, detection, segmentation
-
Utility functions for image reading and visualization
C.1.1 Installation
pip install torchvision
C.1.2 Popular Vision Datasets
torchvision.datasets includes standardized benchmark datasets:
| Dataset | Task | Description |
|---|---|---|
| MNIST | Digit classification | 70,000 grayscale digit images |
| CIFAR-10 / CIFAR-100 | Object classification | 32×32 color images of 10 or 100 classes |
| ImageNet | Large-scale classification | 1M+ images, 1000 classes |
| COCO | Detection, segmentation | Annotated objects + masks |
| CelebA | Face attributes | Celebrity face dataset |
Example: Loading CIFAR-10
from torchvision import datasets, transforms
transform = transforms.ToTensor()
train_dataset = datasets.CIFAR10(root="data/", train=True, download=True, transform=transform)
C.1.3 Image Transformations
torchvision.transforms provides differentiable preprocessing transforms.
Common Transforms
-
ToTensor() -
Normalize(mean, std) -
Resize(size) -
RandomCrop(size) -
RandomHorizontalFlip()
Example Transform Pipeline
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
C.1.4 Pretrained Models
torchvision.models includes state-of-the-art CNN architectures:
-
ResNet (18–152)
-
VGG
-
DenseNet
-
MobileNet
-
EfficientNet
-
Faster R-CNN, Mask R-CNN (detection)
Loading a pretrained model
from torchvision import models
model = models.resnet18(pretrained=True)
C.1.5 Utility Functions
Useful helpers:
-
torchvision.io.read_image() -
torchvision.utils.make_grid() -
torchvision.utils.save_image()
Visualizing a batch
from torchvision.utils import make_grid
import matplotlib.pyplot as plt
grid = make_grid(images_batch)
plt.imshow(grid.permute(1, 2, 0))
plt.show()
C.2 torchtext – Natural Language Processing with PyTorch
torchtext provides text pipelines and dataset utilities for building NLP models with PyTorch.
C.2.1 Installation
pip install torchtext
C.2.2 Features of torchtext
1. Text Datasets
Includes popular NLP datasets:
-
AG News
-
IMDB
-
MultiNLI
-
SQuAD
-
WikiText
2. Vocab and Tokenization
Tools for:
-
Word tokenization
-
Vocabulary building
-
Numericalization (mapping tokens → integers)
3. Embeddings
Pretrained embeddings:
-
GloVe
-
FastText
4. Iterators and DataPipes
Efficient streaming of text data for large corpora.
C.2.3 Example: Tokenization and Vocabulary
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer('basic_english')
def yield_tokens(data):
for text in data:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(["hello world", "deep learning with pytorch"]))
C.2.4 Working with Pretrained Embeddings
from torchtext.vocab import GloVe
glove = GloVe(name='6B', dim=100)
vector = glove['computer']
C.2.5 Example NLP Pipeline
tokenizer = get_tokenizer('basic_english')
def text_pipeline(text):
tokens = tokenizer(text)
return vocab(tokens)
sample = text_pipeline("This is a PyTorch example")
C.3 torchaudio – Audio and Speech Processing with PyTorch
torchaudio integrates audio loading, transformation, feature extraction, and pretrained speech models.
C.3.1 Installation
pip install torchaudio
C.3.2 Audio I/O
Supports reading WAV, MP3, FLAC, etc.
Loading Audio
import torchaudio
waveform, sample_rate = torchaudio.load("audio.wav")
-
waveform: Tensor of shape[channels, samples] -
sample_rate: Sampling frequency
C.3.3 Audio Transformations
Common transforms in torchaudio.transforms:
-
Resample
-
MelSpectrogram
-
MFCC
-
TimeMasking
-
FrequencyMasking
Example: Extracting Mel Spectrogram
mel = torchaudio.transforms.MelSpectrogram(sample_rate=16000)
mel_spec = mel(waveform)
C.3.4 Pretrained Speech Models
torchaudio.pipelines provides state-of-the-art ASR (Automatic Speech Recognition) models:
-
Wav2Vec2
-
HuBERT
-
Conformer
-
Emformer
Example: Speech-to-Text
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()
with torch.no_grad():
output = model(waveform)
C.3.5 Audio Augmentation
Useful for improving speech model robustness:
-
Add background noise
-
Time stretching
-
Pitch shifting
Example:
augment = torchaudio.transforms.FrequencyMasking(freq_mask_param=30)
aug_audio = augment(mel_spec)
C.4 Comparison of PyTorch Companion Libraries
| Feature | torchvision | torchtext | torchaudio |
|---|---|---|---|
| Domain | Images | Text | Audio |
| Supports Pretrained Models | Yes | Limited | Yes (ASR models) |
| Data I/O | Strong | Good | Strong |
| Transformations | Extensive | Moderate | Extensive |
| Dataset Availability | High | High | Medium |
| Common Applications | CNNs, detection, segmentation | NLP, embedding, classification | Speech recognition, audio analysis |
C.5 Best Practices
-
Use torchvision.transforms for robust augmentation in CV tasks
-
For NLP, prefer torchtext DataPipes + pretrained embeddings
-
Use torchaudio for feature extraction (MelSpectrogram, MFCC)
-
Combine vision, audio, and text modules for multimodal deep learning
-
Prefer pretrained models whenever available to accelerate training
C.6 Summary
This appendix introduced the three major PyTorch domain libraries:
-
torchvision for images
-
torchtext for natural language data
-
torchaudio for audio and speech
These libraries simplify dataset loading, preprocessing, augmentation, and modeling across popular machine learning domains. They are essential tools for building efficient and scalable AI systems using PyTorch.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."