Abstract:

Below is Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio), written in a complete, structured, and student-friendly manner suitable for PyTorch book.

Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio)

PyTorch provides a powerful core framework for tensor operations, automatic differentiation, and building deep learning models. However, most real-world machine learning tasks involve working with specialized data types such as images, text, and audio. To simplify this, PyTorch includes three companion libraries:

torchvision – for image data, image models, and transformations
torchtext – for text preprocessing, datasets, and embeddings
torchaudio – for audio loading, preprocessing, and speech applications

These libraries provide optimized data utilities, pretrained models, and industry-ready pipelines that make it easier to build end-to-end ML workflows.

C.1 torchvision – Computer Vision with PyTorch

torchvision is the most commonly used companion library of PyTorch for computer vision tasks. It includes:

Popular image datasets
Image transformations
Pretrained models for classification, detection, segmentation
Utility functions for image reading and visualization

C.1.1 Installation

pip install torchvision

C.1.2 Popular Vision Datasets

torchvision.datasets includes standardized benchmark datasets:

Dataset	Task	Description
MNIST	Digit classification	70,000 grayscale digit images
CIFAR-10 / CIFAR-100	Object classification	32×32 color images of 10 or 100 classes
ImageNet	Large-scale classification	1M+ images, 1000 classes
COCO	Detection, segmentation	Annotated objects + masks
CelebA	Face attributes	Celebrity face dataset

Example: Loading CIFAR-10

from torchvision import datasets, transforms

transform = transforms.ToTensor()

train_dataset = datasets.CIFAR10(root="data/", train=True, download=True, transform=transform)

C.1.3 Image Transformations

torchvision.transforms provides differentiable preprocessing transforms.

Common Transforms

ToTensor()
Normalize(mean, std)
Resize(size)
RandomCrop(size)
RandomHorizontalFlip()

Example Transform Pipeline

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

C.1.4 Pretrained Models

torchvision.models includes state-of-the-art CNN architectures:

ResNet (18–152)
VGG
DenseNet
MobileNet
EfficientNet
Faster R-CNN, Mask R-CNN (detection)

Loading a pretrained model

from torchvision import models

model = models.resnet18(pretrained=True)

C.1.5 Utility Functions

Useful helpers:

torchvision.io.read_image()
torchvision.utils.make_grid()
torchvision.utils.save_image()

Visualizing a batch

from torchvision.utils import make_grid
import matplotlib.pyplot as plt

grid = make_grid(images_batch)
plt.imshow(grid.permute(1, 2, 0))
plt.show()

C.2 torchtext – Natural Language Processing with PyTorch

torchtext provides text pipelines and dataset utilities for building NLP models with PyTorch.

C.2.1 Installation

pip install torchtext

C.2.2 Features of torchtext

1. Text Datasets

Includes popular NLP datasets:

AG News
IMDB
MultiNLI
SQuAD
WikiText

2. Vocab and Tokenization

Tools for:

Word tokenization
Vocabulary building
Numericalization (mapping tokens → integers)

3. Embeddings

Pretrained embeddings:

GloVe
FastText

4. Iterators and DataPipes

Efficient streaming of text data for large corpora.

C.2.3 Example: Tokenization and Vocabulary

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(["hello world", "deep learning with pytorch"]))

C.2.4 Working with Pretrained Embeddings

from torchtext.vocab import GloVe

glove = GloVe(name='6B', dim=100)
vector = glove['computer']

C.2.5 Example NLP Pipeline

tokenizer = get_tokenizer('basic_english')

def text_pipeline(text):
    tokens = tokenizer(text)
    return vocab(tokens)

sample = text_pipeline("This is a PyTorch example")

C.3 torchaudio – Audio and Speech Processing with PyTorch

torchaudio integrates audio loading, transformation, feature extraction, and pretrained speech models.

C.3.1 Installation

pip install torchaudio

C.3.2 Audio I/O

Supports reading WAV, MP3, FLAC, etc.

Loading Audio

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")

waveform: Tensor of shape [channels, samples]
sample_rate: Sampling frequency

C.3.3 Audio Transformations

Common transforms in torchaudio.transforms:

Resample
MelSpectrogram
MFCC
TimeMasking
FrequencyMasking

Example: Extracting Mel Spectrogram

mel = torchaudio.transforms.MelSpectrogram(sample_rate=16000)
mel_spec = mel(waveform)

C.3.4 Pretrained Speech Models

torchaudio.pipelines provides state-of-the-art ASR (Automatic Speech Recognition) models:

Wav2Vec2
HuBERT
Conformer
Emformer

Example: Speech-to-Text

bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()

with torch.no_grad():
    output = model(waveform)

C.3.5 Audio Augmentation

Useful for improving speech model robustness:

Add background noise
Time stretching
Pitch shifting

Example:

augment = torchaudio.transforms.FrequencyMasking(freq_mask_param=30)
aug_audio = augment(mel_spec)

C.4 Comparison of PyTorch Companion Libraries

Feature	torchvision	torchtext	torchaudio
Domain	Images	Text	Audio
Supports Pretrained Models	Yes	Limited	Yes (ASR models)
Data I/O	Strong	Good	Strong
Transformations	Extensive	Moderate	Extensive
Dataset Availability	High	High	Medium
Common Applications	CNNs, detection, segmentation	NLP, embedding, classification	Speech recognition, audio analysis

C.5 Best Practices

Use torchvision.transforms for robust augmentation in CV tasks
For NLP, prefer torchtext DataPipes + pretrained embeddings
Use torchaudio for feature extraction (MelSpectrogram, MFCC)
Combine vision, audio, and text modules for multimodal deep learning
Prefer pretrained models whenever available to accelerate training

C.6 Summary

This appendix introduced the three major PyTorch domain libraries:

torchvision for images
torchtext for natural language data
torchaudio for audio and speech

These libraries simplify dataset loading, preprocessing, augmentation, and modeling across popular machine learning domains. They are essential tools for building efficient and scalable AI systems using PyTorch.