Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio)

Abstract:

Below is Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio), written in a complete, structured, and student-friendly manner suitable for PyTorch book.


Appendix C: Key PyTorch Libraries (torchvision, torchtext, torchaudio)

PyTorch provides a powerful core framework for tensor operations, automatic differentiation, and building deep learning models. However, most real-world machine learning tasks involve working with specialized data types such as images, text, and audio. To simplify this, PyTorch includes three companion libraries:

  • torchvision – for image data, image models, and transformations

  • torchtext – for text preprocessing, datasets, and embeddings

  • torchaudio – for audio loading, preprocessing, and speech applications

These libraries provide optimized data utilities, pretrained models, and industry-ready pipelines that make it easier to build end-to-end ML workflows.


C.1 torchvision – Computer Vision with PyTorch

torchvision is the most commonly used companion library of PyTorch for computer vision tasks. It includes:

  • Popular image datasets

  • Image transformations

  • Pretrained models for classification, detection, segmentation

  • Utility functions for image reading and visualization


C.1.1 Installation

pip install torchvision

C.1.2 Popular Vision Datasets

torchvision.datasets includes standardized benchmark datasets:

Dataset Task Description
MNIST Digit classification 70,000 grayscale digit images
CIFAR-10 / CIFAR-100 Object classification 32×32 color images of 10 or 100 classes
ImageNet Large-scale classification 1M+ images, 1000 classes
COCO Detection, segmentation Annotated objects + masks
CelebA Face attributes Celebrity face dataset

Example: Loading CIFAR-10

from torchvision import datasets, transforms

transform = transforms.ToTensor()

train_dataset = datasets.CIFAR10(root="data/", train=True, download=True, transform=transform)

C.1.3 Image Transformations

torchvision.transforms provides differentiable preprocessing transforms.

Common Transforms

  • ToTensor()

  • Normalize(mean, std)

  • Resize(size)

  • RandomCrop(size)

  • RandomHorizontalFlip()

Example Transform Pipeline

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

C.1.4 Pretrained Models

torchvision.models includes state-of-the-art CNN architectures:

  • ResNet (18–152)

  • VGG

  • DenseNet

  • MobileNet

  • EfficientNet

  • Faster R-CNN, Mask R-CNN (detection)

Loading a pretrained model

from torchvision import models

model = models.resnet18(pretrained=True)

C.1.5 Utility Functions

Useful helpers:

  • torchvision.io.read_image()

  • torchvision.utils.make_grid()

  • torchvision.utils.save_image()

Visualizing a batch

from torchvision.utils import make_grid
import matplotlib.pyplot as plt

grid = make_grid(images_batch)
plt.imshow(grid.permute(1, 2, 0))
plt.show()

C.2 torchtext – Natural Language Processing with PyTorch

torchtext provides text pipelines and dataset utilities for building NLP models with PyTorch.


C.2.1 Installation

pip install torchtext

C.2.2 Features of torchtext

1. Text Datasets

Includes popular NLP datasets:

  • AG News

  • IMDB

  • MultiNLI

  • SQuAD

  • WikiText

2. Vocab and Tokenization

Tools for:

  • Word tokenization

  • Vocabulary building

  • Numericalization (mapping tokens → integers)

3. Embeddings

Pretrained embeddings:

  • GloVe

  • FastText

4. Iterators and DataPipes

Efficient streaming of text data for large corpora.


C.2.3 Example: Tokenization and Vocabulary

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(["hello world", "deep learning with pytorch"]))

C.2.4 Working with Pretrained Embeddings

from torchtext.vocab import GloVe

glove = GloVe(name='6B', dim=100)
vector = glove['computer']

C.2.5 Example NLP Pipeline

tokenizer = get_tokenizer('basic_english')

def text_pipeline(text):
    tokens = tokenizer(text)
    return vocab(tokens)

sample = text_pipeline("This is a PyTorch example")

C.3 torchaudio – Audio and Speech Processing with PyTorch

torchaudio integrates audio loading, transformation, feature extraction, and pretrained speech models.


C.3.1 Installation

pip install torchaudio

C.3.2 Audio I/O

Supports reading WAV, MP3, FLAC, etc.

Loading Audio

import torchaudio

waveform, sample_rate = torchaudio.load("audio.wav")
  • waveform: Tensor of shape [channels, samples]

  • sample_rate: Sampling frequency


C.3.3 Audio Transformations

Common transforms in torchaudio.transforms:

  • Resample

  • MelSpectrogram

  • MFCC

  • TimeMasking

  • FrequencyMasking

Example: Extracting Mel Spectrogram

mel = torchaudio.transforms.MelSpectrogram(sample_rate=16000)
mel_spec = mel(waveform)

C.3.4 Pretrained Speech Models

torchaudio.pipelines provides state-of-the-art ASR (Automatic Speech Recognition) models:

  • Wav2Vec2

  • HuBERT

  • Conformer

  • Emformer

Example: Speech-to-Text

bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
model = bundle.get_model()

with torch.no_grad():
    output = model(waveform)

C.3.5 Audio Augmentation

Useful for improving speech model robustness:

  • Add background noise

  • Time stretching

  • Pitch shifting

Example:

augment = torchaudio.transforms.FrequencyMasking(freq_mask_param=30)
aug_audio = augment(mel_spec)

C.4 Comparison of PyTorch Companion Libraries

Feature torchvision torchtext torchaudio
Domain Images Text Audio
Supports Pretrained Models Yes Limited Yes (ASR models)
Data I/O Strong Good Strong
Transformations Extensive Moderate Extensive
Dataset Availability High High Medium
Common Applications CNNs, detection, segmentation NLP, embedding, classification Speech recognition, audio analysis

C.5 Best Practices

  • Use torchvision.transforms for robust augmentation in CV tasks

  • For NLP, prefer torchtext DataPipes + pretrained embeddings

  • Use torchaudio for feature extraction (MelSpectrogram, MFCC)

  • Combine vision, audio, and text modules for multimodal deep learning

  • Prefer pretrained models whenever available to accelerate training


C.6 Summary

This appendix introduced the three major PyTorch domain libraries:

  • torchvision for images

  • torchtext for natural language data

  • torchaudio for audio and speech

These libraries simplify dataset loading, preprocessing, augmentation, and modeling across popular machine learning domains. They are essential tools for building efficient and scalable AI systems using PyTorch.



Comments