Chapter 17: Model Deployment with PyTorch

Abstract:

Deploying PyTorch models involves making a trained model accessible for inference in a production environment. This process can vary significantly depending on the target environment and desired scale.
Key Steps in PyTorch Model Deployment:
  • Model Export/Serialization:
    • TorchScript: PyTorch models are often converted to TorchScript, an intermediate representation that can be run independently of Python. This enables deployment in C++ environments, mobile devices, and serverless functions.
    • Saving the Model: The model's state dictionary and architecture can be saved using torch.save().
Python
    import torch    import torchvision.models as models    # Assuming 'model' is your trained PyTorch model    model = models.resnet18(pretrained=True)    torch.save(model.state_dict(), 'model_weights.pth')     # For TorchScript:    scripted_model = torch.jit.script(model)    scripted_model.save("scripted_model.pt") 
  • Choosing a Deployment Strategy:
    • Local Deployment (e.g., Flask/FastAPI): For smaller-scale applications or rapid prototyping, models can be served locally using web frameworks like Flask or FastAPI, creating an API endpoint for inference.
    • Cloud Platforms (e.g., AWS SageMaker, Azure ML, Google Cloud Vertex AI): These platforms offer managed services for deploying and scaling machine learning models, often integrating with tools like TorchServe for efficient serving.
    • Edge Devices (e.g., Raspberry Pi, NVIDIA Jetson): For embedded systems, models can be deployed on specialized hardware, potentially requiring platform-specific optimizations or libraries (e.g., PyTorch for Arm).
    • Containerization (e.g., Docker, Kubernetes): Packaging the model and its dependencies into a Docker container ensures consistent deployment across different environments and facilitates scaling with Kubernetes.
  • Inference Code/Serving Logic:
    • A script or handler is needed to load the deployed model, preprocess input data, perform inference, and post-process the output.
    • For API deployments, this logic is typically embedded within the API endpoint handler.
    • For TorchServe, custom handlers are used to define the inference process.
  • Monitoring and Management:
    • After deployment, monitoring the model's performance, resource utilization, and potential drift is crucial.
    • Cloud platforms often provide built-in monitoring tools, while custom solutions can be implemented for other deployments

Here’s the complete Chapter 17 of PyTorch book with Learning Objectives, Sectioned Explanations, Examples, and Exercises.


Chapter 17: Model Deployment

Learning Objectives

By the end of this chapter, you will be able to:

  • Understand the process of deploying trained PyTorch models to production environments.

  • Export PyTorch models using TorchScript and ONNX for interoperability.

  • Serve deep learning models through Flask and FastAPI web frameworks.

  • Integrate PyTorch models into mobile and edge devices for real-time inference.

  • Understand best practices for efficient, scalable, and reliable model deployment.


17.1 Introduction to Model Deployment

Training a deep learning model is only half the journey. The ultimate goal is deployment — making the model available for real-world inference where it serves predictions for users or systems.

Model deployment involves:

  1. Exporting the trained model to a portable and optimized format.

  2. Serving the model through an API, web service, or mobile app.

  3. Monitoring and updating the model as data and requirements evolve.

PyTorch provides flexible tools for model deployment through TorchScript, ONNX, and integrations with web frameworks and mobile devices.


17.2 Exporting Models with TorchScript

What is TorchScript?

TorchScript is a way to create serializable and optimizable models from PyTorch code. It allows models written in Python to be saved, loaded, and executed independently of Python, making deployment faster and more portable.

There are two main ways to create TorchScript models:

  1. Tracing – records the operations from example inputs.

  2. Scripting – directly converts Python code with control flow into TorchScript.


A. Tracing a Model

import torch
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(10, 2)

    def forward(self, x):
        return torch.relu(self.linear(x))

# Instantiate and trace the model
model = SimpleModel()
example_input = torch.rand(1, 10)
traced_model = torch.jit.trace(model, example_input)

# Save the TorchScript model
traced_model.save("traced_model.pt")

# Load it for inference
loaded_model = torch.jit.load("traced_model.pt")
print(loaded_model(torch.rand(1, 10)))

When to use tracing:

  • When your model has static control flow (no if or loops based on data).


B. Scripting a Model

For dynamic models, use scripting:

@torch.jit.script
def scripted_forward(x):
    return x * 2 if x.sum() > 0 else x / 2

scripted_fn = scripted_forward
print(scripted_fn(torch.tensor([1.0, -1.0])))

Or for full modules:

scripted_model = torch.jit.script(model)
scripted_model.save("scripted_model.pt")

Advantages of TorchScript:

  • No need for Python during inference.

  • Faster execution with optimizations.

  • Portable across environments (C++, mobile, etc.).


17.3 Exporting Models with ONNX

What is ONNX?

ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models. It allows models trained in one framework (e.g., PyTorch) to be run in others (e.g., TensorFlow, Caffe2, or OpenVINO).

ONNX is ideal for cross-platform deployment, especially on hardware accelerators and embedded systems.


Exporting a PyTorch Model to ONNX

import torch.onnx

# Example model and input
model = SimpleModel()
dummy_input = torch.randn(1, 10)

# Export to ONNX format
torch.onnx.export(
    model, 
    dummy_input, 
    "simple_model.onnx",
    input_names=["input"], 
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
    opset_version=13
)
print("Model exported to simple_model.onnx")

Validating the ONNX Model

You can verify the model using onnx and onnxruntime:

import onnx
import onnxruntime as ort
import numpy as np

# Load and check the ONNX model
onnx_model = onnx.load("simple_model.onnx")
onnx.checker.check_model(onnx_model)

# Run inference
ort_session = ort.InferenceSession("simple_model.onnx")
inputs = {"input": np.random.randn(1, 10).astype(np.float32)}
outputs = ort_session.run(None, inputs)

print(outputs)

Advantages of ONNX:

  • Cross-framework interoperability.

  • Hardware acceleration support (e.g., NVIDIA TensorRT, Intel OpenVINO).

  • Optimized for deployment on cloud and edge devices.


17.4 Serving Models with Flask and FastAPI

After exporting your model, you need a serving mechanism so users or applications can send data and receive predictions via HTTP APIs.


A. Serving with Flask

Flask is a lightweight web framework ideal for simple deployments.

from flask import Flask, request, jsonify
import torch

app = Flask(__name__)

# Load the model
model = torch.jit.load("traced_model.pt")
model.eval()

@app.route("/predict", methods=["POST"])
def predict():
    data = request.get_json()
    inputs = torch.tensor(data["inputs"])
    with torch.no_grad():
        outputs = model(inputs).tolist()
    return jsonify({"outputs": outputs})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Test using curl or Postman:

curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"inputs": [[0.5, 0.2, 0.1, 0.9, 0.4, 0.3, 0.8, 0.7, 0.6, 0.0]]}'

B. Serving with FastAPI

FastAPI is a modern, high-performance web framework ideal for production environments.

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()
model = torch.jit.load("traced_model.pt")
model.eval()

class InputData(BaseModel):
    inputs: list

@app.post("/predict")
def predict(data: InputData):
    inputs = torch.tensor(data.inputs)
    with torch.no_grad():
        outputs = model(inputs).tolist()
    return {"outputs": outputs}

Run with:

uvicorn app:app --reload

Access API docs automatically at:
👉 http://127.0.0.1:8000/docs


17.5 Integration with Mobile and Edge Devices

PyTorch supports mobile and embedded inference through PyTorch Mobile and TorchScript models.

A. PyTorch Mobile Workflow

  1. Train and export model with TorchScript.

  2. Use the PyTorch Mobile library in Android or iOS apps.

  3. Load the model for inference on-device.


B. Example: Android Integration

Steps:

  1. Convert model to TorchScript

    traced_model.save("mobile_model.pt")
    
  2. Include the model in your Android assets folder

  3. Load model in Kotlin/Java:

import org.pytorch.Module;
import org.pytorch.Tensor;
import org.pytorch.IValue;

Module module = Module.load(assetFilePath(this, "mobile_model.pt"));
Tensor inputTensor = Tensor.fromBlob(new float[]{0.1f, 0.2f, ...}, new long[]{1, 10});
Tensor outputTensor = module.forward(IValue.from(inputTensor)).toTensor();
float[] outputs = outputTensor.getDataAsFloatArray();

Benefits:

  • Runs offline on mobile.

  • Low latency and privacy-friendly.

  • Supports Android, iOS, and edge platforms.


C. Edge Deployment Options

For edge devices like Raspberry Pi, Jetson Nano, or microcontrollers:

  • Use TorchScript or ONNX Runtime.

  • Optimize with quantization and pruning for low-power inference.

  • Integrate with IoT frameworks like AWS IoT Greengrass or Azure IoT Edge.


17.6 Best Practices for Deployment

  1. Use TorchScript/ONNX for portability.

  2. Containerize your service using Docker for scalability.

  3. Monitor latency and throughput in production.

  4. Cache models to avoid reloading on each request.

  5. Implement versioning for model updates.

  6. Secure APIs with authentication and rate-limiting.


17.7 Summary

In this chapter, you learned how to:

  • Export PyTorch models with TorchScript and ONNX.

  • Serve models using Flask and FastAPI APIs.

  • Integrate PyTorch models into mobile and edge devices.

  • Follow deployment best practices for scalable and efficient inference.

Model deployment bridges the gap between model development and real-world application — turning AI research into tangible value.


17.8 Exercises

  1. TorchScript Practice:
    Convert your existing PyTorch classification model to TorchScript and load it for inference.

  2. ONNX Conversion:
    Export the same model to ONNX and verify it using onnxruntime.

  3. API Deployment:
    Create a FastAPI endpoint that accepts image input and returns class predictions.

  4. Mobile Integration:
    Try running a TorchScript model on an Android emulator using PyTorch Mobile.

  5. Edge Optimization:
    Experiment with quantization (torch.quantization) and compare inference speed before and after.

Comments