Core Concepts

1. AI Model Deployment

What is AI model deployment?

AI model deployment is the process of taking a trained model (e.g., a .pt, .safetensors, .bin, or custom asset) and making it usable in the real world.

Typical deployment includes:

  • packaging model files

  • provisioning compute resources

  • handling Python environments and dependencies

  • exposing an API endpoint

  • managing versions

  • scaling the model for multiple users

  • handling retries, logging, timeouts, and errors

  • securing access

  • storing assets

  • monitoring usage

In most platforms, deployment is the hardest and most DevOps-heavy part of running AI.

How deployment works in Norman?

Norman automates the entire deployment lifecycle.

When you call:

model = await norman.upload_model(model_config)
Norman handles:
  • Asset ingestion → stores your model files securely

  • Configuration parsing → inputs, outputs, encodings, metadata

  • Automatic versioning

  • Dependency environment setup (per model)

  • Containerization & preparation

  • Routing into the Compute service

  • API availability → instantly invokable

  • Management & visibility in your Models Library

You never configure servers, build images, or write YAML. You simply upload the model - Norman deploys it.

AI Model Inference

What is AI model inference?

AI inference is the process of running a deployed model on input data to produce an output.

Common examples:
  • Sending an image → getting a classification

  • Sending text → receiving a completion

  • Uploading audio → receiving a transcription

  • Sending tensors → receiving tensors

Inference must be fast, reliable, reproducible, scalable and traceable.

Most real-world systems require queues, workers, file storage, and output routing.

How inference works in Norman

Norman turns inference into a single API call:

response = await norman.invoke({ "model_name": "image_reverser_model", "inputs": [ {"display_title": "Input", "data": "/path/to/file.png"} ] })

Behind the scenes Norman handles:
  • File upload/push (images, audio, text, tensors, etc.)

  • Task creation and scheduling

  • Execution inside the model’s environment

  • Deployment on CPU or GPU depending on model requirements

  • Automatic autoscaling based on load

  • Streaming logs and intermediate events

  • Output storage

  • Binary return via the SDK

  • Every invocation is fully tracked, observable, and saved for later inspection.

  • No infrastructure, no batching logic, no queue management — Norman abstracts all of it.

·

©

2026

·

©

2026