Core Concepts
1. AI Model Deployment
What is AI model deployment?
AI model deployment is the process of taking a trained model (e.g., a .pt, .safetensors, .bin, or custom asset) and making it usable in the real world.
Typical deployment includes:
packaging model files
provisioning compute resources
handling Python environments and dependencies
exposing an API endpoint
managing versions
scaling the model for multiple users
handling retries, logging, timeouts, and errors
securing access
storing assets
monitoring usage
In most platforms, deployment is the hardest and most DevOps-heavy part of running AI.
How deployment works in Norman?
Norman automates the entire deployment lifecycle.
When you call:
Norman handles:
Asset ingestion → stores your model files securely
Configuration parsing → inputs, outputs, encodings, metadata
Automatic versioning
Dependency environment setup (per model)
Containerization & preparation
Routing into the Compute service
API availability → instantly invokable
Management & visibility in your Models Library
You never configure servers, build images, or write YAML. You simply upload the model - Norman deploys it.
AI Model Inference
What is AI model inference?
AI inference is the process of running a deployed model on input data to produce an output.
Common examples:
Sending an image → getting a classification
Sending text → receiving a completion
Uploading audio → receiving a transcription
Sending tensors → receiving tensors
Inference must be fast, reliable, reproducible, scalable and traceable.
Most real-world systems require queues, workers, file storage, and output routing.
How inference works in Norman
Norman turns inference into a single API call:
response = await norman.invoke({ "model_name": "image_reverser_model", "inputs": [ {"display_title": "Input", "data": "/path/to/file.png"} ] })
Behind the scenes Norman handles:
File upload/push (images, audio, text, tensors, etc.)
Task creation and scheduling
Execution inside the model’s environment
Deployment on CPU or GPU depending on model requirements
Automatic autoscaling based on load
Streaming logs and intermediate events
Output storage
Binary return via the SDK
Every invocation is fully tracked, observable, and saved for later inspection.
No infrastructure, no batching logic, no queue management — Norman abstracts all of it.