All Models

Text
TinyLlama-1.1B (1.1 B params, Apache-2.0)
Pocket-size Llama-2 drop-in you can run almost anywhere.
4 K-token brain. Keeps Llama-2’s tokenizer and 4 096-token RoPE window, so existing prompts and LoRA adapters just work.
Trained like a giant. Pre-trained on ~3 T tokens of web+code, then chat-aligned with UltraChat → DPO; scores well against models 3–7 × its size.
Featherweight deploys. ~2 GB VRAM in BF16; 4-bit quant is only ≈ 637 MB—small enough for laptops, phones, or edge gateways.
Plug-and-play. Works out-of-the-box with transformers >= 4.34, vLLM, llama.cpp, Ollama, GGUF, etc. from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") and go.
No legal knots. Apache-2.0 means full commercial use—no “Llama license” clauses.
Why pick it for Norman AI?
TinyLlama lets us spin up an on-device or “budget” inference tier without touching our Llama-based toolchain—same prompt format, one-tenth the RAM. Ideal for edge apps, speculative decoding helpers, and per-tenant fine-tunes where every gigabyte counts.