All Models

Text
Qwen 2.5 1.5B (1.54 B params, Apache-2.0)
Pocket-sized Qwen that still handles marathon prompts.
Spec sheet. 28-layer decoder with Grouped-Query Attention (12 Q / 2 KV heads), RoPE, SwiGLU. Native 32 k-token window; stretch to 128 k with YARN/Dual-Chunk tricks.
Small model, big scores. Beats Phi-2 and Gemma-2B on MMLU (56 .5) and other reasoning / coding tests in the Qwen2 paper table.
Runs on almost anything. BF16 weights need ≈ 4.6 GB VRAM; int-4 quant dives below 1.2 GB—perfect for laptops and edge GPUs.
Multilingual + structured-output savvy. Trained for 29 languages, better JSON / table handling, sturdier system-prompt obedience.
Plug-and-play. Supported in transformers >= 4.51, vLLM, llama.cpp, Ollama, etc.—from_pretrained("Qwen/Qwen2.5-1.5B") and you’re off.
Why pick it for Norman AI?
Qwen 2.5 1.5B gives us long-context chat, solid math/code chops, and full Apache freedom, all inside a 5 GB GPU envelope—ideal for cost-squeezed inference tiers, per-tenant fine-tunes, or on-device assistants.