Qwen3 Hardware Requirements: Run Alibaba's Latest LLM Locally (2026)
AI helped assemble the Qwen3 size table. The MoE-VRAM explanation and the per-size hardware picks were edited by hand against the live Alibaba model cards.
Updated May 2026 · Apache 2.0 license · Dense and MoE variants
Alibaba's Qwen3 family spans eight models from 0.6B to 235B parameters, including dense and Mixture-of-Experts variants. Qwen3-8B is a strong open-weight 8B model — see Alibaba's official Qwen3-8B model card for reported benchmark numbers. This guide covers exact VRAM requirements, thinking mode overhead, and hardware recommendations for every Qwen3 variant.
Qwen3 at a glance
- Model lineup: 0.6B, 1.7B, 4B, 8B, 14B, 32B dense, plus 30B-A3B and 235B-A22B MoE
- Minimum VRAM (Qwen3-4B Q4): ~3 GB, fits any GPU or runs CPU-only
- Qwen3-8B Q4: ~5 GB VRAM. See the model card for reported benchmark scores.
- Thinking mode: Built-in chain-of-thought reasoning, uses 2 to 3x more tokens per response
- llama.cpp support: Native Qwen3 support added April 2025
- Ollama command:
ollama pull qwen3:8b
Qwen3 Model Variants Overview
Qwen3 offers one of the broadest model lineups of any open-weight family, with dense models for predictable VRAM usage and MoE variants for faster inference on high-memory systems. All models support 32K context by default and can be extended. Every Qwen3 model includes built-in thinking mode for enhanced reasoning.
Qwen3-4B
Best entry-level3 GB at Q4
Runs on almost any hardware including CPU-only setups. A practical pick for coding and chat on budget machines (~3 GB at Q4). Q8 fits comfortably on an 8 GB GPU.
Qwen3-8B
Best overall 8B5 GB at Q4
Strong reasoning and coding for its parameter count. Fits any 8 GB GPU at Q4. Vendor-reported benchmarks are published on the official Qwen3-8B Hugging Face model card.
Qwen3-14B
Sweet spot for 12 GB8 GB at Q4
A capable mid-size dense model with coding, math, and multilingual support. The RTX 4070 12 GB runs this model at Q4 with 4 GB to spare for KV cache. Vendor-reported benchmarks are on the official Qwen3-14B Hugging Face model card.
Qwen3-32B
Largest dense Qwen320 GB at Q4
Largest dense Qwen3 variant. Requires an RTX 4090 24 GB at Q4 (tight on VRAM). Dual-GPU setups or 48 GB cards are recommended for Q8. See the Qwen3-32B Hugging Face model card for vendor-reported benchmarks.
Qwen3-30B-A3B
MoE: fast inference20 GB at Q4
MoE model: 30B total parameters loaded into memory, but only 3B active per token. Loads like a 30B dense model but generates tokens at near 3B speed once loaded.
Qwen3-235B-A22B
Cloud or multi-GPU only130 GB+ at Q4
Alibaba's frontier MoE model. Requires 4 or more 24 GB GPUs for local inference. Not practical for most home setups. Best accessed via API or cloud.
For most users building a local AI setup, Qwen3-8B is a sensible starting point: ~5 GB at Q4 fits any 8 GB GPU, and thinking mode is available without loading a separate model. Vendor-reported benchmarks for each variant live on the corresponding Hugging Face model card.
VRAM Requirements by Model and Quantization
The tables below show VRAM required for model weights only. Add 1 to 3 GB for KV cache at standard context lengths. Thinking mode responses are longer and consume proportionally more KV cache.
Dense Models
| Model | Params | Q4 VRAM | Q8 VRAM | Minimum GPU | Best GPU |
|---|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | ~0.5 GB | ~0.8 GB | Any GPU 4 GB+ | Any modern GPU or CPU |
| Qwen3-1.7B | 1.7B | ~1.2 GB | ~2 GB | Any GPU 4 GB+ | Any modern GPU |
| Qwen3-4B | 4B | ~3 GB | ~5 GB | Any GPU 4 GB+ or CPU | RTX 4060 8 GB |
| Qwen3-8B | 8B | ~5 GB | ~9 GB | RTX 4060 8 GB (Q4) | RTX 4070 12 GB (Q8) |
| Qwen3-14B | 14B | ~8 GB | ~15 GB | RTX 4070 12 GB (Q4) | RTX 4060 Ti 16 GB (Q8) |
| Qwen3-32B | 32B | ~20 GB | ~35 GB | RTX 4090 24 GB (Q4) | Dual RTX 4090 or 48 GB GPU |
MoE Models
| Model | Total Params | Active Params | Q4 VRAM | Q8 VRAM | Minimum Hardware | Best Hardware |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 30B | 3B | ~20 GB | ~32 GB | RTX 4090 24 GB (Q4, tight) | RTX 5090 32 GB or Mac Studio 64 GB |
| Qwen3-235B-A22B | 235B | 22B | ~130 GB+ | 240 GB+ | 4x RTX 4090 (multi-GPU) | Cloud or multi-GPU server |
Why MoE models need more VRAM than their active params suggest
Qwen3-30B-A3B activates only 3B parameters per token, but all 30B expert weights must be loaded into VRAM. The router must be able to dispatch any token to any expert without loading or unloading weights mid-inference. VRAM is determined by total parameters, not active ones.
For CPU-only inference, multiply VRAM figures by roughly 2 to estimate system RAM needed, plus OS overhead. Qwen3-4B and smaller models run well on CPU with 16 GB RAM.
Thinking Mode: VRAM and Performance Impact
Qwen3's built-in thinking mode adds chain-of-thought reasoning before producing a final answer. This is controlled by the model itself using special tokens, not by loading a separate model. The VRAM for weights does not change, but thinking mode generates significantly longer sequences, which increases KV cache usage.
How thinking mode affects VRAM
Model weights
No change. Same VRAM as non-thinking inference.
KV cache
2 to 3x larger due to longer reasoning sequences before the final answer.
Practical tip
Reduce context length to 8K or 16K on 8 GB GPUs when using thinking mode.
| Model | Mode | KV Cache (32K ctx) | Total VRAM | Practical GPU |
|---|---|---|---|---|
| Qwen3-8B Q4 | Standard | ~2 GB | ~7 GB | Fits 8 GB GPU |
| Qwen3-8B Q4 | Thinking | ~4-6 GB | ~9-11 GB | Needs 12 GB GPU |
| Qwen3-14B Q4 | Standard | ~2 GB | ~10 GB | Fits 12 GB GPU |
| Qwen3-14B Q4 | Thinking | ~4-6 GB | ~12-14 GB | Needs 16 GB GPU |
| Qwen3-32B Q4 | Standard | ~2 GB | ~22 GB | Fits 24 GB GPU |
| Qwen3-32B Q4 | Thinking | ~4-6 GB | ~24-26 GB | Tight on 24 GB, use 16K ctx |
To disable thinking mode in Ollama, append /no_think to your prompt or set the system prompt accordingly. In llama.cpp, thinking mode is controlled by the model's built-in token logic.
Recommended Hardware by Use Case
8 GB VRAM: Qwen3-4B Q8 or Qwen3-8B Q4
- + RTX 4060 8 GB, RTX 3070 8 GB, RX 7600 8 GB
- + 16 GB system RAM recommended
- + Qwen3-4B Q8 (~5 GB) or Qwen3-8B Q4 (~5 GB)
- + Thinking mode: use 8K-16K context
Expected speed: ~20-30 tok/s
12 GB VRAM: Qwen3-8B Q8 or Qwen3-14B Q4
- + RTX 4070 12 GB, RTX 3060 12 GB
- + 32 GB system RAM recommended
- + Qwen3-8B Q8 (~9 GB) or Qwen3-14B Q4 (~8 GB)
- + Thinking mode comfortable at 16K-32K context
Expected speed: ~25-40 tok/s
24 GB VRAM: Qwen3-32B Q4 or Qwen3-14B Q8
- + RTX 4090 24 GB, RTX 3090 24 GB
- + 32-64 GB system RAM
- + Qwen3-32B Q4 (~20 GB) tight, or Qwen3-14B Q8 (~15 GB) with room
- + Thinking mode: 16K context on Qwen3-32B
Expected speed: ~20-50 tok/s
48 GB VRAM: Qwen3-32B Q8, very comfortable
- + Dual RTX 4090 (48 GB total), RTX 6000 Ada 48 GB
- + 64 GB system RAM
- + Qwen3-32B Q8 (~35 GB), large KV cache for thinking mode
- + Full thinking mode at 32K-64K context
Expected speed: ~30-60 tok/s
RTX 4070 12 GB: Best GPU for Qwen3-14B
The RTX 4070 12 GB is the ideal match for Qwen3-14B at Q4 quantization. The 8 GB model footprint leaves 4 GB free for KV cache, which comfortably covers 32K context even in thinking mode. Inference speed runs at around 35 to 40 tokens per second, making responses feel real-time.
Shop RTX 4070 12 GB on AmazonRTX 4090 24 GB: Best GPU for Qwen3-32B
The RTX 4090 with 24 GB VRAM is the primary single-GPU option for Qwen3-32B Q4. VRAM is tight at around 20 GB for weights, leaving only 4 GB for KV cache, so limit context to 16K when using thinking mode. For a more comfortable experience, pairing two RTX 4090 cards gives 48 GB total, which accommodates Qwen3-32B Q8 and generous context.
Shop RTX 4090 24 GB on AmazonInference Speed by Hardware
Token generation speed is primarily limited by GPU memory bandwidth. The table below shows estimated Q4_K_M speeds at short context. Thinking mode does not change tokens per second, but produces more tokens per query, so total response time is 2 to 4x longer.
| Hardware | Bandwidth | Qwen3-8B Q4 | Qwen3-14B Q4 | Qwen3-32B Q4 |
|---|---|---|---|---|
| RTX 5090 32 GB | 1,792 GB/s | ~120 t/s | ~62 t/s | ~31 t/s |
| RTX 4090 24 GB | 1,008 GB/s | ~68 t/s | ~35 t/s | ~17 t/s |
| RTX 4070 Ti Super 16 GB | 672 GB/s | ~45 t/s | ~23 t/s | - |
| RTX 4070 12 GB | 504 GB/s | ~34 t/s | ~17 t/s | - |
| RTX 4060 Ti 16 GB | 288 GB/s | ~19 t/s | ~10 t/s | - |
| RTX 4060 8 GB | 272 GB/s | ~18 t/s | - | - |
| Mac Studio M4 Max 64 GB | ~600 GB/s | ~40 t/s | ~21 t/s | ~10 t/s |
| Mac mini M4 Pro 48 GB | ~273 GB/s | ~18 t/s | ~9 t/s | ~5 t/s |
Source: speed estimates derived from each GPU's memory bandwidth divided by model size in memory, then cross-referenced with XiongjieDai community llama-bench runs where comparable runs exist. Real-world results vary by software version, quantization variant, and system configuration. Mac bandwidth figures are approximate.
Running Qwen3 with Ollama
Ollama is the fastest way to get Qwen3 running locally. It handles GPU detection, quantization selection, and provides a REST API. Qwen3 support is available in Ollama 0.6 and later. Ollama natively supports thinking mode via the /think and /no_think prompt commands.
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh 2. Pull Qwen3-8B (recommended starting point)
ollama pull qwen3:8b 3. Run interactively
ollama run qwen3:8b Pull Qwen3-14B or Qwen3-32B
ollama pull qwen3:14b ollama pull qwen3:32b Enable or disable thinking mode in chat
/think /no_think
Ollama auto-selects Q4_K_M by default and uses GPU when available. On Mac, it uses Apple Silicon's unified memory automatically. Use ollama ps to check GPU utilization after loading.
Running Qwen3 with llama.cpp
llama.cpp added native Qwen3 support in April 2025. It gives you fine-grained control over quantization, layer offloading between GPU and CPU, and KV cache quantization. Use it when you need maximum control or when running on mixed CPU and GPU systems.
Build llama.cpp with CUDA support
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j Download Qwen3-8B GGUF from Hugging Face
huggingface-cli download Qwen/Qwen3-8B-GGUF --include "*.Q4_K_M.gguf" Run inference with full GPU offloading
./build/bin/llama-cli -m qwen3-8b-q4_k_m.gguf -ngl 99 -c 32768 -p "You are a helpful assistant." Quantize KV cache to save VRAM (useful for thinking mode)
./build/bin/llama-cli -m qwen3-8b-q4_k_m.gguf -ngl 99 -c 32768 -ctk q8_0 -ctv q8_0
The -ctk q8_0 and -ctv q8_0 flags quantize the KV cache keys and values, roughly halving KV cache VRAM at minimal quality cost. This is especially useful when thinking mode generates long sequences. For AMD GPUs, build with -DGGML_HIPBLAS=ON instead of CUDA.
Frequently Asked Questions
Can I run Qwen3 on 8 GB VRAM?
Yes. Qwen3-4B at Q4 quantization uses about 3 GB of VRAM, leaving room for KV cache on any 8 GB GPU. Qwen3-8B at Q4 needs approximately 5 GB, which also fits an 8 GB card. Either model runs well on an RTX 4060 8 GB or RTX 3070 8 GB, delivering 20 to 30 tokens per second.
What is the best Qwen3 model for a 12 GB GPU?
The RTX 4070 12 GB is an excellent match for Qwen3-14B at Q4 quantization, which uses about 8 GB of VRAM. This leaves 4 GB for KV cache, enabling comfortable 32K context. For maximum quality on 12 GB, Qwen3-8B at Q8 quantization is another strong option at around 9 GB VRAM.
What is Qwen3 thinking mode and how much extra VRAM does it need?
Qwen3 models have a built-in chain-of-thought reasoning mode called think mode. When enabled, the model generates internal reasoning steps before producing its final answer. This does not increase VRAM for weights, but the longer output sequences consume more KV cache. Expect 2 to 3 times more tokens per response, meaning KV cache usage grows proportionally. On an 8 GB GPU, reduce context length to 8K or 16K to stay within VRAM limits when using thinking mode.
How do Qwen3 MoE models compare to dense models for VRAM?
Qwen3-30B-A3B is a Mixture-of-Experts model with 30 billion total parameters but only 3 billion active per token. However, all 30 billion parameters must be loaded into memory. At Q4 quantization the model needs about 20 GB of VRAM to load, similar to Qwen3-32B dense. The benefit is that inference speed is closer to a 3B model once loaded, since only a small fraction of weights are used per token. Qwen3-235B-A22B requires 4 or more 24 GB GPUs and is not practical for most home setups.
How do I run Qwen3 with Ollama?
Install Ollama from ollama.com, then use: ollama pull qwen3:8b for the 8B model, ollama pull qwen3:14b for the 14B model, or ollama pull qwen3:32b for the 32B model. Run the model interactively with ollama run qwen3:8b. Ollama automatically selects Q4 quantization and uses your GPU. To enable thinking mode in Ollama, use the /think command prefix in the chat interface.
Related guides
Check exact VRAM requirements or find the right GPU for your budget.
Related Guides
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hugging Face Hub. Alibaba's official Qwen3 model cards (0.6B through 235B-A22B MoE).
- Ollama. Qwen3 GGUF quants and the thinking/non-thinking variants the runtime ships.
- Modal: How much VRAM do I need for LLM inference. VRAM math used for the dense-vs-MoE memory comparison on this page.
Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.