Qwen3 Hardware Requirements: Run Alibaba's Latest LLM Locally (2026)

AI helped assemble the Qwen3 size table. The MoE-VRAM explanation and the per-size hardware picks were edited by hand against the live Alibaba model cards.

Updated May 2026 · Apache 2.0 license · Dense and MoE variants

Alibaba's Qwen3 family spans eight models from 0.6B to 235B parameters, including dense and Mixture-of-Experts variants. Qwen3-8B is a strong open-weight 8B model — see Alibaba's official Qwen3-8B model card for reported benchmark numbers. This guide covers exact VRAM requirements, thinking mode overhead, and hardware recommendations for every Qwen3 variant.

Qwen3 at a glance

Qwen3 Model Variants Overview

Qwen3 offers one of the broadest model lineups of any open-weight family, with dense models for predictable VRAM usage and MoE variants for faster inference on high-memory systems. All models support 32K context by default and can be extended. Every Qwen3 model includes built-in thinking mode for enhanced reasoning.

Qwen3-4B

Best entry-level

3 GB at Q4

Runs on almost any hardware including CPU-only setups. A practical pick for coding and chat on budget machines (~3 GB at Q4). Q8 fits comfortably on an 8 GB GPU.

Qwen3-8B

Best overall 8B

5 GB at Q4

Strong reasoning and coding for its parameter count. Fits any 8 GB GPU at Q4. Vendor-reported benchmarks are published on the official Qwen3-8B Hugging Face model card.

Qwen3-14B

Sweet spot for 12 GB

8 GB at Q4

A capable mid-size dense model with coding, math, and multilingual support. The RTX 4070 12 GB runs this model at Q4 with 4 GB to spare for KV cache. Vendor-reported benchmarks are on the official Qwen3-14B Hugging Face model card.

Qwen3-32B

Largest dense Qwen3

20 GB at Q4

Largest dense Qwen3 variant. Requires an RTX 4090 24 GB at Q4 (tight on VRAM). Dual-GPU setups or 48 GB cards are recommended for Q8. See the Qwen3-32B Hugging Face model card for vendor-reported benchmarks.

Qwen3-30B-A3B

MoE: fast inference

20 GB at Q4

MoE model: 30B total parameters loaded into memory, but only 3B active per token. Loads like a 30B dense model but generates tokens at near 3B speed once loaded.

Qwen3-235B-A22B

Cloud or multi-GPU only

130 GB+ at Q4

Alibaba's frontier MoE model. Requires 4 or more 24 GB GPUs for local inference. Not practical for most home setups. Best accessed via API or cloud.

For most users building a local AI setup, Qwen3-8B is a sensible starting point: ~5 GB at Q4 fits any 8 GB GPU, and thinking mode is available without loading a separate model. Vendor-reported benchmarks for each variant live on the corresponding Hugging Face model card.

VRAM Requirements by Model and Quantization

The tables below show VRAM required for model weights only. Add 1 to 3 GB for KV cache at standard context lengths. Thinking mode responses are longer and consume proportionally more KV cache.

Dense Models

ModelParamsQ4 VRAMQ8 VRAMMinimum GPUBest GPU
Qwen3-0.6B 0.6B ~0.5 GB ~0.8 GB Any GPU 4 GB+ Any modern GPU or CPU
Qwen3-1.7B 1.7B ~1.2 GB ~2 GB Any GPU 4 GB+ Any modern GPU
Qwen3-4B 4B ~3 GB ~5 GB Any GPU 4 GB+ or CPU RTX 4060 8 GB
Qwen3-8B 8B ~5 GB ~9 GB RTX 4060 8 GB (Q4) RTX 4070 12 GB (Q8)
Qwen3-14B 14B ~8 GB ~15 GB RTX 4070 12 GB (Q4) RTX 4060 Ti 16 GB (Q8)
Qwen3-32B 32B ~20 GB ~35 GB RTX 4090 24 GB (Q4) Dual RTX 4090 or 48 GB GPU

MoE Models

ModelTotal ParamsActive ParamsQ4 VRAMQ8 VRAMMinimum HardwareBest Hardware
Qwen3-30B-A3B 30B 3B ~20 GB ~32 GB RTX 4090 24 GB (Q4, tight) RTX 5090 32 GB or Mac Studio 64 GB
Qwen3-235B-A22B 235B 22B ~130 GB+ 240 GB+ 4x RTX 4090 (multi-GPU) Cloud or multi-GPU server

Why MoE models need more VRAM than their active params suggest

Qwen3-30B-A3B activates only 3B parameters per token, but all 30B expert weights must be loaded into VRAM. The router must be able to dispatch any token to any expert without loading or unloading weights mid-inference. VRAM is determined by total parameters, not active ones.

For CPU-only inference, multiply VRAM figures by roughly 2 to estimate system RAM needed, plus OS overhead. Qwen3-4B and smaller models run well on CPU with 16 GB RAM.

Thinking Mode: VRAM and Performance Impact

Qwen3's built-in thinking mode adds chain-of-thought reasoning before producing a final answer. This is controlled by the model itself using special tokens, not by loading a separate model. The VRAM for weights does not change, but thinking mode generates significantly longer sequences, which increases KV cache usage.

How thinking mode affects VRAM

Model weights

No change. Same VRAM as non-thinking inference.

KV cache

2 to 3x larger due to longer reasoning sequences before the final answer.

Practical tip

Reduce context length to 8K or 16K on 8 GB GPUs when using thinking mode.

ModelModeKV Cache (32K ctx)Total VRAMPractical GPU
Qwen3-8B Q4 Standard ~2 GB ~7 GB Fits 8 GB GPU
Qwen3-8B Q4 Thinking ~4-6 GB ~9-11 GB Needs 12 GB GPU
Qwen3-14B Q4 Standard ~2 GB ~10 GB Fits 12 GB GPU
Qwen3-14B Q4 Thinking ~4-6 GB ~12-14 GB Needs 16 GB GPU
Qwen3-32B Q4 Standard ~2 GB ~22 GB Fits 24 GB GPU
Qwen3-32B Q4 Thinking ~4-6 GB ~24-26 GB Tight on 24 GB, use 16K ctx

To disable thinking mode in Ollama, append /no_think to your prompt or set the system prompt accordingly. In llama.cpp, thinking mode is controlled by the model's built-in token logic.

Recommended Hardware by Use Case

8 GB VRAM: Qwen3-4B Q8 or Qwen3-8B Q4

  • + RTX 4060 8 GB, RTX 3070 8 GB, RX 7600 8 GB
  • + 16 GB system RAM recommended
  • + Qwen3-4B Q8 (~5 GB) or Qwen3-8B Q4 (~5 GB)
  • + Thinking mode: use 8K-16K context

Expected speed: ~20-30 tok/s

12 GB VRAM: Qwen3-8B Q8 or Qwen3-14B Q4

  • + RTX 4070 12 GB, RTX 3060 12 GB
  • + 32 GB system RAM recommended
  • + Qwen3-8B Q8 (~9 GB) or Qwen3-14B Q4 (~8 GB)
  • + Thinking mode comfortable at 16K-32K context

Expected speed: ~25-40 tok/s

24 GB VRAM: Qwen3-32B Q4 or Qwen3-14B Q8

  • + RTX 4090 24 GB, RTX 3090 24 GB
  • + 32-64 GB system RAM
  • + Qwen3-32B Q4 (~20 GB) tight, or Qwen3-14B Q8 (~15 GB) with room
  • + Thinking mode: 16K context on Qwen3-32B

Expected speed: ~20-50 tok/s

48 GB VRAM: Qwen3-32B Q8, very comfortable

  • + Dual RTX 4090 (48 GB total), RTX 6000 Ada 48 GB
  • + 64 GB system RAM
  • + Qwen3-32B Q8 (~35 GB), large KV cache for thinking mode
  • + Full thinking mode at 32K-64K context

Expected speed: ~30-60 tok/s

RTX 4070 12 GB: Best GPU for Qwen3-14B

The RTX 4070 12 GB is the ideal match for Qwen3-14B at Q4 quantization. The 8 GB model footprint leaves 4 GB free for KV cache, which comfortably covers 32K context even in thinking mode. Inference speed runs at around 35 to 40 tokens per second, making responses feel real-time.

Shop RTX 4070 12 GB on Amazon

RTX 4090 24 GB: Best GPU for Qwen3-32B

The RTX 4090 with 24 GB VRAM is the primary single-GPU option for Qwen3-32B Q4. VRAM is tight at around 20 GB for weights, leaving only 4 GB for KV cache, so limit context to 16K when using thinking mode. For a more comfortable experience, pairing two RTX 4090 cards gives 48 GB total, which accommodates Qwen3-32B Q8 and generous context.

Shop RTX 4090 24 GB on Amazon

Inference Speed by Hardware

Token generation speed is primarily limited by GPU memory bandwidth. The table below shows estimated Q4_K_M speeds at short context. Thinking mode does not change tokens per second, but produces more tokens per query, so total response time is 2 to 4x longer.

HardwareBandwidthQwen3-8B Q4Qwen3-14B Q4Qwen3-32B Q4
RTX 5090 32 GB 1,792 GB/s ~120 t/s ~62 t/s ~31 t/s
RTX 4090 24 GB 1,008 GB/s ~68 t/s ~35 t/s ~17 t/s
RTX 4070 Ti Super 16 GB 672 GB/s ~45 t/s ~23 t/s -
RTX 4070 12 GB 504 GB/s ~34 t/s ~17 t/s -
RTX 4060 Ti 16 GB 288 GB/s ~19 t/s ~10 t/s -
RTX 4060 8 GB 272 GB/s ~18 t/s - -
Mac Studio M4 Max 64 GB ~600 GB/s ~40 t/s ~21 t/s ~10 t/s
Mac mini M4 Pro 48 GB ~273 GB/s ~18 t/s ~9 t/s ~5 t/s

Source: speed estimates derived from each GPU's memory bandwidth divided by model size in memory, then cross-referenced with XiongjieDai community llama-bench runs where comparable runs exist. Real-world results vary by software version, quantization variant, and system configuration. Mac bandwidth figures are approximate.

Running Qwen3 with Ollama

Ollama is the fastest way to get Qwen3 running locally. It handles GPU detection, quantization selection, and provides a REST API. Qwen3 support is available in Ollama 0.6 and later. Ollama natively supports thinking mode via the /think and /no_think prompt commands.

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

2. Pull Qwen3-8B (recommended starting point)

ollama pull qwen3:8b

3. Run interactively

ollama run qwen3:8b

Pull Qwen3-14B or Qwen3-32B

ollama pull qwen3:14b ollama pull qwen3:32b

Enable or disable thinking mode in chat

/think /no_think

Ollama auto-selects Q4_K_M by default and uses GPU when available. On Mac, it uses Apple Silicon's unified memory automatically. Use ollama ps to check GPU utilization after loading.

Running Qwen3 with llama.cpp

llama.cpp added native Qwen3 support in April 2025. It gives you fine-grained control over quantization, layer offloading between GPU and CPU, and KV cache quantization. Use it when you need maximum control or when running on mixed CPU and GPU systems.

Build llama.cpp with CUDA support

cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

Download Qwen3-8B GGUF from Hugging Face

huggingface-cli download Qwen/Qwen3-8B-GGUF --include "*.Q4_K_M.gguf"

Run inference with full GPU offloading

./build/bin/llama-cli -m qwen3-8b-q4_k_m.gguf -ngl 99 -c 32768 -p "You are a helpful assistant."

Quantize KV cache to save VRAM (useful for thinking mode)

./build/bin/llama-cli -m qwen3-8b-q4_k_m.gguf -ngl 99 -c 32768 -ctk q8_0 -ctv q8_0

The -ctk q8_0 and -ctv q8_0 flags quantize the KV cache keys and values, roughly halving KV cache VRAM at minimal quality cost. This is especially useful when thinking mode generates long sequences. For AMD GPUs, build with -DGGML_HIPBLAS=ON instead of CUDA.

Frequently Asked Questions

Can I run Qwen3 on 8 GB VRAM?

Yes. Qwen3-4B at Q4 quantization uses about 3 GB of VRAM, leaving room for KV cache on any 8 GB GPU. Qwen3-8B at Q4 needs approximately 5 GB, which also fits an 8 GB card. Either model runs well on an RTX 4060 8 GB or RTX 3070 8 GB, delivering 20 to 30 tokens per second.

What is the best Qwen3 model for a 12 GB GPU?

The RTX 4070 12 GB is an excellent match for Qwen3-14B at Q4 quantization, which uses about 8 GB of VRAM. This leaves 4 GB for KV cache, enabling comfortable 32K context. For maximum quality on 12 GB, Qwen3-8B at Q8 quantization is another strong option at around 9 GB VRAM.

What is Qwen3 thinking mode and how much extra VRAM does it need?

Qwen3 models have a built-in chain-of-thought reasoning mode called think mode. When enabled, the model generates internal reasoning steps before producing its final answer. This does not increase VRAM for weights, but the longer output sequences consume more KV cache. Expect 2 to 3 times more tokens per response, meaning KV cache usage grows proportionally. On an 8 GB GPU, reduce context length to 8K or 16K to stay within VRAM limits when using thinking mode.

How do Qwen3 MoE models compare to dense models for VRAM?

Qwen3-30B-A3B is a Mixture-of-Experts model with 30 billion total parameters but only 3 billion active per token. However, all 30 billion parameters must be loaded into memory. At Q4 quantization the model needs about 20 GB of VRAM to load, similar to Qwen3-32B dense. The benefit is that inference speed is closer to a 3B model once loaded, since only a small fraction of weights are used per token. Qwen3-235B-A22B requires 4 or more 24 GB GPUs and is not practical for most home setups.

How do I run Qwen3 with Ollama?

Install Ollama from ollama.com, then use: ollama pull qwen3:8b for the 8B model, ollama pull qwen3:14b for the 14B model, or ollama pull qwen3:32b for the 32B model. Run the model interactively with ollama run qwen3:8b. Ollama automatically selects Q4 quantization and uses your GPU. To enable thinking mode in Ollama, use the /think command prefix in the chat interface.

Related guides

Check exact VRAM requirements or find the right GPU for your budget.

Related Guides

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.