What Hardware Do You Need to Run DeepSeek Locally?
AI assembled the first DeepSeek table; each entry was reconciled against DeepSeek's own model cards and the open llama-bench runs cited below.
Updated May 2026 · DeepSeek R1 distilled models · VRAM requirements · Consumer GPU guide
DeepSeek R1 went viral in January 2025 as a reasoning model that matched frontier performance at a fraction of the cost. The good news for local inference: the distilled versions (7B through 70B) run on consumer hardware you can buy today. This guide covers VRAM requirements for every DeepSeek model, which GPU or Mac fits each one, speed expectations, and where to download them.
What is DeepSeek?
DeepSeek is both a Chinese AI company and a family of models. The models that matter for local inference are:
- R1 The flagship reasoning model (671B MoE). Requires a data-center-scale multi-GPU setup. Not practical locally.
- R1 Distilled Smaller dense models (1.5B to 70B) with R1's reasoning ability baked in via distillation. These are what most people run locally.
- V3 DeepSeek V3 is a 685B MoE chat model — similar scale to full R1, not practical on consumer hardware.
Advertisement
DeepSeek R1 VRAM Requirements by Model
VRAM is calculated using the formula: ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param. Add extra headroom for long reasoning chains — see the KV cache section below.
| Model | Size | Q4 VRAM | Q8 VRAM | Minimum Hardware | Best Hardware |
|---|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | 1.5B | ~2.3 GB | ~3 GB | Any GPU 4GB+ | Mac mini M4 16GB |
| R1-Distill-Qwen-7B | 7B | ~5.5 GB | ~8.5 GB | RTX 4060 8GB (Q4) | RTX 4060 Ti 16GB (Q8) |
| R1-Distill-Llama-8B | 8B | ~6 GB | ~9.5 GB | RTX 4060 8GB (Q4) | RTX 4060 Ti 16GB (Q8) |
| R1-Distill-Qwen-14B | 14B | ~8.5 GB | ~15.5 GB | Arc B580 12GB (Q4) | RTX 4060 Ti 16GB (Q8) |
| R1-Distill-Qwen-32B | 32B | ~17.5 GB | ~33.5 GB | RTX 4090 24GB (Q4)* | RTX 5090 32GB (Q8) |
| R1-Distill-Llama-70B | 70B | ~36.5 GB | — | Mac Studio 64GB | Mac Studio 128GB |
| Full R1 671B (MoE) | 671B | 340GB+ storage | — | Multi-GPU server | Not practical |
* R1-Distill-32B at Q4 on the RTX 4090 uses ~17.5GB of 24GB — fits with ~6.5GB headroom. The RX 7900 XTX 24GB and RTX 3090 24GB also work. Use the VRAM Calculator for context-length-adjusted estimates.
What DeepSeek Can You Run on Your GPU?
Find your GPU or Mac below. Each card lists what distilled models fit and what does not.
RTX 4060 8GB
Runs:
- +R1-Distill-1.5B (Q4/Q8)
- +R1-Distill-7B (Q4 only)
- +R1-Distill-8B (Q4 only)
Does not fit:
- -R1-Distill-14B and larger
Q4 only for the 7B/8B — no headroom for Q8.
Intel Arc B580 12GB
Runs:
- +R1-Distill-1.5B
- +R1-Distill-7B (Q4 & Q8)
- +R1-Distill-8B (Q4 & Q8)
- +R1-Distill-14B (Q4 only)
Does not fit:
- -R1-Distill-32B and larger
Best budget pick for 14B at Q4. Arc ROCm support is improving but less mature than CUDA.
RTX 4060 Ti 16GB
Runs:
- +R1-Distill-1.5B
- +R1-Distill-7B (Q4 & Q8)
- +R1-Distill-8B (Q4 & Q8)
- +R1-Distill-14B (Q4 & Q8)
Does not fit:
- -R1-Distill-32B and larger
Strong sweet spot — runs 14B at full Q8 quality, which is the best reasoning you get at this price.
RTX 4070 12GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B (Q4 only)
Does not fit:
- -R1-Distill-14B Q8 (tight)
- -R1-Distill-32B and larger
12GB limits you to Q4 on 14B. The 4060 Ti 16GB is a better buy for DeepSeek use cases.
RTX 4070 Ti Super 16GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B (Q4 & Q8)
Does not fit:
- -R1-Distill-32B and larger
Same VRAM ceiling as 4060 Ti 16GB but faster bandwidth (~672 GB/s vs 288 GB/s) for quicker reasoning chains.
RTX 4080 16GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B (Q4 & Q8)
Does not fit:
- -R1-Distill-32B and larger
16GB cap is the same as 4070 Ti Super. Very fast token generation but limited model ceiling.
RTX 3090 24GB (used)
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B
- +R1-Distill-32B (Q4 only, tight)
Does not fit:
- -R1-Distill-32B Q8
- -R1-Distill-70B
24GB fits 32B at Q4 (~17.5GB used, ~6.5GB headroom). Bandwidth is lower than 4090 (~936 GB/s) but solid for the price.
AMD RX 7900 XTX 24GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B
- +R1-Distill-32B (Q4 only)
Does not fit:
- -R1-Distill-32B Q8
- -R1-Distill-70B
Same 24GB ceiling as RTX 4090/3090. ROCm support works with Ollama and llama.cpp. See the AMD ROCm guide.
RTX 4090 24GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B
- +R1-Distill-32B (Q4 only)
Does not fit:
- -R1-Distill-32B Q8
- -R1-Distill-70B
Best single-GPU option at 24GB. Fastest 32B Q4 inference on a consumer card (~15 tok/s).
RTX 5090 32GB
Runs:
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B
- +R1-Distill-32B (Q4 & Q8)
Does not fit:
- -R1-Distill-70B (needs 36.5GB)
Only consumer GPU that fits 32B at Q8. Cannot fit 70B at Q4 — the model is 36.5GB, exceeding the 32GB limit.
Mac mini M4 16GB
Runs:
- +R1-Distill-1.5B
- +R1-Distill-7B (Q4 & Q8)
- +R1-Distill-8B (Q4 & Q8)
Does not fit:
- -R1-Distill-14B and larger
Unified memory means all 16GB is usable for models. Silent, efficient, great value for 7B reasoning.
Mac mini M4 24GB
Runs:
- +R1-Distill-1.5B
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B (Q4 & Q8)
Does not fit:
- -R1-Distill-32B and larger
Extra 8GB vs the 16GB model unlocks 14B at full Q8 quality.
Mac mini M4 Pro 48GB
Runs:
- +R1-Distill-1.5B
- +R1-Distill-7B
- +R1-Distill-8B
- +R1-Distill-14B
- +R1-Distill-32B (Q4 & Q8)
Does not fit:
- -R1-Distill-70B
48GB comfortably fits the 32B at both Q4 and Q8. Very fast for Apple Silicon (~800 GB/s bandwidth).
Mac Studio M4 Max 64GB
Runs:
- +All distilled models including R1-Distill-70B (Q4)
Does not fit:
- -R1-Distill-70B at Q8 (needs ~71.5GB)
The minimum practical option for the 70B distill. ~600 GB/s bandwidth gives solid reasoning chain speed.
Mac Studio M4 Max 128GB
Runs:
- +All distilled models including R1-Distill-70B (Q4 & Q8)
Does not fit:
- -Full 671B R1 (still not enough VRAM)
Runs the full distill lineup. 70B at Q8 fits with room to spare.
Reasoning Tokens: Why DeepSeek R1 Needs More VRAM Than You Expect
DeepSeek R1 models use chain-of-thought reasoning. Before outputting an answer, the model generates internal "thinking" tokens that work through the problem step by step. These are consumed by KV cache — the memory buffer that holds previously generated tokens so they do not need to be recomputed.
KV cache grows with every token
Each reasoning step adds to the KV cache. A complex coding or math problem can generate 500-2,000 thinking tokens before the answer begins. This temporarily adds 1-4 GB of KV cache on top of the base model weight memory.
Factor 2-4x more VRAM per query
The base VRAM figures in the table above are for model weights only. During active inference on complex prompts, peak VRAM usage can be 1.5-2x higher. Leave headroom — do not run a model that barely fits.
Responses are slower per query
Even at the same tokens-per-second speed, total response time is longer because the model generates more tokens. A 7B DeepSeek R1 response may take 3-4x longer than a Llama 3.1 7B response to the same prompt.
Context length multiplies this effect
Longer system prompts and conversation history also grow KV cache. For extended reasoning sessions, 16K-32K context, or agentic pipelines, consider stepping up one model size tier to ensure stability.
Practical rule: if the model weight fits in VRAM with less than 20% headroom, you will likely see out-of-memory errors or extreme slowdowns on complex reasoning tasks. See the VRAM Calculator to estimate KV cache at your context length.
Inference Speed by Hardware
Token generation speed is bottlenecked by memory bandwidth. Faster bandwidth means faster reasoning chains. The table below shows estimated Q4_K_M speeds. Note that for reasoning models, these speeds apply to the full token stream including thinking tokens — total wall-clock time per query is longer.
| Hardware | Bandwidth | 7B Q4 tok/s | 14B Q4 tok/s | 32B Q4 tok/s |
|---|---|---|---|---|
| RTX 5090 32GB | 1,792 GB/s | ~102 t/s | ~53 t/s | ~27 t/s |
| RTX 4090 24GB | 1,008 GB/s | ~58 t/s | ~30 t/s | ~15 t/s |
| RTX 4060 Ti 16GB | 288 GB/s | ~16 t/s | ~9 t/s | — |
| Mac Studio M4 Max 64GB | ~600 GB/s | ~34 t/s | ~18 t/s | ~9 t/s |
| Mac mini M4 Pro 48GB | ~273 GB/s | ~15 t/s | ~8 t/s | ~4 t/s |
Speed estimates based on: tokens/sec = bandwidth (GB/s) / model size in memory (GB). Real-world results vary by software version and system configuration. Mac mini M4 Pro bandwidth estimated at ~273 GB/s. Mac Studio M4 Max bandwidth estimated at ~600 GB/s.
Where to Download and Run DeepSeek R1
Ollama
ollama run deepseek-r1:7b Easiest option. One command installs and runs the model. Use :1.5b, :7b, :8b, :14b, :32b, or :70b tags. GPU auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon.
LM Studio
Search "deepseek-r1" in Discover GUI-based model browser and chat interface. Download GGUF quantizations directly from within the app. Best for non-technical users. Runs on Windows, Mac, and Linux.
Hugging Face + llama.cpp
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B Download GGUF files from the bartowski or unsloth community repos on Hugging Face. Run with llama.cpp for maximum control over quantization and context length.
For step-by-step installation instructions, see the how to run LLMs locally guide. For AMD GPU setup with ROCm, see the AMD ROCm guide.
Which Hardware Should You Buy for DeepSeek?
RTX 4060 8GB
Runs the 7B and 8B distills at Q4. Solid reasoning capability for coding, writing, and math at a budget price. Upgrade if you want 14B.
Intel Arc B580 12GB
Best value path to the 14B distill. 12GB fits 14B at Q4. Arc ROCm and Ollama support has improved significantly — verify driver compatibility before buying.
RTX 4060 Ti 16GB
Best bang for DeepSeek use. 16GB runs the 14B at Q8, which is the highest quality reasoning you get at this price. Strongly recommended.
RTX 4090 24GB
Best single-GPU card for DeepSeek. Runs the 32B distill at Q4, which delivers near-full-R1 reasoning quality on complex tasks. Fast bandwidth (~1,008 GB/s) keeps thinking chains snappy.
Mac mini M4 24GB
Runs 14B at Q8 silently and efficiently. Unified memory avoids the VRAM vs RAM split. For 32B, step up to Mac mini M4 Pro 48GB. For 70B, Mac Studio M4 Max 64GB.
For a full cross-budget GPU comparison, see the best GPU for LLMs guide.
Related Resources
Best GPU for LLMs — Full Guide
All budget tiers from entry to workstation
LLM Quantization Explained
Q4 vs Q8 vs FP16 — when quality trade-offs matter
How to Run LLMs Locally
Step-by-step Ollama, LM Studio, llama.cpp setup
Apple Silicon for LLMs
M4, M4 Pro, M4 Max — which Mac for DeepSeek?
Ollama vs LM Studio
Which tool to use for running DeepSeek locally
CPU-Only Inference
Run DeepSeek-R1-Distill-8B without a GPU
AMD GPU ROCm Guide
Run DeepSeek on RX 7900 XTX with ROCm
Best LLMs to Run Locally
DeepSeek vs Qwen3, Phi-4, Llama 3.3 — 2026 picks
Qwen3 Hardware Requirements
Compare DeepSeek R1 Distill vs Qwen3 thinking
Llama 3.3 70B Hardware Guide
Hardware for the best open-weights 70B model
Qwen3 vs DeepSeek vs Llama 3.3
Decide which model family fits your hardware
How to Run DeepSeek R1 Locally
Step-by-step Ollama setup guide for all hardware tiers
DeepSeek V3 Hardware Requirements
Can you run DeepSeek V3 locally? 685B model needs 390 GB VRAM — and what to use instead.
VRAM Calculator
Calculate exact VRAM at your context length
Frequently Asked Questions
What GPU do I need to run DeepSeek R1 locally?
It depends on which distilled model you want. DeepSeek-R1-Distill-7B at Q4 needs ~5.5 GB VRAM — an RTX 4060 8GB handles it. The 14B at Q4 needs ~8.5 GB — the Intel Arc B580 12GB or RTX 4070 12GB work. The 32B at Q4 needs ~17.5 GB — RTX 4090 24GB or better. The 70B distill needs ~36.5 GB, requiring a Mac Studio M4 Max 64GB or a dual-GPU setup. The full 671B R1 is not practical on consumer hardware.
Can I run DeepSeek R1 on an 8GB GPU?
Yes, with the distilled models. DeepSeek-R1-Distill-7B at Q4_K_M needs about 5.5 GB VRAM and runs on an RTX 4060 8GB. The 1.5B distill needs only ~2.3 GB at Q4 and runs on virtually any GPU with 4GB+. You cannot run the 14B, 32B, or larger models on 8GB VRAM.
Why is DeepSeek R1 slower than regular LLMs of the same size?
DeepSeek R1 uses chain-of-thought reasoning, generating internal "thinking" tokens before giving its final answer. A single query can trigger hundreds to thousands of thinking tokens. Total token generation per query is 2-5x higher than a non-reasoning model of the same size, so each response takes noticeably longer even at the same tokens-per-second speed.
What is the difference between DeepSeek R1 and the distilled models?
The full DeepSeek R1 is a 671-billion parameter Mixture-of-Experts model requiring data-center hardware. The distilled models (1.5B to 70B) are smaller dense models trained using knowledge distillation from the full R1, transferring its reasoning capability into a far smaller footprint. The distilled models are what most people run locally. They retain strong reasoning ability despite being orders of magnitude smaller.
Can a Mac mini run DeepSeek R1?
Yes. The Mac mini M4 16GB runs the 1.5B and 7B distills well. The Mac mini M4 24GB adds the 14B at Q8. The Mac mini M4 Pro 48GB fits the 32B at Q4 and Q8. Apple Silicon uses unified memory, so all available memory is usable for models. For the 70B distill, you need a Mac Studio M4 Max with at least 64GB.
Popular hardware for local LLMs
Check VRAM requirements for DeepSeek models, or compare hardware options.
Related Guides
LLM Quantization Explained: Q4, Q8, F16
Run DeepSeek on less VRAM by choosing the right quantization level.
Running 70B Models Locally
Hardware requirements and realistic expectations for large models.
KV Cache and VRAM: How Context Length Blows Up GPU Memory
Why long context windows eat VRAM fast and how to manage it.
Ollama vs llama.cpp: Which Should You Use?
Comparing the two most popular local LLM runtimes.
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:
- Hugging Face Hub. Official DeepSeek model cards (V2, V3, R1, Coder variants) for parameter and context-length numbers.
- Modal: How much VRAM do I need for LLM inference. VRAM-budget formula used to size each DeepSeek quant against real GPUs.
- XiongjieDai GPU-Benchmarks-on-LLM-Inference. Independent tokens-per-second runs for DeepSeek quants across NVIDIA and Apple silicon.
Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.