What Hardware Do You Need to Run DeepSeek Locally?

Q: What GPU do I need to run DeepSeek R1 locally?

It depends on which DeepSeek model you want to run. The distilled models are the practical choice for local use. DeepSeek-R1-Distill-7B at Q4 needs ~5.5 GB VRAM — an RTX 4060 8GB handles it. The 14B model at Q4 needs ~8.5 GB — the Intel Arc B580 12GB or RTX 4070 12GB works. The 32B model at Q4 needs ~17.5 GB — you need an RTX 4090 24GB or RTX 5090 32GB. The 70B distill needs ~36.5 GB and requires a Mac Studio M4 Max 64GB or dual-GPU setup. The full 671B model is not practical on consumer hardware.

Q: Can I run DeepSeek R1 on an 8GB GPU?

Yes, with the right distilled model. The DeepSeek-R1-Distill-7B at Q4_K_M needs about 5.5 GB VRAM and runs on an RTX 4060 8GB or any 8GB GPU. The 1.5B distill needs only ~2.3 GB at Q4 and runs on virtually any GPU with 4GB+. You cannot run the 14B, 32B, or larger models on 8GB VRAM — they do not fit at any common quantization.

Q: Why is DeepSeek R1 slower than regular LLMs of the same size?

DeepSeek R1 is a reasoning model that uses chain-of-thought (CoT) before giving its final answer. It generates internal "thinking" tokens that are not shown to the user but are still computed. A single query can trigger hundreds to thousands of thinking tokens before the actual response begins. This means total token generation per query is 2-5x higher than a non-reasoning model of the same size, so each response takes noticeably longer even if raw tokens-per-second speed is the same.

Q: Can a Mac mini run DeepSeek R1?

Yes. The Mac mini M4 16GB unified memory can run the 1.5B and 7B distilled models at good speeds. The Mac mini M4 24GB also fits the 14B at Q4. The Mac mini M4 Pro 48GB fits the 32B model at Q4. Apple Silicon uses unified memory, which means the full memory pool is available for models — unlike discrete GPUs, there is no separate VRAM cap. For the 70B distill, you need a Mac Studio M4 Max with at least 64GB unified memory.

AI assembled the first DeepSeek table; each entry was reconciled against DeepSeek's own model cards and the open llama-bench runs cited below.

Updated May 2026 · DeepSeek R1 distilled models · VRAM requirements · Consumer GPU guide

DeepSeek R1 went viral in January 2025 as a reasoning model that matched frontier performance at a fraction of the cost. The good news for local inference: the distilled versions (7B through 70B) run on consumer hardware you can buy today. This guide covers VRAM requirements for every DeepSeek model, which GPU or Mac fits each one, speed expectations, and where to download them.

What is DeepSeek?

DeepSeek is both a Chinese AI company and a family of models. The models that matter for local inference are:

R1 The flagship reasoning model (671B MoE). Requires a data-center-scale multi-GPU setup. Not practical locally.
R1 Distilled Smaller dense models (1.5B to 70B) with R1's reasoning ability baked in via distillation. These are what most people run locally.
V3 DeepSeek V3 is a 685B MoE chat model — similar scale to full R1, not practical on consumer hardware.

DeepSeek R1 VRAM Requirements by Model

VRAM is calculated using the formula: ceil(params × bytes_per_param) + 1.5 GB overhead. Q4_K_M uses ~0.5 bytes/param, Q8 uses ~1.0 bytes/param. Add extra headroom for long reasoning chains — see the KV cache section below.

Model	Size	Q4 VRAM	Q8 VRAM	Minimum Hardware	Best Hardware
R1-Distill-Qwen-1.5B	1.5B	~2.3 GB	~3 GB	Any GPU 4GB+	Mac mini M4 16GB
R1-Distill-Qwen-7B	7B	~5.5 GB	~8.5 GB	RTX 4060 8GB (Q4)	RTX 4060 Ti 16GB (Q8)
R1-Distill-Llama-8B	8B	~6 GB	~9.5 GB	RTX 4060 8GB (Q4)	RTX 4060 Ti 16GB (Q8)
R1-Distill-Qwen-14B	14B	~8.5 GB	~15.5 GB	Arc B580 12GB (Q4)	RTX 4060 Ti 16GB (Q8)
R1-Distill-Qwen-32B	32B	~17.5 GB	~33.5 GB	RTX 4090 24GB (Q4)*	RTX 5090 32GB (Q8)
R1-Distill-Llama-70B	70B	~36.5 GB	—	Mac Studio 64GB	Mac Studio 128GB
Full R1 671B (MoE)	671B	340GB+ storage	—	Multi-GPU server	Not practical

* R1-Distill-32B at Q4 on the RTX 4090 uses ~17.5GB of 24GB — fits with ~6.5GB headroom. The RX 7900 XTX 24GB and RTX 3090 24GB also work. Use the VRAM Calculator for context-length-adjusted estimates.

What DeepSeek Can You Run on Your GPU?

Find your GPU or Mac below. Each card lists what distilled models fit and what does not.

RTX 4060 8GB

Runs:

+R1-Distill-1.5B (Q4/Q8)
+R1-Distill-7B (Q4 only)
+R1-Distill-8B (Q4 only)

Does not fit:

-R1-Distill-14B and larger

Q4 only for the 7B/8B — no headroom for Q8.

Intel Arc B580 12GB

Runs:

+R1-Distill-1.5B
+R1-Distill-7B (Q4 & Q8)
+R1-Distill-8B (Q4 & Q8)
+R1-Distill-14B (Q4 only)

Does not fit:

-R1-Distill-32B and larger

Best budget pick for 14B at Q4. Arc ROCm support is improving but less mature than CUDA.

RTX 4060 Ti 16GB

Runs:

+R1-Distill-1.5B
+R1-Distill-7B (Q4 & Q8)
+R1-Distill-8B (Q4 & Q8)
+R1-Distill-14B (Q4 & Q8)

Does not fit:

-R1-Distill-32B and larger

Strong sweet spot — runs 14B at full Q8 quality, which is the best reasoning you get at this price.

RTX 4070 12GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B (Q4 only)

Does not fit:

-R1-Distill-14B Q8 (tight)
-R1-Distill-32B and larger

12GB limits you to Q4 on 14B. The 4060 Ti 16GB is a better buy for DeepSeek use cases.

RTX 4070 Ti Super 16GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B (Q4 & Q8)

Does not fit:

-R1-Distill-32B and larger

Same VRAM ceiling as 4060 Ti 16GB but faster bandwidth (~672 GB/s vs 288 GB/s) for quicker reasoning chains.

RTX 4080 16GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B (Q4 & Q8)

Does not fit:

-R1-Distill-32B and larger

16GB cap is the same as 4070 Ti Super. Very fast token generation but limited model ceiling.

RTX 3090 24GB (used)

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B
+R1-Distill-32B (Q4 only, tight)

Does not fit:

-R1-Distill-32B Q8
-R1-Distill-70B

24GB fits 32B at Q4 (~17.5GB used, ~6.5GB headroom). Bandwidth is lower than 4090 (~936 GB/s) but solid for the price.

AMD RX 7900 XTX 24GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B
+R1-Distill-32B (Q4 only)

Does not fit:

-R1-Distill-32B Q8
-R1-Distill-70B

Same 24GB ceiling as RTX 4090/3090. ROCm support works with Ollama and llama.cpp. See the AMD ROCm guide.

RTX 4090 24GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B
+R1-Distill-32B (Q4 only)

Does not fit:

-R1-Distill-32B Q8
-R1-Distill-70B

Best single-GPU option at 24GB. Fastest 32B Q4 inference on a consumer card (~15 tok/s).

RTX 5090 32GB

Runs:

+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B
+R1-Distill-32B (Q4 & Q8)

Does not fit:

-R1-Distill-70B (needs 36.5GB)

Only consumer GPU that fits 32B at Q8. Cannot fit 70B at Q4 — the model is 36.5GB, exceeding the 32GB limit.

Mac mini M4 16GB

Runs:

+R1-Distill-1.5B
+R1-Distill-7B (Q4 & Q8)
+R1-Distill-8B (Q4 & Q8)

Does not fit:

-R1-Distill-14B and larger

Unified memory means all 16GB is usable for models. Silent, efficient, great value for 7B reasoning.

Mac mini M4 24GB

Runs:

+R1-Distill-1.5B
+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B (Q4 & Q8)

Does not fit:

-R1-Distill-32B and larger

Extra 8GB vs the 16GB model unlocks 14B at full Q8 quality.

Mac mini M4 Pro 48GB

Runs:

+R1-Distill-1.5B
+R1-Distill-7B
+R1-Distill-8B
+R1-Distill-14B
+R1-Distill-32B (Q4 & Q8)

Does not fit:

-R1-Distill-70B

48GB comfortably fits the 32B at both Q4 and Q8. Very fast for Apple Silicon (~800 GB/s bandwidth).

Mac Studio M4 Max 64GB

Runs:

+All distilled models including R1-Distill-70B (Q4)

Does not fit:

-R1-Distill-70B at Q8 (needs ~71.5GB)

The minimum practical option for the 70B distill. ~600 GB/s bandwidth gives solid reasoning chain speed.

Mac Studio M4 Max 128GB

Runs:

+All distilled models including R1-Distill-70B (Q4 & Q8)

Does not fit:

-Full 671B R1 (still not enough VRAM)

Runs the full distill lineup. 70B at Q8 fits with room to spare.

Reasoning Tokens: Why DeepSeek R1 Needs More VRAM Than You Expect

DeepSeek R1 models use chain-of-thought reasoning. Before outputting an answer, the model generates internal "thinking" tokens that work through the problem step by step. These are consumed by KV cache — the memory buffer that holds previously generated tokens so they do not need to be recomputed.

KV cache grows with every token

Each reasoning step adds to the KV cache. A complex coding or math problem can generate 500-2,000 thinking tokens before the answer begins. This temporarily adds 1-4 GB of KV cache on top of the base model weight memory.

Factor 2-4x more VRAM per query

The base VRAM figures in the table above are for model weights only. During active inference on complex prompts, peak VRAM usage can be 1.5-2x higher. Leave headroom — do not run a model that barely fits.

Responses are slower per query

Even at the same tokens-per-second speed, total response time is longer because the model generates more tokens. A 7B DeepSeek R1 response may take 3-4x longer than a Llama 3.1 7B response to the same prompt.

Context length multiplies this effect

Longer system prompts and conversation history also grow KV cache. For extended reasoning sessions, 16K-32K context, or agentic pipelines, consider stepping up one model size tier to ensure stability.

Practical rule: if the model weight fits in VRAM with less than 20% headroom, you will likely see out-of-memory errors or extreme slowdowns on complex reasoning tasks. See the VRAM Calculator to estimate KV cache at your context length.

Inference Speed by Hardware

Token generation speed is bottlenecked by memory bandwidth. Faster bandwidth means faster reasoning chains. The table below shows estimated Q4_K_M speeds. Note that for reasoning models, these speeds apply to the full token stream including thinking tokens — total wall-clock time per query is longer.

Hardware	Bandwidth	7B Q4 tok/s	14B Q4 tok/s	32B Q4 tok/s
RTX 5090 32GB	1,792 GB/s	~102 t/s	~53 t/s	~27 t/s
RTX 4090 24GB	1,008 GB/s	~58 t/s	~30 t/s	~15 t/s
RTX 4060 Ti 16GB	288 GB/s	~16 t/s	~9 t/s	—
Mac Studio M4 Max 64GB	~600 GB/s	~34 t/s	~18 t/s	~9 t/s
Mac mini M4 Pro 48GB	~273 GB/s	~15 t/s	~8 t/s	~4 t/s

Speed estimates based on: tokens/sec = bandwidth (GB/s) / model size in memory (GB). Real-world results vary by software version and system configuration. Mac mini M4 Pro bandwidth estimated at ~273 GB/s. Mac Studio M4 Max bandwidth estimated at ~600 GB/s.

Where to Download and Run DeepSeek R1

Ollama

ollama run deepseek-r1:7b

Easiest option. One command installs and runs the model. Use :1.5b, :7b, :8b, :14b, :32b, or :70b tags. GPU auto-detected on NVIDIA, AMD (ROCm), and Apple Silicon.

LM Studio

Search "deepseek-r1" in Discover

GUI-based model browser and chat interface. Download GGUF quantizations directly from within the app. Best for non-technical users. Runs on Windows, Mac, and Linux.

Hugging Face + llama.cpp

deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Download GGUF files from the bartowski or unsloth community repos on Hugging Face. Run with llama.cpp for maximum control over quantization and context length.

For step-by-step installation instructions, see the how to run LLMs locally guide. For AMD GPU setup with ROCm, see the AMD ROCm guide.

Which Hardware Should You Buy for DeepSeek?

Entry budget

RTX 4060 8GB

Runs the 7B and 8B distills at Q4. Solid reasoning capability for coding, writing, and math at a budget price. Upgrade if you want 14B.

Budget, 14B target

Intel Arc B580 12GB

Best value path to the 14B distill. 12GB fits 14B at Q4. Arc ROCm and Ollama support has improved significantly — verify driver compatibility before buying.

Mid-range sweet spot

RTX 4060 Ti 16GB

Best bang for DeepSeek use. 16GB runs the 14B at Q8, which is the highest quality reasoning you get at this price. Strongly recommended.

High end

RTX 4090 24GB

Best single-GPU card for DeepSeek. Runs the 32B distill at Q4, which delivers near-full-R1 reasoning quality on complex tasks. Fast bandwidth (~1,008 GB/s) keeps thinking chains snappy.

Mac ecosystem

Mac mini M4 24GB

Runs 14B at Q8 silently and efficiently. Unified memory avoids the VRAM vs RAM split. For 32B, step up to Mac mini M4 Pro 48GB. For 70B, Mac Studio M4 Max 64GB.

For a full cross-budget GPU comparison, see the best GPU for LLMs guide.

Related Resources

Best GPU for LLMs — Full Guide

All budget tiers from entry to workstation

LLM Quantization Explained

Q4 vs Q8 vs FP16 — when quality trade-offs matter

How to Run LLMs Locally

Step-by-step Ollama, LM Studio, llama.cpp setup

Apple Silicon for LLMs

M4, M4 Pro, M4 Max — which Mac for DeepSeek?

Ollama vs LM Studio

Which tool to use for running DeepSeek locally

CPU-Only Inference

Run DeepSeek-R1-Distill-8B without a GPU

AMD GPU ROCm Guide

Run DeepSeek on RX 7900 XTX with ROCm

Best LLMs to Run Locally

DeepSeek vs Qwen3, Phi-4, Llama 3.3 — 2026 picks

Qwen3 Hardware Requirements

Compare DeepSeek R1 Distill vs Qwen3 thinking

Llama 3.3 70B Hardware Guide

Hardware for the best open-weights 70B model

Qwen3 vs DeepSeek vs Llama 3.3

Decide which model family fits your hardware

How to Run DeepSeek R1 Locally

Step-by-step Ollama setup guide for all hardware tiers

DeepSeek V3 Hardware Requirements

Can you run DeepSeek V3 locally? 685B model needs 390 GB VRAM — and what to use instead.

VRAM Calculator

Calculate exact VRAM at your context length

Frequently Asked Questions

What GPU do I need to run DeepSeek R1 locally?

It depends on which distilled model you want. DeepSeek-R1-Distill-7B at Q4 needs ~5.5 GB VRAM — an RTX 4060 8GB handles it. The 14B at Q4 needs ~8.5 GB — the Intel Arc B580 12GB or RTX 4070 12GB work. The 32B at Q4 needs ~17.5 GB — RTX 4090 24GB or better. The 70B distill needs ~36.5 GB, requiring a Mac Studio M4 Max 64GB or a dual-GPU setup. The full 671B R1 is not practical on consumer hardware.

Can I run DeepSeek R1 on an 8GB GPU?

Yes, with the distilled models. DeepSeek-R1-Distill-7B at Q4_K_M needs about 5.5 GB VRAM and runs on an RTX 4060 8GB. The 1.5B distill needs only ~2.3 GB at Q4 and runs on virtually any GPU with 4GB+. You cannot run the 14B, 32B, or larger models on 8GB VRAM.

Why is DeepSeek R1 slower than regular LLMs of the same size?

DeepSeek R1 uses chain-of-thought reasoning, generating internal "thinking" tokens before giving its final answer. A single query can trigger hundreds to thousands of thinking tokens. Total token generation per query is 2-5x higher than a non-reasoning model of the same size, so each response takes noticeably longer even at the same tokens-per-second speed.

What is the difference between DeepSeek R1 and the distilled models?

The full DeepSeek R1 is a 671-billion parameter Mixture-of-Experts model requiring data-center hardware. The distilled models (1.5B to 70B) are smaller dense models trained using knowledge distillation from the full R1, transferring its reasoning capability into a far smaller footprint. The distilled models are what most people run locally. They retain strong reasoning ability despite being orders of magnitude smaller.

Can a Mac mini run DeepSeek R1?

Yes. The Mac mini M4 16GB runs the 1.5B and 7B distills well. The Mac mini M4 24GB adds the 14B at Q8. The Mac mini M4 Pro 48GB fits the 32B at Q4 and Q8. Apple Silicon uses unified memory, so all available memory is usable for models. For the 70B distill, you need a Mac Studio M4 Max with at least 64GB.

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Check VRAM requirements for DeepSeek models, or compare hardware options.

VRAM Calculator GPU Buying Guide All Guides

Related Guides

LLM Quantization Explained: Q4, Q8, F16

Run DeepSeek on less VRAM by choosing the right quantization level.

Running 70B Models Locally

Hardware requirements and realistic expectations for large models.

KV Cache and VRAM: How Context Length Blows Up GPU Memory

Why long context windows eat VRAM fast and how to manage it.

Ollama vs llama.cpp: Which Should You Use?

Comparing the two most popular local LLM runtimes.

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide I leaned on:

Hugging Face Hub. Official DeepSeek model cards (V2, V3, R1, Coder variants) for parameter and context-length numbers.
Modal: How much VRAM do I need for LLM inference. VRAM-budget formula used to size each DeepSeek quant against real GPUs.
XiongjieDai GPU-Benchmarks-on-LLM-Inference. Independent tokens-per-second runs for DeepSeek quants across NVIDIA and Apple silicon.

Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.