DeepSeek-R1 7B on OCI Ampere A1: Full CPU Inference Guide — No GPU Required

Nikhil Verma
May 3
9 min read

Every time someone mentions running an LLM in production, the first instinct is to reach for a GPU instance. A100s, H100s, L40s — the cost spirals fast, and for most enterprise inference workloads, you're paying for compute you don't need. I've been running inference experiments on OCI Ampere A1 instances for a while now, and the results keep surprising me in a good way.

DeepSeek-R1 7B is the model that changed the conversation. Released by DeepSeek AI, this reasoning-optimised model punches well above its weight class. The 7B parameter variant, when quantised to Q4_K_M and run with Llama.cpp on OCI's free-tier-eligible Ampere A1 cores, delivers token throughput that makes a compelling case for CPU-first inference in private enterprise environments.

This post covers the full picture — which model we're deploying, why OCI ARM is the right substrate, business justification including ROI and data residency, use cases, complete deployment walkthrough using Llama.cpp, benchmarks against comparable models, and the edge cases you'll hit in production.

Why DeepSeek-R1 7B Specifically

DeepSeek has a family of models. Let's be precise about what we're deploying so there's no ambiguity.

DeepSeek-R1 is a reasoning-first model trained with reinforcement learning to think through problems step-by-step before producing output. It is not a plain next-token predictor — it reasons. The R1 family includes 1.5B, 7B, 8B, 14B, 32B, 70B, and the full 671B MoE variant.

We're deploying DeepSeek-R1 7B (GGUF Q4_K_M quantised), specifically the distilled version based on Qwen2.5-7B as the base architecture. This is the sweet spot for CPU inference:

Model file size: approximately 4.5 GB in Q4_K_M format
Fits entirely in RAM on a 24 GB OCI A1 flex instance
Reasoning capability inherited from the R1 RL training pipeline
No MoE gating overhead — single dense forward pass

The distilled 7B variant retains roughly 65–70% of the full R1 reasoning quality on benchmarks like MATH-500 and GPQA-Diamond, at a fraction of the compute cost.

Technical Benefits of OCI Ampere A1 for LLM Inference

OCI's Ampere A1 (Altra/Altra Max) cores are purpose-built for throughput workloads. Here's why they outperform naive expectations for LLM inference:

Memory bandwidth. CPU inference is bottlenecked by memory bandwidth, not FLOPS. The Altra Max cores in OCI A1 instances have high-bandwidth DRAM controllers. A 24-core A1 flex with 144 GB RAM gives you enough bandwidth to sustain meaningful token throughput for 7B models.

NEON/SVE2 SIMD support. Llama.cpp has mature ARM NEON and SVE2 optimised kernels for quantised matrix multiplication (Q4_K_M, Q5_K_M, Q8_0). The performance delta between a naive implementation and the SIMD-optimised path is 3–4x on ARM hardware.

Core density. OCI A1 flex lets you provision up to 80 OCPUs and 512 GB RAM in a single instance. For multi-user inference with a batching frontend, you can horizontally scale threads without leaving the same VM.

No driver hell. GPU instances require CUDA driver pinning, container toolkit setup, VRAM fragmentation management, and careful attention to CUDA version compatibility with your inference stack. ARM CPU inference needs none of this. It's just a compiled binary and a model file.

Thermal consistency. ARM server chips are designed for sustained throughput at high utilisation. You won't see thermal throttling the way you might with sustained GPU workloads in some environments.

Business Impact: ROI, Data Residency, and the GPU Cost Argument

ROI Analysis

Let's put numbers on this. An OCI A1 flex instance with 24 OCPUs and 144 GB RAM costs approximately $1.44/hour on pay-as-you-go in most regions (24 × $0.06/OCPU-hour). Compare this to an NVIDIA A10G-based GPU instance on OCI (VM.GPU2.1 or BM.GPU.A10), which starts at $1.50–$2.50/hour but requires the GPU-specific shapes that are frequently subject to capacity constraints and significantly higher reserved pricing.

For workloads running 8–12 hours/day, the monthly A1 cost is roughly $350–520. The equivalent GPU shape is $1,100–1,800/month. For a small to mid-size enterprise running internal tooling — a coding assistant, document summariser, support ticket classifier — the ARM instance breaks even within the first month purely on infrastructure cost. Over a 12-month contract with reserved pricing (where A1 can drop to $0.01–0.015/OCPU-hour), the cost advantage widens to 10–15x.

This doesn't account for the GPU availability problem. During peak demand periods, GPU instances are often unavailable for on-demand provisioning in the exact region you need. A1 instances have significantly better regional availability.

Data Residency and Sovereignty

This is where OCI ARM inference becomes a hard requirement rather than a cost optimisation for certain organisations.

When you send prompts to a hosted API — OpenAI, Anthropic, Google — that data traverses the internet to infrastructure outside your control, processed in data centres that may or may not comply with your jurisdictional requirements. For organisations under GDPR, HIPAA, DPDP (India's Digital Personal Data Protection Act), or financial services regulations in the EU and UK, this is not just a preference — it is a compliance issue.

Running DeepSeek-R1 7B on an OCI instance in your tenancy means:

Prompts never leave your VCN (Virtual Cloud Network) if you expose the inference endpoint only on private subnets
All model weights are stored in your OCI Object Storage or on instance local NVMe
Audit logs of API calls are entirely within your OCI tenancy
You choose the OCI region (Frankfurt, London, Mumbai, Sydney) to match your data residency requirements
No third-party model provider has access to your inference traffic

For legal, HR, finance, and healthcare use cases, this shifts DeepSeek-R1 on OCI from "interesting experiment" to "the only viable option."

Why We Don't Need a GPU

The conventional wisdom that LLMs require GPUs comes from training and from running non-quantised, full-precision (FP16/BF16) inference at high throughput. For enterprise internal tooling with modest concurrency requirements — typically 5–20 simultaneous users — the arithmetic works differently.

A Q4_K_M quantised 7B model uses 4 bits per weight instead of 16. The memory footprint drops from ~14 GB (FP16) to ~4.5 GB. At 4 bits per weight, the primary bottleneck becomes DRAM bandwidth, not compute — and modern server ARM CPUs have more than enough bandwidth to sustain 15–25 tokens/second on a 7B model, which is faster than a human reads.

The use cases we're targeting (document analysis, code review, internal Q&A, reasoning over structured data) are inherently low-latency-tolerant and low-concurrency. 15 tokens/second at 5 concurrent users on a single A1 instance is fully production-adequate.

Popular Use Cases

Internal knowledge base Q&A. Connect DeepSeek-R1 to your internal Confluence, SharePoint, or document store via RAG. The reasoning capability of R1 makes it significantly better than vanilla 7B models at multi-hop questions across documents.

Code review and generation. R1's chain-of-thought training makes it highly competent at explaining code, suggesting refactors, and catching logic errors. Works well with private codebases that you cannot send to hosted APIs.

Contract and document summarisation. Legal, procurement, and compliance teams processing sensitive contracts benefit from on-premise inference. R1's reasoning quality shows clearly in structured summarisation tasks.

Support ticket triage and classification. Classify, route, and draft responses for internal helpdesk tickets. Low latency requirements, high volume possible via batched inference.

SQL and data query generation. R1 7B performs well on Text-to-SQL benchmarks. Paired with an internal database, it can serve as a natural language query interface for non-technical teams.

Offline / air-gapped environments. OCI Government Cloud or disconnected VCN deployments where external internet access is restricted. DeepSeek-R1 7B runs entirely offline once the model is loaded.

Prerequisites

Before starting the deployment:

OCI tenancy with A1 flex instance quota (request via Limits & Quotas if needed)
OCI Compute instance: A1 flex, minimum 12 OCPUs, 72 GB RAM (24 OCPUs / 144 GB recommended for production)
Oracle Linux 8 or Ubuntu 22.04 ARM64
SSH access to the instance
At least 20 GB free disk on the instance (for model + build artifacts)

Complete Deployment Steps

Step 1: Provision the OCI A1 Instance

From the OCI Console, navigate to Compute → Instances → Create Instance. Select shape VM.Standard.A1.Flex, configure 24 OCPUs and 144 GB RAM. Choose Oracle Linux 8 or Ubuntu 22.04 (aarch64). Attach a 100 GB block volume or use the boot volume. Assign a public IP if you need direct access, or use a Bastion if this is private-subnet only.

Step 2: Install Dependencies

# Update system
sudo dnf update -y   # Oracle Linux
or

sudo apt-get update && sudo apt-get upgrade -y   # Ubuntu
Install build tools

sudo dnf install -y git cmake gcc g++ python3 python3-pip wget curl   # OL8
or

sudo apt-get install -y git cmake gcc g++ python3 python3-pip wget curl build-essential   # Ubuntu
Verify gcc version (need 11+)

gcc --version
Install Python packages for the optional OpenAI-compatible server

pip3 install --upgrade pip
pip3 install huggingface_hub

Step 3: Build Llama.cpp with ARM NEON/SVE Optimisations

cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build with ARM NEON optimisations enabled

mkdir build && cd build
cmake .. \
  -DGGML_NATIVE=ON \
  -DGGML_AVX=OFF \
  -DGGML_AVX2=OFF \
  -DGGML_ARM_SVE=ON \
  -DCMAKE_BUILD_TYPE=Release

make -j$(nproc)

Verify build output

ls -la bin/
You should see: llama-cli, llama-server, llama-bench, llama-quantize

The -DGGML_NATIVE=ON flag enables auto-detection of CPU capabilities and enables NEON optimisations on Ampere A1. The -DGGML_ARM_SVE=ON flag enables Scalable Vector Extension support where available.

Step 4: Download DeepSeek-R1 7B GGUF Model

cd ~
mkdir -p models/deepseek-r1-7b
cd models/deepseek-r1-7b
Download Q4_K_M quantised model from Hugging Face

Using huggingface-cli (recommended for reliability)

pip3 install -U "huggingface_hub[cli]"
huggingface-cli download \
  bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF \
  DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  --local-dir ~/models/deepseek-r1-7b \
  --local-dir-use-symlinks False

Verify download

ls -lh ~/models/deepseek-r1-7b/
Expect ~4.5 GB file

If you prefer wget with a direct URL:

wget -O ~/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  "https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf"

Step 5: Run a Quick Inference Test

cd ~/llama.cpp/build
./bin/llama-cli \
  -m ~/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  -n 256 \
  --temp 0.6 \
  --top-p 0.95 \
  -t $(nproc) \
  --no-mmap \
  -p "<|begin_of_sentence|><|User|>What is the time complexity of merge sort and why?<|Assistant|>"

Key flags:

-t $(nproc) uses all available cores

--no-mmap loads model fully into RAM for consistent throughput

-n 256 generates up to 256 tokens

--temp 0.6 and --top-p 0.95 are recommended defaults for R1 reasoning models

Step 6: Launch the OpenAI-Compatible HTTP Server

For production use, you want the Llama.cpp HTTP server, which exposes an OpenAI /v1/chat/completions-compatible endpoint. This means any existing tooling (LangChain, OpenWebUI, custom apps) connects without code changes.

cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads $(nproc) \
  --ctx-size 8192 \
  --n-predict 2048 \
  --no-mmap \
  --parallel 4 \
  --log-disable

The --parallel 4 flag enables up to 4 concurrent inference requests via continuous batching. Adjust based on your RAM headroom.

To run as a background service:

cat > ~/start-deepseek.sh << 'EOF'
#!/bin/bash
cd ~/llama.cpp/build
nohup ./bin/llama-server \
  --model ~/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads $(nproc) \
  --ctx-size 8192 \
  --n-predict 2048 \
  --no-mmap \
  --parallel 4 \
  --log-disable > ~/deepseek-server.log 2>&1 &
echo $! > ~/deepseek-server.pid
echo "Server started with PID $(cat ~/deepseek-server.pid)"
EOF
chmod +x ~/start-deepseek.sh
~/start-deepseek.sh

Step 7: Test the API Endpoint

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-7b",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function to find all prime numbers up to N using the Sieve of Eratosthenes."
      }
    ],
    "temperature": 0.6,
    "max_tokens": 1024
  }' | python3 -m json.tool

Step 8: Configure OCI Security List / NSG

If this is a private internal service, lock down the endpoint:

# In OCI Console:
VCN → Security Lists → Add Ingress Rule

Source: Your application subnet CIDR (e.g., 10.0.1.0/24)

Protocol: TCP

Destination Port: 8080
If using OCI NSG (recommended over Security Lists):

Network Security Groups → Add rule

Direction: Ingress

Source: NSG of calling application

Port: 8080/TCP

Never expose port 8080 to 0.0.0.0/0 on a public subnet without an authentication layer (OCI API Gateway with JWT validation is the cleanest option here).

Step 9: Create a systemd Service for Production

sudo bash -c 'cat > /etc/systemd/system/deepseek-r1.service << EOF
[Unit]
Description=DeepSeek-R1 7B Inference Server
After=network.target
[Service]
Type=simple
User=opc
WorkingDirectory=/home/opc/llama.cpp/build
ExecStart=/home/opc/llama.cpp/build/bin/llama-server \
  --model /home/opc/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads 20 \
  --ctx-size 8192 \
  --n-predict 2048 \
  --no-mmap \
  --parallel 4
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF'

sudo systemctl daemon-reload
sudo systemctl enable deepseek-r1
sudo systemctl start deepseek-r1
sudo systemctl status deepseek-r1

Benchmarking DeepSeek-R1 7B on OCI A1

Llama.cpp Built-in Benchmark

cd ~/llama.cpp/build
./bin/llama-bench \
  -m ~/models/deepseek-r1-7b/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf \
  -t $(nproc) \
  --no-mmap \
  -p 512 \
  -n 256 \
  -r 3

Observed results on OCI A1 Flex (24 OCPU, 144 GB RAM, Ampere Altra Max):

| Metric | Value ||---|---|| Prompt processing (pp) | ~420 tokens/sec || Token generation (tg) | ~18–22 tokens/sec || Time to first token (512-token prompt) | ~1.2 seconds || RAM utilised | ~5.8 GB |

Comparison Against Similar 7B Models on the Same Hardware

Model	Quant	TG (tok/s)	Reasoning Quality (MATH-500)	Context	Notes
DeepSeek-R1 7B	Q4_K_M	18–22	~83%	128K	CoT reasoning, best accuracy
Mistral 7B v0.3	Q4_K_M	22–26	~52%	32K	Faster, weaker reasoning
Llama 3.1 8B	Q4_K_M	17–21	~56%	128K	Comparable speed, lower reasoning
Qwen2.5 7B	Q4_K_M	19–23	~75%	128K	Strong, base of R1-distill
Phi-3.5 Mini	Q4_K_M	24–28	~69%	128K	Faster, less reasoning depth

DeepSeek-R1 7B trades roughly 10–15% generation throughput for a 25–30% improvement in reasoning-heavy tasks versus comparably-sized models. For the use cases outlined earlier (document analysis, code review, SQL generation), that trade-off is clearly worth it.

Conclusion

DeepSeek-R1 7B on OCI Ampere A1 is not a compromise — it is a deliberate architecture choice for organisations that care about inference cost, data sovereignty, and operational simplicity. The combination of a reasoning-capable model, aggressive quantisation via Llama.cpp, and the bandwidth-rich ARM cores in OCI's A1 fleet produces a system that can handle real enterprise workloads at a fraction of GPU-instance pricing.

The deployment is reproducible in under an hour. The cost model is predictable. The data never leaves your tenancy. For the use cases covered here — internal tooling, document analysis, code assistance, data querying — this stack competes directly with GPU-backed hosted inference, and wins on total cost of ownership, compliance posture, and operational control.

If you're already running workloads on OCI, the barrier to standing this up is low. If you're evaluating OCI for the first time, the A1 instance is one of the genuinely differentiated offerings in the OCI portfolio that deserves a serious look.