High-Bandwidth AMD Server Build for Local LLM Inference

(CPU-Optimized NUMA Build Guide)

Our Queen

Running very large parameter language models locally presents a fundamental challenge:

Fast inference requires massive memory bandwidth.

There are three primary architectural approaches to solving this problem:

Stack multiple GPUs for high GDDR bandwidth
Unified memory architectures (UMA) such as Apple Silicon
High-channel-count CPU NUMA systems (dual-socket EPYC)

This guide focuses on the third approach: a dual-socket AMD EPYC Genoa system with 24 channels of DDR5-4800, delivering approximately 920 GB/s of theoretical memory bandwidth.

Approach Comparison

1. Multi-GPU Systems

Stacking GPUs provides extremely high memory bandwidth via GDDR, but introduces practical limitations.

Challenges

Power consumption
- Easily exceeds what a standard household breaker can support
- Often requires 1600W+ PSUs or multiple power supplies
Cooling requirements
- Significant airflow needed
- Systems become noisy and generate substantial heat
PCIe lane limitations
- Limited slots and bandwidth
- Workarounds include risers, split slots, mining frames — but these increase complexity
Cost
- High-VRAM GPUs are extremely expensive
- Example: 80GB-class cards can exceed $15,000
VRAM constraints
- Even if the model fits, combining:
  - Model weights
  - Large context windows
  - TTS models
  - Image generation models
- ...often leads to out-of-memory errors

Advantages

Extremely fast inference if fully contained within VRAM
Excellent for training workloads
Minimal need for offloading or quantization compromises

2. Apple Silicon (Unified Memory Architecture)

Apple Silicon provides large shared memory pools in a compact system.

Advantages

Quiet and compact
Stylish, well-integrated hardware
Shared CPU/GPU memory model

Limitations

Less supported architecture overall
Metal performance often below theoretical expectations
Memory is soldered and not upgradeable
Limited PCIe expandability
Cannot fully utilize all available memory for inference
Prompt processing performance can be slow
macOS ecosystem constraints
Cost comparable to other approaches, with less flexibility

3. CPU-Centric NUMA Architecture (EPYC)

This is the approach used in this build.

Overview

Dual-socket AMD EPYC Genoa
24-channel DDR5-4800
~920 GB/s aggregate bandwidth
Fully populated RAM slots

Upfront Investment

Approximately $6,000 USD (at time of writing)
Requires filling all memory channels for maximum bandwidth
Upgrading memory typically means replacing all modules

Advantages

384GB / 768GB / 1.5TB+ RAM configurations are practical
Extensive PCIe lanes for expansion
Strong general-purpose compute (VMs, labs, parallel builds)
GPU remains free for:
- Context processing
- TTS/STT
- Image generation
Future GPU or accelerator upgrades remain possible
Entire system runs comfortably on a 1000W PSU
Can be built to run quietly with large, low-RPM fans
Future EPYC upgrades possible as enterprise hardware depreciates

Limitations

Large physical footprint
SP5 platform components are physically massive
Limited suitability for training (CPUs are not GPUs)
Requires NUMA tuning for peak performance

Model Performance Expectations

70B-Class Models

M i q u 70B Q5
- ~8 tokens/sec without special tuning
- Potentially 20+ tokens/sec with optimization

120B-Class Models

Mistral Large and similar
- ~3 tokens/sec

Mixture-of-Experts (MoE) Models

These are a sweet spot for CPU systems.

Since only a subset of parameters activate per token:

DeepSeek v3 / R1 (~600B class)
- ~10 tokens/sec with empty context
Snowflake 480B-class models
Mixtral 8x22 WizardLM variants
- Significantly faster due to smaller experts

405B Dense Models

Require 424GB+ RAM minimum
~1 token/sec range
This build can run them, but performance is modest

Is This Better Than a GPU Box?

Possibly not — strictly for LLM performance.

However, this configuration is easier to justify when:

You need CPU cores for virtualization
You require large memory pools
You want PCIe expandability
You want flexible storage and networking
You are not primarily training models

Hardware Build (Rig 1)

Platform

Motherboard: Gigabyte MZ73-LM1
CPUs: Dual AMD EPYC 9334 (QS samples)

Note: Newer EPYC “Turin” CPUs support DDR5-6000/6400.
Firmware updates allow compatibility with this board.
Early engineering samples have appeared on eBay at ~$3k per socket.

Memory

24 × 32GB Samsung DDR5-4800 RDIMM
Fully populated for maximum bandwidth

Power Supply

1000W quality PSU (example: FSP Hydro PTM X Pro)

Storage

NVMe via onboard M.2 (NVMe only)
4× SATA breakout included
SlimSAS 8i expansion allows up to 16 additional drives

Case

Large airflow-focused chassis required
SP5 sockets and RAM consume significant space
Ensure large, slow-moving fans for quiet cooling

Cooling

SP5-compatible coolers (4U server height recommended)
Example: CoolServer SP5 coolers
~Room temp idle
~30°C increase under full load

GPU (Optional but Recommended)

24GB GPU (example: NVIDIA A5000)
Onboard video for console
Use GPU for:
- TTS
- Stable Diffusion
- Context processing
- Acceleration tasks

CPU Build

Linux Configuration

Distribution

Debian (Trixie), headless
Kernel 6.6+ strongly recommended
6.12 currently performs best in testing

Critical Tuning

Disable Transparent Hugepages for stability under memory pressure
Reverse proxy (nginx) with backend services firewalled
Avoid exposing inference services to the internet

CUDA Compilation Notes

Modern distros ship gcc13/14
Some CUDA toolkits require forcing gcc 12/13
May require editing package dependencies (libtinfo versions)

BIOS Settings

Use UEFI, not legacy BIOS
Ensure xGMI link at full speed (4×)
Disable unused devices to free PCIe lanes
Configure NUMA:
- NPS0 (single image) recommended for simplicity
- Higher NPS values for multi-user tuning

NUMA Optimization Techniques

EPYC systems require workload-aware tuning.

Multi-Instance Isolation

Using NPS > 0 and numactl:

⎗
✓
numactl -N3 -m3 ./main -m [YOUR_MODEL].gguf --no-mmap --numa numactl -t 16 -s 3955 -p "Hello"

This isolates execution to NUMA node 3.

Check NUMA topology: numactl -H

Use gnu parallel to test optimal NPS settings.

Memory Locality Optimization

For maximum tokens/sec:

⎗

1 2	echo 3 > /proc/sys/vm/drop_caches ./llama-cli -m deepseek-coder-v2-instruct-q8.gguf -t 60 --numa distribute -c 65535 -ngl 0 --interactive-first

Notes:

First response may be slow
Model pages fault into memory progressively
Full speed typically reached after several responses
Same principle applies to long-running servers (llama-server, oobabooga, kobold, etc.)

Proper memory locality can double inference speed.

Conclusion

If you want the absolute lowest cost per token/sec, a GPU-focused build may still win.

However, a high-channel-count dual-socket EPYC system offers:

Massive RAM capacity
Strong memory bandwidth
Expandability
Quiet operation
General-purpose compute flexibility
Long-term upgrade paths

For users who need more than just inference performance — virtualization, experimentation, hosting multiple models — this CPU-centric design is a compelling alternative.

If your goal is the simplest cost-effective inference-only system, consider a smaller, more traditional GPU-focused configuration instead.

High-Bandwidth AMD Server Build for Local LLM Inference

Approach Comparison

1. Multi-GPU Systems

Challenges

Advantages

2. Apple Silicon (Unified Memory Architecture)

Advantages

Limitations

3. CPU-Centric NUMA Architecture (EPYC)

Overview

Upfront Investment

Advantages

Limitations

Model Performance Expectations

70B-Class Models

120B-Class Models

Mixture-of-Experts (MoE) Models

405B Dense Models

Is This Better Than a GPU Box?

Hardware Build (Rig 1)

Platform

Memory

Power Supply

Storage

Case

Cooling

GPU (Optional but Recommended)

Linux Configuration

Distribution

Critical Tuning

CUDA Compilation Notes

BIOS Settings

NUMA Optimization Techniques

Multi-Instance Isolation

Memory Locality Optimization

Conclusion

Warning