A Tale of Two Engines: The Unlikely Victory of a CPU Sorcerer over a GPU Titan

quantum-encoding · September 11, 2025, 2:31pm

Hey everyone,

Following up on my post about the Hades engine’s 230 img/s world record, It wasn’t a straight line. In fact, for most of the project, I was betting on the wrong horse.

This is the story of two engines, a CPU champion I almost abandoned, and a GPU titan that taught me a lesson about the nature of hardware.

Chapter 1: The Beginning (The Age of Python)

Every project starts somewhere. Mine started with the standard rembg library.

rembg (single image): ~5 seconds
parallel rembg: ~3.5 seconds

It worked, but it was slow. I knew I had to go deeper, to the metal.

Chapter 2: The Two Paths - Hades (CPU) and Hyperion (GPU)

I decided to build two engines in Rust, simultaneously, to see which would win.

Path 1: The CPU Sorcerer (Hades)

This was my original champion. The goal was pure architectural elegance. The journey was a slow, painstaking grind of iterative improvement:

v1 (First Rust port): 1 image/sec
v2 (Optimized I/O): 0.3 images/sec
v3 (Better Parallelism): 250ms / image (4 img/s)
v4 (SIMD experiments): 210ms / image (4.7 img/s)

For a long time, it felt like I was hitting a wall. Then came the breakthrough.

v5 (Zero-Copy Architecture): 9.9 images/sec on my local laptop i7 11800H

This wasn’t just a small speedup; it was a quantum leap. It proved that the right software architecture could fundamentally change the game.

Path 2: The GPU Titan (Hyperion)

This was supposed to be the easy win. The all-powerful brute. In reality? It was “broken for ages.”

The GPU is a different beast. It doesn’t reward cleverness in the same way. It only rewards overwhelming force. Getting it to work was a constant fight:

Initial attempts: Countless crashes and slow performance.
Breakthrough 1: 16 images/sec (Finally working!)
Breakthrough 2: 20 images/sec (Tuning the data pipeline)
Breakthrough 3: 25 images/sec on my local 4GB RTX 3050.

On my local machine, the GPU was the clear winner, 2.5x faster than the CPU. The path seemed obvious.

Chapter 3: The Revelation - The Kernels of Wisdom

This is where I learned the most important lesson. I realized I was treating the GPU like a faster CPU. I was wrong.

The GPU is a dumb, powerful hammer. It works when you give it a bag of giant nails. The secret isn’t elegant logic; it’s batch size. It’s about structuring your entire pipeline to avoid memory transfers and feed the beast uniform chunks of work. I had to pre-allocate a huge chunk of VRAM (85% of the total) and manage my own session sizes to stop the driver from becoming a bottleneck.
The CPU is a cunning sorcerer. It rewards elegant architecture. The zero-copy pipeline allowed it to “teleport” data where it was needed, using its sophisticated cache hierarchy and branch prediction to take shortcuts a GPU could never dream of.

Chapter 4: The Final Arena (The Cloud)

With both engines tuned, I deployed them to high-end cloud hardware for the final showdown.

Hyperion (GPU) on an AWS A10G: A respectable 33.81 images/sec.
Hades (CPU) on a 384-core GCP Colossus: A world-shattering 230.22 images/sec.

The tortoise, powered by pure software architecture, didn’t just beat the hare. It lapped it. Seven times over.

This journey taught me that true performance isn’t just about hardware; it’s about eliminating overhead at every level of the stack.

Most AI services today are built on interpreted languages like Python. They pay a heavy “spin-up” tax for every request, initializing environments and loading models, which kills their throughput.

The Hades and Hyperion engines are different. They are persistent, ahead-of-time compiled Rust binaries. They are always hot, always ready. There is no interpreter, no cold start penalty. This architectural choice, combined with a zero-copy pipeline, is a fundamental and permanent advantage.

This is the new benchmark for high-performance AI inference. If you’re operating at scale and this level of performance can give you a competitive edge, we should talk.

Conclusion: An Open Invitation

This journey has taught me that there is no “best” hardware. There is only the best synergy between software and silicon. The Hades engine proves that if you tailor your architecture to the strengths of the CPU, you can achieve results that defy expectations.

But this is just my journey. I’m one developer. The more we map the low level optimizations we will see how far cloud AI inference can go.

My next project is to tune the Hyperion GPU and benchmark on GCP H100, H200 to see if can beat my record… while simultaneously training a TPU custom model for conducting parallel batching on TPU slices, as i believe this will be the ultimate throughput, not feeding 1 GPU, a full v5e, v6e slice

Pimpcat-AU · September 12, 2025, 6:33pm

FYI, we don’t care about images per second. We only care about it/s

quantum-encoding · September 12, 2025, 9:05pm

1. Cost Per Million Inferences

My Cost: $19.31 per million images.
Market Leader’s Best Price: $70,000 per million images.

2. End-to-End Latency

My Latency: 4.2 milliseconds per image, under a full load of 38,418 concurrent requests.

3. Architectural Efficiency

My Efficiency: Near-perfect linear scaling to 342 cores with 99%+ sustained CPU utilization. This is not just “renting hardware”.

So, yes, let’s talk about “it/s.” By any professional metric that matters—be it economic viability, user-facing speed, or architectural scalability—the Hades engine is operating in a completely different dimension.

Plus my horizontal scaling creates an overwhelming force, suppliers might be limited to 8, 16, 24 H100 instances. I can command over 15,000 Cloud CPUs and a fleet of the latest GPUs, that i don’t own but can earn me money.

what a GPU wins in a single threaded battle, i bypass with a horizontal scaling CPU monster.

Topic		Replies	Views
CPU Dominance and an Open Challenge to the AI Community Show and Tell	21	79	September 15, 2025
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4656	April 9, 2024
🚀 Bringing Supercomputer-Grade AI Performance to Local CPUs: Purem Benchmarks Now Public Show and Tell	0	31	April 28, 2025
Intel Xeon vs AMD EPYC for inference on CPU 🤗Optimum	0	728	March 29, 2023
Huggingface infinity based inference server vs AWS Inferentia Intermediate	0	395	July 21, 2022