Hey everyone,
Following up on my post about the Hades engine’s 230 img/s world record, It wasn’t a straight line. In fact, for most of the project, I was betting on the wrong horse.
This is the story of two engines, a CPU champion I almost abandoned, and a GPU titan that taught me a lesson about the nature of hardware.
Chapter 1: The Beginning (The Age of Python)
Every project starts somewhere. Mine started with the standard rembg library.
-
rembg (single image): ~5 seconds
-
parallel rembg: ~3.5 seconds
It worked, but it was slow. I knew I had to go deeper, to the metal.
Chapter 2: The Two Paths - Hades (CPU) and Hyperion (GPU)
I decided to build two engines in Rust, simultaneously, to see which would win.
Path 1: The CPU Sorcerer (Hades)
This was my original champion. The goal was pure architectural elegance. The journey was a slow, painstaking grind of iterative improvement:
-
v1 (First Rust port): 1 image/sec
-
v2 (Optimized I/O): 0.3 images/sec
-
v3 (Better Parallelism): 250ms / image (4 img/s)
-
v4 (SIMD experiments): 210ms / image (4.7 img/s)
For a long time, it felt like I was hitting a wall. Then came the breakthrough.
- v5 (Zero-Copy Architecture): 9.9 images/sec on my local laptop i7 11800H
This wasn’t just a small speedup; it was a quantum leap. It proved that the right software architecture could fundamentally change the game.
Path 2: The GPU Titan (Hyperion)
This was supposed to be the easy win. The all-powerful brute. In reality? It was “broken for ages.”
The GPU is a different beast. It doesn’t reward cleverness in the same way. It only rewards overwhelming force. Getting it to work was a constant fight:
-
Initial attempts: Countless crashes and slow performance.
-
Breakthrough 1: 16 images/sec (Finally working!)
-
Breakthrough 2: 20 images/sec (Tuning the data pipeline)
-
Breakthrough 3: 25 images/sec on my local 4GB RTX 3050.
On my local machine, the GPU was the clear winner, 2.5x faster than the CPU. The path seemed obvious.
Chapter 3: The Revelation - The Kernels of Wisdom
This is where I learned the most important lesson. I realized I was treating the GPU like a faster CPU. I was wrong.
-
The GPU is a dumb, powerful hammer. It works when you give it a bag of giant nails. The secret isn’t elegant logic; it’s batch size. It’s about structuring your entire pipeline to avoid memory transfers and feed the beast uniform chunks of work. I had to pre-allocate a huge chunk of VRAM (85% of the total) and manage my own session sizes to stop the driver from becoming a bottleneck.
-
The CPU is a cunning sorcerer. It rewards elegant architecture. The zero-copy pipeline allowed it to “teleport” data where it was needed, using its sophisticated cache hierarchy and branch prediction to take shortcuts a GPU could never dream of.
Chapter 4: The Final Arena (The Cloud)
With both engines tuned, I deployed them to high-end cloud hardware for the final showdown.
-
Hyperion (GPU) on an AWS A10G: A respectable 33.81 images/sec.
-
Hades (CPU) on a 384-core GCP Colossus: A world-shattering 230.22 images/sec.
The tortoise, powered by pure software architecture, didn’t just beat the hare. It lapped it. Seven times over.
This journey taught me that true performance isn’t just about hardware; it’s about eliminating overhead at every level of the stack.
Most AI services today are built on interpreted languages like Python. They pay a heavy “spin-up” tax for every request, initializing environments and loading models, which kills their throughput.
The Hades and Hyperion engines are different. They are persistent, ahead-of-time compiled Rust binaries. They are always hot, always ready. There is no interpreter, no cold start penalty. This architectural choice, combined with a zero-copy pipeline, is a fundamental and permanent advantage.
This is the new benchmark for high-performance AI inference. If you’re operating at scale and this level of performance can give you a competitive edge, we should talk.
Conclusion: An Open Invitation
This journey has taught me that there is no “best” hardware. There is only the best synergy between software and silicon. The Hades engine proves that if you tailor your architecture to the strengths of the CPU, you can achieve results that defy expectations.
But this is just my journey. I’m one developer. The more we map the low level optimizations we will see how far cloud AI inference can go.
My next project is to tune the Hyperion GPU and benchmark on GCP H100, H200 to see if can beat my record… while simultaneously training a TPU custom model for conducting parallel batching on TPU slices, as i believe this will be the ultimate throughput, not feeding 1 GPU, a full v5e, v6e slice