What "High Performance" Actually Means

March 20, 2024·Kevin Karsopawiro

trading

performance

rust

Here's an uncomfortable truth: most systems described as "high performance" aren't. They're regular systems that happen to be fast enough. There's nothing wrong with that, but let's be precise about our terminology.

When you work on trading systems where microseconds directly translate to money, you develop a different relationship with performance. The techniques are different. The tradeoffs are different. And most importantly, the mindset is different.

Let me share what that world looks like.

The Only Metric That Matters

First, let's talk about how to measure performance correctly.

If you're only tracking average latency, you're missing the picture. Average latency is like average wealth: Bill Gates walks into a bar, and suddenly the average patron is a billionaire. Doesn't tell you much about actual people.

What matters is the latency distribution. Specifically:

p50: The median. Half your requests are faster, half are slower.
p99: 1% of requests are slower than this. This is where problems hide.
p99.9: One in a thousand requests. This is where demons live.
Max: The worst case. Sometimes this is what kills you.

A system with 1ms average latency might have a p99 of 50ms. That means 1% of your users are having a terrible time. If you're serving millions of requests, that's tens of thousands of slow experiences.

Expanding brain meme: Average latency -> p50 -> p99 -> p99.9 with jitter analysis

The Numbers You Should Memorize

Here are the latency numbers that should inform every architectural decision:

Operation	Time
L1 cache reference	1 ns
L2 cache reference	4 ns
L3 cache reference	12 ns
Main memory access	100 ns
SSD random read	16,000 ns
Network round trip (same datacenter)	500,000 ns

Look at these ratios. Memory is 100x slower than L1 cache. Network is 5,000x slower than memory. SSD is 160x slower than memory.

When you make a database call, even to a fast database in the same datacenter, you're spending 500,000 nanoseconds on network alone. That's enough time to do millions of CPU operations.

This is why truly low-latency systems keep everything in memory. Not "cache frequently accessed data." Everything. The database is for persistence and recovery, not for serving requests.

What "Low Latency" Actually Requires

Let's get specific about what real low-latency systems look like:

No Allocations In The Hot Path

Memory allocation involves syscalls, potential contention, and unpredictable latency. In performance-critical code, you pre-allocate everything at startup.

// Allocate pools at startup
let mut order_pool: Vec<Order> = Vec::with_capacity(100_000);
let mut message_buffer: Vec<u8> = Vec::with_capacity(65536);

// Hot path uses pre-allocated memory
fn process_order(&mut self, data: &[u8]) {
    // Reuse existing allocation, never allocate
    let order = self.order_pool.get_mut(self.next_slot);
    order.parse_from(data);
    // ...
}

This is not how you should write most code. It's harder to read and maintain. But when nanoseconds matter, you do what's necessary.

No Garbage Collection

GC pauses are the enemy of consistent latency. Even modern collectors like ZGC or Shenandoah have pause times that are unacceptable for true low-latency work.

This doesn't mean Java is bad. It means Java isn't the right tool for sub-millisecond latency requirements. Use Rust, C++, or carefully managed C.

Cache-Friendly Data Structures

Modern CPUs are fast. Memory is slow. The bottleneck is usually getting data into the CPU, not processing it.

// Bad: pointer-heavy structure, cache unfriendly
struct Order {
    data: Box<OrderData>,
    metadata: Box<Metadata>,
}

// Good: contiguous memory, cache friendly
struct Order {
    price: u64,
    quantity: u32,
    side: u8,
    // All data inline, no indirection
}

// Even better: array of structs for sequential access
let orders: [Order; 10000] = [...];

When the CPU loads memory, it loads entire cache lines (64 bytes). If your data is scattered across the heap, you waste most of each cache line. Contiguous arrays let the prefetcher work effectively.

Kernel Bypass For Networking

The network stack adds latency. For the fastest systems, you bypass the kernel entirely using technologies like DPDK or kernel bypass NICs. Packets go directly from the network card to userspace memory.

This is complex, requires specialized hardware, and most systems don't need it. But if you're competing on microseconds, it's table stakes.

The Architecture Implications

These requirements have profound implications for system architecture:

Monoliths, Not Microservices

Every network hop adds latency. Every service boundary adds serialization overhead. For low-latency work, you colocate. Everything that needs to be fast runs in the same process, often on the same CPU core.

This isn't an argument against microservices in general. They're great for organizational scaling, deployment independence, and fault isolation. But those benefits come with latency costs. Know your priorities.

Data Locality Is Everything

The processing logic runs on the same machine as the data. Ideally, data lives in L3 cache, shared between cores. Moving data across machines is a last resort.

Single-Threaded Hot Paths

Counterintuitively, the fastest code is often single-threaded. No locks, no contention, no cache coherency traffic between cores. You handle concurrency at the architecture level (multiple independent processes) rather than within the hot path.

When You Actually Need This

Here's the important caveat: most systems don't need microsecond latency.

If your users are on the other side of the internet, network latency dominates. Shaving microseconds off your processing time doesn't matter when there's 50ms of network between you and the user.

If you're building web applications, your performance bottleneck is probably database queries and DOM rendering. Optimizing CPU-bound code won't help.

If you're processing batch workloads, throughput matters more than latency. The techniques that maximize throughput (batching, parallel processing) often increase individual request latency.

Low-latency engineering makes sense when:

You're competing on speed (trading, gaming, real-time bidding)
You have tight latency SLAs (financial services, certain embedded systems)
You're building infrastructure that others depend on (databases, message queues)

For everything else, build for maintainability first and optimize the actual bottlenecks second.

The Measurement Discipline

If there's one thing to take away, it's this: measure, don't guess.

Before optimizing anything, profile your system. Understand where time actually goes. You will be surprised. The bottleneck is almost never where you expect.

After making changes, measure again. Confirm the improvement. Sometimes "optimizations" make things worse due to unexpected interactions.

Track percentiles, not averages. The problems hide in the tails.

Benchmark with realistic data and realistic load patterns. Microbenchmarks lie. They show you best-case performance that you'll never see in production.

The Mindset Shift

Working on genuinely low-latency systems changes how you think about software:

Every abstraction has a cost. Sometimes abstraction is worth it, sometimes it isn't.
The compiler and CPU are not magic. You need to understand what your code actually does.
Performance is a feature. It requires investment like any other feature.
Measurement is non-negotiable. Intuition fails at the nanosecond scale.

These lessons apply broadly, even if you never build a trading system. Understanding where performance comes from makes you better at knowing when it matters and when it doesn't.

And knowing when performance doesn't matter is just as valuable as knowing how to optimize. Premature optimization is still the root of all evil. But informed decisions about when to optimize? That's engineering.