Doogal Simpson's Dev Blog

Networking Logic: How Your Data Navigates the Internet

Doogal Simpson — Sat, 28 Mar 2026 12:21:03 GMT

TL;DR: Downloading a file is an asynchronous process where data is fragmented into packets that navigate a mesh of routers independently. By using Layer 3 protocols to prioritize low-latency paths over physical distance, the network ensures resilient delivery even when specific nodes fail or become congested.

When you download a file, you aren't opening a single pipe; you're solving a distributed logistics problem. I look at this process as thousands of independent packets of information converging on a destination. They don't follow a pre-set, static path. Instead, they navigate a web of hardware, jumping from server to server until they reassemble on your machine. To understand how this works, you have to look past the file itself and focus on the routing logic that moves it.

How does data actually move across the internet?

Data moves by being fragmented into discrete packets, usually constrained by a 1500-byte Maximum Transmission Unit (MTU). These packets are injected into the network via the TCP/IP stack and routed independently, which allows the network to utilize all available bandwidth across various physical paths.

I think of it as a group of a hundred friends going on a road trip. You can’t fit them all in one car, so I put them in individual vehicles. Because every vehicle has the final address, they don't need to stay in a bumper-to-bumper convoy. One car might take the highway, another might take a toll road, and a third might take a bypass. As long as they reach the final coordinate, the specific path each car takes is secondary to the goal of arriving at the destination.

What is the role of a router in packet switching?

I look at routers as Layer 3 decision engines that use routing tables and protocols like BGP or OSPF to forward packets toward their destination. A router evaluates the header of every incoming packet and determines the next hop based on current link costs, congestion, and path availability.

In my view, the router is the sat-nav at every intersection. When a packet hits a router, the hardware calculates the best available port. It isn't looking for the shortest physical distance; it’s looking for the path with the lowest latency. If a primary fiber link is saturated, the router will push the packet onto a secondary path that might be physically longer but currently has higher throughput. It’s a series of local hop decisions that ensure global delivery.

How does the network handle path failure and congestion?

Network resilience is managed through dynamic re-routing and Time to Live (TTL) values that prevent packets from looping indefinitely. When a node fails or a path congests, the network reaches "convergence," where routers update their internal tables to reflect the new state of the network and bypass the failure.

If I’m on a road trip and hit a "Road Closed" sign, my sat-nav recalculates the route based on real-time telemetry. The internet functions the same way. If a data center goes offline or an undersea cable is cut, your packets don't just stop. They take a detour through different nodes. I find this design elegant because it assumes the underlying infrastructure is unreliable and builds the reliability into the endpoint logic instead.

Technical Specs of the Data Trip

Parameter	Technical Reality	Analogy Equivalent
MTU	1500-byte packet limit	Car seating capacity.
TTL	Hop-limit counter	A fuel gauge that drops at every intersection.
Router	Layer 3 Forwarding	The sat-nav at every crossroads.
TCP	Sequence Reassembly	The manifest used to check friends in at the goal.
Latency	RTT (Round Trip Time)	Total travel time for one vehicle.

FAQ

Do packets always arrive in the correct order?
No. Since packets take different paths, Packet #100 might arrive before Packet #1. I rely on the TCP layer on the receiving machine to act as a buffer, reassembling the data into the correct order based on sequence numbers once the entire set of packets has arrived.

What happens if a packet is lost in transit?
If a packet is dropped due to hardware failure or congestion, the receiver identifies the gap in the sequence. It sends a request back to the source to re-transmit that specific missing packet. This ensures the integrity of the file without needing to restart the entire transfer from the beginning.

How does a router determine the 'best' path?
Routers maintain routing tables that are updated via neighbor exchange protocols. They use metrics like hop count and bandwidth to determine the path of least resistance. When a router updates its table to reflect a faster path, it’s a process called convergence, ensuring all traffic stays as efficient as possible.

What About Second HTTP? Solving the 100-File Connection Bottleneck

Doogal Simpson — Sat, 28 Mar 2026 12:18:48 GMT

TL;DR: HTTP/2 replaces the inefficient six-connection limit of HTTP/1.1 with a single, multiplexed stream. By breaking assets into small, interleaved chunks, it eliminates head-of-line blocking and prevents multiple connections from fighting for bandwidth, allowing browsers to request and download hundreds of files simultaneously without the overhead of repeated TCP handshakes.

We’ve had first HTTP, but what about second HTTP? Back in the "olden days," websites were tiny. You had an HTML file, a CSS file, and maybe a couple of GIFs. For that scale, HTTP/1.1 worked fine. But modern web engineering has moved toward shipping hundreds of small files—JavaScript modules, fragmented CSS, and optimized assets.

When you try to shove 100 files through a protocol designed for five, things get slow. It isn't just about the raw size of the data; it’s about how the protocol manages the "wire" itself. Let's look at why we had to move on to HTTP/2 to build the websites we actually want to build today.

Why is the 6-connection limit a problem for modern sites?

HTTP/1.1 browsers typically limit themselves to six concurrent TCP connections per domain. If your site requires 100 files to render, the browser is forced to download them in batches of six, leaving the remaining 94 files stuck in a queue until a slot opens up.

This isn't just a queuing issue; it's a resource management disaster. Each of those six connections takes time to establish via a TCP handshake. Once they are open, these connections don't cooperate; they actively fight each other for available bandwidth. Instead of one smooth stream of data, you have six competing processes creating noise and congestion on the network. For a site with 100+ assets, this "batching" adds massive latency that purely sequential processing can't overcome.

What happens when a file gets stuck in HTTP/1.1?

In HTTP/1.1, if one file in a connection downloads slowly or gets "stuck," that entire connection is blocked until the transfer completes. This is known as Head-of-Line (HOL) blocking, where a single heavy asset prevents every subsequent file in the queue from moving forward.

If you're down to five active connections because one is hung up on a large image or a slow server response, your throughput drops immediately. There is no way for the browser to say, "Hey, skip that big file and send me the tiny JS snippet instead." The protocol is strictly sequential within those six pipes. If the "head" of the line is blocked, everything behind it stays put.

How does HTTP/2 multiplexing eliminate the queue?

HTTP/2 ignores the six-connection rule and uses one single, high-performance connection to request everything at once. It does this by chopping every file into little chunks and interleaving them, so data for all 100 files starts moving down the wire simultaneously.

Because it’s one connection, we avoid the overhead of multiple handshakes and the bandwidth contention issues where separate connections fight for priority. If one file gets stuck or arrives slowly, it doesn't matter. The browser is already busy receiving the chunks for the other 99 files. Everything gets sent down the wire in parallel and is reconstructed by the browser on the other end.

Feature	HTTP/1.1 (The Old Way)	HTTP/2 (The Modern Way)
Concurrency	6 connections (Max)	1 connection (Multiplexed)
File Handling	Sequential (one at a time per pipe)	Parallel (all at once via chunks)
Network Efficiency	High contention, multiple handshakes	Low overhead, optimized bandwidth
Failure Mode	HOL blocking stalls the queue	Interleaved frames prevent stalls

Is setting up one connection really faster than six?

You might think more pipes equal more speed, but in networking, the opposite is often true because of the "slow-start" algorithm in TCP. A single HTTP/2 connection can optimize its throughput faster and more accurately than six competing connections that are constantly triggering congestion control mechanisms.

By moving to HTTP/2, we've stopped trying to hack around the protocol and started using one that understands we’re building sites made of hundreds of files. It’s about getting everything down the wire as fast as possible so the user isn't left staring at a loading spinner while the browser tries to manage a congested queue. Cheers!

FAQ

Do I still need to bundle my files into one giant 'bundle.js' with HTTP/2? While bundling isn't as critical for bypassing connection limits as it was in HTTP/1.1, it’s still useful for compression efficiency. However, HTTP/2 makes it much more performant to ship unbundled modules, which can lead to better caching strategies.

Does HTTP/2 work over unencrypted connections? While the spec doesn't strictly require encryption, all major browser implementations (Chrome, Firefox, Safari) only support HTTP/2 over TLS (HTTPS). If you want the speed of Second HTTP, you need an SSL certificate.

How does the browser know how to put the chunks back together? HTTP/2 uses a framing layer. Each chunk of data is wrapped in a 'frame' that contains a stream identifier. The browser sees these IDs and knows exactly which file each chunk belongs to, allowing it to reassemble the assets perfectly even though they arrived interleaved.

The 600ms Tax: Why Every TCP Connection Starts with a State Negotiation

Doogal Simpson — Sat, 28 Mar 2026 12:17:08 GMT

TL;DR: A TCP handshake is a mandatory three-step negotiation—SYN, SYN-ACK, and ACK—required to synchronize sequence numbers and reserve memory buffers before data transfer. This protocol overhead adds a minimum of 1.5 round-trips of latency, making persistent connection reuse a critical optimization for reducing time-to-first-byte (TTFB).

Fetching a file from a remote server isn't just a matter of bandwidth; it’s a battle against the physics of the network. If your round-trip time (RTT) to a server is 200ms, you aren't waiting 200ms for your data. You’re likely waiting 600ms before the first byte even arrives. This delay isn't a limitation of your fiber line—it’s the intentional cost of establishing a reliable state between two machines over an unreliable infrastructure.

I look at this as a "resource reservation tax." Before a single packet of your HTML or JSON is sent, both the client and the server have to agree on exactly how they will track the data. This negotiation is what we call the TCP handshake.

What is a TCP handshake and why is it necessary?

A TCP handshake is a three-way exchange used to initialize a reliable logical connection by synchronizing sequence numbers and allocating memory buffers on both the client and server. This process ensures that both parties have the state necessary to track packet delivery, handle retransmissions, and reassemble data in the correct order.

Think of it as initializing a shared state machine. Both computers need to know where to start counting bytes—the sequence number—and they need to set aside specific memory for that specific connection. Without this synchronization, the receiving end would have no way to distinguish between a new packet and a delayed packet from a previous session. It’s about creating a predictable environment out of the chaos of the internet.

How does the three-way handshake impact network latency?

The TCP handshake impacts latency by requiring three sequential legs of communication—client to server, server to client, and client back to server—before the application data can be requested. This creates a latency floor where the initial connection setup time is directly proportional to the physical distance between the two machines.

Phase	Direction	Technical Purpose
SYN	Client -> Server	Client proposes an initial sequence number and requests synchronization.
SYN-ACK	Server -> Client	Server acknowledges client's sequence, proposes its own, and allocates memory buffers.
ACK	Client -> Server	Client confirms the server's state; usually piggybacks the first actual data request (e.g., HTTP GET).

If each leg of this journey takes 200ms, you’ve spent 600ms just establishing that the two computers can "hear" each other and have enough memory allocated to handle the session. This is why a small 1KB file can often feel as slow to load as a much larger one; the overhead of the handshake is the dominant factor.

Why is connection reuse essential for high-performance apps?

Connection reuse, or persistence, allows multiple requests to be sent over a single established TCP connection, bypassing the 1.5 RTT handshake penalty for subsequent data transfers. By maintaining the synchronized state and allocated memory, the client and server can communicate with zero additional setup overhead after the initial connection.

Imagine you’re building a microservice that needs to fetch twenty different assets. If you opened a new TCP connection for every single asset, you’d be paying that 600ms tax twenty times over. That is 12 seconds of just "saying hello." Instead, modern protocols like HTTP/1.1 and HTTP/2 establish a small pool of persistent connections. We pay the handshake price once, keep the buffers warm, and then stream the data through the existing pipe. This is the single most effective way to mitigate the impact of physical distance on application performance.

FAQ

Why can't we just start sending data with the first SYN packet? Standard TCP requires the three-way handshake to prevent "Sequence Number Guessing" attacks and to ensure the server doesn't allocate resources for spoofed IP addresses. However, an optimization called TCP Fast Open (TFO) does allow data to be included in the SYN packet for subsequent connections if the client has connected to that server before.

Does a TCP handshake happen for every HTTP request? In modern web development, no. Thanks to the Keep-Alive header in HTTP/1.1 and the multiplexing capabilities of HTTP/2 and HTTP/3, a single TCP (or QUIC) connection is typically kept open and reused for hundreds of requests to the same origin.

What happens if the ACK packet in the handshake is lost? If the final ACK from the client is lost, the server will eventually time out the half-open connection and deallocate the memory it reserved. The client, assuming the connection is open, will attempt to send data, which the server will reject or ignore, forcing the client to re-establish the connection.

Cheers.

TCP Exponential Backoff: Why Your Retries are Doubling

Doogal Simpson — Sat, 28 Mar 2026 12:15:51 GMT

TCP prevents network meltdowns by doubling its wait time (Exponential Backoff) every time a packet fails to acknowledge. Instead of spamming a congested link, I look at how the protocol calculates a dynamic Retransmission Timeout (RTO) and then backs off to allow hardware buffers to clear and avoid total congestion collapse.

I find it wild that we can just download a file from a physical computer on another continent through a chaotic web of underwater cables and intermediary servers. When I think about TCP, I'm looking at the protocol responsible for taking that file, chopping it into chunks, and ensuring it arrives mostly reliably despite the physical insanity of the global internet infrastructure. The genius isn't just in the delivery, but in how the protocol knows when to stop talking so the network doesn't cave in on itself.

How does TCP handle packet loss?

I see TCP ensuring reliability by requiring a specific acknowledgment (ACK) for every data segment sent. If the sender transmits a chunk and doesn't receive an ACK within a set window, it assumes the packet was lost and initiates a retransmission.

I usually think of this like a microservice health check or a database heartbeat. If I send a request and don't get a response, I have to decide when that request has officially failed. If I retry after a single millisecond, I'm going to overwhelm a service that might just be slightly lagged. If I wait five seconds, I'm killing my application's throughput. I need TCP to find that precise "wait" window for every unique connection to keep the data moving as fast as possible without causing a bottleneck.

What determines the Retransmission Timeout (RTO)?

The RTO is a dynamic timer that I use to judge when a packet is officially "lost" based on previous Round Trip Time (RTT) measurements. It isn't a static value because the latency I see to a server in London is vastly different from the latency to a server in my own rack.

If my previous packets have been successfully acknowledged in 500ms, my TCP stack might set an RTO of 700ms to provide a buffer for minor jitter. But once that 700ms timer expires without an ACK, the logic has to change. If I just kept hitting the network at 700ms intervals during a failure, I'd be making a bad situation worse. This is why I rely on exponential backoff to handle the silence.

Why does TCP use exponential backoff?

I use exponential backoff to prevent "congestion collapse," a state where a network is so saturated with retransmissions that no actual work can get through. By doubling the RTO after every failure, I'm effectively using a circuit breaker to reduce the load on the network until the bottleneck clears.

To understand why I need this, we have to look at the hardware buffers. Every router between my machine and the destination has a finite amount of memory to queue incoming packets. When network traffic spikes and those buffers reach capacity, the router performs a "tail drop"—it simply discards any new incoming packets because there is nowhere to put them.

If every device on that segment responded to a drop by immediately resending data at high frequency, they would create a broadcast storm. The buffers would stay at 100% utilization, and the router would spend all its resources dropping packets rather than routing them. By exponentially increasing the wait time, I'm giving those hardware buffers the space they need to drain and recover. It's about being a good neighbor to the rest of the traffic on the wire.

How does backoff scale across retries?

With every consecutive failure to receive an ACK, I double the previous RTO. This binary exponential backoff continues until I hit a maximum threshold or the operating system finally decides the connection is dead and kills the socket.

Retry Count	Backoff Multiplier	Example RTO (ms)
Initial Transmission	1x	700
1	2x	1,400
2	4x	2,800
3	8x	5,600
4	16x	11,200
5	32x	22,400

Eventually, the network does time out. There's a limit to how long I'll wait, but this aggressive backing off is what keeps a local network failure from cascading into a total blackout for every other user on that same infrastructure. It’s the difference between a minor lag and a total network shutdown.

FAQ

What is the difference between RTT and RTO?

RTT (Round Trip Time) is the actual measured time it takes for a packet to travel to the destination and back. RTO (Retransmission Timeout) is the calculated duration the sender will wait for an acknowledgment before assuming the packet was lost, typically derived from RTT plus a variance buffer.

Why not just use a fixed retry interval?

Fixed intervals lead to congestion collapse. If thousands of devices all retry at the same static interval, they can synchronize their retransmissions, creating massive spikes in traffic that keep router buffers full and prevent the network from ever recovering.

Can I tune the maximum number of TCP retries?

In Linux, I can tune this via sysctl using the net.ipv4.tcp_retries2 parameter. This setting dictates how many times the kernel will retry before giving up on an established connection. While I can lower this to fail faster, increasing it too much can lead to stale sockets hanging around for over half an hour on a dead link.

TCP: Why the Internet Works Even When It's Broken

Doogal Simpson — Sat, 28 Mar 2026 12:12:33 GMT

TL;DR: TCP is how we send big files over a mess of unreliable cables. It chops data into numbered chunks and won't stop nagging the receiver until every single piece is accounted for. If a packet gets dropped or a router chokes, TCP just resends it until the job is done.

The internet is a series of physical cables and aging hardware that is constantly failing. Between your computer and a server, there are dozens of points of failure where data can be lost, corrupted, or just dropped because a router got too hot. TCP (Transmission Control Protocol) is the protocol that keeps your data from becoming a corrupted mess by assuming the network is going to fail.

How does TCP handle data loss on an unreliable network?

TCP handles data loss by breaking large files into small chunks and requiring a formal acknowledgment for every single one. Instead of hoping a 1GB file arrives in one piece, it treats the network as a "best-effort" medium and takes full responsibility for verifying that every byte landed safely.

Sending a massive file over a physical wire is a gamble. If one bit flips or a single packet hits a congested router and gets dropped, the whole transmission is ruined. TCP doesn't gamble. It puts every page of your data into its own envelope, labels it with a page number, and sends it out. Then, it waits for a phone call. If the recipient says they got pages 1, 2, and 4, the sender knows page 3 was lost in the mail. The sender doesn't have to guess; they just grab a copy of page 3 and send it again until the recipient confirms they have it.

What is the step-by-step process of TCP data transmission?

TCP transmission follows a strict loop of segmentation, sequencing, and verification. It transforms a raw, unreliable stream of bits into a structured conversation between two machines to ensure the final payload is identical to the source.

Phase	What Happens	Why it Matters
Segmentation	Chop the payload into MTU-sized segments.	Keeps chunks small enough for hardware to handle without choking.
Sequencing	Stamp every packet with a sequence ID.	Lets the receiver rebuild the file in order, even if packets arrive late.
Transmission	Push segments onto the physical wire.	This is the "unreliable" part where cables and routers take over.
ACK Loop	Wait for Acknowledgment (ACK) signals.	The only way the sender knows the data actually arrived.
Retransmit	Resend segments if an ACK times out.	Fixes network errors automatically without the user ever noticing.

Why does TCP use sequence numbers for every packet?

Sequence numbers act as the index that allows the receiver to reassemble data in the correct order and identify gaps. Without these numbers, the receiver would have no way of knowing if a packet was missing or if the data arrived out of sequence.

Think about a high-res image being sent across the country. Packet #50 might take a faster route through the network and arrive before packet #49. Without sequence numbers, your computer would just stick the bits together in the order they arrived, and the image would look like static. The sequence number tells the OS exactly where that chunk belongs in the final file, allowing it to buffer early arrivals until the missing gaps are filled.

What happens when a TCP acknowledgment is never received?

When an acknowledgment (ACK) doesn't return within a specific window, the sender assumes the packet died in transit and triggers a retransmission. It keeps the data in a local buffer and refuses to clear it until it's 100% sure the other side has it.

This is the core of TCP's reliability. If you are pushing code to a server and the connection flutters, TCP doesn't just let that chunk of data vanish. It will keep retrying that specific sequence ID until the server finally responds with a green light. It’s a persistent, nagging mechanism that ensures the integrity of the data at the cost of some overhead.

FAQ

What is the cost of TCP reliability? The primary cost is high-latency overhead. Because every packet requires an acknowledgment (the ACK), and there is a "handshake" to start the connection, TCP is naturally slower than protocols that just fire data into the void without checking if it landed.

Why use TCP over UDP? You use TCP when accuracy is non-negotiable, like loading a website, sending an email, or downloading software. You use UDP when speed is more important than a few dropped packets, like in a Zoom call or a competitive multiplayer game where a momentary glitch is better than the whole stream pausing to wait for a retransmission.

Does TCP ever give up on resending data? Yes. While TCP is persistent, it isn't infinite. If it fails to get an ACK after a set number of retries or a specific timeout period, it will eventually "reset" the connection and signal to the application that the network path is dead.

Fixing Biased Entropy: The Von Neumann Unbiasing Trick

Doogal Simpson — Sat, 28 Mar 2026 12:11:17 GMT

TL;DR: I've found that hardware entropy sources are rarely uniform. To solve this, I use Von Neumann Unbiasing, which pairs bits and discards identical results (00, 11). By mapping 01 to 0 and 10 to 1, I can extract a perfectly fair 50/50 distribution from any biased source, provided the bias is constant and bits are independent.

I’ve found that hardware is always noisier than you’d expect—and rarely in the way you want. When I pull entropy from thermal jitter or diode noise, I'm dealing with the messy physical world, which doesn't care about my requirement for a perfect distribution. A sensor might lean toward a logic high or low due to temperature fluctuations or voltage drops, and in practice, achieving a perfect 0.5 probability out of a physical component is almost impossible.

If I see someone using biased entropy for key generation, I know they're shrinking their effective keyspace and making their system vulnerable to brute-force attacks. A cryptographic key is only as strong as the randomness used to create it. If your bits are weighted 60/40, you’ve introduced a pattern that an attacker can exploit. I need to process that raw, physical noise into a mathematically balanced stream before it is used in a production environment.

Why are hardware random number generators often biased?

Physical sensors are influenced by environmental conditions and internal circuit resistance that favor one electrical state over another. Unlike a mathematical pseudo-random number generator (PRNG), a hardware source is a physical device subject to manufacturing defects and external interference.

I want you to imagine you’re building a microservice that relies on an internal hardware RNG. If that hardware is even slightly more sensitive to a certain voltage threshold, it will produce more ones than zeros. This bias can be subtle—perhaps only a 1% shift—but in the world of security, that shift is enough to weaken the randomness of every session key your service generates.

How do you turn a weighted coin flip into a 50/50 result?

I group incoming bits into pairs and discard any pair where the bits match. I only output a single bit when I see a transition—either a zero followed by a one, or a one followed by a zero.

This logic ensures that the output remains fair even if the source is heavily biased. Here is the mapping logic I use to clean the stream:

First Bit	Second Bit	Action	Result
0	0	Discard	(None)
1	1	Discard	(None)
0	1	Accept	0
1	0	Accept	1

Why does this trick guarantee a perfect probability split?

The reason I love this trick is because the math is elegantly simple: p * q is always equal to q * p. Even if your source favors one side, the probability of seeing a specific sequence of two different bits is mathematically identical to seeing its reverse.

Let’s say I am looking at a broken hardware sensor that lands on heads (1) 75% of the time and tails (0) 25% of the time.

The probability of (1,1) is 0.75 * 0.75 (0.5625) -> Discarded.
The probability of (0,0) is 0.25 * 0.25 (0.0625) -> Discarded.
The probability of (0,1) is 0.25 * 0.75 (0.1875) -> Result: 0.
The probability of (1,0) is 0.75 * 0.25 (0.1875) -> Result: 1.

Since 0.1875 is exactly equal to 0.1875, I get an exactly 50% chance of getting a 0 or a 1. The original bias doesn't change the fact that the two mixed outcomes are equally likely.

What are the performance trade-offs of unbiasing?

The primary trade-off I see is throughput; I am forced to throw away a massive amount of raw data, which can lead to entropy starvation in systems like Linux. When the entropy pool in /dev/random runs dry, the OS blocks, which can halt a deployment or stall a cryptographic handshake.

In that 75/25 bias scenario, I am discarding 62.5% of the raw bits. If I have a system generating long-term SSH host keys during a cloud instance boot-up, this discarding logic can cause a visible hang. I've seen setup scripts stop and deployment pipelines stall because the system was waiting on a hardware-accelerated source that was too biased to keep up with the demand. When I implement this in firmware, I keep the logic as lean as possible:

def unbias(bit_stream):
    while True:
        x, y = next(bit_stream), next(bit_stream)
        if x != y:
            return x

FAQ

Does this work if the hardware bias changes over time?

No. This algorithm relies on the probability (p) remaining constant across both bits in the pair. If the bias is drifting rapidly—for instance, if I am looking at a sensor that is overheating and its 1/0 ratio is swinging wildly every millisecond—the unbiasing effect breaks down and I may still end up with skewed output.

What happens if the bits are correlated?

If the bits are not independent—meaning a 1 is more likely to be followed by another 1 (autocorrelation)—this trick fails. In those cases, I would typically use a cryptographic hash function like SHA-256 as an entropy extractor to flatten the distribution and remove the patterns.

Is there a more efficient way to extract bits?

Yes, algorithms like the Peres or Elias strategies extract more entropy by looking at longer bit sequences. However, I rarely use them because they require complex state management and larger memory buffers. Von Neumann is my go-to for low-level work because it requires zero memory and can be implemented with a simple loop or a few logic gates.

Why Your Computer Can't Just Pick a Number: Navigating the Spectrum of Randomness

Doogal Simpson — Sat, 28 Mar 2026 12:08:55 GMT

TL;DR: Computers are deterministic, meaning they struggle to create "true" randomness. I solve this using a spectrum of techniques: Pseudo-Random Number Generators (PRNGs) for logic like gaming, hardware-based True Random Number Generators (TRNGs) for standard security, and quantum systems for absolute cryptographic unpredictability where physics guarantees the result.

One of the best things about computers is that they do exactly what you tell them. And one of the worst things about computers is that they will do exactly what you tell them. If I want a random number, I’m immediately running into a wall because machines are deterministic by design. They don't "guess"; they calculate.

When I need a random result, I can just flip a coin or roll a die. It’s messy, physical, and easy. But for a computer, providing a random value requires breaking its own internal logic to find a source of chaos. Depending on the stakes—whether I'm building a loot drop for an RPG or a high-level encryption layer—I have to choose the right level of randomness.

Why is generating a random number so hard for a computer?

Computers are deterministic systems, meaning if I give a machine the same input and state, it will produce the exact same output every single time. Because a computer lacks the natural "messiness" of a human, it cannot generate a truly random value without an external source of noise.

I’ve found that many engineers overlook how rigid our hardware really is. Every operation is the result of a defined instruction set. If I ask a function to return a random value, that function has to execute logic. And if that logic is based on math, it’s reproducible. To get something that feels random, I have to point the computer at something it can't control.

What is a pseudo random number generator and when should I use one?

A Pseudo-Random Number Generator (PRNG) is a deterministic algorithm that takes a starting "seed" and runs it through a formula to produce a sequence of numbers that appear random. While the output looks chaotic to a user, the entire sequence is actually fixed and will repeat perfectly if I use the same seed twice.

I use PRNGs for the vast majority of my work—specifically in areas like video games or UI testing. If I’m building a game like Minecraft, I actually want this determinism; it’s what allows players to share a "world seed" and see the exact same terrain. For standard tasks like calling Math.random(), a PRNG is plenty, but I have to remember that if an attacker knows my seed, they can predict every "random" number that follows.

How does a true random number generator harvest physical entropy?

True Random Number Generators (TRNGs) move beyond algorithms by harvesting entropy from physical chaos within the hardware, such as CPU temperature fluctuations or the nanosecond timing of hardware interrupts. Instead of calculating a number, the system is essentially "measuring" the noise of the physical world.

I’ve seen people point to simple system time as a source of truth, but let’s be clear: system time is usually just a seed for a PRNG. To get to the TRNG level, I’m looking for hardware "jitter." These are the tiny, unpredictable micro-fluctuations in thermal noise or the exact moment a packet hits a network card. This is the standard for things like gambling websites, where I need to ensure that no amount of reverse-engineering can reveal a pattern in the deck shuffle.

Level of Randomness	Source	Predictability	Best Use Case
Pseudo (PRNG)	Seeded Algorithms	High (if seed is known)	Games, UI, Simulations
True (TRNG)	Hardware Entropy (Heat/Jitter)	Very Low	Gambling, SSL Certificates
Quantum (QRNG)	Subatomic Particles	Zero	High-Stakes Cryptography

Do I need quantum randomness for secure cryptography?

Quantum randomness is the gold standard used when the stakes are high enough that I need unpredictability guaranteed by the laws of physics. This involves measuring events at the subatomic level—like sending particles at a half-silvered mirror—where the outcome is a literal 50/50 probability.

In the world of cryptographics, "good enough" usually isn't enough. If there is even a slight statistical bias in my random number source, a sophisticated attacker can exploit it to break an encryption key. By reaching into the realm of quantum mechanics, I ensure that the randomness is genuine and absolute. It moves the security of the system from a software challenge to a physical certainty.

FAQ

Is the random number generator in my programming language secure? Generally, no. Most default functions like Math.random() or rand() are PRNGs designed for speed, not security. If I’m generating a password or a session token, I always reach for a cryptographically secure library like the Web Crypto API or crypto/rand in Go.

What happens if I use the same seed in a PRNG? I will get the exact same sequence of "random" numbers every time. This is a common pitfall in testing; if I don't vary my seed (often by using the current timestamp), my "random" tests will actually be testing the exact same path over and over.

Where does a headless server get its entropy if there’s no user input? Modern servers gather entropy from hardware sources like the RDRAND instruction on Intel CPUs or interrupt timings from the disk and network. If a system runs out of this entropy, it can actually "starve," causing processes that require high-quality randomness to hang until more chaos is harvested.

The 'Top 1%' Hiring Myth: It’s a Ratio, Not a Talent Rank

Doogal Simpson — Sat, 28 Mar 2026 12:07:26 GMT

TL;DR: When a company claims to hire the "top 1%," they are describing a recruitment ratio—one hire for every 100 CVs—not an objective ranking of talent. Engineering skill is context-dependent, meaning the "elite" candidate for a kernel-heavy infrastructure team is often a poor fit for a product-focused startup.

When a company says they only hire the top 1%, it’s a marketing flex used to justify high prices to VCs and ego boosts to candidates. It suggests there’s a master leaderboard of engineers where every candidate is boiled down to a single "talent" attribute. This is a statistical sleight of hand. They want you to think they’ve found the best human in the pile, but the reality is much more mundane.

What does it actually mean to hire the top 1%?

It means the company received 100 CVs and hired one person. This is a measure of recruitment volume and filtering intensity, not a scientific ranking of engineering capability.

In a hypothetical lineup of 100 developers, the "top 1%" pitch implies a linear ranking where candidate #1 is objectively superior to #2. But engineering talent isn't a scalar value. If a team needs a Rust specialist to write memory-safe kernels, a world-class React developer who can ship a feature in two hours is effectively useless to them. Both are elite in their domains, but they aren't interchangeable on a single scale. The "1%" label is just the result of a specific filter applied to a specific pile of resumes.

Why is technical talent subjective across different companies?

Every engineering team over-indexes on specific pain points, meaning one firm’s "perfect hire" is another firm’s "hard pass." Talent is context-dependent, shifting based on whether a team needs product intuition, deep infrastructure knowledge, or client-facing communication.

Company Need	Priority Trait	Engineering Profile
Rapid Prototyping	Product Sense	High-velocity delivery over perfect abstractions.
High-Scale Infra	Technical Depth	Focus on latency, concurrency, and low-level optimization.
Technical Consulting	Communication	Translating technical debt into stakeholder risk.
Early-Stage Growth	Generalist	Polyglot capable of jumping from CSS to DB migrations.

How does random chance impact the "Top 1%" claim?

Absolutely. Hiring is as much about the interviewer’s mood and niche biases as it is about your GitHub streak.

It is common for the same cohort of 100 candidates to produce completely different "winners" at different companies. One developer might land the offer because they happen to be using the exact library the lead dev is currently struggling with. Another might get rejected because the interviewer has an irrational vendetta against a specific framework. Because different companies prioritize different signals, the "top 1%" label is just a byproduct of whoever happened to fit that week's specific requirements and the interviewers' individual preferences.

Is the recruitment process an objective talent filter?

No, it is a matching process that frequently mistakes coincidence for quality. Different companies have different priorities, and the "top 1%" at one company might be at the bottom of the list for another based solely on the tech stack or the product philosophy.

You could take the same 100 candidates, send them to ten different companies, and walk away with ten different "top 1%" hires. All ten companies would claim they found the elite tier, but in reality, they just found the person who best matched their specific, biased requirements at that exact moment in time.

FAQ

Why do big tech companies use the 1% metric? It manufactures scarcity and maintains a "premium" brand image. This attracts a high volume of applicants, which ironically makes the ratio even smaller and reinforces the claim.

Is there such a thing as an objectively "elite" engineer? There are high-impact engineers, but their status is usually a result of being in an environment where their specific skills act as force multipliers. An elite systems architect is just another dev in a team that only needs basic CRUD apps.

Should I tailor my profile for "top tier" companies? Yes, because they aren't looking for "talent" in the abstract; they are looking for a specific set of attributes—like product sense or specific language depth—that solve their immediate technical hurdles.**

Git Internals: Why Your Commits Aren't Actually Diffs

Doogal Simpson — Sat, 28 Mar 2026 12:06:06 GMT

TL;DR: Git is a content-addressable filesystem that stores project states as full snapshots rather than incremental deltas. Every object—blobs, trees, and commits—is identified by a unique SHA-1 hash of its content, creating an immutable chain where any change to a single byte results in entirely new objects.

You see green and red lines in a pull request and assume Git stores diffs. It doesn't. I view Git as a persistent key-value store where the key is a hash and the value is your data. When I commit code, I am not saving a list of changes; I am saving a snapshot of the entire project state at that exact moment.

How does Git store actual file content?

Git ignores filenames and stores raw data as 'blobs' named after their own SHA-1 hashes.

When I create a file called hello.txt with the content "hello world" and add it to a repo, Git hashes that string and creates a blob. If I look inside the .git/objects directory, I can see exactly how this is stored. Git takes the first two characters of the hash to create a directory and uses the remaining 38 characters as the filename. For example, a hash starting with e69de2 would be stored at .git/objects/e6/9de29.... This is the core of content-addressable storage: the address of the data is derived from the data itself. If I change a single character in that file, the hash changes, and Git writes an entirely new blob file.

What role does a tree object play?

A tree object defines the project structure by mapping human-readable filenames to their specific blob or sub-tree hashes.

I think of a tree as a simple directory listing. Each line records file permissions, the object type, the SHA-1 hash, and the filename. This architecture is why I can rename a file without Git needing to copy the actual file data. The blob hash remains identical because the content "hello world" hasn't changed; I have only updated the tree object to point that same hash to a new filename. Because trees are also named after the hash of their content, any change to a filename or a sub-directory hash results in a brand-new tree hash.

Object	Data Responsibility	Identity Hash Source
Blob	Stores raw file bytes	The literal file content
Tree	Maps names to hashes	The directory list content
Commit	Links trees to parents	The metadata and tree pointer

What happens when a file is modified?

Modifying a single byte triggers a cascade where a new blob, a new tree, and a new commit are all created with unique hashes.

If I update hello.txt from "hello world" to "hello world, how are you?", the system rebuilds the state of the world. Git writes a new blob for the updated string. Because the hash for hello.txt is now different, the tree containing it must be updated, resulting in a new tree hash. Finally, I create a new commit pointing to that new root tree. This commit also stores the hash of its parent commit. This pointer is what creates the chain we call history. Because the parent hash is part of the commit's content, if I change one bit in an old commit, its hash changes, breaking every subsequent hash in the chain. This immutability is why Git history is so mathematically consistent.

FAQ

Does Git's snapshot model waste a lot of disk space?

Git periodically runs a garbage collection process (git gc) that packs objects into compressed files. While it uses delta compression for physical storage, Git maintains the snapshot model at the logical level, ensuring data retrieval is fast and consistent.

How does Git know if a file hasn't changed?

When I run a commit, Git compares the current hash of a file's content to the hash stored in the previous tree. If they match, Git simply reuses the existing hash in the new tree object rather than creating a redundant blob.

Why are commit hashes unique across different machines?

Since the hash is derived from the content—including the tree hash, author, timestamp, and parent hash—the identity is unique to that specific snapshot. This allows developers to work asynchronously without a central server assigning version numbers.

Your JavaScript Array is a Hash Map in Disguise

Doogal Simpson — Sat, 28 Mar 2026 12:04:32 GMT

TL;DR: JavaScript arrays are fundamentally objects where integer keys are treated as strings. To save performance, engines like V8 attempt to optimize these into contiguous memory blocks (Elements Kinds). However, mixing types or creating sparse "holes" triggers a de-optimization to Dictionary Mode, moving data from the 1ns CPU cache to 100ns RAM lookups.

If you have ever wondered why typeof [] returns "object", the answer isn't just "JavaScript is weird." It is an architectural warning. In the underlying C++ of the V8 engine, arrays are not fixed-size contiguous memory buffers by default. They are hash maps—specifically, exotic objects that map integer keys to values. While this makes JS incredibly flexible, it creates a massive performance hurdle that the engine has to work overtime to solve.

Why does typeof return "object" for a JavaScript array?

JavaScript arrays are keyed collections that inherit from the Object prototype, meaning they are essentially specialized objects where the keys are property names. Even though we access elements using arr[0], the engine internally treats that index as a string key "0" to maintain compliance with the language specification.

Under the hood, this means a standard array doesn't have a guaranteed memory layout. In a language like C, an array of four integers is a single 16-byte block of memory. In JavaScript, a naive array is a collection of pointers scattered across the heap. To find an element, the engine has to perform a hash lookup, which is computationally expensive compared to a simple memory offset. This architectural choice is why V8 spends so much effort trying to "guess" when it can treat your array like a real, contiguous block of memory.

How does V8 use "Elements Kinds" to optimize performance?

V8 uses a system called Elements Kinds to track the internal structure of an array, attempting to store data in the most efficient C++ representation possible. If you create an array of small integers, V8 labels it PACKED_SMI_ELEMENTS and stores it as a contiguous block of memory, allowing the CPU to access it with near-zero overhead.

This optimization is all about hardware efficiency. When data is contiguous, it lives in the CPU's L1 or L2 cache. The CPU can use "prefetching" to load the next few elements into the cache before your code even asks for them. Retrieval from the cache takes about 1 nanosecond. However, if the array becomes a hash map (Dictionary Mode), the CPU has to engage in "pointer chasing." It must go all the way to the system RAM—which can take 100 nanoseconds or more—to find the memory address of the next bucket in the hash map. That 100x latency jump is the hidden tax of unoptimized JavaScript.

What triggers the transition to Dictionary Mode?

The transition from a fast, packed array to a slow hash map is often a one-way street. V8 starts with the most optimized state and "downgrades" the array as you introduce complexity, such as mixing data types or creating large gaps between indices.

If you have a PACKED_SMI_ELEMENTS array and you push a floating-point number into it, the engine transitions it to PACKED_DOUBLE_ELEMENTS. If you then push a string, it becomes PACKED_ELEMENTS (a generic array of pointers). The most destructive action, however, is creating a "holey" array. If you define let a = [1, 2, 3] and then suddenly set a[1000] = 4, V8 refuses to allocate 997 empty memory slots. Instead, it panics and converts the entire structure into DICTIONARY_ELEMENTS. Once an array is downgraded to a dictionary, it rarely—if ever—gets promoted back to a packed state.

// Starts as PACKED_SMI_ELEMENTS (Fastest)
const arr = [1, 2, 3]; 

// Transitions to PACKED_DOUBLE_ELEMENTS
arr.push(1.5); 

// Transitions to DICTIONARY_ELEMENTS (Hash Map)
// This creates a 'hole', triggering a full de-optimization
arr[1000] = 42;

Why do these 100ns delays matter in intensive tasks?

In standard UI development, a 100ns delay is invisible. However, in high-throughput backend processing or 60fps graphical programming, these delays are catastrophic. In a requestAnimationFrame loop, you have a hard limit of 16.6ms to finish all calculations. If you are iterating over thousands of "arrays" that are actually hash maps, the constant round-trips to RAM will eat your frame budget and cause visible stuttering.

Similarly, if you are building a data-intensive microservice that processes millions of JSON objects, the cumulative cost of hash map lookups instead of direct memory access can result in a 10x or 100x decrease in total throughput. This is why tools like TensorFlow.js or high-performance game engines use TypedArrays (like Int32Array), which bypass this "Elements Kind" guessing game entirely and force the engine to use contiguous memory.

V8 Array State Transitions

State	Description	Latency
`PACKED_SMI`	Contiguous small integers	~1ns (Cache)
`PACKED_DOUBLE`	Contiguous floats	~1ns (Cache)
`HOLEY_ELEMENTS`	Array with missing indices	Variable (Slower)
`DICTIONARY_ELEMENTS`	Pure Hash Map (De-optimized)	~100ns (RAM)

FAQ

How can I prevent my arrays from becoming hash maps? Initialize your arrays with their final size if possible and avoid "holey" assignments. Most importantly, keep your arrays monomorphic—meaning, don't mix integers, strings, and objects in the same collection.

Are TypedArrays immune to this de-optimization? Yes. Int32Array, Float64Array, and others are backed by an ArrayBuffer. They have a fixed length and a fixed type, which guarantees they stay as contiguous blocks of memory regardless of what you do with the data.

Does deleting an element make an array a hash map? Using the delete keyword on an array index (e.g., delete arr[2]) creates a hole, which transitions the array to a HOLEY state. While it might not immediately hit Dictionary Mode, it significantly slows down access because the engine must now check the prototype chain for that missing index.

Scalable Proximity Search: Why Geohashing Beats Radius Queries

Doogal Simpson — Sat, 28 Mar 2026 12:02:18 GMT

TL;DR: Geohashing maps 2D coordinates to a 1D string using recursive binary partitioning and bit interleaving. By encoding these bits into a Base-32 string, we leverage B-Tree prefix matching for efficient spatial lookups, bypassing the high CPU costs of Haversine distance calculations at scale.

Calculating Haversine distances for 10,000 moving objects per tick is a disaster for database performance. Standard SQL radius queries aren't built for high-concurrency spatial updates; they are computationally heavy and fail to scale because they require calculating the distance between the query point and every potential candidate in the dataset. To keep latency low, you need to stop thinking in floating-point coordinates and start thinking in B-Tree friendly strings. This is where geohashing comes in, moving the heavy lifting from the CPU to the database index.

How does binary partitioning resolve geographic coordinates?

Binary partitioning recursively divides the map into smaller quadrants, assigning a 0 or 1 based on whether a point falls in the lower/left or upper/right half of the current bounding box. This creates a deterministic bitstream representing a specific geographic area rather than a precise point.

I recently posted a video about geohashing because it is the most efficient way to handle real-time location data. We start by looking at the globe vertically. If a coordinate is in the Northern hemisphere, we assign a 1; if it is in the Southern hemisphere, a 0. We then take that specific hemisphere and split it again. Is the point in the upper or lower half of that new section? This recursive halving continues until we reach the desired resolution. We perform the exact same operation horizontally for longitude (Left/Right splits), resulting in two separate binary streams that describe the point's location with increasing precision.

Why do we interleave bitstreams for 1D spatial indexing?

Interleaving, also known as bit-zipping, alternates bits from the latitude and longitude streams to create a single sequence that preserves 2D proximity within a 1D format. This process follows a Morton order curve (or Z-order curve), which maps multi-dimensional data to one dimension while maintaining the locality of the data points.

In a standard implementation, the first bit of the latitude stream is followed by the first bit of the longitude stream, then the second bit of each, and so on. This ensures that the resulting combined bitstream—and the subsequent geohash string—represents a specific "tile" on the map. Because the bits are interleaved, points that are close together in a 2D space are highly likely to share the same prefix in their 1D bitstream.

Bit Index (i)	Latitude Bit (V_i)	Longitude Bit (H_i)	Interleaved Result (V_0 H_0 ... V_i H_i)
0	1	0	10
1	0	1	1001
2	1	1	100111
3	0	0	10011100
4	1	1	1001110011

How does Base-32 encoding optimize database lookups?

Base-32 encoding converts the final interleaved bitstream into a compact, human-readable string using a specific 32-character alphabet (0-9, b-z) that deliberately excludes ambiguous characters like a, i, l, and o. This representation is highly efficient for database indexing because strings are stored lexicographically in B-Trees, allowing for lightning-fast range scans.

When a taxi app needs to find drivers near you, it does not calculate the distance to every driver in the city. It identifies your geohash—for example, gcpvj0—and queries the database for any driver whose geohash starts with the same prefix. This turns a complex spatial intersection into a simple string comparison. Since the database index is sorted, the system can find all records within the same geographic tile by performing a single index seek followed by a range scan, which is significantly faster than any geometric calculation.

FAQ

What is a Z-order curve and how does it relate to geohashing? A Z-order curve is a space-filling curve that visits every point in a multi-dimensional grid while preserving the locality of points. Geohashing uses this curve by interleaving bits; the path the bits follow as you increase precision looks like a repeating 'Z' shape. This is what allows us to represent 2D data in a 1D string index without losing spatial context.

Why does the geohash alphabet exclude certain letters? The standard geohash alphabet (the Crockford-inspired Base-32) excludes the letters 'a', 'i', 'l', and 'o' to prevent human transcription errors and character confusion. This makes the hashes more robust when being passed through URLs, logged in debugging consoles, or manually entered by engineers.

How do you handle the 'boundary problem' where nearby points have different hashes? Because the Z-order curve occasionally 'jumps' (for example, at the equator or the prime meridian), two points can be centimeters apart but have entirely different geohash prefixes. To mitigate this, production proximity services do not just search the user's current geohash tile; they calculate and query the eight immediate neighboring tiles as well.

Cheers.

Beyond the Haversine Formula: Why I Use Geohashing for Spatial Search

Doogal Simpson — Sat, 28 Mar 2026 12:00:55 GMT

TL;DR: Geohashing encodes 2D coordinates into hierarchical string prefixes, transforming expensive O(n) geometric calculations into efficient indexed lookups. By mapping geographic areas to unique string identifiers, databases can execute high-speed prefix scans rather than performing floating-point evaluations on every record. This allows applications to scale proximity searches to millions of concurrent users without hitting CPU bottlenecks.

I have seen too many backends crawl to a halt because they are trying to solve geometry problems at the query layer. The naive approach to finding a nearby taxi or a local restaurant is to store latitude and longitude as floats and then run a radius query. While this works in a development environment with a hundred rows, it fails at production scale because it forces the database to perform an unindexed geometric predicate on every single record.

When I am architecting a system that needs to handle thousands of concurrent spatial queries, I stop asking the database to do trigonometry. Instead, I treat location as a string matching problem. By moving the complexity from the CPU to the index, I can keep response times low even as the dataset grows into the millions.

Why is calculating distance in a database expensive?

Proximity is not a property that standard B-Tree indexes can efficiently filter without specialized spatial extensions. This usually forces the database engine to perform an expensive floating-point evaluation on every record in the table to determine if a point falls within the search radius.

In a standard relational database, an index is great for finding an exact match or a range of numbers. But a radius is a circle, and latitude/longitude are two independent variables. To find objects in that circle, the database has to calculate the distance between your center point and every potential candidate. If I have 100,000 drivers and 1,000 users searching at once, the CPU is effectively pinned just trying to solve high-school geometry millions of times per second. It simply does not scale.

How does Geohashing solve the spatial indexing problem?

Geohashing encodes latitude and longitude into a single string by recursively partitioning the map into a grid. This maps 2D spatial data onto a 1D string, where points that are physically close share the same character prefix, allowing the database to bucket locations together.

I think of it as a recursive subdivision of the world. If I split the map into a grid and label a large square 'B', any point inside that square starts with the letter 'B'. If I divide 'B' into smaller squares and label one 'C', any point in that smaller square now has the prefix 'BC'. As I continue this process, I build a hierarchical prefix tree. A six-character geohash represents a specific neighborhood, while an eight-character hash points to a specific street corner.

Geohash Length	Approximate Area Coverage	Engineering Use Case
1	5,000km x 5,000km	Global data sharding
4	39km x 19km	Regional search / Weather
6	1.2km x 0.6km	Local delivery / Dispatching
8	38m x 19m	Precise asset tracking

Why is prefix matching better than radius math?

Prefix matching turns a complex spatial calculation into a simple index range scan. By querying for a specific string prefix, I am leveraging the database’s primary or secondary index to find nearby points in logarithmic time rather than linear time.

When I use geohashes, I am no longer asking the database to calculate distances. I am asking it to find every record where the location_hash starts with a specific string, like bcde. This is an operation that every modern database, from PostgreSQL to DynamoDB, is built to do at high speed. It essentially turns a spatial query into a standard B-Tree lookup, which is significantly cheaper on the CPU than executing the Haversine formula across the entire dataset.

How do you handle points on the edge of a grid?

I handle the "edge case"—where two people are physically close but separated by an arbitrary grid line—by querying the user's current square plus its eight immediate neighbors. This ensures complete coverage without sacrificing the performance gains of the indexed lookup.

While querying nine squares sounds like more work than querying one, it is still orders of magnitude faster than a full table scan. Most geohashing libraries provide a simple function to calculate these "neighboring" hashes. By fetching these nine prefixes in a single batch, I can guarantee that I never miss a nearby taxi just because it happens to be across the street in a different grid cell.

FAQ

Can I use Geohashing with NoSQL databases like DynamoDB? Yes, this is a primary use case for Geohashing. Since DynamoDB doesn't have native spatial types, you can store the geohash as a Sort Key to perform efficient prefix scans, which is the only way to do performant "nearby" searches in most NoSQL environments.

How do I decide the length of the geohash to store? I usually store at a high precision (10-12 characters) but query at a lower precision. For a ride-sharing app, I might query at length 6 to get a 1km search area, then do a quick client-side sort to find the absolute closest driver from that filtered subset.

Is Geohashing better than PostGIS? PostGIS is excellent for complex polygons and precise geographic analysis. However, if your only goal is to find "points near me" at massive scale, Geohashing is often easier to implement, cheaper to run, and more portable across different database technologies.**

Why Your Database Hates COUNT(DISTINCT) and Why HyperLogLog is the Cure

Doogal Simpson — Wed, 18 Mar 2026 15:42:17 GMT

TL;DR: HyperLogLog (HLL) is a probabilistic data structure that estimates unique counts by analyzing the bit patterns of hashed IDs. Instead of storing every user ID, it tracks the maximum number of leading zeros in hashed values, allowing you to estimate billions of unique views using about 12KB of memory with ~2% error.

Scaling unique view counts is a silent database killer. If you try to track every user_id for every post on a platform with millions of users, your infrastructure costs will eventually eclipse the value of the feature itself. You're effectively burning RAM to show a number on a UI that doesn't even need to be 100% precise.

I’ve seen plenty of teams try the naive route: a dedicated table of user IDs and a big COUNT(DISTINCT) query. At a certain scale, that stops being a query and starts being a resource exhaustion event. If you want to count millions of unique views across millions of posts without your database screaming for mercy, you have to stop storing data and start using math.

Why is the naive approach to unique counts so expensive?

Count distinct operations scale linearly with cardinality. Storing a billion 64-bit IDs consumes 8GB of RAM—unsustainable when you’re tracking millions of individual posts simultaneously.

The math is brutal. If you have 1,000,000 posts and each has 1,000 unique views, you aren't just storing a million numbers; you're storing a billion relations. Even if you hash those IDs to save space, the storage footprint remains massive. You are paying for 100% accuracy in a context where 98% accuracy is indistinguishable to the end user. Most platforms don't need to know a post has exactly 1,004,202 views; they just need to know it's around 1M.

How does the leading zeros math actually work?

HyperLogLog hashes incoming data into a string of bits where zeros and ones are equally likely. By tracking the longest sequence of leading zeros observed across all hashes, the algorithm uses the probability of "coin flipping" to estimate total cardinality.

Think of it as a statistical shortcut. If I tell you I flipped a coin and got a sequence starting with one zero (Heads), you wouldn’t be surprised; that happens 1 in 2 times. If I tell you I found a sequence starting with ten zeros in a row, you’d correctly guess that I’ve probably flipped that coin about 1,024 times.

Here is how HLL applies that to your data:

Run a user_id through a hash function to get a binary string.
Count the continuous zeros at the start of that string.
Keep track of the maximum number of zeros you’ve seen for that specific post.

If the maximum number of continuous zeros you've seen is 10, the math (2^10) tells you that you’ve likely seen around 1,000 unique users. You aren't storing the IDs; you're just storing one small integer: the "max zeros" count.

When should I choose an estimate over a precise count?

Choose probabilistic structures when the cost of storage outweighs the value of perfect precision. For high-volume metrics like video views, social media likes, or unique site visitors, a 2% error rate is a fair trade for a 99% reduction in memory usage.

In a real-world system like Redis, HyperLogLog structures are capped at about 12KB. This is a constant memory footprint. It doesn't matter if you are counting a hundred users or a hundred billion; that 12KB doesn't grow.

Scaling Strategy: Exact vs. Estimated

Metric Type	Accuracy Required	Recommended Tool
Financial Transactions	100% (Strict)	SQL COUNT(DISTINCT)
Unique Website Visitors	~98% (High)	HyperLogLog (HLL)
Social Media View Counts	~98% (High)	HyperLogLog (HLL)
Real-time Trending Tags	~95% (Medium)	HLL or Top-K Algorithms

FAQ

Can HyperLogLog handle merging counts from different time periods? Yes. One of the best features of HLL is that it is "mergeable." If you have one HLL for Monday’s visitors and another for Tuesday’s, you can combine them to get a unique count for the whole 48-hour period without re-processing the raw data.

What happens if I have a very small number of users? HyperLogLog is actually quite smart about this. Most implementations use "Linear Counting" for small sets to keep the error rate near zero, only switching to the probabilistic "leading zeros" math once the volume hits a certain threshold.

Does the choice of hash function matter? Absolutely. The hash function must be uniform, meaning every bit in the output has an equal 50/50 chance of being a 0 or a 1. If your hash function is biased, your estimates will be consistently wrong.

Cheers, Doogal

Stop Joining Millions of Rows for Every Single Swipe

Doogal Simpson — Wed, 18 Mar 2026 15:39:18 GMT

TL;DR: Dating apps avoid the architectural nightmare of joining millions of left-swipe records by using Bloom filters. By hashing user IDs into a bit array, they get a 100% guarantee that a '0' means a user is new, while accepting a rare false positive as a necessary trade-off for high-concurrency performance.

I’ve seen plenty of teams try to scale discovery feeds by throwing more hardware at SQL joins, and it is a losing game. If I have to check a user’s entire history of left swipes against a pool of millions of profiles every time they refresh their deck, the database isn't just going to be slow—it’s going to stop breathing. We aren't talking about a simple query; we are talking about a cross-reference that grows every time a user interacts with the app.

Scaling a discovery engine requires moving away from absolute row-level checks and toward probabilistic data structures. When I look at how top-tier apps handle the "have they seen this person?" problem, I don’t see massive JOIN statements. I see Bloom filters.

Why can't we just use a standard database join?

I can't use a standard join because the latency of checking billions of swipe records for every profile in a stack is unsustainable at scale. Even with perfect indexing, the I/O overhead of a massive NOT IN or LEFT JOIN on that volume will kill the request-response cycle and frustrate the user.

Think about the math. If I have 50 million users and each user has swiped on a few thousand people, my swipes table is a disaster waiting to happen. If I try to run a query to "show me 10 people this user hasn't swiped on," the engine has to scan or index-hop through a mountain of data. By the time the database returns a result, the user has already closed the app. I need a way to filter those candidates without actually touching the disk for every individual check.

How does a Bloom filter handle membership tests?

I use a Bloom filter to treat membership as a bitmask operation rather than a record lookup. It consists of a fixed-size array of bits, all starting at zero, which I flip to one based on the output of multiple hash functions.

When a user swipes left on someone, I don't just log the event; I run that profile’s ID through a set of hash functions. If I'm using two functions, they might return the integers 1 and 3. I go to my bit array, find those indices, and set them to one. That profile is now "recorded."

Later, when I'm deciding whether to show that same profile again, I run the ID through those same two hashes. If I get back 1 and 3, and both are already set to "1," I assume the user has seen them. But if I check a different profile and the hashes return 3 and 5, and index 5 is still a "0," I know for a fact—with 100% mathematical certainty—that this user has never seen this profile. I show it to them immediately.

The Engineering Trade-offs of Probabilistic Filtering

Factor	Bloom Filter Approach
Storage	Bit-level storage; constant footprint.
Speed	Constant time (O(k)) lookups.
Accuracy	100% Negative accuracy; Probabilistic Positive.
Cost	Negligible CPU/Memory overhead.

What happens when the filter gives a false positive?

I accept the false positive as a necessary trade-off: the app simply skips showing a specific profile because it incorrectly thinks the user has already swiped on it. In a discovery engine with millions of candidates, missing one potential match is an invisible error that saves the entire system from a performance collapse.

If I used a standard Hash Set to keep track of every swipe, the memory usage would balloon until I was spending a fortune on RAM just to keep the feed functional. With a Bloom filter, I keep the memory footprint predictable. If the filter tells me all the bits are "1," but it’s actually a collision and the user hasn't seen that person, I just move to the next candidate. The user doesn't care about the one person they didn't see; they care that the app didn't hang for five seconds while loading the next card.

FAQ

Can I remove a swipe from a Bloom filter? No. In a standard Bloom filter, I can't flip a bit back to zero because I don't know which other IDs also hashed to that same bit. If I need to support "undoing" a swipe, I’d have to use a more complex structure like a Counting Bloom Filter or just rebuild the filter from the source of truth occasionally.

How do I decide the size of the bit array? It’s a balance between memory and the error rate. If I make the array too small, it fills up with ones too quickly and starts giving me false positives for everyone. I calculate the size based on how many swipes I expect a user to perform over the lifetime of their account to keep the collision rate under a certain threshold.

Does this replace my primary database? Absolutely not. I still store the actual swipe data in a persistent database like Postgres or Cassandra for long-term records and analytics. The Bloom filter is a performance layer I use to make real-time decisions in the discovery feed without hitting the heavy data store every single time.

How Big Tech Scales View Counts: The Power of HyperLogLog and Harmonic Means

Doogal Simpson — Wed, 18 Mar 2026 15:35:46 GMT

TL;DR: Scaling unique view counts for millions of posts requires more than just a COUNT(DISTINCT) query. Modern platforms use HyperLogLog, a probabilistic data structure that estimates cardinality using hashing and bucketing. By applying a harmonic mean across thousands of independent buckets, engineers can maintain high accuracy with a tiny memory footprint.

When we talk about scale, we often focus on throughput or latency, but memory consumption for unique metrics is a silent killer. If you are building a platform with millions of users and millions of posts, your first instinct might be to store a set of User IDs for every post to ensure you don't count the same person twice. If you have 10 million unique visitors on a post, and each user ID is a 64-bit integer, you are burning roughly 80MB of memory just for one post's view count. Multiply that by a million posts, and you are looking at an 80TB memory requirement just for view counts. It is simply not feasible.

To solve this, we use HyperLogLog (HLL). It allows us to estimate the number of unique items—what we call cardinality—without storing the actual items. Instead of keeping every User ID, we hash the IDs and look for patterns in the bits. Specifically, we look at the number of leading zeros in the hashed value.

How does HyperLogLog estimate millions of unique views?

HyperLogLog estimates cardinality by observing the maximum number of leading zeros in the hashed binary representations of incoming data. It operates on the mathematical probability that in a random distribution of bits, a sequence of n leading zeros will occur once every 2^n elements.

Think of it like flipping a coin. If I tell you that I flipped a coin until I got heads, and it took me ten tries (meaning I got nine tails in a row), you can reasonably guess that I’ve been flipping coins for a while. If I see a hashed User ID that starts with 10 zeros, I can estimate that I have probably processed around 2^10 unique users. I don't need to know who the users are; I just need to record the highest number of leading zeros I've seen so far. This allows HLL to represent billions of unique items using only a few kilobytes of state.

Why is a single HyperLogLog estimate often inaccurate?

A single HLL estimate relies on the maximum number of leading zeros in a hash, meaning it can only scale in powers of two. Because the estimate is derived from a single maximum value, one "lucky" hash with an unusually long string of zeros can cause the entire estimate to jump from 1,024 to 2,048 instantly.

I like to think of this as a ladder where the rungs are spaced further apart the higher you go. If your only markers are at 1,024, 2,048, and 4,096, you have no way to represent 1,500. If you are tracking a post with 1,025 views, but one user’s ID happens to hash into a value that starts with 11 zeros by pure chance, your system will report 2,048 views. You are off by nearly 100% because of a single outlier. This variance is the primary weakness of a raw HLL implementation.

How does bucketing fix the estimation variance?

Bucketing divides the incoming data into thousands of independent streams, calculating a separate estimate for each to ensure that one outlier cannot corrupt the entire result. By using the first few bits of a hash to assign a User ID to a specific bucket, we distribute the "luck" of the hashes across a wider range of data points.

When a User ID comes in, we might use the first 10 bits of its hash to choose one of 1,024 buckets. We then perform the HLL leading-zero count on the remaining bits for just that bucket. Instead of one giant, shaky estimate for the entire post, we now have 1,024 small estimates. If one bucket gets a "lucky" hash and over-estimates its portion of the traffic, it only represents 1/1,024th of the total data. The other 1,023 buckets will likely be much more accurate, diluting the impact of the outlier.

Why use a harmonic mean instead of a regular average?

The harmonic mean is used to aggregate bucket estimates because it is significantly more resilient to large outliers than a standard arithmetic mean. It effectively "ignores" the buckets that over-report due to chance, keeping the final count grounded in the majority of the data.

In a standard arithmetic mean, if you have ten buckets where nine report 100 and one reports 1,000, your average is 190. That single outlier has pulled your average nearly 100% higher than the reality. The harmonic mean—calculated as the reciprocal of the average of the reciprocals—weights smaller values more heavily. For that same set of numbers, the harmonic mean would be much closer to 100. Since HLL outliers are almost always over-estimations (the "lucky" hashes), the harmonic mean is the perfect tool to pull the estimate back toward the true count.

Feature	Traditional Set (COUNT DISTINCT)	HyperLogLog with Bucketing
Memory Scaling	O(n) - Grows with every new user	O(log log n) - Nearly constant
Accuracy	100% (Exact)	~98.1% (Probabilistic)
Storage Size	Megabytes to Gigabytes	~12KB per counter
Best Use Case	Financial transactions / Billing	View counts / Unique visitors

FAQ

Can I use HyperLogLog to see if a specific user has viewed a post? No. HyperLogLog is a "write-only" structure for the original data. It forgets the actual IDs immediately after processing them to save space. If you need to check for the existence of a specific ID, you would need a Bloom Filter.

What is the standard error rate I can expect? For a typical implementation using 16,384 buckets (which takes about 12KB of space), the standard error is roughly 0.81%. For view counts on a social media post, this level of error is virtually indistinguishable to the end user.

Is the harmonic mean expensive to calculate? Compared to the massive CPU and I/O cost of querying millions of rows in a traditional database, the harmonic mean is negligible. It involves a single pass over your buckets (usually a few thousand integers) and a bit of division, which modern CPUs handle in microseconds.

How to build a profanity filter that actually works

Doogal Simpson — Sun, 15 Mar 2026 11:52:36 GMT

TL;DR: A production-ready profanity filter isn't just a list of banned words; it's a pipeline. You start with sanitization to normalize character substitutions, followed by a Trie for efficient prefix matching. To avoid the Scunthorpe problem, you cross-reference matches against an allow-list or use context-aware ML models to score intent, balancing raw speed with semantic accuracy.

Building a content filter seems like a Junior-level task until you actually have to deploy it to a live chat or a comment section. If you just use a regex or String.contains() on a list of banned words, you’ll quickly realize that users are incredibly creative at bypassing filters. Whether it's adding a period (b.u.m), using leetspeak (b@m), or hiding a word inside a valid one (bumpy), a simple search-and-replace won't cut it. You need a multi-stage pipeline that balances performance with accuracy.

How do you handle character substitutions and leetspeak?

Sanitization normalizes the input before it ever hits your matching logic by stripping non-alpha characters and mapping homoglyphs back to their base ASCII equivalents.

Before you run any comparisons, you need a canonical version of the text. This involves two steps: stripping noise (punctuation, whitespace, and special characters) and character mapping. If a user types b.u.m, your sanitizer should collapse that into bum. If they use @ for a or 0 for o, you map those visual lookalikes back to their standard letters.

// Conceptual normalization flow
const map = { '@': 'a', '0': 'o', '1': 'i', '3': 'e', '5': 's' };
const clean = input.toLowerCase()
  .replace(/[^a-z0-9]/g, '') // Strip symbols
  .split('').map(c => map[c] || c).join('');

This "clean" string is what you actually pass to your detection engine. Without this step, your dictionary would need to be millions of permutations long to catch even the simplest evasions.

Why use a Trie instead of a simple Hash Map?

A Trie (prefix tree) allows for O(L) lookup complexity where L is the length of the input string, making it significantly more efficient for detecting banned prefixes within a continuous stream of text.

In a standard hash map approach, to find every banned word in a 500-character paragraph, you would have to generate every possible substring and check it against the map. That’s an O(N²) operation. With a Trie, you iterate through the message once. As you walk the string, you walk the Trie. If you hit a terminal node in the Trie, you’ve found a match. This is the difference between a filter that lags your app and one that processes thousands of messages per second. It allows you to identify not just exact matches, but matches embedded within a larger stream of characters without re-scanning the string for every entry in your database.

How do you solve the Scunthorpe Problem?

To solve the Scunthorpe problem, you must validate flagged matches against an allow-list to ensure the "bad word" isn't actually a substring of a legitimate word like "bumpy" or "album."

This is where many engineers get stuck. If your Trie flags the word "bum," you shouldn't immediately trigger a block. Instead, you need to perform a look-ahead and look-behind on the original string. This is essentially a secondary validation step. If the Trie identifies a match at index i through j, you check if that specific range is part of a known-good word in your allow-list.

Filtering Stage	Technical Goal	Latency Cost
Sanitization	Normalize input characters	Low (O(N))
Trie Traversal	Fast prefix matching	Low (O(L))
Allow-listing	Resolve Scunthorpe false positives	Moderate (O(Match Count))
ML Inference	Context and intent scoring	High (O(Inference Time))

When should you use Machine Learning instead of a Trie?

Use ML scoring when you need to detect intent or harassment that doesn't rely on specific banned words, but be aware that ML introduces a significant latency trade-off compared to the Trie approach.

A Trie is a deterministic, high-speed tool. It is great at finding "bad words," but it’s terrible at finding "bad behavior." It can't catch sarcasm or a user being hostile without using slurs. This is where models like Google’s Perspective API or custom BERT-based classifiers come in. They provide a toxicity score (0 to 1) based on the context of the whole sentence.

However, from a systems design perspective, you shouldn't run every single message through an ML model. Inference is expensive and slow. A common pattern is to use the Trie as a first-pass filter. If the Trie catches a high-confidence match, you block it immediately. If the message passes the Trie but the user has been flagged recently or the message contains suspicious patterns, you then asynchronously or conditionally route it to the ML model for a deeper score. This saves your CPU cycles for messages that actually need the semantic analysis.

FAQ

How do you handle words that are safe in one context but not another? This is the limitation of the Trie. If a word’s toxicity is context-dependent, you have to rely on ML scoring. A Trie can only tell you if a word exists; only a transformer-based model can tell you what that word means in that specific sentence.

What happens if a user uses Unicode characters that look like Latin letters? This is a sanitization edge case called "IDN homograph attacks." Your character map needs to include common Unicode lookalikes (like the Cyrillic 'а') and map them back to their Latin counterparts before the text hits the Trie.

Should I block the message or just mask the bad words? In high-throughput systems, masking (***) is often preferred because it provides immediate feedback to the user without breaking the flow of the UI. However, for severe toxicity, outright blocking is necessary to prevent the storage of harmful content in your database.

How Docker Actually Works: A Deep Dive into Namespaces and Cgroups

Doogal Simpson — Sat, 14 Mar 2026 11:39:04 GMT

TL;DR: Docker containers are just standard Linux processes restricted by Namespaces and Cgroups. Namespaces provide visibility isolation by partitioning kernel resources like PIDs and networking, while Cgroups (Control Groups) provide resource isolation by enforcing hard limits on CPU, memory, and I/O usage to prevent host exhaustion.

Docker feels like magic until your container gets OOMKilled or you can’t reach a port you swore was open. Then you realize you aren’t running a mini-virtual machine; you’re just running a process in a very fancy cage. That cage is built out of two fundamental Linux kernel features: Namespaces and Cgroups. If you want to move beyond the basics of docker run, you need to understand how the kernel handles these two mechanisms.

How do Linux Namespaces isolate container processes?

Namespaces partition kernel resources so that one set of processes sees one set of resources while another set sees a completely different set. They provide an isolated view of global kernel resources—such as the process tree, network interfaces, and mount points—without actually virtualizing the hardware.

Think of a private office inside a communal WeWork. When you’re inside those walls, you feel like the CEO of your own domain. You see your own desk and your own files. In Linux terms, your process thinks it is PID 1, the first process on the system. However, from the perspective of the WeWork manager—the host kernel—you’re just Tenant 3458. Namespaces make this possible by using the unshare system call to detach a process from the host's default view.

This isolation extends to the network stack. In a NET namespace, your process gets its own routing table and its own virtual network interfaces. Docker typically hooks this up by creating a veth (virtual ethernet) pair: one end stays on the host’s docker0 bridge, and the other is shoved into the container's namespace. The container thinks it has a physical NIC, but it’s just a tunnel to the host’s bridge.

What is the role of Cgroups in Docker resource management?

Cgroups (Control Groups) are the resource governors of the Linux kernel. They define the hard limits on how much CPU, memory, and I/O a process can consume to prevent a "noisy neighbor" from crashing your host or starving other containers.

If Namespaces are the walls of the office, Cgroups are the circuit breakers. Imagine a startup in the office next to you tries to run fifty crypto-mining rigs. In a raw Linux environment, they’d suck all the power out of the building and trip your lights. Cgroups prevent this by metering consumption. You can see this in action on any Linux machine by looking at /sys/fs/cgroup/. This filesystem is where the kernel exposes the control knobs for every running process.

Modern systems have largely transitioned to Cgroup v2, which replaced the fragmented, multiple-hierarchy mess of v1 with a unified hierarchy. This makes it easier for the kernel to manage resources like memory and I/O together. When you set a memory limit in your Docker Compose file, the kernel monitors that process's usage against the threshold defined in the cgroup. If the process tries to overreach, the kernel’s Out-Of-Memory (OOM) killer calculates an oom_score_adj and terminates the offender to keep the rest of the system stable.

How do Namespaces and Cgroups compare?

Namespaces determine what a process is allowed to see, while Cgroups determine what it is allowed to use. One is about scoping identity and visibility, while the other is about measuring and limiting physical hardware consumption.

Feature	Linux Primitive	Primary Responsibility	The Analogy
Namespaces	`ns`	Visibility / Isolation	The office walls and door
Cgroups	`cgroup`	Resource Allocation	The utility meter/circuit breaker
Focus	Security & Context	Performance & Stability	Privacy vs. Power Usage

Why does the distinction between Namespaces and Cgroups matter?

Understanding this distinction is the difference between debugging a permissions error and a performance bottleneck. It allows you to pinpoint whether a failure stems from a restricted view of the system or a resource ceiling imposed by the kernel.

Let's say your microservice is failing to connect to a database. If the network interface is missing or the routing table is empty, you're looking at a Namespace configuration issue—the process literally can't see the path to the outside world. On the other hand, if your service is mysteriously disappearing during high-traffic spikes without throwing a stack trace, you’re likely hitting a Cgroup limit. The kernel doesn't ask for permission; it sees the memory limit has been breached and kills the process instantly (Exit Code 137).

FAQ

Can I manually inspect the namespaces of a running container? Yes. You can use the nsenter tool or the lsns command to see which namespaces are active. Every process on Linux has a directory in /proc/[pid]/ns/ that contains symlinks to the namespaces it currently occupies.

Does Cgroup v2 change how Docker performs? Cgroup v2 provides more consistent resource accounting, especially for buffered I/O, which was notoriously difficult to track in v1. Most modern Linux distributions use v2 by default, and Docker leverages this for better performance isolation.

Is it possible for a process to escape a namespace? Namespace escapes usually require a kernel vulnerability or a misconfiguration, such as running a container with the --privileged flag or mounting the host's /proc filesystem inside the container. In a standard setup, the isolation is enforced by the kernel's internal security checks.

Scaling Profanity Filters: Why I Use Tries for Real-Time Chat

Doogal Simpson — Fri, 13 Mar 2026 11:37:15 GMT

TL;DR: When I'm building high-traffic chat systems, a standard list lookup for profanity is too slow because search time grows with the size of the dictionary. I use a Trie (prefix tree) to move to O(K) performance. This ensures that filtering speed depends entirely on the length of the word being checked, not the number of banned words.

I’ve seen developers fall into the same trap over and over: they maintain a blacklist of 5,000 words and run a basic .contains() check for every word in a player's message. In a small app, you won't notice. But when I'm looking at game architecture handling millions of messages, that O(N) overhead is a disaster.

Every time you add a new word to that list, you're increasing the workload for your CPU. If you're processing a sentence with ten words against a list of 5,000, you’re potentially doing 50,000 comparisons. That overhead becomes unsustainable when you scale. To fix this, I move the logic into a Trie.

Why is a simple list lookup too slow for profanity filters?

I find that linear searches force the CPU to iterate through a dictionary until it finds a match, which means latency spikes as the blacklist grows. This dictionary-dependent speed is a bottleneck in any real-time system where milliseconds matter.

If you have a bank of 5,000 bad words, your server has to ask "Is it this word?" five thousand times for every piece of text sent. When I'm scaling a game to millions of active users, that's billions of unnecessary operations. You aren't just checking if a word exists; you're wasting cycles re-scanning the same prefixes across a flat array.

What is a Trie and how does it optimize string searching?

I use a Trie—a type of prefix tree—to break words down into a sequence of character nodes. This shifts the complexity from O(N) to O(K), where performance is determined solely by the number of letters in the word I’m checking.

Think of it as a roadmap of characters. Instead of scanning a whole list, I start at a root node and follow a path. For the word "BUM," I jump to 'B', then 'U', then 'M'. If the 'M' node is flagged as a word, I've caught a match. This is where the performance win happens: I only ever make three jumps, whether my dictionary has 50 words or 50,000. The size of the blacklist no longer affects my search time.

How do you implement a Trie for word validation?

I implement this by building a node structure where each node contains a map of its children and a boolean flag. The search method simply walks through these maps character by character until it hits a terminal node or a dead end.

class TrieNode {
  children: Map = new Map();
  isBadWord: boolean = false;

  search(word: string): boolean {
    let current: TrieNode | undefined = this;
    for (const char of word) {
      current = current.children.get(char);
      if (!current) return false;
    }
    return current.isBadWord;
  }
}

What are the trade-offs when choosing a Trie over a List?

The primary trade-off I consider is that a Trie trades memory for speed. While it provides near-instant lookups, it requires more RAM to store the individual node objects and character maps compared to a flat array of strings.

Metric	Array / List Search	Trie (Prefix Tree)
Search Time	O(N) - Depends on list size	O(K) - Depends on word length
Scaling Cost	Linear - Slower as list grows	Constant - Size doesn't matter
Memory Usage	Minimal (stores strings as-is)	Moderate (object/map overhead)
Best For	Small, static lists	High-velocity chat streams

In my experience, the memory overhead is almost always worth the massive CPU savings. For a modern game server, a few extra megabytes of RAM to keep a 5,000-word Trie in memory is a small price to pay for lightning-fast validation across millions of chat messages.

Cheers.

FAQ

How does a Trie handle substrings or "hidden" words?

To catch words buried in other strings, I don't just check the whole word; I start a Trie traversal at every character index of the player's message. This allows me to catch banned terms even if they are part of a larger, unformatted string.

Is a Trie faster than a Hash Set?

In many cases, yes. While a Hash Set has O(1) average lookup, you still have to hash the entire input string first. A Trie allows me to fail-fast the moment a prefix doesn't match, which is often more efficient for long strings or partial matches.

Can I use a Trie for languages with different alphabets?

Definitely. Since I use a Map for the children, the Trie doesn't care if the keys are ASCII, UTF-8, or emojis. As long as you can map a character to a node, the O(K) lookup logic remains identical across different languages.

Why Your Linked List Wants to Be a Bloody Tree

Doogal Simpson — Fri, 13 Mar 2026 11:30:12 GMT

Quick Answer: A linked list hits a wall when searching because it’s stuck in linear O(n) time. By giving every node two pointers instead of one, you create a Binary Search Tree (BST). This wacky structure uses its branching logic to cut your search space in half with every step, turning a billion-element crawl into a 30-step sprint.

So, what happens if you make a linked list where, instead of every node pointing to a single node, every node points to two more? You’re a human being, you’ve got free will, and you can do what you want. But if you do that, you’ve not got a linked list anymore, you’ve got to have made a bloody tree, haven’t you?

Trees are super interesting data structures. They are flexible, they have loads of use cases, and they are basically the reason your computer doesn't catch fire when you try to find a file. One of the coolest forms is the Binary Search Tree (BST), which takes the biggest problem we have with linked lists—the search time—and absolutely guts it.

What happens if I just give a node two pointers?

You’ve stopped being linear and started building a hierarchy. Instead of a single path from head to tail, you’ve created a structure that fans out, where one parent node leads to two children, effectively turning a simple queue into a tree.

The linked list is a linear crawl. If you want the item at the end, you have to walk past everyone else first. By giving nodes two pointers, you’re allowing the data to branch. This means you can start making decisions about where to go next based on the value you're looking for, rather than just blindly following a single wire.

Why is searching a linked list so slow?

Searching a linked list is Order n (O(n)), which is a fancy way of saying it’s a slog. If you have a billion elements and the thing you want is at the end, you’re performing a billion checks before you can go home.

We don’t like that. It's slow, it's inefficient, and it doesn't scale. If your data grows, your wait time grows at exactly the same rate. This is the bottleneck that makes linked lists frustrating for anything other than simple stacks or queues where you only care about the ends.

How do trees handle a billion elements in 30 steps?

By using a Binary Search Tree, you ensure that one path from a node contains values larger than the current one, and the other path contains smaller values. This means every time you jump from one node to the next, you’re reducing your search space by half.

After the first node, you’ve gone from a billion elements to half a billion elements. Then 250 million. Then 125 million. Log n of a billion is about 30, I think. So instead of searching through a billion elements, you’re now searching through 30. That is the power of the "wacky linked list" approach.

Data Structure	Search Complexity	Logic for Lookups
Linked List	O(n)	Linear walk; check every single node.
Binary Search Tree	O(log n)	Binary split; discard half the data per jump.
Sorted Array	O(log n)	Binary search; requires contiguous memory.

Is my computer’s folder structure really just a tree?

Yes, it’s the most obvious real-world example of this logic. Every folder contains folders, which contain more folders, which eventually contain files.

In this analogy, your files are the "leaves"—the nodes at the very end with no children—and your folders are the nodes. Imagine if your OS didn't have this tree structure. To find a single cat photo on a 2TB drive, it would have to scan every single sector on the disk in a straight line. You’d be waiting forever. Because it's a tree, the OS just follows a few branches (Root -> Users -> Doogal -> Pictures) and finds it instantly.

To build this in code, you’re basically just doubling your pointer overhead. Here is how a node looks when it stops being a list and starts being a tree:

struct Node {
  int value;
  struct Node *left;  // Smaller stuff goes here
  struct Node *right; // Larger stuff goes here
};

FAQ

Does a tree use more memory than a linked list?

Yes, you're literally doubling the pointer overhead. In a standard linked list, you have one pointer per node; in a binary tree, you have two (left and right). You’re trading a bit of memory for a massive increase in search speed.

What if my tree isn't organized by value?

Then it’s just a regular binary tree, not a Binary Search Tree. You’ll still have a hierarchical structure, but you lose that O(log n) search advantage because you won't know which branch to pick to find your data.

Can a tree become as slow as a linked list?

It can. If you insert numbers in order (1, 2, 3, 4...), the tree just grows in one long line to the right. We call that a "degenerate" tree. It’s basically just a linked list with extra steps, which is why we usually use balancing logic to keep the tree wide instead of long.

Cheers!

32-Element Branching: How Scala Vectors Solve Immutable Memory Pressure

Doogal Simpson — Wed, 11 Mar 2026 21:58:12 GMT

TL;DR: Traditional immutable arrays are slow because updating an element requires a full O(n) copy. Scala’s Vector solves this by using a 32-way branching trie. This enables structural sharing, allowing the collection to reuse most of the original memory and reducing complexity to an effectively constant O(log32 n).

I have always found the overhead of immutability to be one of the most interesting engineering trade-offs. In an immutable context, you can’t mutate state; you have to return new state. This means that if you are adding an element to a standard array, you have to create a new array with all the previous elements plus your new one. If you are adding thousands of elements, your CPU spends more time shuffling pointers and allocating memory than running business logic. I want to look at how Scala solved this with the Vector, a data structure that is surprisingly elegant once you look under the hood.

Why is O(n) memory allocation a problem for immutable state?

O(n) allocation forces the runtime to duplicate every reference in an array during an update, which causes linear growth in CPU cycles and memory usage. This creates massive pressure on the garbage collector and leads to latency spikes as the dataset grows.

If you are building a service that maintains a large list of active user sessions in an immutable array, every single attribute change triggers a full duplication of that list. It isn't just about the memory footprint; it’s about the 'stop-the-world' GC events triggered by thousands of short-lived array copies. To scale, we need a way to modify a 'leaf' of data without re-writing the entire forest. This is the exact problem that persistent data structures were designed to solve.

How does the 32-way branching trie enable structural sharing?

A 32-way trie breaks the collection into a tree where each node is a 32-element array of pointers. When an update occurs, the system only creates new nodes for the path leading to the changed index, while the rest of the new tree simply points to existing, unchanged branches.

I find this concept of structural sharing to be a masterclass in efficiency. Because each node has 32 slots, the tree is incredibly wide and shallow. A three-level tree already holds 32,768 elements (32^3). To update one element in that list, you only need to allocate three new 32-element arrays. The rest of the data—the other 32,672 elements—is reused by pointing the new nodes at the old branches. You are trading a massive O(n) copy for a surgical operation that touches only a handful of 32-slot nodes.

Operation	Immutable Array	Scala Vector
Access	O(1)	O(log32 n)
Update	O(n)	O(log32 n)
Memory Use	Full Copy	Structural Sharing
Complexity	Linear	Effective Constant

Why is the 32-element structure a perfect interview topic?

I find this is a super interesting data structure to bring up in interviews because it bridges the gap between high-level Big O theory and low-level hardware constraints. It allows you to have a real conversation about recursion, bit-masking, and why designers pick specific constants over others.

When I talk about the branching factor of 32, I look for an understanding of why that number was chosen. It’s 2^5, meaning you can use fast bit-shifting for indexing instead of expensive division. Furthermore, a 32-pointer array typically fits inside a single 64-byte CPU cache line. You’re showing an interviewer that you understand that software performance is ultimately limited by how fast the hardware can move bits between the cache and the registers.

How does the recursion depth remain manageable?

Because the branching factor is 32, the depth of the tree is the log base 32 of N, which stays extremely low even for massive datasets. For almost any addressable memory space, the tree is only 5 or 6 levels deep, allowing the recursive navigation of the tree to perform with the speed of an iterative loop.

Navigating a 32-way trie is a naturally recursive process: you mask the index to find the correct slot in the current array, then recurse into the child node. Because the depth is so shallow—5 levels gets you to over 33 million elements—you get the elegance of recursive logic without any risk of stack overflow. It’s one of the few places where recursion provides the performance profile of an O(1) jump.

// Conceptual look at a Vector update
val original = Vector.range(1, 1000)
// This doesn't copy 1000 elements; it creates
// a new path in the trie and shares the rest.
val updated = original.updated(500, 42)

FAQ

Why use 32 instead of a binary tree (branching factor of 2)? A binary tree is much deeper, requiring significantly more pointer dereferences to reach a leaf. 32-way branching keeps the tree shallow enough that almost any element can be reached in 6 hops or fewer, which is much closer to the performance of a flat array.

Is a Vector slower for reads than a regular array? Yes. You are trading a direct memory jump (O(1)) for a few pointer hops (O(log32 n)). However, in an immutable context, the massive speed gain on updates and appends far outweighs the minor penalty on read latency.

When should I use a standard Array instead of a Vector? If you are doing local, mutable performance-critical work—like a heavy mathematical calculation inside a single function—a standard array is faster. Use a Vector when your data needs to be shared across threads or throughout an application where immutability and structural sharing are required to keep memory pressure low.