Under The Hood

Playing in sync is hard. ondaire does the hard part.

Getting two speakers to play the very same instant over flaky Wi-Fi means fighting both the network and physics at once — and then proving you actually won. Below are the four problems that pull rooms apart and the fix for each, followed by the measurements that back it up: first from a microphone in the room, then from the cluster’s own live telemetry. Four problems, four fixes:

network jitter

Wi-Fi delivers packets in bursts and at uneven intervals — played naively, audio stutters and rooms drift apart.

Every frame is stamped with a presentation time and played at that exact deadline against a shared clock. A small per-group playout buffer absorbs the jitter, so output stays smooth and every room hits the same instant.

packet loss

On a busy network UDP datagrams simply vanish, and a dropped frame is an audible click or gap.

Audio is Opus-coded so each frame fits one small packet (no fragile IP fragmentation), and the master sends forward-error-correction parity alongside it, so most lost frames are rebuilt with no retransmit. A TCP transport is one toggle away when you'd rather trade a little latency for certainty.

system clock drift

No two devices agree on what time it is — their clocks start at different offsets and tick at slightly different rates.

One node is the time reference. Each player continuously measures its offset and round-trip to that master and translates “play at T” into its own local time, so the same frame lands at the same real-world instant on every speaker.

DAC clock drift

Even with perfect timing, no two sound cards sample at exactly 48 kHz — a few parts-per-million apart, rooms slide out of phase over a long track.

A continuous rate servo watches how fast each DAC actually drains its buffer versus the master timeline and resamples by a micro-correction to match. Rooms stay phase-locked for hours, not just for the first minute.

Measured, not promised

We put a microphone in the room.

Anyone can claim “perfect sync,” so we did the harder thing and measured it from the air. A single microphone in the room with two Raspberry-Pi speakers, a calibrated sine sweep, and a matched filter recover each speaker’s acoustic arrival time to a fraction of a sample — that’s the real gap your ears would hear, room reflections and all. Ten minutes, every burst, nothing smoothed away. Two views of the same recording:

Inter-speaker offset over a ten-minute run: a flat line hugging zero, the bulk of bursts inside ±0.1 ms and 99% inside ±0.4 ms.

Inter-speaker sync 84 µs

Locked, the whole session

Across a ten-minute run the two speakers held a median 84 µs apart — flat, burst after burst, not the drift-and-snap of a loop fighting itself. Your ears fuse two arrivals into one source up to roughly five milliseconds (the precedence effect), so this sits deep inside “one sound.” Honest read: the graph has the slow sound-card warm-up drift removed — the master’s cross-room equalization tracks that out — so what you see is the moment-to-moment sync.

Cumulative distribution of the inter-speaker offset: 50% of bursts within 84 µs, 95% within 0.31 ms, 99.5% within 0.44 ms, none past half a millisecond.

How close, how often 99.5% < 0.44 ms

Sub-half-millisecond, by the numbers

The whole distribution, not a cherry-picked peak: half the bursts land within 84 µs, 95% within 0.31 ms, 99.5% within 0.44 ms — and nothing crosses half a millisecond. Honest read: this is what a single microphone hears, so it already includes the room and the mic’s own ~150 µs of noise; the electrical sync between the cards is at least this tight, not looser.

Caught in the act

The same story, in the cluster’s own numbers.

No microphone this time — just the cluster’s own telemetry. We played for twenty minutes and polled every node once a second: each one reports its clock offset to the master and exactly how many samples the playout servo injected, dropped, or replaced with silence. Two real Raspberry-Pi Zero players (zero-01, zero-02) and a master (study) as the time reference. Graphed raw.

Clock drift rate over 20 minutes: the master is a flat zero line; zero-01 sits near +12 ppm and zero-02 near +15 ppm, each a roughly flat line wandering a little.

The problem · crystal drift +12 & +15 ppm

Three clocks, three speeds

The master clock is the straight zero line. Each Pi’s quartz runs at its own measured rate — zero-01 about +12 ppm fast, zero-02 about +15 ppm — and this is the rate, not an accumulating offset: a roughly flat line that only wanders as the boards warm. Left uncorrected, +15 ppm pulls a speaker about a millisecond out of sync every seventy seconds, and seconds apart over an evening. This is the raw measurement — the slope of each device’s reported clock offset — not a spec-sheet figure.

Injection and drop rate per node: injected hovers on each crystal’s ppm line, dropped stays near zero.

The fix · rate matching net ≈ 0 ppm

Cancelled, sample by sample

The playout servo resamples each node by a hair to erase that drift: it injects a duplicate sample now and then — at about +13 and +14 ppm, landing right on each crystal’s own drift line (faint) — and drops almost nothing. Injected minus dropped nets to essentially zero against the master, which is why the rooms stay phase-locked. The independent clock measurement, the servo’s own reported rate, and this realized injection rate all agree to within about a part per million — three different signals telling the same story.

Silence inserted per minute: a low trickle under ~1 ms/min for both nodes, totalling 13 ms and 4 ms over the run.

The cost · dropouts 13 ms in 20 min

Almost no silence at all

When a buffer briefly runs dry the node outputs silence rather than the wrong sample. Across the full twenty minutes that came to 13 ms on zero-01 and 4 ms on zero-02 — a steady trickle well under a millisecond per minute, roughly 0.001% of the session, none of it clustered into an audible gap. Honest read: this is a real Wi-Fi link, so it isn’t zero; it’s just small enough not to hear.