ScNat
In Cam Audio Lag

Spin Clap

a subframe audio video sync measurement device

To assess the performance of syncing programs, I designed and realized an apparatus to measure the synchronization between the in cam audio stream and the video stream as recorded on the camera memory card.

Stepping frame by frame in any NLE software one can verify that sound is moderately in sync with the images. But how to measure synchronization correctness within a precision better than a single frame duration? The method described here successfully determined the sync error of some consumer models within a 2 milliseconds precision1:

Experimental Setup

A mechanical slate is used to generates events filmed by the camera. The measurement must be done multiple times in order to evaluate its reproducibility, hence a 30 RPM rotating arm is used as a slate and recorded for a minute or two, generating some 30-60 clicking events, all processed to generate valid statistical distributions.

No sound is produced: an electrical contact sends for each turn a 100 mV pulse directly into the camera microphone input. Before running the experiment, a calibration static shot is done to record the rotating arm reference position, exactly at the angular position where the electrical click will occur while spinning.

For more details and images about the appartus, see the #audiosyncerror mastodon hashtag.

Principle of operation

Successive arm positions are used for a linear regression to interpolate the precise time when the arm passes through the reference position, this time value T vT_{v} ( video time) is compared to the time detected for the sound pulse in the audio recording, T aT_{a} ( audio time). Ideally those time should coincide, and the sync error Δt\Delta t

Δt=T aT v\Delta t = T_{a} - T_{v}

should be zero, or at the very least, much less than a frame duration. If t a<t vt_{a} \lt t_{v}, the audio click happens before the arm reaches the reference position so the audio is leading. Hence cameras with lagging audio have a positive Δt\Delta t and leading audio will have a Δt<0\Delta t \lt 0.

Automated image and sound processing

About ten frames are used for each pass (five frames before the click and five after) and for each frame the angular arm position has to be measured with the best available spatial precision (at pixel level). Manually the whole process would be tedious so the video frames and sound track are analyzed with a python script (I won’t publish this hack but can share it if you ask).

To facilitate automated pattern recognition, high contrast videos are obtained in complete obscurity with a lightly lit LED fixed on the rotating arm. At this low light level the camera sets the maximum shutter exposition so the LED draws a long arc of circle and successive arcs nearly touch each other (this almost corresponds to a 360 degrees shutter angle) .

After manually entering the (x,y)(x,y) reference position of the LED obtained from the ‘calibration’ static sequence, here are the steps the python script goes through :

Here’s a composite (and reversed) image of a single clockwise pass where nine frames were used for the fit (plotted below the video composite).


  1. 2 milliseconds is the standard deviation of multiple measures, there is maybe a systematic error, but that’s unlikely.