ScNat
In Cam Audio Lag

Spin Clap

a subframe audio video sync measurement device

To assess the performance of syncing programs, I designed and realized an apparatus to measure the synchronization between the in cam audio stream and the video stream as recorded on the camera memory card.

Stepping frame by frame in any NLE software one can verify that sound is moderately in sync with the images. But how to measure synchronization correctness within a precision better than a single frame duration? The method described here successfully determined the sync error of some consumer models within a 2 milliseconds precision¹:

FUJIFILM X-E1: sound leads by 24 ± 2 ms (for 42 ms frames)
Panasonic Lumix G85: sound leads by 3 ± 2 ms (for 32 ms frames)

Experimental Setup

A mechanical slate is used to generates events filmed by the camera. The measurement must be done multiple times in order to evaluate its reproducibility, hence a 30 RPM rotating arm is used as a slate and recorded for a minute or two, generating some 30-60 clicking events, all processed to generate valid statistical distributions.

No sound is produced: an electrical contact sends for each turn a 100 mV pulse directly into the camera microphone input. Before running the experiment, a calibration static shot is done to record the rotating arm reference position, exactly at the angular position where the electrical click will occur while spinning.

For more details and images about the appartus, see the #audiosyncerror mastodon hashtag.

Principle of operation

Successive arm positions are used for a linear regression to interpolate the precise time when the arm passes through the reference position, this time value $T_{v}$ ( video time) is compared to the time detected for the sound pulse in the audio recording, $T_{a}$ ( audio time). Ideally those time should coincide, and the sync error $\Delta t$

\Delta t = T_{a} - T_{v}

should be zero, or at the very least, much less than a frame duration. If $t_{a} \lt t_{v}$ , the audio click happens before the arm reaches the reference position so the audio is leading. Hence cameras with lagging audio have a positive $\Delta t$ and leading audio will have a $\Delta t \lt 0$ .

Automated image and sound processing

About ten frames are used for each pass (five frames before the click and five after) and for each frame the angular arm position has to be measured with the best available spatial precision (at pixel level). Manually the whole process would be tedious so the video frames and sound track are analyzed with a python script (I won’t publish this hack but can share it if you ask).

To facilitate automated pattern recognition, high contrast videos are obtained in complete obscurity with a lightly lit LED fixed on the rotating arm. At this low light level the camera sets the maximum shutter exposition so the LED draws a long arc of circle and successive arcs nearly touch each other (this almost corresponds to a 360 degrees shutter angle) .

After manually entering the $(x,y)$ reference position of the LED obtained from the ‘calibration’ static sequence, here are the steps the python script goes through :

Find the time $T_{a}$ at which the (first) click occurs in the sound track
With this time $T_{a}$ , load in memory some frames around that instant
For each loaded image, detect the $(x,y)$ position of the arc extremity
With this set of extremities, determine the centre of rotation.
Convert each $(x,y)$ positions to their corresponding angular position $\theta$ (with the click reference angle set as origin $\theta=0$ )
Using frame numbers and the frame #0 beginning as $t_{v}=0$ , for each angle $\theta$ calculate the corresponding video time $t_{v}$ (the start of an arc is taken as the beginning of each frame exposition)
With this set of points $(\theta, t_{v})$ fit a straight line (the angular velocity is constant)
The x-intercept of the fitted line is $T_{v}$ the time at which the click position is crossed, determined with video frames.
Compute and store the synchronization error $\Delta t = T_{a} - T_{v}$
Wash, rinse and repeat until the end of the clip.
Use $\Delta t$ distribution for the entire clip to evaluate the method uncertainty (found to be around 2 ms)

Here’s a composite (and reversed) image of a single clockwise pass where nine frames were used for the fit (plotted below the video composite).

2 milliseconds is the standard deviation of multiple measures, there is maybe a systematic error, but that’s unlikely. ↩