
At some point back when the dinosaurs roamed the Earth and I was in
high school, I borrowed my first digital signal processing book from a friend.
I later went on to an engineering education and master's thesis about DSP,
but the very basics of DSP never stop to fascinate me. Today, I wanted to
write something about one of them and how it affects audio processing in
Nageru (and finally, how Debian's policies put me in a bit of a bind on
this issue).
DSP texts tend to obscure profound truths with large amounts of maths,
so I'll try to present a somewhat less general result that doesn't require
going into the mathematical details. That rule is:
Adding a signal to weighted, delayed copies of itself is a filtering operation.
(It's simple, but ignoring it will have sinister effects, as we'll see later.)
Let's see exactly what that means with a motivating example. Let's say that
I have a signal where I want to get rid of (or rather, reduce) high frequencies.
The simplest way I can think of is to add every neighboring sample; that is,
set y
n = x
n + x
n-1. For each sample, we
add the previous sample, ie., the signal as it was one sample ago.
(We ignore what happens at the edges; the common
convention is to assume signals extend out to infinity with zeros.)
What effect will this have? We can figure it out with some trigonometry,
but let's just demonstrate it by plotting instead: We assume 48 kHz sample rate
(which means that our one-sample delay is 20.83 s) and a 22 kHz
note (definitely treble!), and plot the signal with one-sample delay
(the x axis is sample number):

As you can see, the resulting signal is a new signal of the same frequency
(which is always true; linear filtering can never create new frequencies,
just boost or dampen existing ones), but with much lower amplitude.
The signal and the delayed version of it end up cancelling each other mostly
out. Also note that there signal has changed
phase; the resulting signal
has been a bit delayed compared to the original.
Now let's look at a 50 Hz signal (turn on your bass face). We need to zoom out a bit to see
full 50 Hz cycles:

The original signal and the delayed one overlap almost exactly! For a lower
frequency, the one-sample delay means almost nothing (since the waveform is
varying so slowly), and thus, in this case, the resulting signal is amplified,
not dampened. (The signal has changed phase here, too actually exactly as much
in terms of real time but we don't really see it, because we've zoomed out.)
Real signals are not pure sines, but they can be seen as sums of many sines
(another fundamental DSP result), and since filtering is a linear operation,
it affects those sines independently. In other words, we now have a very
simple filter that will amplify low frequencies and dampen high frequencies
(and delay the entire signal a little bit). We can do this for all
frequencies from 0 to 24000 Hz; let's ask Octave to do it for us:

(Of course, in a real filter, we'd probably multiply the result with 0.5
to leave the bass untouched instead of boosting it, but it doesn't really
change anything. A real filter would have a lot more coefficients, though,
and they wouldn't all be the same!)
Let's now turn to a problem that will at first seem different: Combining
audio from multiple different time sources. For instance, when mixing video,
you could have input from two different cameras or sounds card and would want
to combine them (say, a source playing music and then some audience sound
from a camera). However, unless you are lucky enough to have a professional-grade setup
where everything runs off the same clock (and separate clock source cables
run to every device), they won't be in sync; sample clocks are good, but they are
not perfect, and they have e.g. some temperature variance. Say we have really
good clocks and they only differ by 0.01%; this means that after an hour of
streaming, we have 360 ms delay, completely ruining lip sync!
This means we'll need to resample at least one of the sources to match the
other; that is, play one of them faster or slower than it came in originally.
There are two problems here: How do you determine how much to resample
the signals, and how do we resample them?
The former is a difficult problem in its own right; about every algorithm
not backed in solid control theory
is doomed to fail in one way or another, and when they fail, it's extremely
annoying to listen to. Nageru follows
a 2012 paper by Fons Adriaensen;
GStreamer does well, something else. It fails pretty badly in a number of
cases; see e.g.
this 2015 master's thesis
that tries to patch it up. However, let's ignore this part of the problem
for now and focus on the resampling.
So let's look at the case where we've determined we have a signal and need to
play it 0.01% faster (or slower); in a real situation, this number would vary
a bit (clocks are not even consistently wrong). This means that at some point,
we want to output sample number 3000 and that corresponds to input sample
number 3000.3, ie., we need to figure out what's between two input samples.
As with so many other things, there's a way to do this that's simple, obvious
and wrong, namely linear interpolation.
The basis of linear interpolation is to look at the two neighboring samples
and weigh them according to the position we want. If we need sample 3000.3,
we calculate y = 0.7 x
3000 + 0.3 x
3001 (don't switch
the two coefficients!), or, if we want to save one multiplication and get
better numerical behavior, we can use the equivalent
y = x
3000 + 0.3 (x
3001 - x
3000).
And if we need sample 5000.5, we take y = 0.5 x
5000 + 0.5 x
5001. And after a while, we'll be back on integer samples; output sample
10001 corresponds to x
10000 exactly.
By now, I guess it should be obvious what's going on: We're creating a filter!
Linear interpolation will inevitably result in high frequencies being
dampened; and even worse, we are creating a
time-varying filter, which means
that the amount of dampening will vary over time. This manifests itself as
a kind of high-frequency flutter , where the amount of flutter depends on
the relative resampling frequencies. There's also cubic resampling (which
can mean any of several different algorithms), but it only really reduces
the problem, it doesn't really solve it.
The proper way of interpolating depends a lot on exactly what you want
(e.g., whether you intend to change the rate quickly or not);
this paper
lays out a bunch of them, and was the paper that originally made me understand
why linear interpolation is so bad. Nageru outsources this problem to
zita-resampler,
again by Fons Adriaensen; it yields extremely high-quality resampling
under controlled delay, through a relatively common technique known as
polyphase filters.
Unfortunately, doing this kind of calculations takes CPU. Not a lot of CPU,
but Nageru runs in rather CPU-challenged environments (ultraportable laptops
where the GPU wants most of the TDP, and the CPU has to go down to the
lowest frequency), and it is moving in a direction where it needs to resample
many more channels (more on the later), so every bit of CPU helps. So I coded
up an SSE optimization of the inner loop for a particular common case
(stereo signals) and sent it in for upstream inclusion. (It made the
code 2.7 times as fast without any structural changes or reducing precision,
which is pretty much what you can expect from SSE.)
Unfortunately, after a productive discussion, suddenly upstream went silent.
I tried pinging, pinging again, and after half a year pinging again, but to
no avail. I
filed the patch in Debian's BTS,
but the maintainer understandably is reluctant to carry a delta against upstream.
I also can't embed a copy; Debian policy would dictate that I build against
the system's zita-resampler. I could work around it by
rewriting zita-resampler until it looks nothing like the original, which
might be a good idea anyway if I wanted to squeeze out the last drops of speed;
there are AVX optimizations to be had in addition to SSE, and the structure
as-is isn't ideal for SSE optimizations (although some of the changes I have
in mind would have to be offset against increased L1 cache footprint,
so careful benchmarking would be needed). But in a sense, it feels like just
working around a policy that's there for good reason. So like I said, I'm
in a bit of a bind. Maybe I should just buy a faster laptop.
Oh, and how does GStreamer solve this? Well, it doesn't use linear interpolation.
It does something even worse it uses nearest neighbor. Gah.
Update: I was asked to clarify that this is about the audio resampling done
by the GStreamer audio sink to sync signals, not in the audioresample element, which solves
a related but different problem (static sample rate conversion).
The audioresample element supports a number of different resampling methods;
I haven't evaluated them.