Narabu is a new intraframe video codec. You probably want to read

part 1,

part 2,

part 3
and

part 4
first.
At this point, we've basically caught up with where I am, so thing are less
set in stone. However, let's look at what qualitatively differentiates
encoding from decoding; unlike in interframe video codecs (where you need
to do motion vector search and stuff), encoding and decoding are very much
mirror images of each other, so we can intuitively expect them to be
relatively similar in performance. The procedure is DCT, quantization,
entropy coding, and that's it.
One important difference is in the entropy coding. Since our rANS encoding
is non-adaptive (a choice made largely for simplicity, but also because
our streams are so short) works by first signaling a distribution and then
encoding each coefficient using that distribution. However, we don't know
that distribution until we've DCT-ed all blocks in the picture, so we can't
just DCT each block and entropy code the coefficients on-the-fly.
There are a few ways to deal with this:

- DCT and quantize first, then store to a temporary array, count up all the
values, make rANS distributions, and then entropy code from that array.
This is conceptually the simplest, but see below.
- Use some sort of approximate distribution, which also saves us from doing
the actual counting. This has the problem that
different images look different, so it loses efficiency (and could do so
quite badly). Furthermore, it means this approximate distribution could
never have a 0 for any frequency, which also is a net-negative when we're
talking about only 12-bit resolution.
- Use the values from the
*previous* frame, since video tends to have
correlation between frames (and we can thus expect the distribution to
be a better fit than some generic distribution decided ahead of time).
Has many of the same problems as previous point, though, plus we get
back the overhead of counting. Still, an interesting possibility that
I haven't tried in practice yet.
- Do DCT and quantization just to get the values for counting, then do
it all over again later when we have the distribution. Saves the store
to the temporary array (the loads will be the same, since we need to
load the image twice), at the cost of doing DCT twice. I haven't tried
this either.
- A variant on the previous one: Do DCT on a few random blocks (10%?)
to get an approximate distribution. I did something similar to this
at some point, and it was surprisingly good, but I haven't gone further
with it.

As you can see, tons of possible strategies here. For simplicity, I've ended
up with the former, although this could very well be changed at some point.
There are some interesting subproblems, though:
First of all, we need to decide the data type of this temporary array.
The DCT tends to concentrate energy into fewer coefficients (which is
a great property for compression!), so even after quantization, some of
them will get quite large. This means we cannot store them in an 8-bit
texture; however, even the bigger ones are very rarely bigger than 10 bits
(including a sign bit), so using 16-bit textures wastes precious memory
bandwidth.
I ended up slicing the coefficients by horizontal index and then pairing
them up (so that we generate pairs 0+7, 1+6, 2+5 and 3+4 for the first line
of the 8x8 DCT block, 8+15, 9+14, 10+13 and 11+12 for the next line, etc.).
This allows us to pack two coefficients into a 16-bit texture, for an

*average* of 8 bits per coefficient, which is what we want. It makes for
some slightly fiddly clamping and bit-packing since we are packing signed
values, but nothing really bad.
Second, and perhaps surprisingly enough, counting efficiently is nontrivial.
We want a histogram over what coefficients are used the more often, ie.,
for each coefficient, we want something like

`++counts[dist][coeff]`

(recall we have four distinct distributions). However, since we're in a
massively parallel algorithm, this add needs to be atomic, and since values
like e.g. 0 are super-common, all of our GPU cores will end up fighting over
the cache line containing

`counts[dist][0]`

. This is not fast.
Think 10 ms/frame not fast.
Local memory to the rescue again; all modern GPUs have fast atomic adds
to local memory (basically integrating adders into the cache, as I understand
it, although I might have misunderstood here). This means we just make a
suitably large local group, build up our sub-histogram in local memory and
then add all nonzero buckets (atomically) to the global histogram. This
improved performance

*dramatically*, to the point where it was well below 0.1
ms/frame.
However, our histogram is still a bit raw; it sums to 1280x720 = 921,600
values, but we want an approximation that sums to exactly 4096 (12 bits), with
some additional constraints (like no nonzero coefficients). Charles Bloom has

an exposition of a nearly optimal algorithm,
although it took me a while to understand it. The basic idea is: Make a good
approximation by multiplying each frequency by 4096/921600 (rounding
intelligently). This will give you something that very nearly rounds to
4096 either above or below, e.g. 4101. For each step you're above or below
the total (five in this case), find the best single coefficient to adjust (most entropy gain, or least
loss); Bloom is using a heap, but on the GPU, each core is slow but we have
many of them, so it's better just to try all 256 possibilities in parallel
and have a simple voting scheme through local memory to find the best one.
And then finally, we want a cumulative distribution function, but that is
simple through a parallel prefix sum on the 256 elements.
And then finally, we can take our DCT coefficients and the finished rANS
distribution, and write the data! We'll have to leave some headroom for
the streams (I've allowed 1 kB for each, which should be ample except for
adversarial data and we'll probably solve that just by truncating the
writes and accepting the corruption), but we'll compact them when we write to
disk.
Of course, the Achilles heel here is performance. Where decoding 720p (luma only)
on my GTX 950 took 0,4 ms or so, encoding is 1,2 ms or so, which is just too
slow. Remember that 4:2:2 is twice that, and we want multiple streams, so
2,4 ms per frame is eating too much. I don't really know why it's so slow;
the DCT isn't bad, the histogram counting is fast, it's just the rANS shader
that's slow for some reason I don't understand, and also haven't had the time
to really dive deeply into. Of course, a faster GPU would be faster, but I
don't think I can reasonably demand that people get a 1080 just to encode
a few video streams.
Due to this, I haven't really worked out the last few kinks. In particular,
I haven't implemented DC coefficient prediction (it needs to be done before
tallying up the histograms, so it can be a bit tricky to do efficiently,
although perhaps local memory will help again to send data between
neighboring DCT blocks). And I also haven't properly done bounds checking
in the encoder or decoder, but it should hopefully be simple as long as
we're willing to accept that evil input decodes into garbage instead of
flagging errors explicitly. It also depends on a GLSL extension that my
Haswell laptop doesn't have to get 64-bit divides when precalculating the
rANS tables; I've got some code to simulate 64-bit divides using 32-bit,
but it doesn't work yet.
The code as it currently stands is in

this Git repository;
you can consider it licensed under GPLv3. It's really not very user-friendly
at this point, though, and rather rough around the edges.
Next time, we'll wrap up with some performance numbers. Unless I magically
get more spare time in the meantime and/or some epiphany about how to make
the encoder faster. :-)