Search Results: "Steinar H. Gunderson"

21 September 2021

Russell Coker: Links September 2021

Matthew Garrett wrote an interesting and insightful blog post about the license of software developed or co-developed by machine-learning systems [1]. One of his main points is that people in the FOSS community should aim for less copyright protection. The USENIX ATC 21/OSDI 21 Joint Keynote Address titled It s Time for Operating Systems to Rediscover Hardware has some inssightful points to make [2]. Timothy Roscoe makes some incendiaty points but backs them up with evidence. Is Linux really an OS? I recommend that everyone who s interested in OS design watch this lecture. Cory Doctorow wrote an interesting set of 6 articles about Disneyland, ride pricing, and crowd control [3]. He proposes some interesting ideas for reforming Disneyland. Benjamin Bratton wrote an insightful article about how philosophy failed in the pandemic [4]. He focuses on the Italian philosopher Giorgio Agamben who has a history of writing stupid articles that match Qanon talking points but with better language skills. Arstechnica has an interesting article about penetration testers extracting an encryption key from the bus used by the TPM on a laptop [5]. It s not a likely attack in the real world as most networks can be broken more easily by other methods. But it s still interesting to learn about how the technology works. The Portalist has an article about David Brin s Startide Rising series of novels and his thought s on the concept of Uplift (which he denies inventing) [6]. Jacobin has an insightful article titled You re Not Lazy But Your Boss Wants You to Think You Are [7]. Making people identify as lazy is bad for them and bad for getting them to do work. But this is the first time I ve seen it described as a facet of abusive capitalism. Jacobin has an insightful article about free public transport [8]. Apparently there are already many regions that have free public transport (Tallinn the Capital of Estonia being one example). Fare free public transport allows bus drivers to concentrate on driving not taking fares, removes the need for ticket inspectors, and generally provides a better service. It allows passengers to board buses and trams faster thus reducing traffic congestion and encourages more people to use public transport instead of driving and reduces road maintenance costs. Interesting research from Israel about bypassing facial ID [9]. Apparently they can make a set of 9 images that can pass for over 40% of the population. I didn t expect facial recognition to be an effective form of authentication, but I didn t expect it to be that bad. Edward Snowden wrote an insightful blog post about types of conspiracies [10]. Kevin Rudd wrote an informative article about Sky News in Australia [11]. We need to have a Royal Commission now before we have our own 6th Jan event. Steve from Big Mess O Wires wrote an informative blog post about USB-C and 4K 60Hz video [12]. Basically you can t have a single USB-C hub do 4K 60Hz video and be a USB 3.x hub unless you have compression software running on your PC (slow and only works on Windows), or have DisplayPort 1.4 or Thunderbolt (both not well supported). All of the options are not well documented on online store pages so lots of people will get unpleasant surprises when their deliveries arrive. Computers suck. Steinar H. Gunderson wrote an informative blog post about GaN technology for smaller power supplies [13]. A 65W USB-C PSU that fits the usual wall wart form factor is an interesting development.

15 November 2017

Steinar H. Gunderson: Introducing Narabu, part 6: Performance

Narabu is a new intraframe video codec. You probably want to read part 1, part 2, part 3, part 4 and part 5 first. Like I wrote in part 5, there basically isn't a big splashy ending where everything is resolved here; you're basically getting some graphs with some open questions and some interesting observations. First of all, though, I'll need to make a correction: In the last part, I wrote that encoding takes 1.2 ms for 720p luma-only on my GTX 950, which isn't correct I remembered the wrong number. The right number is 2.3 ms, which I guess explains even more why I don't think it's acceptable at the current stage. (I'm also pretty sure it's possible to rearchitect the encoder so that it's much better, but I am moving on to other video-related things for the time being.) I encoded a picture straight off my DSLR (luma-only) at various resolutions, keeping the aspect. Then I decoded it a bunch of times on my GTX 950 (low-end last-generation NVIDIA) and on my HD 4400 (ultraportable Haswell laptop) and measured the times. They're normalized for megapixels per second decoded; remember that doubling width (x axis) means quadruple the pixels. Here it is: Narabu decoding performance graph I'm not going to comment much beyond two observations: Encoding only contains the GTX 950 because I didn't finish the work to get that single int64 divide off: Narabu encoding performance graph This is interesting. I have few explanations. Probably more benchmarking and profiling would be needed to make sense of any of it. In fact, it's so strange that I would suspect a bug, but it does indeed seem to create a valid bitstream that is decoded by the decoder. Do note, however, that seemingly even on the smallest resolutions, there's a 1.7 ms base cost (you can't see it on the picture, but you'd see it in an unnormalized graph). I don't have a very good explanation for this either (even though there are some costs that are dependent on the alphabet size instead of the number of pixels), but figuring it out would probably be a great start for getting the performance up. So that concludes the series, on a cliffhanger. :-) Even though it's not in a situation where you can just take it and put it into something useful, I hope it was an interesting introduction to the GPU! And in the meantime, I've released version 1.6.3 of Nageru, my live video mixer (also heavily GPU-based) with various small adjustments and bug fixes found before and during Tr ndisk. And Movit is getting compute shaders for that extra speed boost, although parts of it is bending my head. Exciting times in GPU land :-)

4 November 2017

Steinar H. Gunderson: Tr ndisk 2017 live stream

We're streaming live from Tr ndisk 2017, first round of the Norwegian ultimate frisbee series, today from 0945 CET and throughout the day/weekend. It's an interesting first for Nageru in that it's sports, where everything happens much faster and there are more demands for overlay graphics (I've made a bunch of CasparCG templates). I had hoped to get to use Narabu in this, but as the (unfinished) post series indicates, I simply had to prioritize other things. There's plenty of new things for us anyway, not the least that I'll be playing and not operating. :-) Feel free to tune in to the live stream, although we don't have international stream reflectors. It's a fun sport with many nice properties. :-) There will be YouTube not too long after the day is over, too. Edit: All games from Saturday and Sunday are now online; see this YouTube list. Most commentary is in Norwegian, although some games are in English. I was happy everything worked well, and the production crew did a great job (I was busy playing), but of course there's tons of small things we want to improve for next time.

2 November 2017

Steinar H. Gunderson: Introducing Narabu, part 5: Encoding

Narabu is a new intraframe video codec. You probably want to read part 1, part 2, part 3 and part 4 first. At this point, we've basically caught up with where I am, so thing are less set in stone. However, let's look at what qualitatively differentiates encoding from decoding; unlike in interframe video codecs (where you need to do motion vector search and stuff), encoding and decoding are very much mirror images of each other, so we can intuitively expect them to be relatively similar in performance. The procedure is DCT, quantization, entropy coding, and that's it. One important difference is in the entropy coding. Since our rANS encoding is non-adaptive (a choice made largely for simplicity, but also because our streams are so short) works by first signaling a distribution and then encoding each coefficient using that distribution. However, we don't know that distribution until we've DCT-ed all blocks in the picture, so we can't just DCT each block and entropy code the coefficients on-the-fly. There are a few ways to deal with this: As you can see, tons of possible strategies here. For simplicity, I've ended up with the former, although this could very well be changed at some point. There are some interesting subproblems, though: First of all, we need to decide the data type of this temporary array. The DCT tends to concentrate energy into fewer coefficients (which is a great property for compression!), so even after quantization, some of them will get quite large. This means we cannot store them in an 8-bit texture; however, even the bigger ones are very rarely bigger than 10 bits (including a sign bit), so using 16-bit textures wastes precious memory bandwidth. I ended up slicing the coefficients by horizontal index and then pairing them up (so that we generate pairs 0+7, 1+6, 2+5 and 3+4 for the first line of the 8x8 DCT block, 8+15, 9+14, 10+13 and 11+12 for the next line, etc.). This allows us to pack two coefficients into a 16-bit texture, for an average of 8 bits per coefficient, which is what we want. It makes for some slightly fiddly clamping and bit-packing since we are packing signed values, but nothing really bad. Second, and perhaps surprisingly enough, counting efficiently is nontrivial. We want a histogram over what coefficients are used the more often, ie., for each coefficient, we want something like ++counts[dist][coeff] (recall we have four distinct distributions). However, since we're in a massively parallel algorithm, this add needs to be atomic, and since values like e.g. 0 are super-common, all of our GPU cores will end up fighting over the cache line containing counts[dist][0]. This is not fast. Think 10 ms/frame not fast. Local memory to the rescue again; all modern GPUs have fast atomic adds to local memory (basically integrating adders into the cache, as I understand it, although I might have misunderstood here). This means we just make a suitably large local group, build up our sub-histogram in local memory and then add all nonzero buckets (atomically) to the global histogram. This improved performance dramatically, to the point where it was well below 0.1 ms/frame. However, our histogram is still a bit raw; it sums to 1280x720 = 921,600 values, but we want an approximation that sums to exactly 4096 (12 bits), with some additional constraints (like no nonzero coefficients). Charles Bloom has an exposition of a nearly optimal algorithm, although it took me a while to understand it. The basic idea is: Make a good approximation by multiplying each frequency by 4096/921600 (rounding intelligently). This will give you something that very nearly rounds to 4096 either above or below, e.g. 4101. For each step you're above or below the total (five in this case), find the best single coefficient to adjust (most entropy gain, or least loss); Bloom is using a heap, but on the GPU, each core is slow but we have many of them, so it's better just to try all 256 possibilities in parallel and have a simple voting scheme through local memory to find the best one. And then finally, we want a cumulative distribution function, but that is simple through a parallel prefix sum on the 256 elements. And then finally, we can take our DCT coefficients and the finished rANS distribution, and write the data! We'll have to leave some headroom for the streams (I've allowed 1 kB for each, which should be ample except for adversarial data and we'll probably solve that just by truncating the writes and accepting the corruption), but we'll compact them when we write to disk. Of course, the Achilles heel here is performance. Where decoding 720p (luma only) on my GTX 950 took 0,4 ms or so, encoding is 1,2 ms or so, which is just too slow. Remember that 4:2:2 is twice that, and we want multiple streams, so 2,4 ms per frame is eating too much. I don't really know why it's so slow; the DCT isn't bad, the histogram counting is fast, it's just the rANS shader that's slow for some reason I don't understand, and also haven't had the time to really dive deeply into. Of course, a faster GPU would be faster, but I don't think I can reasonably demand that people get a 1080 just to encode a few video streams. Due to this, I haven't really worked out the last few kinks. In particular, I haven't implemented DC coefficient prediction (it needs to be done before tallying up the histograms, so it can be a bit tricky to do efficiently, although perhaps local memory will help again to send data between neighboring DCT blocks). And I also haven't properly done bounds checking in the encoder or decoder, but it should hopefully be simple as long as we're willing to accept that evil input decodes into garbage instead of flagging errors explicitly. It also depends on a GLSL extension that my Haswell laptop doesn't have to get 64-bit divides when precalculating the rANS tables; I've got some code to simulate 64-bit divides using 32-bit, but it doesn't work yet. The code as it currently stands is in this Git repository; you can consider it licensed under GPLv3. It's really not very user-friendly at this point, though, and rather rough around the edges. Next time, we'll wrap up with some performance numbers. Unless I magically get more spare time in the meantime and/or some epiphany about how to make the encoder faster. :-)

28 October 2017

Steinar H. Gunderson: Introducing Narabu, part 4: Decoding

Narabu is a new intraframe video codec. You probably want to read part 1, part 2 and part 3 first. So we're at the stage where the structure is in place. How do we decode? Once we have the structure, it's actually fairly straightforward: First of all, we need to figure out where each slice starts and ends. This is done on the CPU, but it's mostly just setting up pointers, so it's super-cheap. It doesn't see any pixels at all, just lengths and some probability distributions (those are decoded on the CPU, but they're only a few hundred values and no FP math is involved). Then, we set up local groups of size 64 (8x8) with 512 floats of shared memory. Each group will be used for decoding one slice (320 blocks), all coefficients. Each thread starts off doing rANS decoding (this involves a lookup into a small LUT, and of course some arithmetic) and dequantization of 8 blocks (for its coefficient); this means we now have 512 coefficients, so we add a barrier, do 64 horizontal IDCTs (one 8-point IDCT per thread), add a new barrier, do 64 vertical IDCTs, and then finally write out the data. We can then do the next 8 coefficients the same way (we kept the rANS decoding state), and so on. Note how the parallelism changes during the decoding; it's a bit counterintuitive at first, and barriers are not for free, but it avoids saving the coefficients to global memory (which is costly). First, the parallelism is over coefficients, then over horizontal DCT blocks, then over vertical DCT blocks, and then we start all over again. In CPU multithreading, this would be very tricky, and probably not worth it at all, but on the GPU, it gives us tons of parallelism. One problem is that the rANS work is going to be unbalanced within each warp. There's a lot more data (and thus more calculation and loading of compressed data from memory) in the lower-frequency coefficients, which means the other threads in the warp are wasting time doing nothing when they do more work. I tried various schemes to balance it better (like making even larger thread groups to get e.g. more coefficient 0s that could work together, or reordering the threads' coefficients in a zizag), but seemingly it didn't help any. Again, the lack of a profiler here is hampering effective performance investigations. In any case, the performance is reasonably good (my GTX 950 does 1280x720 luma-only in a little under 0.4 ms, which equates to ~1400 fps for full 4:2:2). As a side note, GoPro open-sourced their CineForm HD implementation the other day, and I guess it only goes to show that these kind of things really belong on the GPU; they claim similar performance numbers to what I get on my low-end NVIDIA GPU (923.6 fps for 1080p 4:2:2, which would be roughtly 1800 fps on 720p60), but that's on a 4GHz 8-core Broadwell-E, which is basically taking the most expensive enthusiast desktop CPU you can get your hands on and then overclocking it. Kudos to GoPro for freeing it, though (even under a very useful license), even though FFmpeg already had a working reverse-engineered implementation. :-) Next time, we'll look at the encoder, which is a bit more complicated. After that, we'll do some performance tests and probably wrap up the series. Stay tuned.

22 October 2017

Steinar H. Gunderson: Introducing Narabu, part 3: Parallel structure

Narabu is a new intraframe video codec. You probably want to read part 1 and part 2 first. Now having a rough idea of how the GPU works, it's obvious that we need our codec to split the image into a lot of independent chunks to be parallel; we're aiming for about a thousand chunks. We'll also aim for a format that's as simple as possible; other people can try to push the frontiers on research, I just want something that's fast, reasonably good, and not too hard to implement. First of all, let's look at JPEG. It works by converting the image to Y'CbCr (usually with subsampled chroma), split each channel into 8x8 blocks, DCT them, quantize and then use Huffman coding to encode those coefficients. This is a very standard structure, and JPEG, despite being really old by now, is actually hard to beat, so let's use that as a template. But we probably can't use JPEG wholesale. Why? The answer is that the Huffman coding is just too serial. It's all just one stream, and without knowing where the previous symbol ends, we don't know where to start decoding the next. You can partially solve this by having frequent restart markers, but these reduce coding efficiency, and there's no index of them; you'd have to make a parallel scan for them, which is annoying. So we'll need to at least do something about the entropy coding. Do we need to change anything else to reach our ~1000 parallel target? 720p is the standard video target these days; the ~1M raw pixels would be great if we could do all of them independently, but we obviously can't, since DCT is not independent per pixel (that would sort of defeat the purpose). However, there are 14,400 DCT blocks (8x8), so we can certainly do all the DCTs in parallel. Quantization after that is trivially parallelizable, at least as long as we don't aim for trellis or the likes. So indeed, it's only entropy coding that we need to worry about. Since we're changing entropy coding anyway, I picked rANS, which is confusing but has a number of neat properties; it's roughly as expensive as Huffman coding, but has an efficiency closer to that of arithmetic encoding, and you can switch distributions for each symbol. It requires a bit more calculation, but GPUs have plenty of ALUs (a typical recommendation is 10:1 calculation to memory access), so that should be fine. I picked a pretty common variation with 32-bit state and 8-bit I/O, since 64-bit arithmetic is not universally available on GPUs (I checked a variation with 64-bit state and 32-bit I/O, but it didn't really matter much for speed). Fabian Giesen has described how you can actually do rANS encoding and decoding in parallel over a warp, but I've treated it as a purely serial operation. I don't do anything like adaptation, though; each coefficient is assigned to one out of four rANS static distributions that are signaled at the start of each stream. (Four is a nice tradeoff between coding efficiency, L1 cache use and the cost of transmitting the distributions themselves. I originally used eight, but a bit of clustering showed it was possible to get it down to four at basically no extra cost. JPEG does a similar thing, with separate Huffman codes for AC and DC coefficients. And I've got separate distributions for luma and chroma, which also makes a lot of sense.) The restart cost of rANS is basically writing the state to disk; some of that we would need to write even without a restart, and there are ways to combine many such end states for less waste (I'm not using them), but let's be conservative and assume we waste all of it for each restart. At 150 Mbit/second for 720p60, the luma plane of one frame is about 150 kB. 10% (15 kB) sounds reasonable to sacrifice to restarts, which means we can have about 3750 of them, or one for each 250th pixel or so. (3750 restarts also means 3750 independent entropy coding streams, so we're still well above our 1000 target.) So the structure is more or less clear; we'll DCT the blocks in parallel (since an 8x8 block can be expressed as 8 vertical DCTs and then 8 horizontal DCTs, we can even get 8-way parallelism there), and then encode at least 250 coefficients into an rANS stream. I've chosen to let a stream encompass a single coefficient from each of 320 DCT blocks instead of all coefficients from 5 DCT blocks; it's sort of arbitrary and might be a bit unintuitive, but it feels more natural, especially since you don't need to switch rANS distributions for each coefficient. So that's a lot of rambling for a sort-of abstract format. Next time we'll go into how decoding works. Or encoding. I haven't really made up my mind yet.

19 October 2017

Steinar H. Gunderson: Introducing Narabu, part 2: Meet the GPU

Narabu is a new intraframe video codec. You may or may not want to read part 1 first. The GPU, despite being extremely more flexible than it was fifteen years ago, is still a very different beast from your CPU, and not all problems map well to it performance-wise. Thus, before designing a codec, it's useful to know what our platform looks like. A GPU has lots of special functionality for graphics (well, duh), but we'll be concentrating on the compute shader subset in this context, ie., we won't be drawing any polygons. Roughly, a GPU (as I understand it!) is built up about as follows: A GPU contains 1 20 cores; NVIDIA calls them SMs (shader multiprocessors), Intel calls them subslices. (Trivia: A typical mid-range Intel GPU contains two cores, and thus is designated GT2.) One such core usually runs the same program, although on different data; there are exceptions, but typically, if your program can't fill an entire core with parallelism, you're wasting energy. Each core, in addition to tons (thousands!) of registers, also has some shared memory (also called local memory sometimes, although that term is overloaded), typically 32 64 kB, which you can think of in two ways: Either as a sort-of explicit L1 cache, or as a way to communicate internally on a core. Shared memory is a limited, precious resource in many algorithms. Each core/SM/subslice contains about 8 execution units (Intel calls them EUs, NVIDIA/AMD calls them something else) and some memory access logic. These multiplex a bunch of threads (say, 32) and run in a round-robin-ish fashion. This means that a GPU can handle memory stalls much better than a typical CPU, since it has so many streams to pick from; even though each thread runs in-order, it can just kick off an operation and then go to the next thread while the previous one is working. Each execution unit has a bunch of ALUs (typically 16) and executes code in a SIMD fashion. NVIDIA calls these ALUs CUDA cores , AMD calls them stream processors . Unlike on CPU, this SIMD has full scatter/gather support (although sequential access, especially in certain patterns, is much more efficient than random access), lane enable/disable so it can work with conditional code, etc.. The typically fastest operation is a 32-bit float muladd; usually that's single-cycle. GPUs love 32-bit FP code. (In fact, in some GPU languages, you won't even have 8-, 16-bit or 64-bit types. This is annoying, but not the end of the world.) The vectorization is not exposed to the user in typical code (GLSL has some vector types, but they're usually just broken up into scalars, so that's a red herring), although in some programming languages you can get to swizzle the SIMD stuff internally to gain advantage of that (there's also schemes for broadcasting bits by voting etc.). However, it is crucially important to performance; if you have divergence within a warp, this means the GPU needs to execute both sides of the if. So less divergent code is good. Such a SIMD group is called a warp by NVIDIA (I don't know if the others have names for it). NVIDIA has SIMD/warp width always 32; AMD used to be 64 but is now 16. Intel supports 4 32 (the compiler will autoselect based on a bunch of factors), although 16 is the most common. The upshot of all of this is that you need massive amounts of parallelism to be able to get useful performance out of a CPU. A rule of thumb is that if you could have launched about a thousand threads for your problem on CPU, it's a good fit for a GPU, although this is of course just a guideline. There's a ton of APIs available to write compute shaders. There's CUDA (NVIDIA-only, but the dominant player), D3D compute (Windows-only, but multi-vendor), OpenCL (multi-vendor, but highly variable implementation quality), OpenGL compute shaders (all platforms except macOS, which has too old drivers), Metal (Apple-only) and probably some that I forgot. I've chosen to go for OpenGL compute shaders since I already use OpenGL shaders a lot, and this saves on interop issues. CUDA probably is more mature, but my laptop is Intel. :-) No matter which one you choose, the programming model looks very roughly like this pseudocode:
for (size_t workgroup_idx = 0; workgroup_idx < NUM_WORKGROUPS; ++workgroup_idx)     // in parallel over cores
        char shared_mem[REQUESTED_SHARED_MEM];  // private for each workgroup
        for (size_t local_idx = 0; local_idx < WORKGROUP_SIZE; ++local_idx)    // in parallel on each core
                main(workgroup_idx, local_idx, shared_mem);
         
 
except in reality, the indices will be split in x/y/z for your convenience (you control all six dimensions, of course), and if you haven't asked for too much shared memory, the driver can silently make larger workgroups if it helps increase parallelity (this is totally transparent to you). main() doesn't return anything, but you can do reads and writes as you wish; GPUs have large amounts of memory these days, and staggering amounts of memory bandwidth. Now for the bad part: Generally, you will have no debuggers, no way of logging and no real profilers (if you're lucky, you can get to know how long each compute shader invocation takes, but not what takes time within the shader itself). Especially the latter is maddening; the only real recourse you have is some timers, and then placing timer probes or trying to comment out sections of your code to see if something goes faster. If you don't get the answers you're looking for, forget printf you need to set up a separate buffer, write some numbers into it and pull that buffer down to the GPU. Profilers are an essential part of optimization, and I had really hoped the world would be more mature here by now. Even CUDA doesn't give you all that much insight sometimes I wonder if all of this is because GPU drivers and architectures are meant to be shrouded in mystery for competitiveness reasons, but I'm honestly not sure. So that's it for a crash course in GPU architecture. Next time, we'll start looking at the Narabu codec itself.

18 October 2017

Steinar H. Gunderson: Introducing Narabu, part 1: Introduction

Narabu is a new intraframe video codec, from the Japanese verb narabu ( ), which means to line up or be parallel. Let me first state straight up that Narabu isn't where I hoped it would be at this stage; the encoder isn't fast enough, and I have to turn my attention to other projects for a while. Nevertheless, I think it is interesting as a research project in its own right, and I don't think it should stop me from trying to write up a small series. :-) In the spirit of Leslie Lamport, I'll be starting off with describing what problem I was trying to solve, which will hopefully make the design decisions a lot clearer. Subsequent posts will dive into background information and then finally Narabu itself. I want a codec to send signals between different instances of Nageru, my free software video mixer, and also longer-term between other software, such as recording or playout. The reason is pretty obvious for any sort of complex configuration; if you are doing e.g. both a stream mix and a bigscreen mix, they will naturally want to use many of the same sources, and sharing them over a single GigE connection might be easier than getting SDI repeaters/splitters, especially when you have a lot of them. (Also, in some cases, you might want to share synthetic signals, such as graphics, that never existed on SDI in the first place.) This naturally leads to the following demands: There's a bunch of intraframe formats around. The most obvious thing to do would be to use Intel Quick Sync to produce H.264 (intraframe H.264 blows basically everything else out of the sky in terms of PSNR, and QSV hardly uses any power at all), but sadly, that's limited to 4:2:0. I thought about encoding the three color planes as three different monochrome streams, but monochrome is not supported either. Then there's a host of software solutions. x264 can do 4:2:2, but even on ultrafast, it gobbles up an entire core or more at 720p60 at the target bitrates (mostly in entropy coding). FFmpeg has implementations of all kinds of other codecs, like DNxHD, CineForm, MJPEG and so on, but they all use much more CPU for encoding than the target. NDI would seem to fit the bill exactly, but fails the licensing check, and also isn't robust to corrupted or malicious data. (That, and their claims about video quality are dramatically overblown for any kinds of real video data I've tried.) So, sadly, this leaves only really one choice, namely rolling my own. I quickly figured I couldn't beat the world on CPU video codec speed, and didn't really want to spend my life optimizing AVX2 DCTs anyway, so again, the GPU will come to our rescue in the form of compute shaders. (There are some other GPU codecs out there, but all that I've found depend on CUDA, so they are NVIDIA-only, which I'm not prepared to commit to.) Of course, the GPU is quite busy in Nageru, but if one can make an efficient enough codec that one stream can work at only 5% or so of the GPU (meaning 1200 fps or so), it wouldn't really make a dent. (As a spoiler, the current Narabu encoder isn't there for 720p60 on my GTX 950, but the decoder is.) In the next post, we'll look a bit at the GPU programming model, and what it means for how our codec needs to look like on the design level.

11 September 2017

Steinar H. Gunderson: rANS encoding of signed coefficients

I'm currently trying to make sense of some still image coding (more details to come at a much later stage!), and for a variety of reasons, I've chosen to use rANS as the entropy coder. However, there's an interesting little detail that I haven't actually seen covered anywhere; maybe it's just because I've missed something, or maybe because it's too blindingly obvious, but I thought I would document what I ended up with anyway. (I had hoped for something even more elegant, but I guess the obvious would have to do.) For those that don't know rANS coding, let me try to handwave it as much as possible. Your state is typically a single word (in my case, a 32-bit word), which is refilled from the input stream as needed. The encoder and decoder works in reverse order; let's just talk about the decoder. Basically it works by looking at the lowest 12 (or whatever) bits of the decoder state, mapping each of those 2^12 slots to a decoded symbol. More common symbols are given more slots, proportionally to the frequency. Let me just write a tiny, tiny example with 2 bits and three symbols instead, giving four slots:
Lowest bits Symbol
00 0
01 0
10 1
11 2
Note that the zero coefficient here maps to one out of two slots (ie., a range); you don't choose which one yourself, the encoder stashes some information in there (which is used to recover the next control word once you know which symbol there is). Now for the actual problem: When storing DCT coefficients, we typically want to also store a sign (ie., not just 1 or 2, but also -1/+1 and -2/+2). The statistical distribution is symmetrical, so the sign bit is incompressible (except that of course there's no sign bit needed for 0). We could have done this by introducing new symbols -1 and -2 in addition to our three other ones, but this means we'll need more bits of precision, and accordingly larger look-up tables (which is negative for performance). So let's find something better. We could also simply store it separately somehow; if the coefficient is non-zero, store the bits in some separate repository. Perhaps more elegantly, you can encode a second symbol in the rANS stream with probability 1/2, but this is more expensive computationally. But both of these have the problem that they're divergent in terms of control flow; nonzero coefficients potentially need to do a lot of extra computation and even loads. This isn't nice for SIMD, and it's not nice for GPU. It's generally not really nice. The solution I ended up with was simulating a larger table with a smaller one. Simply rotate the table so that the zero symbol has the top slots instead of the bottom slots, and then replicate the rest of the table. For instance, take this new table:
Lowest bits Symbol
000 1
001 2
010 0
011 0
100 0
101 0
110 -1
111 -2
(The observant reader will note that this doesn't describe the exact same distribution as last time zero has twice the relative frequency as in the other table but ignore that for the time being.) In this case, the bottom half of the table doesn't actually need to be stored! We know that if the three bottom bits are >= 110 (6 in decimal), we have a negative value, can subtract 6, and then continue decoding. If we are go past the end of our 2-bit table despite that, we know we are decoding a zero coefficient (which doesn't have a sign), so we can just clamp the read; or for a GPU, reads out-of-bounds on a texture will typically return 0 anyway. So it all works nicely, and the divergent I/O is gone. If this pickled your interest, you probably want to read up on rANS in general; Fabian Giesen (aka ryg) has some notes that work as a good starting point, but beware; some of this is pretty confusing. :-)

7 September 2017

Steinar H. Gunderson: Licensing woes

On releasing modified versions of GPLv3 software in binary form only (quote anonymized):
And in my opinion it's perfectly ok to give out a binary release of a project, that is a work in progress, so that people can try it out and coment on it. It's easier for them to have it as binary and not need to compile it themselfs. If then after a (long) while the code is still only released in binary form, then it's ok to start a discussion. But only for a quick test, that is unneccessary. So people, calm down and enjoy life!
I wonder at what point we got here.

20 August 2017

Steinar H. Gunderson: Random codec notes

Post-Solskogen, there hasn't been all that many commits in the main Nageru repository, but that doesn't mean the project is standing still. In particular, I've been working with NVIDIA to shake out a crash bug in their drivers (which in itself uncovered some interesting debugging techniques, although in the end, the bug turned out just to be uncovered by the boring standard technique of analyzing crash dumps and writing a minimal program to reproduce). But I've also been looking at intraframe codecs; my sort-of plan was to go to VideoLAN Dev Days to present my findings, but unfortunately, there seems to be a schedule conflict, so instead, you can have some scattered random notes: Work in progress :-) Maybe something more coherent will come out eventually. Edit: Forgot about TF switching!

5 August 2017

Steinar H. Gunderson: Dear conference organizers

Dear conference organizers, In this day and age, people stream conferences and other events over the Internet. Most of the Internet happens to be in a different timezone from yours (it's crazy, I know!). This means that if you publish a schedule, please say which timezone it's in. We've even got this thing called JavaScript now, which allows you to also convert times to the user's local timezone (the future is now!), so you might want to consider using it. :-) (Yes, this goes for you, DebConf, and also for you, Assembly.)

17 July 2017

Steinar H. Gunderson: Solskogen 2017: Nageru all the things

Solskogen 2017 is over! What a blast that was; I especially enjoyed that so many old-timers came back to visit, it really made the party for me. This was the first year we were using Nageru for not only the stream but also for the bigscreen mix, and I was very relieved to see the lack of problems; I've had nightmares about crashes with 150+ people watching (plus 200-ish more on stream), but there were no crashes and hardly a dropped frame. The transition to a real mixing solution as well as from HDMI to SDI everywhere gave us a lot of new opportunities, which allowed a number of creative setups, some of them cobbled together on-the-spot: It's been a lot of fun, but also a lot of work. And work will continue for an even better show next year after some sleep. :-)

9 July 2017

Steinar H. Gunderson: Nageru 1.6.1 released

I've released version 1.6.1 of Nageru, my live video mixer. Now that Solskogen is coming up, there's been a lot of activity on the Nageru front, but hopefully everything is actually coming together now. Testing has been good, but we'll see whether it stands up to the battle-hardening of the real world or not. Hopefully I won't be needing any last-minute patches. :-) Besides the previously promised Prometheus metrics (1.6.1 ships with a rather extensive set, as well as an example Grafana dashboard) and frame queue management improvements, a surprising late addition was that of a new transcoder called Kaeru (following the naming style of Nageru itself, from the japanese verb kaeru ( ) which means roughly to replace or excahnge iKnow! claims it can also mean convert , but I haven't seen support for this anywhere else). Normally, when I do streams, I just let Nageru do its thing and send out a single 720p60 stream (occasionally 1080p), usually around 5 Mbit/sec; less than that doesn't really give good enough quality for the high-movement scenarios I'm after. But Solskogen is different in that there's a pretty diverse audience when it comes to networking conditions; even though I have a few mirrors spread around the world (and some JavaScript to automatically pick the fastest one; DNS round-robin is really quite useless here!), not all viewers can sustain such a bitrate. Thus, there's also a 480p variant with around 1 Mbit/sec or so, and it needs to come from somewhere. Traditionally, I've been using VLC for this, but streaming is really a niche thing for VLC. I've been told it will be an increased focus for 4.0 now that 3.0 is getting out the door, but over the last few years, there's been a constant trickle of little issues that have been breaking my transcoding pipeline. My solution for this was to simply never update VLC, but now that I'm up to stretch, this didn't really work anymore, and I'd been toying around with the idea of making a standalone transcoder for a while. (You'd ask why not the ffmpeg(1) command-line client? , but it's a bit too centered around files and not streams; I use it for converting to HLS for iOS devices, but it has a nasty habit of I/O blocking real work, and its HTTP server really isn't meant for production work. I could survive the latter if it supported Metacube and I could feed it into Cubemap, but it doesn't.) It turned out Nageru had already grown most of the pieces I needed; it had video decoding through FFmpeg, x264 encoding with speed control (so that it automatically picks the best preset the machine can sustain at any given time) and muxing, audio encoding, proper threading everywhere, and a usable HTTP server that could output Metacube. All that was required was to add audio decoding to the FFmpeg input, and then replace the GPU-based mixer and GUI with a very simple driver that just connects the decoders to the encoders. (This means it runs fine on a headless server with no GPU, but it also means you'll get FFmpeg's scaling, which isn't as pretty or fast as Nageru's. I think it's an okay tradeoff.) All in all, this was only about 250 lines of delta, which pales compared to the ~28000 lines of delta that are between 1.3.1 (used for last Solskogen) and 1.6.1. It only supports a rather limited set of Prometheus metrics, and it has some limitations, but it seems to be stable and deliver pretty good quality. I've denoted it experimental for now, but overall, I'm quite happy with how it turned out, and I'll be using it for Solskogen. Nageru 1.6.1 is on its way into Debian, but it depends on a new version of Movit which needs to go through the NEW queue (a soname bump), so it might be a few days. In the meantime, I'll be busy preparing for Solskogen. :-)

25 June 2017

Steinar H. Gunderson: Frame queue management in Nageru 1.6.1

Nageru 1.6.1 is on its way, and what was intended to only be a release centered around monitoring improvements (more specifically a full set of native Prometheus] metrics) actually ended up getting a fairly substantial change to how Nageru manages its frame queues. To understand what's changing and why, it's useful to first understand the history of Nageru's queue management. Nageru 1.0.0 started out with a fairly simple scheme, but with some basics that are still relevant today: One of the input cards was deemed the master card, and whenever it delivers a frame, the master clock ticks and an output frame is produced. (There are some subtleties about dropped frames and/or the master card changing frame rates, but I'm going to ignore them, since they're not important to the discussion.) To this end, every card keeps a preallocated frame queue; when a card delivers a frame, it's put into the queue, and when the master clock ticks, it tries picking out one frame from each of the other card's queues to mix together. Note that mix here could be as simple as picking one input and throwing all the other ones away; the queueing algorithm doesn't care, it just feeds all of them to the theme and lets that run whatever GPU code it needs to match the user's preferences. The only thing that really keeps the queues bounded is that the frames in them are preallocated (in GPU memory), so if one queue gets longer than 16 frames, Nageru starts dropping it. But is 16 the right number? There are two conflicting demands here, ignoring memory usage: The 1.0.0 scheme does about as well as one could possibly hope in never dropping frames, but unfortunately, it can be pretty poor at latency. For instance, if your master card runs at 50 Hz and you have a 60 Hz card, the latter will eventually build up a delay of 16 * 16.7 ms = 266.7 ms clearly unacceptable, and rather unneeded. You could ask the user to specify a queue length, but the user probably doesn't know, and also shouldn't really have to care more knobs to twiddle are a bad thing, and even more so knobs the user is expected to twiddle. Thus, Nageru 1.2.0 introduced queue autotuning; it keeps a running estimate on how big the queue needs to be to avoid underruns, simply based on experience. If we've been dropping frames on a queue and then there's an underrun, the safe queue length is increased by one, and if the queue has been having excess frames for more than a thousand successive master clock ticks, we reduce it by one again. Whenever the queue has more than this safe number, we drop frames. This was simple, effective and largely fixed the problem. However, when adding metrics, I noticed a peculiar effect: Not all of my devices have equally good clocks. In particular, when setting up for 1080p50, my output card's internal clock (which assumes the role of the master clock when using HDMI/SDI output) seems to tick at about 49.9998 Hz, and my simple home camcorder delivers frames at about 49.9995 Hz. Over the course of an hour, this means it produces one more frame than you should have which should of course be dropped. Having an SDI setup with synchronized clocks (blackburst/tri-level) would of course fix this problem, but most people are not so lucky with their cameras, not to mention the price of PC graphics cards with SDI outputs! However, this happens very slowly, which means that for a significant amount of time, the two clocks will very nearly be in sync, and thus racing. Who ticks first is determined largely by luck in the jitter (normal is maybe 1ms, but occasionally, you'll see delayed delivery of as much as 10 ms), and this means that the 1000 frames estimate is likely to be thrown off, and the result is hundreds of dropped frames and underruns in that period. Once the clocks have diverged enough again, you're off the hook, but again, this isn't a good place to be. Thus, Nageru 1.6.1 change the algorithm around yet again, by incorporating more data to build an explicit jitter model. 1.5.0 was already timestamping each frame to be able to measure end-to-end latency precisely (now also exposed in Prometheus metrics), but from 1.6.1, they are actually used in the queueing algorithm. I ran several eight- to twelve-hour tests and simply stored all the event arrivals to a file, and then simulated a few different algorithms (including the old algorithm) to see how they fared in measures such as latency and number of drops/underruns. I won't go into the full details of the new queueing algorithm (see the commit if you're interested), but the gist is: Based on the last 5000 frames, it tries to estimate the maximum possible jitter for each input (ie., how late the frame could possibly be). Based on this as well as clock offsets, it determines whether it's really sure that there will be an input frame available on the next master tick even if it drops the queue, and then trims the queue to fit. The result is pretty satisfying; here's the end-to-end latency of my camera being sent through to the SDI output: As you can see, the latency goes up, up, up until Nageru figures it's now safe to drop a frame, and then does it in one clean drop event; no more hundreds on drops involved. There are very late frame arrivals involved in this run two extra frame drops, to be precise but the algorithm simply determines immediately that they are outliers, and drops them without letting them linger in the queue. (Immediate dropping is usually preferred to sticking around for a bit and then dropping it later, as it means you only get one disturbance event in your stream as opposed to two. Of course, you can only do it if you're reasonably sure it won't lead to more underruns later.) Nageru 1.6.1 will ship before Solskogen, as I intend to run it there :-) And there will probably be lovely premade Grafana dashboards from the Prometheus data. Although it would have been a lot nicer if Grafana were more packaging-friendly, so I could pick it up from stock Debian and run it on armhf. Hrmf. :-)

30 May 2017

Steinar H. Gunderson: Nageru 1.6.0 released

I've released version 1.6.0 of Nageru, my live video mixer, together with dependent libraries Movit 1.5.1 and bmusb 0.7.0. The primray new feature this time is integration with CasparCG, the dominating open-source broadcast graphics system, which opens up a whole new world of possibilities for intelligent overlay graphics. (Actually, the feature is a bit more generic than that; any FFmpeg file or stream will do as input. Audio isn't supported yet, though.) You can see a simple HTML5 CasparCG setup in the ultimate tournament stream test we did in April, in preparation of a larger event in September; CasparCG generates a stream with alpha, which is then fed into Nageru and used on top of the three camera sources. Apart from that, there's a new frame analyzer that helps with calibrating your signal chain; there are lots of devices that will happily mess with your signal, and measuring is the first step in counteracting that. (There's also a few input interpretation tweaks that will help most common issues.) 1.6.0 is on its way to Debian experimental, along with its dependencies (stretch will release with Nageru 1.4.2); there are likely to be backports when stretch releases and the backport queue opens up.

26 May 2017

Steinar H. Gunderson: Last minute stretch bugs

The last week, I found no less than three pet bugs I have hope that will be allowed to go in before stretch release: I promise, none of these were found late because I upgraded to stretch too late just a perfect storm. :-)

27 April 2017

Steinar H. Gunderson: Chinese HDMI-to-SDI converters, part II

Following up on my previous post, I have only a few small nuggets of extra information: The power draw appears to be 230 240 mA at 5 V, or about 1.15 W. (It doesn't matter whether they have a signal or not.) This means you can power them off of a regular USB power bank; in a recent test, we used Deltaco 20000 mAh (at 3.7 V) power banks, which are supposed to power a GoPro plus such a converter for somewhere between 8 12 hours (we haven't measured exactly). It worked brilliantly, and solved a headache of how to get AC to the camera and converter; just slap on a power bank instead, and all you need to run is SDI. Curiously enough, 230 mA is so little that the power bank thinks it doesn't count as load, and just turns itself off after ten seconds or so. However, with the GoPro also active, it stays on all the time. At least it ran for the two hours that we needed without a hitch. The SDI converters don't accept USB directly, but you can purchase dumb USB-to-5.5x2.1mm converters cheap from wherever, which works fine even though USB is supposed to give you only 100 mA without handshaking. Some eBay sellers seem to include them with the converters, even. I guess the power bank just doesn't care; it's spec-ed to 2.1 A on two ports and 1 A on the last one. Update: Sebastian Reichel pointed out that USB 2.0 ups the limit to 500 mA, so you should be within spec in any case. :-) And here's a picture of the entire contraption as a bonus: Power bank and HDMI-to-SDI converter

18 April 2017

Steinar H. Gunderson: Chinese HDMI-to-SDI converters

I often need to convert signals from HDMI to SDI (and occasionally back). This requires a box of some sort, and eBay obliges; there's a bunch of different sellers of the same devices, selling them around $20 25. They don't seem to have a brand name, but they are invariably sold as 3G-SDI converters (meaning they should go up to 1080p60) and look like this: There are also corresponding SDI-to-HDMI converters that look pretty much the same except they convert the other way. (They're easy to confuse, but that's not a problem unique tothem.) I've used them for a while now, and there are pros and cons. They seem reliable enough, and they're 1/4th the price of e.g. Blackmagic's Micro converters, which is a real bargain. However, there are also some issues: The last issue is by far the worst, but it only affects 3G-SDI resolutions. 720p60, 1080p30 and 1080i60 all work fine. And to be fair, not even Blackmagic's own converters actually send 352M correctly most of the time I wish there were a way I could publish this somewhere people would actually read it before buying these things, but without a name, it's hard for people to find it. They're great value for money, and I wouldn't hesitate to recommend them for almost all use but then, there's that almost. :-)

5 April 2017

Steinar H. Gunderson: Nageru 1.5.0 released

I just released version 1.5.0 of Nageru, my live video mixer. The biggest feature is obviously the HDMI/SDI live output, but there are lots of small nuggets everywhere; it's been four months in the making. I'll simply paste the NEWS entry here:
Nageru 1.5.0, April 5th, 2017
  - Support for low-latency HDMI/SDI output in addition to (or instead of) the
    stream. This currently only works with DeckLink cards, not bmusb. See the
    manual for more information.
  - Support changing the resolution from the command line, instead of locking
    everything to 1280x720.
  - The A/V sync code has been rewritten to be more in line with Fons
    Adriaensen's original paper. It handles several cases much better,
    in particular when trying to match 59.94 and 60 Hz sources to each other.
    However, it might occasionally need a few extra seconds on startup to
    lock properly if startup is slow.
  - Add support for using x264 for the disk recording. This makes it possible,
    among other things, to run Nageru on a machine entirely without VA-API
    support.
  - Support for 10-bit Y'CbCr, both on input and output. (Output requires
    x264 disk recording, as Quick Sync Video does not support 10-bit H.264.)
    This requires compute shader support, and is in general a little bit
    slower on input and output, due to the extra amount of data being shuffled
    around. Intermediate precision is 16-bit floating-point or better,
    as before.
  - Enable input mode autodetection for DeckLink cards that support it.
    (bmusb mode has always been autodetected.)
  - Add functionality to add a time code to the stream; useful for debugging
    latency.
  - The live display is now both more performant and of higher image quality.
  - Fix a long-standing issue where the preview displays would be too bright
    when using an NVIDIA GPU. (This did not affect the finished stream.)
  - Many other bugfixes and small improvements.
1.5.0 is on its way into Debian experimental (it's too late for the stretch release, especially as it also depends on Movit and bmusb from experimental), or you can get it from the home page as always.

Next.