Search Results: "cfm"

8 December 2025

Fran ois Marier: Learning a new programming language with an LLM

I started learning Go this year. First, I picked a Perl project I wanted to rewrite, got a good book and ignored AI tools since I thought they would do nothing but interfere with learning. Eventually though, I decided to experiment a bit and ended up finding a few ways to use AI assistants effectively even when learning something new.

Searching more efficiently The first use case that worked for me was search. Instead of searching on a traditional search engine and then ending up on Stack Overflow, I could get the answer I was looking for directly in an AI side-window in my editor. Of course, that's bad news for Stack Overflow. I was however skeptical from the beginning since LLMs make mistakes, sometimes they making up function signatures or APIs that don't exist. Therefore I got into the habit of going to the official standard library documentation to double-check suggestions. For example, if the LLM suggests using strings.SplitN, I verify the function signature and behaviour carefully before using it. Basically, "don't trust and do verify." I stuck to the standard library in my project, but if an LLM recommends third-party dependencies for you, make sure they exist and that Socket doesn't flag them as malicious. Research has found that 5-20% of packages suggested by LLMs don't actually exist, making this a real attack vector (dubbed "slopsquatting").

Autocomplete is too distracting A step I took early on was to disable AI autocomplete in my editor. When learning a new language, you need to develop muscle memory for the syntax. Also, Go is no Java. There's not that much boilerplate to write in general. I found it quite distracting to see some almost correct code replace my thinking about the next step. I can see how one could go faster with these suggestions, but being a developer is not just about cranking out lines of code as fast as possible, it's also about constantly learning new things (and retaining them).

Asking about idiomatic code One of the most useful prompts when learning a new language is "Is this the most idiomatic way to do this in Go?". Large language models are good at recognizing patterns and can point out when you're writing code that works but doesn't follow the conventions of the language. This is especially valuable early on when you don't yet have a feel for what "good" code looks like in that language. It's usually pretty easy (at least for an experience developer) to tell when the LLM suggestion is actually counter productive or wrong. If it increases complexity or is harder to read/decode, it's probably not a good idea to do it.

Reviews One way a new dev gets better is through code review. If you have access to a friend who's an expert in the language you're learning, then you can definitely gain a lot by asking for feedback on your code. If you don't have access to such a valuable resource, or as a first step before you consult your friend, I found that AI-assisted code reviews can be useful:
  1. Get the model to write the review prompt for you. Describe what you want reviewed and let it generate a detailed prompt.
  2. Feed that prompt to multiple models. They each have different answers and will detect different problems.
  3. Be prepared to ignore 50% of what they recommend. Some suggestions will be stylistic preferences, others will be wrong, or irrelevant.
The value is in the other 50%: the suggestions that make you think about your code differently or catch genuine problems. Similarly for security reviews:
  • A lot of what they flag will need to be ignored (false positives, or things that don't apply to your threat model).
  • Some of it may highlight areas for improvement that you hadn't considered.
  • Occasionally, they will point out real vulnerabilities.
But always keep in mind that AI chatbots are trained to be people-pleasers and often feel the need to suggest something when nothing was needed

An unexpected benefit One side effect of using AI assistants was that having them write the scaffolding for unit tests motivated me to increase my code coverage. Trimming unnecessary test cases and adding missing ones is pretty quick when the grunt work is already done, and I ended up testing more of my code (being a personal project written in my own time) than I might have otherwise.

Learning In the end, I continue to believe in the value of learning from quality books (I find reading paper-based most effective). In addition, I like to create Anki questions for common mistakes or things I find I have to look up often. Remembering something will always be faster than asking an AI tool. So my experience this year tells me that LLMs can supplement traditional time-tested learning techniques, but I don't believe it obsoletes them. P.S. I experimented with getting an LLM to ghost-write this post for me from an outline (+ a detailed style guide) and I ended up having to rewrite at least 75% of it. It was largely a waste of time.

21 October 2025

Gunnar Wolf: LLM Hallucinations in Practical Code Generation Phenomena, Mechanism, and Mitigation

This post is a review for Computing Reviews for LLM Hallucinations in Practical Code Generation Phenomena, Mechanism, and Mitigation , a article published in Proceedings of the ACM on Software Engineering, Volume 2, Issue ISSTA
How good can large language models (LLMs) be at generating code? This may not seem like a very novel question, as several benchmarks (for example, HumanEval and MBPP, published in 2021) existed before LLMs burst into public view and started the current artificial intelligence (AI) inflation. However, as the paper s authors point out, code generation is very seldom done as an isolated function, but instead must be deployed in a coherent fashion together with the rest of the project or repository it is meant to be integrated into. Today, several benchmarks (for example, CoderEval or EvoCodeBench) measure the functional correctness of LLM-generated code via test case pass rates. This paper brings a new proposal to the table: comparing LLM-generated repository-level evaluated code by examining the hallucinations generated. The authors begin by running the Python code generation tasks proposed in the CoderEval benchmark against six code-generating LLMs. Next, they analyze the results and build a taxonomy to describe code-based LLM hallucinations, with three types of conflicts (task requirement, factual knowledge, and project context) as first-level categories and eight subcategories within them. The authors then compare the results of each of the LLMs per the main hallucination category. Finally, they try to find the root cause for the hallucinations. The paper is structured very clearly, not only presenting the three research questions (RQ) but also referring to them as needed to explain why and how each partial result is interpreted. RQ1 (establishing a hallucination taxonomy) is the most thoroughly explored. While RQ2 (LLM comparison) is clear, it just presents straightforward results without much analysis. RQ3 (root cause discussion) is undoubtedly interesting, but I feel it to be much more speculative and not directly related to the analysis performed. After tackling their research questions, Zhang et al. propose a possible mitigation to counter the effect of hallucinations: enhance the LLM with retrieval-augmented generation (RAG) so it better understands task requirements, factual knowledge, and project context. The presented results show that all of the models are clearly (though modestly) improved by the proposed RAG-based mitigation. The paper is clearly written and easy to read. It should provide its target audience with interesting insights and discussions. I would have liked more details on their RAG implementation, but I suppose that s for a follow-up work.

21 September 2025

Gunnar Wolf: We, Programmers A Chronicle of Coders from Ada to AI

This post is a review for Computing Reviews for We, Programmers A Chronicle of Coders from Ada to AI , a book published in Addison-Wesley
When this book was presented as available for review, I jumped on it. After all, who doesn t love reading a nice bit of computing history, as told by a well-known author (affectionaly known as Uncle Bob ), one who has been immersed in computing since forever? What s not to like there? Reading on, the book does not disappoint. Much to the contrary, it digs into details absent in most computer history books that, being an operating systems and computer architecture geek, I absolutely enjoyed. But let me first address the book s organization. The book is split into four parts. Part 1, Setting the Stage, is a short introduction, answering the question Who are we? ( we being the programmers, of course). It describes the fascination many of us felt when we realized that the computer was there to obey us, to do our bidding, and we could absolutely control it. Part 2 talks about the giants of the computing world, on whose shoulders we stand. It digs in with a level of detail I have never seen before, discussing their personal lives and technical contributions (as well as the hoops they had to jump through to get their work done). Nine chapters cover these giants, ranging chronologically from Charles Babbage and Ada Lovelace to Ken Thompson, Dennis Richie, and Brian Kernighan (understandably, giants who worked together are grouped in the same chapter). This is the part with the most historically overlooked technical details. For example, what was the word size in the first computers, before even the concept of a byte had been brought into regular use? What was the register structure of early central processing units (CPUs), and why did it lead to requiring self-modifying code to be able to execute loops? Then, just as Unix and C get invented, Part 3 skips to computer history as seen through the eyes of Uncle Bob. I must admit, while the change of rhythm initially startled me, it ends up working quite well. The focus is no longer on the giants of the field, but on one particular person (who casts a very long shadow). The narrative follows the author s life: a boy with access to electronics due to his father s line of work; a computing industry leader, in the early 2000s, with extreme programming; one of the first producers of training materials in video format a role that today might be recognized as an influencer. This first-person narrative reaches year 2023. But the book is not just a historical overview of the computing world, of course. Uncle Bob includes a final section with his thoughts on the future of computing. As this is a book for programmers, it is fitting to start with the changes in programming languages that we should expect to see and where such changes are likely to take place. The unavoidable topic of artificial intelligence is presented next: What is it and what does it spell for computing, and in particular for programming? Interesting (and sometimes surprising) questions follow: What does the future of hardware development look like? What is prone to be the evolution of the World Wide Web? What is the future of programming and programmers? At just under 500 pages, the book is a volume to be taken seriously. But space is very well used with this text. The material is easy to read, often funny and always informative. If you enjoy computer history and understanding the little details in the implementations, it might very well be the book you want.

18 September 2025

John Goerzen: Running an Accurate 80 25 DOS-Style Console on Modern Linux Is Possible After All

Here, in classic Goerzen deep dive fashion, is more information than you knew you wanted about a topic you ve probably never thought of. I found it pretty interesting, because it took me down a rabbit hole of subsystems I ve never worked with much and a mishmash of 1980s and 2020s tech. I had previously tried and failed to get an actual 80x25 Linux console, but I ve since figured it out! This post is about the Linux text console not X or Wayland. We re going to get the console right without using those systems. These instructions are for Debian trixie, but should be broadly applicable elsewhere also. The end result can look like this: Photo of a color VGA monitor displaying a BBS login screen (That s a Wifi Retromodem that I got at VCFMW last year in the Hayes modem case)

What s a pixel? How would you define a pixel these days? Probably something like a uniquely-addressable square dot in a two-dimensional grid . In the world of VGA and CRTs, that was just a logical abstraction. We got an API centered around that because it was convenient. But, down the VGA cable and on the device, that s not what a pixel was. A pixel, back then, was a time interval. On a multisync monitor, which were common except in the very early days of VGA, the timings could be adjusted which produced logical pixels of different sizes. Those screens often had a maximum resolution but not necessarily a native resolution in the sense that an LCD panel does. Different timings produced different-sized pixels with equal clarity (or, on cheaper monitors, equal fuzziness). A side effect of this was that pixels need not be square. And, in fact, in the standard DOS VGA 80x25 text mode, they weren t. You might be seeing why DVI, DisplayPort, and HDMI replaced VGA for LCD monitors: with a VGA cable, you did a pixel-to-analog-timings conversion, then the display did a timings-to-pixels conversion, and this process could be a bit lossy. (Hence why you sometimes needed to fill the screen with an image and push the center button on those older LCD screens) (Note to the pedantically-inclined: yes I am aware that I have simplified several things here; for instance, a color LCD pixel is made up of approximately 3 sub-dots of varying colors, and that things like color eInk displays have two pixel grids with different sizes of pixels layered atop each other, and printers are another confusing thing altogether, and and and . MOST PEOPLE THINK OF A PIXEL AS A DOT THESE DAYS, OK?)

What was DOS text mode? We think of this as the standard display: 80 columns wide and 25 rows tall. 80x25. By the time Linux came along, the standard Linux console was VGA text mode something like the 4th incarnation of text modes on PCs (after CGA, MDA, and EGA). VGA also supported certain other sizes of characters giving certain other text dimensions, but if I cover all of those, this will explode into a ridiculously more massive page than it already is. So to display text on an 80x25 DOS VGA system, ultimately characters and attributes were written into the text buffer in memory. The VGA system then rendered it to the display as a 720x400 image (at 70Hz) with non-square pixels such that the result was approximately a 4:3 aspect ratio. The font used for this rendering was a bitmapped one using 8x16 cells. You might do some math here and point out that 8 * 80 is only 640, and you d be correct. The fonts were 8x16 but the rendered cells were 9x16. The extra pixel was normally used for spacing between characters. However, in line graphics mode, characters 0xC0 through 0xDF repeated the 8th column in the position of the 9th, allowing the continuous line-drawing characters we re used to from TUIs.

Problems rendering DOS fonts on modern systems By now, you re probably seeing some of the issues we have rendering DOS screens on more modern systems. These aren t new at all; I remember some of these from back in the days when I ran OS/2, and I think also saw them on various terminals and consoles in OS/2 and Windows. Some issues you d encounter would be:
  • Incorrect aspect ratio caused by using the original font and rendering it using 1:1 square pixels (resulting in a squashed appearance)
  • Incorrect aspect ratio for ANOTHER reason, caused by failing to render column 9, resulting in text that is overall too narrow
  • Characters appearing to be touching each other when they shouldn t (failing to render column 9; looking at you, dosbox)
  • Gaps between line drawing characters that should be continuous, caused by rendering column 9 as empty space in all cases

Character set issues DOS was around long before Unicode was. In the DOS world, there were codepages that selected the glyphs for roughly the high half of the 256 possible characters. CP437 was the standard for the USA; others existed for other locations that needed different characters. On Unix, the USA pre-Unicode standard was Latin-1. Same concept, but with different character mappings. Nowadays, just about everything is based on UTF-8. So, we need some way to map our CP437 glyphs into Unicode space. If we are displaying DOS-based content, we ll also need a way to map CP437 characters to Unicode for display later, and we need these maps to match so that everything comes out right. Whew. So, let s get on with setting this up!

Selecting the proper video mode As explained in my previous post, proper hardware support for DOS text mode is limited to x86 machines that do not use UEFI. Non-x86 machines, or x86 machines with UEFI, simply do not contain the necessary support for it. As these are now standard, most of the time, the text console you see on Linux is actually the kernel driving the video hardware in graphics mode, and doing the text rendering in software. That s all well and good, but it makes it quite difficult to actually get an 80x25 console. First, we need to be running at 720x400. This is where I ran into difficulty last time. I realized that my laptop s LCD didn t advertise any video modes other than its own native resolution. However, almost all external monitors will, and 720x400@70 is a standard VGA mode from way back, so it should be well-supported. You need to find the Linux device name for your device. You can look at the possible devices with ls -l /sys/class/drm. If you also have a GUI, xrandr may help too. But in any case, each directory under /sys/class/drm has a file named modes, and if you cat them all, you will eventually come across one with a bunch of modes defined. Drop the leading card0 or whatever from the directory name, and that s your device. (Verify that 720x400 is in modes while you re at it.) Now, you re going to edit /etc/default/grub and add something like this to GRUB_CMDLINE_LINUX_DEFAULT:
video=DP-1:720x400@70
Of course, replace DP-1 with whatever your device is. Now you can run update-grub and reboot. You should have a 720x400 display. At first, I thought I had succeeded by using Linux s built-in VGA font with that mode. But it looked too tall. After noticing that repeated 0s were touching, I got suspicious about the missing 9th column in the cells. stty -a showed that my screen was 90x25, which is exactly what it would show if I was using 8x16 instead of 9x16 cells. Sooo . I need to prepare a 9x16 font.

Preparing a font Here s where it gets complicated. I ll give you the simple version and the hard mode. The simple mode is this: Download https://www.complete.org/downloads/CP437-VGA.psf.gz and stick it in /usr/local/etc, then skip to the Activating the font section below. The font assembled here is based on the Ultimate Oldschool PC Font Pack v2.2, which is (c) 2016-2020 VileR and licensed under Creative Commons Attribution-ShareAlike 4.0 International License. My psf file is derived from this using the instructions below.

Building it yourself First, install some necessary software: apt-get install fontforge bdf2psf Start by going to the Oldschool PC Font Pack Download page. Download oldschool_pc_font_pack_v2.2_FULL.zip and unpack it. The file we re interested in is otb - Bm (linux bitmap)/Bm437_IBM_VGA_9x16.otb. Open it in fontforge by running fontforge BmPlus_IBM_VGA_9x16.otb. When it asks if you will load the bitmap fonts, hit select all, then yes. Go to File -> generate fonts. Save in a BDF, no need for outlines, and use guess for resolution. Now you have a file such as Bm437_IBM_VGA_9x16-16.bdf. Excellent. Now we need to generate a Unicode map file. We will make sure this matches the system s by enumerating every character from 0x00 to 0xFF, converting it from CP437 to Unicode, and writing the appropriate map. Here s a Python script to do that:
for i in range(0, 256):
    cp437b = b'%c' % i
    uni = ord(cp437b.decode('cp437'))
    print(f"U+ uni:04x ")
Save that file as genmap.py and run python3 genmap.py > cp437-uni. Now, we re ready to build the psf file:
bdf2psf --fb Bm437_IBM_VGA_9x16-16.bdf \
  /dev/null cp437-uni 256 CP437-VGA.psf
By convention, we normally store these files gzipped, so gzip CP437-VGA.psf. You can test it on the console with setfont CP437-VGA.psf.gz. Now copy this file into /usr/local/etc.

Activating the font Now, edit /etc/default/console-setup. It should look like this:
# CONFIGURATION FILE FOR SETUPCON

# Consult the console-setup(5) manual page.

ACTIVE_CONSOLES="/dev/tty[1-6]"

CHARMAP="UTF-8"

CODESET="Lat15"
FONTFACE="VGA"
FONTSIZE="8x16"
FONT=/usr/local/etc/CP437-VGA.psf.gz

VIDEOMODE=

# The following is an example how to use a braille font
# FONT='lat9w-08.psf.gz brl-8x8.psf'
At this point, you should be able to reboot. You should have a proper 80x25 display! Log in and run stty -a to verify it is indeed 80x25.

Using and testing CP437 Part of the point of CP437 is to be able to access BBSs, ANSI art, and similar. Now, remember, the Linux console is still in UTF-8 mode, so we have to translate CP437 to UTF-8, then let our font map translate it back to CP437. A weird trip, but it works. Let s test it using the Textfiles ANSI art collection. In the artworks section, I randomly grabbed a file near the top: borgman.ans. Download that, and display with:
clear; iconv -f CP437 -t UTF-8 < borgman.ans
You should see something similar to but actually more accurate than the textfiles PNG rendering of it, which you ll note has an incorrect aspect ratio and some rendering issues. I spot-checked with a few others and they seemed to look good. belinda.ans in particular tries quite a few characters and should give you a good sense if it is working.

Use with interactive programs That s all well and good, but you re probably going to want to actually use this with some interactive program that expects CP437. Maybe Minicom, Kermit, or even just telnet? For this, you ll want to apt-get install luit. luit maps CP437 (or any other encoding) to UTF-8 for display, and then of course the Linux console maps UTF-8 back to the CP437 font. Here s a way you can repeat the earlier experiment using luit to run the cat program:
clear; luit -encoding CP437 cat borgman.ans
You can run any command under luit. You can even run luit -encoding CP437 bash if you like. If you do this, it is probably a good idea to follow my instructions on generating locales on my post on serial terminals, and then within luit, set LANG=en_us.IBM437. But note especially that you can run programs like minicom and others for accessing BBSs under luit.

Final words This gave you a nice DOS-type console. Although it doesn t have glyphs for many codepoints, it does run in UTF-8 mode and therefore is compatible with modern software. You can achieve greater compatibility with more UTF-8 codepoints with the DOS font, at the expense of accuracy of character rendering (especially for the double-line drawing characters) by using /usr/share/bdf2psf/standard.equivalents instead of /dev/null in the bdf2psf command. Or you could go for another challenge, such as using the DEC vt-series fonts for coverage of ISO-8859-1. But just using fonts extracted from DEC ROM won t work properly, because DEC terminals had even more strangeness going on than DOS fonts.

11 September 2025

Gunnar Wolf: Saying _hi_ to my good Reproducible Builds friends while reading a magazine article

Just wanted to share I enjoy reading George V. Neville s Kode Vicious column, which regularly appears on some of ACM s publications I follow, such as ACM Queue or Communications. Today I was very pleasantly surprised, while reading the column titled Can t we have nice things Kode Vicious answers to a question on why computing has nothing comparable to the beauty of ancient physics laboratories turned into museums (i.e. Faraday s laboratory) by giving a great hat tip to a project stemmed off Debian, and where many of my good Debian friends spend a lot of their energies: Reproducible builds. KV says:
Once the proper measurement points are known, we want to constrain the system such that what it does is simple enough to understand and easy to repeat. It is quite telling that the push for software that enables reproducible builds only really took off after an embarrassing widespread security issue ended up affecting the entire Internet. That there had already been 50 years of software development before anyone thought that introducing a few constraints might be a good idea is, well, let s just say it generates many emotions, none of them happy, fuzzy ones.
Yes, KV is a seasoned free software author. But I found it heart warming that the Reproducible Builds project is mentioned without needing to introduce it (assuming familiarity across the computing industry and academia), recognized as game-changing as we understood it would be over ten years ago when it was first announced, and enabling of beauty in computing. Congratulations to all of you who have made this possible! RB+ACM

25 August 2025

Gunnar Wolf: The comedy of computation, or, how I learned to stop worrying and love obsolescence

This post is a review for Computing Reviews for The comedy of computation, or, how I learned to stop worrying and love obsolescence , a book published in Stanford University Press
The Comedy of Computation is not an easy book to review. It is a much enjoyable book that analyzes several examples of how being computational has been approached across literary genres in the last century how authors of stories, novels, theatrical plays and movies, focusing on comedic genres, have understood the role of the computer in defining human relations, reactions and even self-image. Mangrum structures his work in six thematic chapters, where he presents different angles on human society: How have racial stereotypes advanced in human imagination and perception about a future where we interact with mechanical or computational partners (from mechanical tools performing jobs that were identified with racial profiles to intelligent robots that threaten to control society); the genericity of computers and people can be seen as generic, interchangeable characters, often fueled by the tendency people exhibit to confer anthropomorphic qualities to inanimate objects; people s desire to be seen as truly authentic , regardless of what it ultimately means; romantic involvement and romance-led stories (with the computer seen as a facilitator for human-to-human romances, distractor away from them, or being itself a part of the couple); and the absurdity in antropomorphization, in comparing fundamentally different aspects such as intelligence and speed at solving mathematical operations, as well as the absurdity presented blatantly as such by several techno-utopian visions. But presenting this as a linear set of concepts that are presented does not do justice to the book. Throughout the sections of each chapter, a different work serves as the axis Novels and stories, Hollywood movies, Broadway plays, some covers for the Time magazine, a couple of presenting the would-be future, even a romantic comedy entirely written by bots . And for each of them, Benjamin Mangrum presents a very thorough analysis, drawing relations and comparing with contemporary works, but also with Shakespeare, classical Greek myths, and a very long etc tera. This book is hard to review because of the depth of work the author did: Reading it repeatedly made me look for other works, or at least longer references for them. Still, despite being a work with such erudition, Mangrum s text is easy and pleasant to read, without feeling heavy or written in an overly academic style. I very much enjoyed reading this book. It is certainly not a technical book about computers and society in any way; it is an exploration of human creativity and our understanding of the aspects the author has found as central to understanding the impact of computing on humankind. However, there is one point I must mention before closing: I believe the editorial decision to present the work as a running text, with all the material conceptualized as footnotes presented as a separate, over 50 page long final chapter, detracts from the final result. Personally, I enjoy reading the footnotes because they reveal the author s thought processes, even if they stray from the central line of thought. Even more, given my review copy was a PDF, I could not even keep said chapter open with one finger, bouncing back and forth. For all purposes, I missed out on the notes; now that I finished reading and stumbled upon that chapter, I know I missed an important part of the enjoyment.

11 June 2025

Gunnar Wolf: Understanding Misunderstandings - Evaluating LLMs on Networking Questions

This post is a review for Computing Reviews for Understanding Misunderstandings - Evaluating LLMs on Networking Questions , a article published in Association for Computing Machinery (ACM), SIGCOMM Computer Communication Review
Large language models (LLMs) have awed the world, emerging as the fastest-growing application of all time ChatGPT reached 100 million active users in January 2023, just two months after its launch. After an initial cycle, they have gradually been mostly accepted and incorporated into various workflows, and their basic mechanics are no longer beyond the understanding of people with moderate computer literacy. Now, given that the technology is better understood, we face the question of how convenient LLM chatbots are for different occupations. This paper embarks on the question of whether LLMs can be useful for networking applications. This paper systematizes querying three popular LLMs (GPT-3.5, GPT-4, and Claude 3) with questions taken from several network management online courses and certifications, and presents a taxonomy of six axes along which the incorrect responses were classified: The authors also measure four strategies toward improving answers: The authors observe that, while some of those strategies were marginally useful, they sometimes resulted in degraded performance. The authors queried the commercially available instances of Gemini and GPT, which achieved scores over 90 percent for basic subjects but fared notably worse in topics that require understanding and converting between different numeric notations, such as working with Internet protocol (IP) addresses, even if they are trivial (that is, presenting the subnet mask for a given network address expressed as the typical IPv4 dotted-quad representation). As a last item in the paper, the authors compare performance with three popular open-source models: Llama3.1, Gemma2, and Mistral with their default settings. Although those models are almost 20 times smaller than the GPT-3.5 commercial model used, they reached comparable performance levels. Sadly, the paper does not delve deeper into these models, which can be deployed locally and adapted to specific scenarios. The paper is easy to read and does not require deep mathematical or AI-related knowledge. It presents a clear comparison along the described axes for the 503 multiple-choice questions presented. This paper can be used as a guide for structuring similar studies over different fields.

4 June 2025

Gunnar Wolf: The subjective value of privacy Assessing individuals' calculus of costs and benefits in the context of state surveillance

This post is an unpublished review for The subjective value of privacy Assessing individuals' calculus of costs and benefits in the context of state surveillance
Internet users, software developers, academics, entrepreneurs basically everybody is now aware of the importance of considering privacy as a core part of our online experience. User demand, and various national or regional laws, have made privacy a continuously present subject. And privacy is such an all-encompassing, complex topic, the angles from which it can be studied seems never to finish; I recommend computer networking-oriented newcomers to the topic to refer to Brian Kernighan s excellent work [1]. However, how do regular people like ourselves, in our many capacities feel about privacy? Lukas Antoine presents a series of experiments aiming at better understanding how people throughout the world understands privacy, and when is privacy held as more or less important than security in different aspects, Particularly, privacy is often portrayed as a value set at tension against surveillance, and particularly state surveillance, in the name of security: conventional wisdom presents the idea of privacy calculus. This is, it is often assumed that individuals continuously evaluate the costs and benefits of divulging their personal data, sharing data when they expect a positive net outcome, and denying it otherwise. This framework has been accepted for decades, and the author wishes to challenge it. This book is clearly his doctoral thesis on political sciences, and its contents are as thorough as expected in this kind of product. The author presents three empirical studies based on cross-survey analysis. The first experiment explores the security justifications for surveillance and how they influence their support. The second one searches whether the stance on surveillance can be made dependent on personal convenience or financial cost. The third study explores whether privacy attitude is context-dependant or can be seen as a stable personality trait. The studies aim to address the shortcomings of published literature in the field, mainly, (a) the lack of comprehensive research on state surveillance, needed or better understanding privacy appreciation, (b) while several studies have tackled the subjective measure of privacy, there is a lack of cross-national studies to explain wide-ranging phenomena, (c) most studies in this regard are based on population-based surveys, which cannot establish causal relationships, (d) a seemingly blind acceptance of the privacy calculus mentioned above, with no strong evidence that it accurately measures people s motivations for disclosing or withholding their data. The specific take, including the framing of the tension between privacy and surveillance has long been studied, as can be seen in Steven Nock s 1993 book [2], but as Sannon s article in 2022 shows [3], social and technological realities require our undertanding to be continuously kept up to date. The book is full with theoretical references and does a very good job of explaining the path followed by the author. It is, though, a heavy read, and, for people not coming from the social sciences tradition, leads to the occasional feeling of being lost. The conceptual and theoretical frameworks and presented studies are thorough and clear. The author is honest in explaining when the data points at some of his hypotheses being disproven, while others are confirmed. The aim of the book is for people digging deep into this topic. Personally, I have authored several works on different aspects of privacy (such as a book [4] and a magazine number [5]), but this book did get me thinking on many issues I had not previously considered. Looking for comparable works, I find Friedewald et al. s 2017 book [6] chapter organization to follow a similar thought line. My only complaint would be that, for the publication as part of its highly prestigious publisher, little attention has been paid to editorial aspects: sub-subsection depth is often excessive and unclear. Also, when publishing monographs based on doctoral works, it is customary to no longer refer to the work as a thesis and to soften some of the formal requirements such a work often has, with the aim of producing a more gentle and readable book; this book seems just like the mass-production of an (otherwise very interesting and well made) thesis work. References:

12 May 2025

Reproducible Builds: Reproducible Builds in April 2025

Welcome to our fourth report from the Reproducible Builds project in 2025. These monthly reports outline what we ve been up to over the past month, and highlight items of news from elsewhere in the increasingly-important area of software supply-chain security. Lastly, if you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. Table of contents:
  1. reproduce.debian.net
  2. Fifty Years of Open Source Software Supply Chain Security
  3. 4th CHAINS Software Supply Chain Workshop
  4. Mailing list updates
  5. Canonicalization for Unreproducible Builds in Java
  6. OSS Rebuild adds new TUI features
  7. Distribution roundup
  8. diffoscope & strip-nondeterminism
  9. Website updates
  10. Reproducibility testing framework
  11. Upstream patches

reproduce.debian.net The last few months have seen the introduction, development and deployment of reproduce.debian.net. In technical terms, this is an instance of rebuilderd, our server designed monitor the official package repositories of Linux distributions and attempt to reproduce the observed results there. This month, however, we are pleased to announce that reproduce.debian.net now tests all the Debian trixie architectures except s390x and mips64el. The ppc64el architecture was added through the generous support of Oregon State University Open Source Laboratory (OSUOSL), and we can support the armel architecture thanks to CodeThink.

Fifty Years of Open Source Software Supply Chain Security Russ Cox has published a must-read article in ACM Queue on Fifty Years of Open Source Software Supply Chain Security. Subtitled, For decades, software reuse was only a lofty goal. Now it s very real. , Russ article goes on to outline the history and original goals of software supply-chain security in the US military in the early 1970s, all the way to the XZ Utils backdoor of 2024. Through that lens, Russ explores the problem and how it has changed, and hasn t changed, over time. He concludes as follows:
We are all struggling with a massive shift that has happened in the past 10 or 20 years in the software industry. For decades, software reuse was only a lofty goal. Now it s very real. Modern programming environments such as Go, Node and Rust have made it trivial to reuse work by others, but our instincts about responsible behaviors have not yet adapted to this new reality. We all have more work to do.

4th CHAINS Software Supply Chain Workshop Convened as part of the CHAINS research project at the KTH Royal Institute of Technology in Stockholm, Sweden, the 4th CHAINS Software Supply Chain Workshop occurred during April. During the workshop, there were a number of relevant workshops, including: The full listing of the agenda is available on the workshop s website.

Mailing list updates On our mailing list this month:
  • Luca DiMaio of Chainguard posted to the list reporting that they had successfully implemented reproducible filesystem images with both ext4 and an EFI system partition. They go on to list the various methods, and the thread generated at least fifteen replies.
  • David Wheeler announced that the OpenSSF is building a glossary of sorts in order that they consistently use the same meaning for the same term and, moreover, that they have drafted a definition for reproducible build . The thread generated a significant number of replies on the definition, leading to a potential update to the Reproducible Build s own definition.
  • Lastly, kpcyrd posted to the list with a timely reminder and update on their repro-env tool. As first reported in our July 2023 report, kpcyrd mentions that:
    My initial interest in reproducible builds was how do I distribute pre-compiled binaries on GitHub without people raising security concerns about them . I ve cycled back to this original problem about 5 years later and built a tool that is meant to address this. [ ]

Canonicalization for Unreproducible Builds in Java Aman Sharma, Benoit Baudry and Martin Monperrus have published a new scholarly study related to reproducible builds within Java. Titled Canonicalization for Unreproducible Builds in Java, the article s abstract is as follows:
[ ] Achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using OSS-Rebuild and jNorm respectively. Finally, we present Chains-Rebuild, a tool that raises reproducibility success from 9.48% to 26.89% on 12,283 unreproducible artifacts. To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and Chains-Rebuild, a canonicalization tool for mitigating unreproducible builds in Java.
A full PDF of their article is available from arXiv.

OSS Rebuild adds new TUI features OSS Rebuild aims to automate rebuilding upstream language packages (e.g. from PyPI, crates.io and npm registries) and publish signed attestations and build definitions for public use. OSS Rebuild ships a text-based user interface (TUI) for viewing, launching, and debugging rebuilds. While previously requiring ownership of a full instance of OSS Rebuild s hosted infrastructure, the TUI now supports a fully local mode of build execution and artifact storage. Thanks to Giacomo Benedetti for his usage feedback and work to extend the local-only development toolkit. Another feature added to the TUI was an experimental chatbot integration that provides interactive feedback on rebuild failure root causes and suggests fixes.

Distribution roundup In Debian this month:
  • Roland Clobus posted another status report on reproducible ISO images on our mailing list this month, with the summary that all live images build reproducibly from the online Debian archive .
  • Debian developer Simon Josefsson published another two reproducibility-related blog posts this month, the first on the topic of Verified Reproducible Tarballs. Simon sardonically challenges the reader as follows: Do you want a supply-chain challenge for the Easter weekend? Pick some well-known software and try to re-create the official release tarballs from the corresponding Git checkout. Is anyone able to reproduce anything these days? After that, they also published a blog post on Building Debian in a GitLab Pipeline using their multi-stage rebuild approach.
  • Roland also posted to our mailing list to highlight that there is now another tool in Debian that generates reproducible output, equivs . This is a tool to create trivial Debian packages that might Depend on other packages. As Roland writes, building the [equivs] package has been reproducible for a while, [but] now the output of the [tool] has become reproducible as well .
  • Lastly, 9 reviews of Debian packages were added, 10 were updated and 10 were removed this month adding to our extensive knowledge about identified issues.
The IzzyOnDroid Android APK repository made more progress in April. Thanks to funding by NLnet and Mobifree, the project was also to put more time into their tooling. For instance, developers can now easily run their own verification builder in less than 5 minutes . This currently supports Debian-based systems, but support for RPM-based systems is incoming.
  • The rbuilder_setup tool can now setup the entire framework within less than five minutes. The process is configurable, too, so everything from just the basics to verify builds up to a fully-fledged RB environment is also possible.
  • This tool works on Debian, RedHat and Arch Linux, as well as their derivates. The project has received successful reports from Debian, Ubuntu, Fedora and some Arch Linux derivates so far.
  • Documentation on how to work with reproducible builds (making apps reproducible, debugging unreproducible packages, etc) is available in the project s wiki page.
  • Future work is also in the pipeline, including documentation, guidelines and helpers for debugging.
NixOS defined an Outreachy project for improving build reproducibility. In the application phase, NixOS saw some strong candidates providing contributions, both on the NixOS side and upstream: guider-le-ecit analyzed a libpinyin issue. Tessy James fixed an issue in arandr and helped analyze one in libvlc that led to a proposed upstream fix. Finally, 3pleX fixed an issue which was accepted in upstream kitty, one in upstream maturin, one in upstream python-sip and one in the Nix packaging of python-libbytesize. Sadly, the funding for this internship fell through, so NixOS were forced to abandon their search. Lastly, in openSUSE news, Bernhard M. Wiedemann posted another monthly update for their work there.

diffoscope & strip-nondeterminism diffoscope is our in-depth and content-aware diff utility that can locate and diagnose reproducibility issues. This month, Chris Lamb made the following changes, including preparing and uploading a number of versions to Debian:
  • Use the --walk argument over the potentially dangerous alternative --scan when calling out to zipdetails(1). [ ]
  • Correct a longstanding issue where many >-based version tests used in conditional fixtures were broken. This was used to ensure that specific tests were only run when the version on the system was newer than a particular number. Thanks to Colin Watson for the report (Debian bug #1102658) [ ]
  • Address a long-hidden issue in the test_versions testsuite as well, where we weren t actually testing the greater-than comparisons mentioned above, as it was masked by the tests for equality. [ ]
  • Update copyright years. [ ]
In strip-nondeterminism, however, Holger Levsen updated the Continuous Integration (CI) configuration in order to use the standard Debian pipelines via debian/salsa-ci.yml instead of using .gitlab-ci.yml. [ ]

Website updates Once again, there were a number of improvements made to our website this month including:
  • Aman Sharma added OSS-Rebuild s stabilize tool to the Tools page. [ ][ ]
  • Chris Lamb added a configure.ac (GNU Autotools) example for using SOURCE_DATE_EPOCH. [ ]. Chris also updated the SOURCE_DATE_EPOCH snippet and move the archive metadata to a more suitable location. [ ]
  • Denis Carikli added GNU Boot to our ever-evolving Projects page.

Reproducibility testing framework The Reproducible Builds project operates a comprehensive testing framework running primarily at tests.reproducible-builds.org in order to check packages and other artifacts for reproducibility. In April, a number of changes were made by Holger Levsen, including:
  • reproduce.debian.net-related:
    • Add armel.reproduce.debian.net to support the armel architecture. [ ][ ]
    • Add a new ARM node, codethink05. [ ][ ]
    • Add ppc64el.reproduce.debian.net to support testing of the ppc64el architecture. [ ][ ][ ]
    • Improve the reproduce.debian.net front page. [ ][ ]
    • Make various changes to the ppc64el nodes. [ ][ ]9[ ][ ]
    • Make various changes to the arm64 and armhf nodes. [ ][ ][ ][ ]
    • Various changes related to the rebuilderd-worker entry point. [ ][ ][ ]
    • Create and deploy a pkgsync script. [ ][ ][ ][ ][ ][ ][ ][ ]
    • Fix the monitoring of the riscv64 architecture. [ ][ ]
    • Make a number of changes related to starting the rebuilderd service. [ ][ ][ ][ ]
  • Backup-related:
    • Backup the rebuilder databases every week. [ ][ ][ ][ ]
    • Improve the node health checks. [ ][ ]
  • Misc:
    • Re-use existing connections to the SSH proxy node on the riscv64 nodes. [ ][ ]
    • Node maintenance. [ ][ ][ ]
In addition:
  • Jochen Sprickerhof fixed the risvc64 host names [ ] and requested access to all the rebuilderd nodes [ ].
  • Mattia Rizzolo updated the self-serve rebuild scheduling tool, replacing the deprecated SSO -style authentication with OpenIDC which authenticates against salsa.debian.org. [ ][ ][ ]
  • Roland Clobus updated the configuration for the osuosl3 node to designate 4 workers for bigger builds. [ ]

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

Finally, if you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

17 December 2024

Gunnar Wolf: The science of detecting LLM-generated text

This post is a review for Computing Reviews for The science of detecting LLM-generated text , a article published in Communications of the ACM
While artificial intelligence (AI) applications for natural language processing (NLP) are no longer something new or unexpected, nobody can deny the revolution and hype that started, in late 2022, with the announcement of the first public version of ChatGPT. By then, synthetic translation was well established and regularly used, many chatbots had started attending users requests on different websites, voice recognition personal assistants such as Alexa and Siri had been widely deployed, and complaints of news sites filling their space with AI-generated articles were already commonplace. However, the ease of prompting ChatGPT or other large language models (LLMs) and getting extensive answers its text generation quality is so high that it is often hard to discern whether a given text was written by an LLM or by a human has sparked significant concern in many different fields. This article was written to present and compare the current approaches to detecting human- or LLM-authorship in texts. The article presents several different ways LLM-generated text can be detected. The first, and main, taxonomy followed by the authors is whether the detection can be done aided by the LLM s own functions ( white-box detection ) or only by evaluating the generated text via a public application programming interface (API) ( black-box detection ). For black-box detection, the authors suggest training a classifier to discern the origin of a given text. Although this works at first, this task is doomed from its onset to be highly vulnerable to new LLMs generating text that will not follow the same patterns, and thus will probably evade recognition. The authors report that human evaluators find human-authored text to be more emotional and less objective, and use grammar to indicate the tone of the sentiment that should be used when reading the text a trait that has not been picked up by LLMs yet. Human-authored text also tends to have higher sentence-level coherence, with less term repetition in a given paragraph. The frequency distribution for more and less common words is much more homogeneous in LLM-generated texts than in human-written ones. White-box detection includes strategies whereby the LLMs will cooperate in identifying themselves in ways that are not obvious to the casual reader. This can include watermarking, be it rule based or neural based; in this case, both processes become a case of steganography, as the involvement of a LLM is explicitly hidden and spread through the full generated text, aiming at having a low detectability and high recoverability even when parts of the text are edited. The article closes by listing the authors concerns about all of the above-mentioned technologies. Detecting an LLM, be it with or without the collaboration of the LLM s designers, is more of an art than a science, and methods deemed as robust today will not last forever. We also cannot assume that LLMs will continue to be dominated by the same core players; LLM technology has been deeply studied, and good LLM engines are available as free/open-source software, so users needing to do so can readily modify their behavior. This article presents itself as merely a survey of methods available today, while also acknowledging the rapid progress in the field. It is timely and interesting, and easy to follow for the informed reader coming from a different subfield.

11 November 2024

Gunnar Wolf: Why academics under-share research data - A social relational theory

This post is a review for Computing Reviews for Why academics under-share research data - A social relational theory , a article published in Journal of the Association for Information Science and Technology
As an academic, I have cheered for and welcomed the open access (OA) mandates that, slowly but steadily, have been accepted in one way or another throughout academia. It is now often accepted that public funds means public research. Many of our universities or funding bodies will demand that, with varying intensities sometimes they demand research to be published in an OA venue, sometimes a mandate will only prefer it. Lately, some journals and funder bodies have expanded this mandate toward open science, requiring not only research outputs (that is, articles and books) to be published openly but for the data backing the results to be made public as well. As a person who has been involved with free software promotion since the mid 1990s, it was natural for me to join the OA movement and to celebrate when various universities adopt such mandates. Now, what happens after a university or funder body adopts such a mandate? Many individual academics cheer, as it is the right thing to do. However, the authors observe that this is not really followed thoroughly by academics. What can be observed, rather, is the slow pace or feet dragging of academics when they are compelled to comply with OA mandates, or even an outright refusal to do so. If OA and open science are close to the ethos of academia, why aren t more academics enthusiastically sharing the data used for their research? This paper finds a subversive practice embodied in the refusal to comply with such mandates, and explores an hypothesis based on Karl Marx s productive worker theory and Pierre Bourdieu s ideas of symbolic capital. The paper explains that academics, as productive workers, become targets for exploitation: given that it s not only the academics sharing ethos, but private industry s push for data collection and industry-aligned research, they adapt to technological changes and jump through all kinds of hurdles to create more products, in a result that can be understood as a neoliberal productivity measurement strategy. Neoliberalism assumes that mechanisms that produce more profit for academic institutions will result in better research; it also leads to the disempowerment of academics as a class, although they are rewarded as individuals due to the specific value they produce. The authors continue by explaining how open science mandates seem to ignore the historical ways of collaboration in different scientific fields, and exploring different angles of how and why data can be seen as under-shared, failing to comply with different aspects of said mandates. This paper, built on the social sciences tradition, is clearly a controversial work that can spark interesting discussions. While it does not specifically touch on computing, it is relevant to Computing Reviews readers due to the relatively high percentage of academics among us.

21 September 2024

Gunnar Wolf: 50 years of queries

This post is a review for Computing Reviews for 50 years of queries , a article published in Communications of the ACM
The relational model is probably the one innovation that brought computers to the mainstream for business users. This article by Donald Chamberlin, creator of one of the first query languages (that evolved into the ubiquitous SQL), presents its history as a commemoration of the 50th anniversary of his publication of said query language. The article begins by giving background on information processing before the advent of today s database management systems: with systems storing and processing information based on sequential-only magnetic tapes in the 1950s, adopting a record-based, fixed-format filing system was far from natural. The late 1960s and early 1970s saw many fundamental advances, among which one of the best known is E. F. Codd s relational model. The first five pages (out of 12) present the evolution of the data management community up to the 1974 SIGFIDET conference. This conference was so important in the eyes of the author that, in his words, it is the event that starts the clock on 50 years of relational databases. The second part of the article tells about the growth of the structured English query language (SEQUEL) eventually renamed SQL including the importance of its standardization and its presence in commercial products as the dominant database language since the late 1970s. Chamberlin presents short histories of the various implementations, many of which remain dominant names today, that is, Oracle, Informix, and DB2. Entering the 1990s, open-source communities introduced MySQL, PostgreSQL, and SQLite. The final part of the article presents controversies and criticisms related to SQL and the relational database model as a whole. Chamberlin presents the main points of controversy throughout the years: 1) the SQL language lacks orthogonality; 2) SQL tables, unlike formal relations, might contain null values; and 3) SQL tables, unlike formal relations, may contain duplicate rows. He explains the issues and tradeoffs that guided the language design as it unfolded. Finally, a section presents several points that explain how SQL and the relational model have remained, for 50 years, a winning concept, as well as some thoughts regarding the NoSQL movement that gained traction in the 2010s. This article is written with clear language and structure, making it easy and pleasant to read. It does not drive a technical point, but instead is a recap on half a century of developments in one of the fields most important to the commercial development of computing, written by one of the greatest authorities on the topic.

2 September 2024

Gunnar Wolf: Free and open source software and other market failures

This post is a review for Computing Reviews for Free and open source software and other market failures , a article published in Communications of the ACM
Understanding the free and open-source software (FOSS) movement has, since its beginning, implied crossing many disciplinary boundaries. This article describes FOSS s history, explaining its undeniable success throughout the 1990s, and why the movement today feels in a way as if it were on autopilot, lacking the steam it once had. The author presents several examples of different industries where, as it happened with FOSS in computing, fundamental innovations happened not because the leading companies of each field are attentive to customers needs, but to a certain degree, despite them not even considering those needs, it is typically due to the hubris that comes from being a market leader. Kemp exemplifies his hypothesis by presenting the messy landscape of the commercial, mutually incompatible systems of Unix in the 1980s. Different companies had set out to implement their particular flavor of open Unix computers, but with clear examples of vendor lock-in techniques. He speculates that, if we had been able to buy a reasonably priced and solid Unix for our 32-bit PCs nobody would be running FreeBSD or Linux today, except possibly as an obscure hobby. He states that the FOSS movement was born out of the utter market failure of the different Unix vendors. The focus of the article shifts then to the FOSS movement itself: 25 years ago, as FOSS systems slowly gained acceptance and then adoption in the serious market and at the center of the dot-com boom of the early 2000s, Linux user groups (LUGs) with tens of thousands of members bloomed throughout the world; knowing this history, why have all but a few of them vanished into oblivion? Kemp suggests that the strength and vitality that LUGs had ultimately reflects the anger that prompted technical users to take the situation into their own hands and fix it; once the software industry was forced to change, the strongly cohesive FOSS movement diluted. The frustrations and anger of [information technology, IT] in 2024, Kamp writes, are entirely different from those of 1991. As an example, the author closes by citing the difficulty of maintaining despite having the resources to do so an aging legacy codebase that needs to continue working year after year.

25 May 2024

Gunnar Wolf: How computers make books from graphics rendering, search algorithms, and functional programming to indexing and typesetting

This post is a review for Computing Reviews for How computers make books from graphics rendering, search algorithms, and functional programming to indexing and typesetting , a book published in Manning
If we look at the age-old process of creating books, how many different areas can a computer help us with? And how can each of them be used to teach computer science (CS) fundamentals to a nontechnical audience? This is the premise of John Whitington s enticing book and the result is quite amazing. The book immediately drew my attention when looking at the titles available for review. After all, my initiation into computing as a kid was learning the LaTeX typesetting system while my father worked on his first book on scientific language and typography [1]. Whitington picks 11 different technical aspects of book production, from how dots of ink are transferred to a white page and how they are made into controllable, recognizable shapes, all the way to forming beautiful typefaces and the nuances of properly addressing white-space to present aesthetically pleasing paragraphs, building it all into specific formats aimed at different ends. But if we dig beyond just the chapter titles, we will find a very interesting book on CS that, without ever using technical language or notation, presents aspects as varied as anti-aliasing, vector and raster images, character sets such as ASCII and Unicode, an introduction to programming, input methods for different writing systems, efficient encoding (compression) methods, both for text and images, lossless and lossy, and recursion and dithering methods. To my absolute surprise, while the author thankfully spared the reader the syntax usually associated with LISP-related languages, the programming examples clearly stem from the LISP school, presenting solutions based on tail recursion. Of course, it is no match for Donald Knuth s classic book on this same topic [2], but could very well be a primer for readers to approach it. The book is light and easy to read, and keeps a very informal, nontechnical tone throughout. My only complaint relates to reading it in PDF format; the topic of this book, and the care with which the images were provided by the author, warrant high resolution. The included images are not only decorative but an integral part of the book. Maybe this is specific to my review copy, but all of the raster images were in very low resolution. This book is quite different from what readers may usually expect, as it introduces several significant topics in the field. CS professors will enjoy it, of course, but also readers with a humanities background, students new to the field, or even those who are just interested in learning a bit more.

References
  1. S nchez y G ndara, A.; Magari os Lamas, F.; Wolf, K. B., Manual de lenguaje y tipograf a cient fica en castellano. Trillas, Mexico City, Mexico, 1986, https://www.fis.unam.mx/~bwolf/manual.html
  2. Knuth, D. E. Digital typographyCSLI Lecture Notes: CSLI Lecture Notes. CSLI Publications, Stanford, CA, 1999, https://www-cs-faculty.stanford.edu/~knuth/dt.html

9 May 2024

Gunnar Wolf: Hacks, leaks, and revelations The art of analyzing hacked and leaked data

This post is a review for Computing Reviews for Hacks, leaks and revelations The art of analyzing hacked and leaked data , a book published in No Starch Press
Imagine you ve come across a trove of files documenting a serious deed and you feel the need to blow the whistle. Or maybe you are an investigative journalist and this whistleblower trusts you and wants to give you said data. Or maybe you are a technical person, trusted by said journalist to help them do things right not only to help them avoid being exposed while leaking the information, but also to assist them in analyzing the contents of the dataset. This book will be a great aid for all of the above tasks. The author, Micah Lee, is both a journalist and a computer security engineer. The book is written entirely from his experience handling important datasets, and is organized in a very logical and sound way. Lee organized the 14 chapters in five parts. The first part the most vital to transmitting the book s message, in my opinion begins by talking about the care that must be taken when handling a sensitive dataset: how to store it, how to communicate it to others, sometimes even what to redact (exclude) so the information retains its strength but does not endanger others (or yourself). The first two chapters introduce several tools for encrypting information and keeping communication anonymous, not getting too deep into details and keeping it aimed at a mostly nontechnical audience. Something that really sets this book apart from others like it is that Lee s aim is not only to tell stories about the hacks and leaks he has worked with, or to present the technical details on how he analyzed them, but to teach readers how to do the work. From Part 2 onward the book adopts a tutorial style, teaching the reader numerous tools for obtaining and digging information out of huge and very timely datasets. Lee guides the reader through various data breaches, all of them leaked within the last five years: BlueLeaks, Oath Keepers email dumps, Heritage Foundation, Parler, Epik, and Cadence Health. He guides us through a tutorial on using the command line (mostly targeted at Linux, but considering MacOS and Windows as well), running Docker containers, learning the basics of Python, parsing and filtering structured data, writing small web applications for getting at the right bits of data, and working with structured query language (SQL) databases. The book does an excellent job of fulfilling its very ambitious aims, and this is even more impressive given the wide range of professional profiles it is written for; that being said, I do have a couple critiques. First, the book is ideologically loaded: the datasets all exhibit the alt-right movement that has gained strength in the last decade. Lee takes the reader through many instances of COVID deniers, rioters for Donald Trump during the January 2021 attempted coup, attacks against Black Lives Matter activists, and other extremism research; thus this book could alienate right-wing researchers, who might also be involved in handling important whistleblowing cases. Second, given the breadth of the topic and my 30-plus years of programming experience, I was very interested in the first part of each chapter but less so in the tutorial part. I suppose a journalist reading through the same text might find the sections about the importance of data handling and source protection to be similarly introductory. This is unavoidable, of course, given the nature of this work. However, while Micah Lee is an excellent example of a journalist with the appropriate technical know-how to process the types of material he presents as examples, expecting any one person to become a professional in both fields is asking too much. All in all, this book is excellent. The writing style is informal and easy to read, the examples are engaging, and the analysis is very good. It will certainly teach you something, no matter your background, and it might very well complement your professional skills.

7 March 2024

Gunnar Wolf: Constructed truths truth and knowledge in a post-truth world

This post is a review for Computing Reviews for Constructed truths truth and knowledge in a post-truth world , a book published in Springer Link
Many of us grew up used to having some news sources we could implicitly trust, such as well-positioned newspapers and radio or TV news programs. We knew they would only hire responsible journalists rather than risk diluting public trust and losing their brand s value. However, with the advent of the Internet and social media, we are witnessing what has been termed the post-truth phenomenon. The undeniable freedom that horizontal communication has given us automatically brings with it the emergence of filter bubbles and echo chambers, and truth seems to become a group belief. Contrary to my original expectations, the core topic of the book is not about how current-day media brings about post-truth mindsets. Instead it goes into a much deeper philosophical debate: What is truth? Does truth exist by itself, objectively, or is it a social construct? If activists with different political leanings debate a given subject, is it even possible for them to understand the same points for debate, or do they truly experience parallel realities? The author wrote this book clearly prompted by the unprecedented events that took place in 2020, as the COVID-19 crisis forced humanity into isolation and online communication. Donald Trump is explicitly and repeatedly presented throughout the book as an example of an actor that took advantage of the distortions caused by post-truth. The first chapter frames the narrative from the perspective of information flow over the last several decades, on how the emergence of horizontal, uncensored communication free of editorial oversight started empowering the netizens and created a temporary information flow utopia. But soon afterwards, algorithmic gatekeepers started appearing, creating a set of personalized distortions on reality; users started getting news aligned to what they already showed interest in. This led to an increase in polarization and the growth of narrative-framing-specific communities that served as echo chambers for disjoint views on reality. This led to the growth of conspiracy theories and, necessarily, to the science denial and pseudoscience that reached unimaginable peaks during the COVID-19 crisis. Finally, when readers decide based on completely subjective criteria whether a scientific theory such as global warming is true or propaganda, or question what most traditional news outlets present as facts, we face the phenomenon known as fake news. Fake news leads to post-truth, a state where it is impossible to distinguish between truth and falsehood, and serves only a rhetorical function, making rational discourse impossible. Toward the end of the first chapter, the tone of writing quickly turns away from describing developments in the spread of news and facts over the last decades and quickly goes deep into philosophy, into the very thorny subject pursued by said discipline for millennia: How can truth be defined? Can different perspectives bring about different truth values for any given idea? Does truth depend on the observer, on their knowledge of facts, on their moral compass or in their honest opinions? Zoglauer dives into epistemology, following various thinkers ideas on what can be understood as truth: constructivism (whether knowledge and truth values can be learnt by an individual building from their personal experience), objectivity (whether experiences, and thus truth, are universal, or whether they are naturally individual), and whether we can proclaim something to be true when it corresponds to reality. For the final chapter, he dives into the role information and knowledge play in assigning and understanding truth value, as well as the value of second-hand knowledge: Do we really own knowledge because we can look up facts online (even if we carefully check the sources)? Can I, without any medical training, diagnose a sickness and treatment by honestly and carefully looking up its symptoms in medical databases? Wrapping up, while I very much enjoyed reading this book, I must confess it is completely different from what I expected. This book digs much more into the abstract than into information flow in modern society, or the impact on early 2020s politics as its editorial description suggests. At 160 pages, the book is not a heavy read, and Zoglauer s writing style is easy to follow, even across the potentially very deep topics it presents. Its main readership is not necessarily computing practitioners or academics. However, for people trying to better understand epistemology through its expressions in the modern world, it will be a very worthy read.

23 February 2024

Gunnar Wolf: 10 things software developers should learn about learning

This post is a review for Computing Reviews for 10 things software developers should learn about learning , a article published in Communications of the ACM
As software developers, we understand the detailed workings of the different components of our computer systems. And probably due to how computers were presented since their appearance as digital brains in the 1940s we sometimes believe we can transpose that knowledge to how our biological brains work, be it as learners or as problem solvers. This article aims at making the reader understand several mechanisms related to how learning and problem solving actually work in our brains. It focuses on helping expert developers convey knowledge to new learners, as well as learners who need to get up to speed and start coding. The article s narrative revolves around software developers, but much of what it presents can be applied to different problem domains. The article takes this mission through ten points, with roughly the same space given to each of them, starting with wrong assumptions many people have about the similarities between computers and our brains. The first section, Human Memory Is Not Made of Bits, explains the brain processes of remembering as a way of strengthening the force of a memory ( reconsolidation ) and the role of activation in related network pathways. The second section, Human Memory Is Composed of One Limited and One Unlimited System, goes on to explain the organization of memories in the brain between long-term memory (functionally limitless, permanent storage) and working memory (storing little amounts of information used for solving a problem at hand). However, the focus soon shifts to how experience in knowledge leads to different ways of using the same concepts, the importance of going from abstract to concrete knowledge applications and back, and the role of skills repetition over time. Toward the end of the article, the focus shifts from the mechanical act of learning to expertise. Section 6, The Internet Has Not Made Learning Obsolete, emphasizes that problem solving is not just putting together the pieces of a puzzle; searching online for solutions to a problem does not activate the neural pathways that would get fired up otherwise. The final sections tackle the differences that expertise brings to play when teaching or training a newcomer: the same tools that help the beginner s productivity as training wheels will often hamper the expert user s as their knowledge has become automated. The article is written with a very informal and easy-to-read tone and vocabulary, and brings forward several issues that might seem like commonsense but do ring bells when it comes to my own experiences both as a software developer and as a teacher. The article closes by suggesting several books that further expand on the issues it brings forward. While I could not identify a single focus or thesis with which to characterize this article, the several points it makes will likely help readers better understand (and bring forward to consciousness) mental processes often taken for granted, and consider often-overlooked aspects when transmitting knowledge to newcomers.

20 January 2024

Gunnar Wolf: A deep learning technique for intrusion detection system using a recurrent neural networks based framework

This post is a review for Computing Reviews for A deep learning technique for intrusion detection system using a recurrent neural networks based framework , a article published in Computer Communications
So let s assume you already know and understand that artificial intelligence s main building blocks are perceptrons, that is, mathematical models of neurons. And you know that, while a single perceptron is too limited to get interesting information from, very interesting structures neural networks can be built with them. You also understand that neural networks can be trained with large datasets, and you can get them to become quite efficient and accurate classifiers for data comparable to your dataset. Finally, you are interested in applying this knowledge to defensive network security, particularly in choosing the right recurrent neural network (RNN) framework to create an intrusion detection system (IDS). Are you still with me? Good! This paper might be right for you! The paper builds on a robust and well-written introduction and related work sections to arrive at explaining in detail what characterizes a RNN, the focus of this work, among other configurations also known as neural networks, and why they are particularly suited for machine learning (ML) tasks. RNNs must be trained for each problem domain, and publicly available datasets are commonly used for such tasks. The authors present two labeled datasets representing normal and hostile network data, identified according to different criteria: NSL-KDD and UNSW-NB15. They proceed to show a framework to analyze and compare different RNNs and run them against said datasets, segmented for separate training and validation phases, compare results, and finally select the best available model for the task measuring both training speed as well as classification accuracy. The paper is quite heavy due to both its domain-specific terminology many acronyms are used throughout the text and its use of mathematical notation, both to explain specific properties of each of the RNN types and for explaining the preprocessing carried out for feature normalization and selection. This is partly what led me to start the first paragraph by assuming that we, as readers, already understand a large body of material if we are to fully follow the text. The paper does begin by explaining its core technologies, but quickly ramps up and might get too technical for nonexpert readers. It is undeniably an interesting and valuable read, showing the state of the art in IDS and ML-assisted technologies. It does not detail any specific technology applying its findings, but we will probably find the information conveyed here soon enough in industry publications.

22 December 2023

Gunnar Wolf: Pushing some reviews this way

Over roughly the last year and a half I have been participating as a reviewer in ACM s Computing Reviews, and have even been honored as a Featured Reviewer. Given I have long enjoyed reading friends reviews of their reading material (particularly, hats off to the very active Russ Allbery, who both beats all of my frequency expectations (I could never sustain the rythm he reads to!) and holds documented records for his >20 years as a book reader, with far more clarity and readability than I can aim for!), I decided to explicitly share my reviews via this blog, as the audience is somewhat congruent; I will also link here some reviews that were not approved for publication, clearly marking them so. I will probably work on wrangling my Jekyll site to display an (auto-)updated page and RSS feed for the reviews. In the meantime, the reviews I have published are:

21 March 2022

Gunnar Wolf: Long, long, long live Emacs after 39 years

Reading Planet Debian (see, Sam, we are still having a conversation over there? ), I read Anarcat s 20+ years of Emacs. And.. Well, should I brag contribute to the discussion? Of course, why not? Emacs is the first computer program I can name that I ever learnt to use to do something minimally useful. 39 years ago.
From the Space Cadet keyboard that (obviously ) influenced Emacs early design
The Emacs editor was born, according to Wikipedia, in 1976, same year as myself. I am clearly not among its first users. It was already a well-established citizen when I first learnt it; I am fortunate to be the son of a Physics researcher at UNAM, My father used to take me to his institute after he noticed how I was attracted to computers; we would usually spend some hours there between 7 and 11PM on Friday nights. His institute had a computer room where they had very sweet gear: Some 10 Heathkit terminals quite similar to this one: The terminals were connected (via individual switches) to both a PDP-11 and a Foonly F2 computers. The room also had a beautiful thermal printer, a beautiful Tektronix vectorial graphics output terminal, and some other stuff. The main user for my father was to typeset some books; he had recently (1979) published Integral Transforms in Science and Engineering (that must be my first mention in scientific literature), and I remember he was working on the proceedings of a conference he held in Oaxtepec (the account he used in the system was oax, not his usual kbw, which he lent me). He was also working on Manual de Lenguaje y Tipograf a Cient fica en Castellano, where you can see some examples of TeX; due to a hardware crash, the book has the rare privilege of being a direct copy of the output of the thermal printer: It was not possible to produce a higher resolution copy for several years But it is fun and interesting to see what we were able to produce with in-house tools back in 1985! So, what could he teach me so I could use the computers while he worked? TeX, of course. No, no LaTeX (that was published in 1984). LaTeX is a set of macros developed initially by Leslie Lamport, used to make TeX easier; TeX was developed by Donald Knuth, and if I have this information correct, it was Knuth himself who installed and demonstrated TeX in the Foonly computer, during a visit to UNAM. Now, after 39 years hammering at Emacs buffers Have I grown extra fingers? Nope. I cannot even write decent elisp code, and can barely read it. I do use org-mode (a lot!) and love it; I have written basically five books, many articles and lots of presentations and minor documents with it. But I don t read my mail or handle my git from Emacs. I could say, I m a relatively newbie after almost four decades. Four decades When we got a PC in 1986, my father got the people at the Institute to get him memacs (micro-emacs). There was probably a ten year period I barely used any emacs, but always recognized it. My fingers hve memorized a dozen or so movement commands, and a similar number of file management commands. And yes, Emacs and TeX are still the main tools I use day to day.

Next.