Search Results: "tora"

13 October 2024

Andy Simpkins: The state of the art

A long time ago . A long time ago a computer was a woman (I think almost exclusively a women, not a man) who was employed to do a lot of repetitive mathematics typically for accounting and stock / order processing. Then along came Lyons, who deployed an artificial computer to perform the same task, only with fewer errors in less time. Modern day computing was born we had entered the age of the Digital Computer. These computers were large, consumed huge amounts of power but were precise, and gave repeatable, verifiable results. Over time the huge mainframe digital computers have shrunk in size, increased in performance, and consume far less power so much so that they often didn t need the specialist CFC based, refrigerated liquid cooling systems of their bigger mainframe counterparts, only requiring forced air flow, and occasionally just convection cooling. They shrank so far and became cheep enough that the Personal Computer became to be, replacing the mainframe with its time shared resources with a machine per user. Desktop or even portable laptop computers were everywhere. We networked them together, so now we can share information around the office, a few computers were given specialist tasks of being available all the time so we could share documents, or host databases these servers were basically PCs designed to operate 24 7, usually more powerful than their desktop counterparts (or at least with faster storage and networking). Next we joined these networks together and the internet was born. The dream of a paperless office might actually become realised we can now send email (and documents) from one organisation (or individual) to another via email. We can make our specialist computers applications available outside just the office and web servers / web apps come of age. Fast forward a few years and all of a sudden we need huge data-halls filled with Rack scale machines augmented with exotic GPUs and NPUs again with refrigerated liquid cooling, all to do the same task that we were doing previously without the magical buzzword that has been named AI; because we all need another dot com bubble or block chain band waggon to jump aboard. Our AI enabled searches take slightly longer, consume magnitudes more power, and best of all the results we are given may or may not be correct . Progress, less precise answers, taking longer, consuming more power, without any verification and often giving a different result if you repeat your question AND we still need a personal computing device to access this wondrous thing. Remind me again why we are here? (time lines and huge swaves of history simply ignored to make an attempted comic point this is intended to make a point and not be scholarly work)

Taavi V n nen: Bulk downloading Wikimedia Commons categories

Wikimedia Commons, the Wikimedia project for freely licensed media files, also contains a bunch of photos by me and photos of me at various events. While I don't think Commons is going away anytime soon, I would still like to have a local copy of those images available on my own storage hardware. Obviously this requires some way to query for photos you want to download. I'm using Commons categories for this, since that's easy to implement and works for both use cases. The Commons community tends to come up with very specific categories that you can use, and if not, you can usually categorize the files yourself.
Me replying 'shh' to a Discord message showing myself categorizing photos about me and accusing me of COI editing
thankfully Commons has no such thing as a Conflict of interest (COI) policy
There is almost an existing tool for this: Sam Wilson's mwcli project has support for exporting images one has uploaded to Commons. However I couldn't use that to upload photos of me others have uploaded, plus it's written in PHP and I don't exactly want to deal with the problem of figuring out how to package it in a way I could neatly install it on my NAS. So I wrote my own tool for it, called comload. It's written in Python because Python is easy to deploy (I can just throw it in a .deb and upload it to my internal repository), and because I did not find a Go library to handle Action API pagination for me. The basic usage is like this:
$ comload --subcats "Taavi V n nen"
This will download any files in Category:Taavi V n nen and its sub-categories to the current directory. Former image versions, as well as the image description and SDC data, if any, is also included. And it's smart enough to not download any files that are already there on future runs, so you can just throw it in a systemd timer to get any future files. I'd still like it to handle moved files without creating a duplicate copy, but otherwise I'm really happy with the current state. comload is available from PyPI and from my Git server directly, and is licensed under the GPLv3.

11 October 2024

Steve McIntyre: Rock 5 ITX

It's been a while since I've posted about arm64 hardware. The last machine I spent my own money on was a SolidRun Macchiatobin, about 7 years ago. It's a small (mini-ITX) board with a 4-core arm64 SoC (4 * Cortex-A72) on it, along with things like a DIMM socket for memory, lots of networking, 3 SATA disk interfaces. The Macchiatobin was a nice machine compared to many earlier systems, but it took quite a bit of effort to get it working to my liking. I replaced the on-board U-Boot firmware binary with an EDK2 build, and that helped. After a few iterations we got a new build including graphical output on a PCIe graphics card. Now it worked much more like a "normal" x86 computer. I still have that machine running at home, and it's been a reasonably reliable little build machine for arm development and testing. It's starting to show its age, though - the onboard USB ports no longer work, and so it's no longer useful for doing things like installation testing. :-/ So... I was involved in a conversation in the #debian-arm IRC channel a few weeks ago, and diederik suggested the Radxa Rock 5 ITX. It's another mini-ITX board, this time using a Rockchip RK3588 CPU. Things have moved on - the CPU is now an 8-core big.LITTLE config: 4*Cortex A76 and 4*Cortex A55. The board has NVMe on-board, 4*SATA, built-in Mali graphics from the CPU, soldered-on memory. Just about everything you need on an SBC for a small low-power desktop, a NAS or whatever. And for about half the price I paid for the Macchiatobin. I hit "buy" on one of the listed websites. :-) A few days ago, the new board landed. I picked the version with 24GB of RAM and bought the matching heatsink and fan. I set it up in an existing case borrowed from another old machine and tried the Radxa "Debian" build. All looked OK, but I clearly wasn't going to stay with that. Onwards to running a native Debian setup! I installed an EDK2 build from https://github.com/edk2-porting/edk2-rk3588 onto the onboard SPI flash, then rebooted with a Debian 12.7 (Bookworm) arm64 installer image on a USB stick. How much trouble could this be? I was shocked! It Just Worked (TM) I'm running a standard Debian arm64 system. The graphical installer ran just fine. I installed onto the NVMe, adding an Xfce desktop for some simple tests. Everything Just Worked. After many years of fighting with a range of different arm machines (from simple SBCs to desktops and servers), this was without doubt the most straightforward setup I've ever done. Wow! It's possible to go and spend a lot of money on an Ampere machine, and I've seen them work well too. But for a hobbyist user (or even a smaller business), the Rock 5 ITX is a lovely option. Total cost to me for the board with shipping fees, import duty, etc. was just over 240. That's great value, and I can wholeheartedly recommend this board! The two things that are missing compared to the Macchiatobin? This is soldered-on memory (but hey, 24G is plenty for me!) It also doesn't have a PCIe slot, but it has sufficient onboard network, video and storage interfaces that I think it will cover most people's needs. Where's the catch? It seems these are very popular right now, so it can be difficult to find these machines in stock online. FTAOD, I should also point out: I bought this machine entirely with my own money, for my own use for development and testing. I've had no contact with the Radxa or Rockchip folks at all here, I'm just so happy with this machine that I've felt the need to shout about it! :-) Here's some pictures... Rock 5 ITX top view Rock 5 ITX back panel view Rock 5 EDK2 startuo Rock 5 xfce login Rock 5 ITX running Firefox

8 September 2024

Jacob Adams: Linux's Bedtime Routine

How does Linux move from an awake machine to a hibernating one? How does it then manage to restore all state? These questions led me to read way too much C in trying to figure out how this particular hardware/software boundary is navigated. This investigation will be split into a few parts, with the first one going from invocation of hibernation to synchronizing all filesystems to disk. This article has been written using Linux version 6.9.9, the source of which can be found in many places, but can be navigated easily through the Bootlin Elixir Cross-Referencer: https://elixir.bootlin.com/linux/v6.9.9/source Each code snippet will begin with a link to the above giving the file path and the line number of the beginning of the snippet.

A Starting Point for Investigation: /sys/power/state and /sys/power/disk These two system files exist to allow debugging of hibernation, and thus control the exact state used directly. Writing specific values to the state file controls the exact sleep mode used and disk controls the specific hibernation mode1. This is extremely handy as an entry point to understand how these systems work, since we can just follow what happens when they are written to.

Show and Store Functions These two files are defined using the power_attr macro: kernel/power/power.h:80
#define power_attr(_name) \
static struct kobj_attribute _name##_attr =     \
    .attr   =               \
        .name = __stringify(_name), \
        .mode = 0644,           \
     ,                  \
    .show   = _name##_show,         \
    .store  = _name##_store,        \
 
show is called on reads and store on writes. state_show is a little boring for our purposes, as it just prints all the available sleep states. kernel/power/main.c:657
/*
 * state - control system sleep states.
 *
 * show() returns available sleep state labels, which may be "mem", "standby",
 * "freeze" and "disk" (hibernation).
 * See Documentation/admin-guide/pm/sleep-states.rst for a description of
 * what they mean.
 *
 * store() accepts one of those strings, translates it into the proper
 * enumerated value, and initiates a suspend transition.
 */
static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr,
			  char *buf)
 
	char *s = buf;
#ifdef CONFIG_SUSPEND
	suspend_state_t i;
	for (i = PM_SUSPEND_MIN; i < PM_SUSPEND_MAX; i++)
		if (pm_states[i])
			s += sprintf(s,"%s ", pm_states[i]);
#endif
	if (hibernation_available())
		s += sprintf(s, "disk ");
	if (s != buf)
		/* convert the last space to a newline */
		*(s-1) = '\n';
	return (s - buf);
 
state_store, however, provides our entry point. If the string disk is written to the state file, it calls hibernate(). This is our entry point. kernel/power/main.c:715
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
			   const char *buf, size_t n)
 
	suspend_state_t state;
	int error;
	error = pm_autosleep_lock();
	if (error)
		return error;
	if (pm_autosleep_state() > PM_SUSPEND_ON)  
		error = -EBUSY;
		goto out;
	 
	state = decode_state(buf, n);
	if (state < PM_SUSPEND_MAX)  
		if (state == PM_SUSPEND_MEM)
			state = mem_sleep_current;
		error = pm_suspend(state);
	  else if (state == PM_SUSPEND_MAX)  
		error = hibernate();
	  else  
		error = -EINVAL;
	 
 out:
	pm_autosleep_unlock();
	return error ? error : n;
 
kernel/power/main.c:688
static suspend_state_t decode_state(const char *buf, size_t n)
 
#ifdef CONFIG_SUSPEND
	suspend_state_t state;
#endif
	char *p;
	int len;
	p = memchr(buf, '\n', n);
	len = p ? p - buf : n;
	/* Check hibernation first. */
	if (len == 4 && str_has_prefix(buf, "disk"))
		return PM_SUSPEND_MAX;
#ifdef CONFIG_SUSPEND
	for (state = PM_SUSPEND_MIN; state < PM_SUSPEND_MAX; state++)  
		const char *label = pm_states[state];
		if (label && len == strlen(label) && !strncmp(buf, label, len))
			return state;
	 
#endif
	return PM_SUSPEND_ON;
 
Could we have figured this out just via function names? Sure, but this way we know for sure that nothing else is happening before this function is called.

Autosleep Our first detour is into the autosleep system. When checking the state above, you may notice that the kernel grabs the pm_autosleep_lock before checking the current state. autosleep is a mechanism originally from Android that sends the entire system to either suspend or hibernate whenever it is not actively working on anything. This is not enabled for most desktop configurations, since it s primarily for mobile systems and inverts the standard suspend and hibernate interactions. This system is implemented as a workqueue2 that checks the current number of wakeup events, processes and drivers that need to run3, and if there aren t any, then the system is put into the autosleep state, typically suspend. However, it could be hibernate if configured that way via /sys/power/autosleep in a similar manner to using /sys/power/state to manually enable hibernation. kernel/power/main.c:841
static ssize_t autosleep_store(struct kobject *kobj,
			       struct kobj_attribute *attr,
			       const char *buf, size_t n)
 
	suspend_state_t state = decode_state(buf, n);
	int error;
	if (state == PM_SUSPEND_ON
	    && strcmp(buf, "off") && strcmp(buf, "off\n"))
		return -EINVAL;
	if (state == PM_SUSPEND_MEM)
		state = mem_sleep_current;
	error = pm_autosleep_set_state(state);
	return error ? error : n;
 
power_attr(autosleep);
#endif /* CONFIG_PM_AUTOSLEEP */
kernel/power/autosleep.c:24
static DEFINE_MUTEX(autosleep_lock);
static struct wakeup_source *autosleep_ws;
static void try_to_suspend(struct work_struct *work)
 
	unsigned int initial_count, final_count;
	if (!pm_get_wakeup_count(&initial_count, true))
		goto out;
	mutex_lock(&autosleep_lock);
	if (!pm_save_wakeup_count(initial_count)  
		system_state != SYSTEM_RUNNING)  
		mutex_unlock(&autosleep_lock);
		goto out;
	 
	if (autosleep_state == PM_SUSPEND_ON)  
		mutex_unlock(&autosleep_lock);
		return;
	 
	if (autosleep_state >= PM_SUSPEND_MAX)
		hibernate();
	else
		pm_suspend(autosleep_state);
	mutex_unlock(&autosleep_lock);
	if (!pm_get_wakeup_count(&final_count, false))
		goto out;
	/*
	 * If the wakeup occurred for an unknown reason, wait to prevent the
	 * system from trying to suspend and waking up in a tight loop.
	 */
	if (final_count == initial_count)
		schedule_timeout_uninterruptible(HZ / 2);
 out:
	queue_up_suspend_work();
 
static DECLARE_WORK(suspend_work, try_to_suspend);
void queue_up_suspend_work(void)
 
	if (autosleep_state > PM_SUSPEND_ON)
		queue_work(autosleep_wq, &suspend_work);
 

The Steps of Hibernation

Hibernation Kernel Config It s important to note that most of the hibernate-specific functions below do nothing unless you ve defined CONFIG_HIBERNATION in your Kconfig4. As an example, hibernate itself is defined as the following if CONFIG_HIBERNATE is not set. include/linux/suspend.h:407
static inline int hibernate(void)   return -ENOSYS;  

Check if Hibernation is Available We begin by confirming that we actually can perform hibernation, via the hibernation_available function. kernel/power/hibernate.c:742
if (!hibernation_available())  
	pm_pr_dbg("Hibernation not available.\n");
	return -EPERM;
 
kernel/power/hibernate.c:92
bool hibernation_available(void)
 
	return nohibernate == 0 &&
		!security_locked_down(LOCKDOWN_HIBERNATION) &&
		!secretmem_active() && !cxl_mem_active();
 
nohibernate is controlled by the kernel command line, it s set via either nohibernate or hibernate=no. security_locked_down is a hook for Linux Security Modules to prevent hibernation. This is used to prevent hibernating to an unencrypted storage device, as specified in the manual page kernel_lockdown(7). Interestingly, either level of lockdown, integrity or confidentiality, locks down hibernation because with the ability to hibernate you can extract bascially anything from memory and even reboot into a modified kernel image. secretmem_active checks whether there is any active use of memfd_secret, and if so it prevents hibernation. memfd_secret returns a file descriptor that can be mapped into a process but is specifically unmapped from the kernel s memory space. Hibernating with memory that not even the kernel is supposed to access would expose that memory to whoever could access the hibernation image. This particular feature of secret memory was apparently controversial, though not as controversial as performance concerns around fragmentation when unmapping kernel memory (which did not end up being a real problem). cxl_mem_active just checks whether any CXL memory is active. A full explanation is provided in the commit introducing this check but there s also a shortened explanation from cxl_mem_probe that sets the relevant flag when initializing a CXL memory device. drivers/cxl/mem.c:186
* The kernel may be operating out of CXL memory on this device,
* there is no spec defined way to determine whether this device
* preserves contents over suspend, and there is no simple way
* to arrange for the suspend image to avoid CXL memory which
* would setup a circular dependency between PCI resume and save
* state restoration.

Check Compression The next check is for whether compression support is enabled, and if so whether the requested algorithm is enabled. kernel/power/hibernate.c:747
/*
 * Query for the compression algorithm support if compression is enabled.
 */
if (!nocompress)  
	strscpy(hib_comp_algo, hibernate_compressor, sizeof(hib_comp_algo));
	if (crypto_has_comp(hib_comp_algo, 0, 0) != 1)  
		pr_err("%s compression is not available\n", hib_comp_algo);
		return -EOPNOTSUPP;
	 
 
The nocompress flag is set via the hibernate command line parameter, setting hibernate=nocompress. If compression is enabled, then hibernate_compressor is copied to hib_comp_algo. This synchronizes the current requested compression setting (hibernate_compressor) with the current compression setting (hib_comp_algo). Both values are character arrays of size CRYPTO_MAX_ALG_NAME (128 in this kernel). kernel/power/hibernate.c:50
static char hibernate_compressor[CRYPTO_MAX_ALG_NAME] = CONFIG_HIBERNATION_DEF_COMP;
/*
 * Compression/decompression algorithm to be used while saving/loading
 * image to/from disk. This would later be used in 'kernel/power/swap.c'
 * to allocate comp streams.
 */
char hib_comp_algo[CRYPTO_MAX_ALG_NAME];
hibernate_compressor defaults to lzo if that algorithm is enabled, otherwise to lz4 if enabled5. It can be overwritten using the hibernate.compressor setting to either lzo or lz4. kernel/power/Kconfig:95
choice
	prompt "Default compressor"
	default HIBERNATION_COMP_LZO
	depends on HIBERNATION
config HIBERNATION_COMP_LZO
	bool "lzo"
	depends on CRYPTO_LZO
config HIBERNATION_COMP_LZ4
	bool "lz4"
	depends on CRYPTO_LZ4
endchoice
config HIBERNATION_DEF_COMP
	string
	default "lzo" if HIBERNATION_COMP_LZO
	default "lz4" if HIBERNATION_COMP_LZ4
	help
	  Default compressor to be used for hibernation.
kernel/power/hibernate.c:1425
static const char * const comp_alg_enabled[] =  
#if IS_ENABLED(CONFIG_CRYPTO_LZO)
	COMPRESSION_ALGO_LZO,
#endif
#if IS_ENABLED(CONFIG_CRYPTO_LZ4)
	COMPRESSION_ALGO_LZ4,
#endif
 ;
static int hibernate_compressor_param_set(const char *compressor,
		const struct kernel_param *kp)
 
	unsigned int sleep_flags;
	int index, ret;
	sleep_flags = lock_system_sleep();
	index = sysfs_match_string(comp_alg_enabled, compressor);
	if (index >= 0)  
		ret = param_set_copystring(comp_alg_enabled[index], kp);
		if (!ret)
			strscpy(hib_comp_algo, comp_alg_enabled[index],
				sizeof(hib_comp_algo));
	  else  
		ret = index;
	 
	unlock_system_sleep(sleep_flags);
	if (ret)
		pr_debug("Cannot set specified compressor %s\n",
			 compressor);
	return ret;
 
static const struct kernel_param_ops hibernate_compressor_param_ops =  
	.set    = hibernate_compressor_param_set,
	.get    = param_get_string,
 ;
static struct kparam_string hibernate_compressor_param_string =  
	.maxlen = sizeof(hibernate_compressor),
	.string = hibernate_compressor,
 ;
We then check whether the requested algorithm is supported via crypto_has_comp. If not, we bail out of the whole operation with EOPNOTSUPP. As part of crypto_has_comp we perform any needed initialization of the algorithm, loading kernel modules and running initialization code as needed6.

Grab Locks The next step is to grab the sleep and hibernation locks via lock_system_sleep and hibernate_acquire. kernel/power/hibernate.c:758
sleep_flags = lock_system_sleep();
/* The snapshot device should not be opened while we're running */
if (!hibernate_acquire())  
	error = -EBUSY;
	goto Unlock;
 
First, lock_system_sleep marks the current thread as not freezable, which will be important later7. It then grabs the system_transistion_mutex, which locks taking snapshots or modifying how they are taken, resuming from a hibernation image, entering any suspend state, or rebooting.

The GFP Mask The kernel also issues a warning if the gfp mask is changed via either pm_restore_gfp_mask or pm_restrict_gfp_mask without holding the system_transistion_mutex. GFP flags tell the kernel how it is permitted to handle a request for memory. include/linux/gfp_types.h:12
 * GFP flags are commonly used throughout Linux to indicate how memory
 * should be allocated.  The GFP acronym stands for get_free_pages(),
 * the underlying memory allocation function.  Not every GFP flag is
 * supported by every function which may allocate memory.
In the case of hibernation specifically we care about the IO and FS flags, which are reclaim operators, ways the system is permitted to attempt to free up memory in order to satisfy a specific request for memory. include/linux/gfp_types.h:176
 * Reclaim modifiers
 * -----------------
 * Please note that all the following flags are only applicable to sleepable
 * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them).
 *
 * %__GFP_IO can start physical IO.
 *
 * %__GFP_FS can call down to the low-level FS. Clearing the flag avoids the
 * allocator recursing into the filesystem which might already be holding
 * locks.
gfp_allowed_mask sets which flags are permitted to be set at the current time. As the comment below outlines, preventing these flags from being set avoids situations where the kernel needs to do I/O to allocate memory (e.g. read/writing swap8) but the devices it needs to read/write to/from are not currently available. kernel/power/main.c:24
/*
 * The following functions are used by the suspend/hibernate code to temporarily
 * change gfp_allowed_mask in order to avoid using I/O during memory allocations
 * while devices are suspended.  To avoid races with the suspend/hibernate code,
 * they should always be called with system_transition_mutex held
 * (gfp_allowed_mask also should only be modified with system_transition_mutex
 * held, unless the suspend/hibernate code is guaranteed not to run in parallel
 * with that modification).
 */
static gfp_t saved_gfp_mask;
void pm_restore_gfp_mask(void)
 
	WARN_ON(!mutex_is_locked(&system_transition_mutex));
	if (saved_gfp_mask)  
		gfp_allowed_mask = saved_gfp_mask;
		saved_gfp_mask = 0;
	 
 
void pm_restrict_gfp_mask(void)
 
	WARN_ON(!mutex_is_locked(&system_transition_mutex));
	WARN_ON(saved_gfp_mask);
	saved_gfp_mask = gfp_allowed_mask;
	gfp_allowed_mask &= ~(__GFP_IO   __GFP_FS);
 

Sleep Flags After grabbing the system_transition_mutex the kernel then returns and captures the previous state of the threads flags in sleep_flags. This is used later to remove PF_NOFREEZE if it wasn t previously set on the current thread. kernel/power/main.c:52
unsigned int lock_system_sleep(void)
 
	unsigned int flags = current->flags;
	current->flags  = PF_NOFREEZE;
	mutex_lock(&system_transition_mutex);
	return flags;
 
EXPORT_SYMBOL_GPL(lock_system_sleep);
include/linux/sched.h:1633
#define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
Then we grab the hibernate-specific semaphore to ensure no one can open a snapshot or resume from it while we perform hibernation. Additionally this lock is used to prevent hibernate_quiet_exec, which is used by the nvdimm driver to active its firmware with all processes and devices frozen, ensuring it is the only thing running at that time9. kernel/power/hibernate.c:82
bool hibernate_acquire(void)
 
	return atomic_add_unless(&hibernate_atomic, -1, 0);
 

Prepare Console The kernel next calls pm_prepare_console. This function only does anything if CONFIG_VT_CONSOLE_SLEEP has been set. This prepares the virtual terminal for a suspend state, switching away to a console used only for the suspend state if needed. kernel/power/console.c:130
void pm_prepare_console(void)
 
	if (!pm_vt_switch())
		return;
	orig_fgconsole = vt_move_to_console(SUSPEND_CONSOLE, 1);
	if (orig_fgconsole < 0)
		return;
	orig_kmsg = vt_kmsg_redirect(SUSPEND_CONSOLE);
	return;
 
The first thing is to check whether we actually need to switch the VT kernel/power/console.c:94
/*
 * There are three cases when a VT switch on suspend/resume are required:
 *   1) no driver has indicated a requirement one way or another, so preserve
 *      the old behavior
 *   2) console suspend is disabled, we want to see debug messages across
 *      suspend/resume
 *   3) any registered driver indicates it needs a VT switch
 *
 * If none of these conditions is present, meaning we have at least one driver
 * that doesn't need the switch, and none that do, we can avoid it to make
 * resume look a little prettier (and suspend too, but that's usually hidden,
 * e.g. when closing the lid on a laptop).
 */
static bool pm_vt_switch(void)
 
	struct pm_vt_switch *entry;
	bool ret = true;
	mutex_lock(&vt_switch_mutex);
	if (list_empty(&pm_vt_switch_list))
		goto out;
	if (!console_suspend_enabled)
		goto out;
	list_for_each_entry(entry, &pm_vt_switch_list, head)  
		if (entry->required)
			goto out;
	 
	ret = false;
out:
	mutex_unlock(&vt_switch_mutex);
	return ret;
 
There is an explanation of the conditions under which a switch is performed in the comment above the function, but we ll also walk through the steps here. Firstly we grab the vt_switch_mutex to ensure nothing will modify the list while we re looking at it. We then examine the pm_vt_switch_list. This list is used to indicate the drivers that require a switch during suspend. They register this requirement, or the lack thereof, via pm_vt_switch_required. kernel/power/console.c:31
/**
 * pm_vt_switch_required - indicate VT switch at suspend requirements
 * @dev: device
 * @required: if true, caller needs VT switch at suspend/resume time
 *
 * The different console drivers may or may not require VT switches across
 * suspend/resume, depending on how they handle restoring video state and
 * what may be running.
 *
 * Drivers can indicate support for switchless suspend/resume, which can
 * save time and flicker, by using this routine and passing 'false' as
 * the argument.  If any loaded driver needs VT switching, or the
 * no_console_suspend argument has been passed on the command line, VT
 * switches will occur.
 */
void pm_vt_switch_required(struct device *dev, bool required)
Next, we check console_suspend_enabled. This is set to false by the kernel parameter no_console_suspend, but defaults to true. Finally, if there are any entries in the pm_vt_switch_list, then we check to see if any of them require a VT switch. Only if none of these conditions apply, then we return false. If a VT switch is in fact required, then we move first the currently active virtual terminal/console10 (vt_move_to_console) and then the current location of kernel messages (vt_kmsg_redirect) to the SUSPEND_CONSOLE. The SUSPEND_CONSOLE is the last entry in the list of possible consoles, and appears to just be a black hole to throw away messages. kernel/power/console.c:16
#define SUSPEND_CONSOLE	(MAX_NR_CONSOLES-1)
Interestingly, these are separate functions because you can use TIOCL_SETKMSGREDIRECT (an ioctl11) to send kernel messages to a specific virtual terminal, but by default its the same as the currently active console. The locations of the previously active console and the previous kernel messages location are stored in orig_fgconsole and orig_kmsg, to restore the state of the console and kernel messages after the machine wakes up again. Interestingly, this means orig_fgconsole also ends up storing any errors, so has to be checked to ensure it s not less than zero before we try to do anything with the kernel messages on both suspend and resume. drivers/tty/vt/vt_ioctl.c:1268
/* Perform a kernel triggered VT switch for suspend/resume */
static int disable_vt_switch;
int vt_move_to_console(unsigned int vt, int alloc)
 
	int prev;
	console_lock();
	/* Graphics mode - up to X */
	if (disable_vt_switch)  
		console_unlock();
		return 0;
	 
	prev = fg_console;
	if (alloc && vc_allocate(vt))  
		/* we can't have a free VC for now. Too bad,
		 * we don't want to mess the screen for now. */
		console_unlock();
		return -ENOSPC;
	 
	if (set_console(vt))  
		/*
		 * We're unable to switch to the SUSPEND_CONSOLE.
		 * Let the calling function know so it can decide
		 * what to do.
		 */
		console_unlock();
		return -EIO;
	 
	console_unlock();
	if (vt_waitactive(vt + 1))  
		pr_debug("Suspend: Can't switch VCs.");
		return -EINTR;
	 
	return prev;
 
Unlike most other locking functions we ve seen so far, console_lock needs to be careful to ensure nothing else is panicking and needs to dump to the console before grabbing the semaphore for the console and setting a couple flags.

Panics Panics are tracked via an atomic integer set to the id of the processor currently panicking. kernel/printk/printk.c:2649
/**
 * console_lock - block the console subsystem from printing
 *
 * Acquires a lock which guarantees that no consoles will
 * be in or enter their write() callback.
 *
 * Can sleep, returns nothing.
 */
void console_lock(void)
 
	might_sleep();
	/* On panic, the console_lock must be left to the panic cpu. */
	while (other_cpu_in_panic())
		msleep(1000);
	down_console_sem();
	console_locked = 1;
	console_may_schedule = 1;
 
EXPORT_SYMBOL(console_lock);
kernel/printk/printk.c:362
/*
 * Return true if a panic is in progress on a remote CPU.
 *
 * On true, the local CPU should immediately release any printing resources
 * that may be needed by the panic CPU.
 */
bool other_cpu_in_panic(void)
 
	return (panic_in_progress() && !this_cpu_in_panic());
 
kernel/printk/printk.c:345
static bool panic_in_progress(void)
 
	return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
 
kernel/printk/printk.c:350
/* Return true if a panic is in progress on the current CPU. */
bool this_cpu_in_panic(void)
 
	/*
	 * We can use raw_smp_processor_id() here because it is impossible for
	 * the task to be migrated to the panic_cpu, or away from it. If
	 * panic_cpu has already been set, and we're not currently executing on
	 * that CPU, then we never will be.
	 */
	return unlikely(atomic_read(&panic_cpu) == raw_smp_processor_id());
 
console_locked is a debug value, used to indicate that the lock should be held, and our first indication that this whole virtual terminal system is more complex than might initially be expected. kernel/printk/printk.c:373
/*
 * This is used for debugging the mess that is the VT code by
 * keeping track if we have the console semaphore held. It's
 * definitely not the perfect debug tool (we don't know if _WE_
 * hold it and are racing, but it helps tracking those weird code
 * paths in the console code where we end up in places I want
 * locked without the console semaphore held).
 */
static int console_locked;
console_may_schedule is used to see if we are permitted to sleep and schedule other work while we hold this lock. As we ll see later, the virtual terminal subsystem is not re-entrant, so there s all sorts of hacks in here to ensure we don t leave important code sections that can t be safely resumed.

Disable VT Switch As the comment below lays out, when another program is handling graphical display anyway, there s no need to do any of this, so the kernel provides a switch to turn the whole thing off. Interestingly, this appears to only be used by three drivers, so the specific hardware support required must not be particularly common.
drivers/gpu/drm/omapdrm/dss
drivers/video/fbdev/geode
drivers/video/fbdev/omap2
drivers/tty/vt/vt_ioctl.c:1308
/*
 * Normally during a suspend, we allocate a new console and switch to it.
 * When we resume, we switch back to the original console.  This switch
 * can be slow, so on systems where the framebuffer can handle restoration
 * of video registers anyways, there's little point in doing the console
 * switch.  This function allows you to disable it by passing it '0'.
 */
void pm_set_vt_switch(int do_switch)
 
	console_lock();
	disable_vt_switch = !do_switch;
	console_unlock();
 
EXPORT_SYMBOL(pm_set_vt_switch);
The rest of the vt_switch_console function is pretty normal, however, simply allocating space if needed to create the requested virtual terminal and then setting the current virtual terminal via set_console.

Virtual Terminal Set Console With set_console, we begin (as if we haven t been already) to enter the madness that is the virtual terminal subsystem. As mentioned previously, modifications to its state must be made very carefully, as other stuff happening at the same time could create complete messes. All this to say, calling set_console does not actually perform any work to change the state of the current console. Instead it indicates what changes it wants and then schedules that work. drivers/tty/vt/vt.c:3153
int set_console(int nr)
 
	struct vc_data *vc = vc_cons[fg_console].d;
	if (!vc_cons_allocated(nr)   vt_dont_switch  
		(vc->vt_mode.mode == VT_AUTO && vc->vc_mode == KD_GRAPHICS))  
		/*
		 * Console switch will fail in console_callback() or
		 * change_console() so there is no point scheduling
		 * the callback
		 *
		 * Existing set_console() users don't check the return
		 * value so this shouldn't break anything
		 */
		return -EINVAL;
	 
	want_console = nr;
	schedule_console_callback();
	return 0;
 
The check for vc->vc_mode == KD_GRAPHICS is where most end-user graphical desktops will bail out of this change, as they re in graphics mode and don t need to switch away to the suspend console. vt_dont_switch is a flag used by the ioctls11 VT_LOCKSWITCH and VT_UNLOCKSWITCH to prevent the system from switching virtual terminal devices when the user has explicitly locked it. VT_AUTO is a flag indicating that automatic virtual terminal switching is enabled12, and thus deliberate switching to a suspend terminal is not required. However, if you do run your machine from a virtual terminal, then we indicate to the system that we want to change to the requested virtual terminal via the want_console variable and schedule a callback via schedule_console_callback. drivers/tty/vt/vt.c:315
void schedule_console_callback(void)
 
	schedule_work(&console_work);
 
console_work is a workqueue2 that will execute the given task asynchronously.

Console Callback drivers/tty/vt/vt.c:3109
/*
 * This is the console switching callback.
 *
 * Doing console switching in a process context allows
 * us to do the switches asynchronously (needed when we want
 * to switch due to a keyboard interrupt).  Synchronization
 * with other console code and prevention of re-entrancy is
 * ensured with console_lock.
 */
static void console_callback(struct work_struct *ignored)
 
	console_lock();
	if (want_console >= 0)  
		if (want_console != fg_console &&
		    vc_cons_allocated(want_console))  
			hide_cursor(vc_cons[fg_console].d);
			change_console(vc_cons[want_console].d);
			/* we only changed when the console had already
			   been allocated - a new console is not created
			   in an interrupt routine */
		 
		want_console = -1;
	 
...
console_callback first looks to see if there is a console change wanted via want_console and then changes to it if it s not the current console and has been allocated already. We do first remove any cursor state with hide_cursor. drivers/tty/vt/vt.c:841
static void hide_cursor(struct vc_data *vc)
 
	if (vc_is_sel(vc))
		clear_selection();
	vc->vc_sw->con_cursor(vc, false);
	hide_softcursor(vc);
 
A full dive into the tty driver is a task for another time, but this should give a general sense of how this system interacts with hibernation.

Notify Power Management Call Chain kernel/power/hibernate.c:767
pm_notifier_call_chain_robust(PM_HIBERNATION_PREPARE, PM_POST_HIBERNATION)
This will call a chain of power management callbacks, passing first PM_HIBERNATION_PREPARE and then PM_POST_HIBERNATION on startup or on error with another callback. kernel/power/main.c:98
int pm_notifier_call_chain_robust(unsigned long val_up, unsigned long val_down)
 
	int ret;
	ret = blocking_notifier_call_chain_robust(&pm_chain_head, val_up, val_down, NULL);
	return notifier_to_errno(ret);
 
The power management notifier is a blocking notifier chain, which means it has the following properties. include/linux/notifier.h:23
 *	Blocking notifier chains: Chain callbacks run in process context.
 *		Callouts are allowed to block.
The callback chain is a linked list with each entry containing a priority and a function to call. The function technically takes in a data value, but it is always NULL for the power management chain. include/linux/notifier.h:49
struct notifier_block;
typedef	int (*notifier_fn_t)(struct notifier_block *nb,
			unsigned long action, void *data);
struct notifier_block  
	notifier_fn_t notifier_call;
	struct notifier_block __rcu *next;
	int priority;
 ;
The head of the linked list is protected by a read-write semaphore. include/linux/notifier.h:65
struct blocking_notifier_head  
	struct rw_semaphore rwsem;
	struct notifier_block __rcu *head;
 ;
Because it is prioritized, appending to the list requires walking it until an item with lower13 priority is found to insert the current item before. kernel/notifier.c:252
/*
 *	Blocking notifier chain routines.  All access to the chain is
 *	synchronized by an rwsem.
 */
static int __blocking_notifier_chain_register(struct blocking_notifier_head *nh,
					      struct notifier_block *n,
					      bool unique_priority)
 
	int ret;
	/*
	 * This code gets used during boot-up, when task switching is
	 * not yet working and interrupts must remain disabled.  At
	 * such times we must not call down_write().
	 */
	if (unlikely(system_state == SYSTEM_BOOTING))
		return notifier_chain_register(&nh->head, n, unique_priority);
	down_write(&nh->rwsem);
	ret = notifier_chain_register(&nh->head, n, unique_priority);
	up_write(&nh->rwsem);
	return ret;
 
kernel/notifier.c:20
/*
 *	Notifier chain core routines.  The exported routines below
 *	are layered on top of these, with appropriate locking added.
 */
static int notifier_chain_register(struct notifier_block **nl,
				   struct notifier_block *n,
				   bool unique_priority)
 
	while ((*nl) != NULL)  
		if (unlikely((*nl) == n))  
			WARN(1, "notifier callback %ps already registered",
			     n->notifier_call);
			return -EEXIST;
		 
		if (n->priority > (*nl)->priority)
			break;
		if (n->priority == (*nl)->priority && unique_priority)
			return -EBUSY;
		nl = &((*nl)->next);
	 
	n->next = *nl;
	rcu_assign_pointer(*nl, n);
	trace_notifier_register((void *)n->notifier_call);
	return 0;
 
Each callback can return one of a series of options. include/linux/notifier.h:18
#define NOTIFY_DONE		0x0000		/* Don't care */
#define NOTIFY_OK		0x0001		/* Suits me */
#define NOTIFY_STOP_MASK	0x8000		/* Don't call further */
#define NOTIFY_BAD		(NOTIFY_STOP_MASK 0x0002)
						/* Bad/Veto action */
When notifying the chain, if a function returns STOP or BAD then the previous parts of the chain are called again with PM_POST_HIBERNATION14 and an error is returned. kernel/notifier.c:107
/**
 * notifier_call_chain_robust - Inform the registered notifiers about an event
 *                              and rollback on error.
 * @nl:		Pointer to head of the blocking notifier chain
 * @val_up:	Value passed unmodified to the notifier function
 * @val_down:	Value passed unmodified to the notifier function when recovering
 *              from an error on @val_up
 * @v:		Pointer passed unmodified to the notifier function
 *
 * NOTE:	It is important the @nl chain doesn't change between the two
 *		invocations of notifier_call_chain() such that we visit the
 *		exact same notifier callbacks; this rules out any RCU usage.
 *
 * Return:	the return value of the @val_up call.
 */
static int notifier_call_chain_robust(struct notifier_block **nl,
				     unsigned long val_up, unsigned long val_down,
				     void *v)
 
	int ret, nr = 0;
	ret = notifier_call_chain(nl, val_up, v, -1, &nr);
	if (ret & NOTIFY_STOP_MASK)
		notifier_call_chain(nl, val_down, v, nr-1, NULL);
	return ret;
 
Each of these callbacks tends to be quite driver-specific, so we ll cease discussion of this here.

Sync Filesystems The next step is to ensure all filesystems have been synchronized to disk. This is performed via a simple helper function that times how long the full synchronize operation, ksys_sync takes. kernel/power/main.c:69
void ksys_sync_helper(void)
 
	ktime_t start;
	long elapsed_msecs;
	start = ktime_get();
	ksys_sync();
	elapsed_msecs = ktime_to_ms(ktime_sub(ktime_get(), start));
	pr_info("Filesystems sync: %ld.%03ld seconds\n",
		elapsed_msecs / MSEC_PER_SEC, elapsed_msecs % MSEC_PER_SEC);
 
EXPORT_SYMBOL_GPL(ksys_sync_helper);
ksys_sync wakes and instructs a set of flusher threads to write out every filesystem, first their inodes15, then the full filesystem, and then finally all block devices, to ensure all pages are written out to disk. fs/sync.c:87
/*
 * Sync everything. We start by waking flusher threads so that most of
 * writeback runs on all devices in parallel. Then we sync all inodes reliably
 * which effectively also waits for all flusher threads to finish doing
 * writeback. At this point all data is on disk so metadata should be stable
 * and we tell filesystems to sync their metadata via ->sync_fs() calls.
 * Finally, we writeout all block devices because some filesystems (e.g. ext2)
 * just write metadata (such as inodes or bitmaps) to block device page cache
 * and do not sync it on their own in ->sync_fs().
 */
void ksys_sync(void)
 
	int nowait = 0, wait = 1;
	wakeup_flusher_threads(WB_REASON_SYNC);
	iterate_supers(sync_inodes_one_sb, NULL);
	iterate_supers(sync_fs_one_sb, &nowait);
	iterate_supers(sync_fs_one_sb, &wait);
	sync_bdevs(false);
	sync_bdevs(true);
	if (unlikely(laptop_mode))
		laptop_sync_completion();
 
It follows an interesting pattern of using iterate_supers to run both sync_inodes_one_sb and then sync_fs_one_sb on each known filesystem16. It also calls both sync_fs_one_sb and sync_bdevs twice, first without waiting for any operations to complete and then again waiting for completion17. When laptop_mode is enabled the system runs additional filesystem synchronization operations after the specified delay without any writes. mm/page-writeback.c:111
/*
 * Flag that puts the machine in "laptop mode". Doubles as a timeout in jiffies:
 * a full sync is triggered after this time elapses without any disk activity.
 */
int laptop_mode;
EXPORT_SYMBOL(laptop_mode);
However, when running a filesystem synchronization operation, the system will add an additional timer to schedule more writes after the laptop_mode delay. We don t want the state of the system to change at all while performing hibernation, so we cancel those timers. mm/page-writeback.c:2198
/*
 * We're in laptop mode and we've just synced. The sync's writes will have
 * caused another writeback to be scheduled by laptop_io_completion.
 * Nothing needs to be written back anymore, so we unschedule the writeback.
 */
void laptop_sync_completion(void)
 
	struct backing_dev_info *bdi;
	rcu_read_lock();
	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
		del_timer(&bdi->laptop_mode_wb_timer);
	rcu_read_unlock();
 
As a side note, the ksys_sync function is simply called when the system call sync is used. fs/sync.c:111
SYSCALL_DEFINE0(sync)
 
	ksys_sync();
	return 0;
 

The End of Preparation With that the system has finished preparations for hibernation. This is a somewhat arbitrary cutoff, but next the system will begin a full freeze of userspace to then dump memory out to an image and finally to perform hibernation. All this will be covered in future articles!
  1. Hibernation modes are outside of scope for this article, see the previous article for a high-level description of the different types of hibernation.
  2. Workqueues are a mechanism for running asynchronous tasks. A full description of them is a task for another time, but the kernel documentation on them is available here: https://www.kernel.org/doc/html/v6.9/core-api/workqueue.html 2
  3. This is a bit of an oversimplification, but since this isn t the main focus of this article this description has been kept to a higher level.
  4. Kconfig is Linux s build configuration system that sets many different macros to enable/disable various features.
  5. Kconfig defaults to the first default found
  6. Including checking whether the algorithm is larval? Which appears to indicate that it requires additional setup, but is an interesting choice of name for such a state.
  7. Specifically when we get to process freezing, which we ll get to in the next article in this series.
  8. Swap space is outside the scope of this article, but in short it is a buffer on disk that the kernel uses to store memory not current in use to free up space for other things. See Swap Management for more details.
  9. The code for this is lengthy and tangential, thus it has not been included here. If you re curious about the details of this, see kernel/power/hibernate.c:858 for the details of hibernate_quiet_exec, and drivers/nvdimm/core.c:451 for how it is used in nvdimm.
  10. Annoyingly this code appears to use the terms console and virtual terminal interchangeably.
  11. ioctls are special device-specific I/O operations that permit performing actions outside of the standard file interactions of read/write/seek/etc. 2
  12. I m not entirely clear on how this flag works, this subsystem is particularly complex.
  13. In this case a higher number is higher priority.
  14. Or whatever the caller passes as val_down, but in this case we re specifically looking at how this is used in hibernation.
  15. An inode refers to a particular file or directory within the filesystem. See Wikipedia for more details.
  16. Each active filesystem is registed with the kernel through a structure known as a superblock, which contains references to all the inodes contained within the filesystem, as well as function pointers to perform the various required operations, like sync.
  17. I m including minimal code in this section, as I m not looking to deep dive into the filesystem code at this time.

7 September 2024

Sergio Durigan Junior: Chatting in the 21st century

Several people have been asking me to explain and/or write about my solution for chatting nowadays. I realize that the current scenario is much more complex than, say, 10 or 20 years ago. Back then, this post would probably be more about the IRC client I used than about different chatting technologies. I have also spent a non trivial amount of time setting things up the way I want, so I understand that it s about time to write about my setup not only because I think it can be helpful to others, but also because I would like to document things for myself.

The backbone: Matrix I chose to use Matrix as the place where I integrate everything. Despite there being some heavy (and justified) criticism on the protocol itself, it serves me well for what I need right now. Obviously, I don t like the fact that I have to provide Matrix and all of its accompanying bridges a VPS with 4GB of RAM and 3 vCPUs, but I think that that ship has sailed, unfortunately. In an ideal world, I would be using XMPP and dedicating only a fraction of the resources I m using today to have a full chat system. And since I have been running my personal XMPP server for more than a decade now, I did try to find a solution that would allow me to keep using it, but unfortunately the protocol became almost a hobbyist thing, so there s that.

A few disclaimers I self-host everything, including my Matrix server. Much of what I did won t work if you don t self-host Matrix, so keep that in mind. This won t be a post teaching you how to deploy the services. My intention is to describe what I use and for what purpose. Also, as much as I try to use Debian packages for everything I do, I opted to deploy all services using a community-maintained Ansible playbook which is very well written and organized: matrix-docker-ansible-deploy. Last but not least, as I said above, you will likely need a machine with a good amount of RAM, CPU and storage, especially if you deploy Synapse as your Matrix homeserver (which is what I recommend if you plan to use the bridges I ll mention). My current VPS has 4GB of RAM, 3 vCPUs and 80GB of storage (of which I m currently using approximately 55GB).

Problem #1: my Matrix client(s) There are a lot of clients that can talk the Matrix protocol, but most of them are either web clients or GUI programs. I live on the terminal, more specifically inside Emacs, so I settled for the amazing ement.el Emacs mode. It works surprisingly well, but unfortunately doesn t support end-to-end encryption out of the box; for that, you have to hook it up with pantalaimon. Unfortunately, the project seems abandoned and therefore I don t recommend you to use it. I don t use it myself. When I have to reply some E2E encrypted message from another user, I go to my web browser and use my self-hosted Element client. It s a nuisance, but one that I m willing to accept because of security concerns. If you re into web clients and don t want to use Element (because it is heavy), you can try Cinny. It s lightweight and supports a decent set of features. If you re a terminal lover but don t use Emacs, you may want to try gomuks or iamb.

Problem #2: IRC bridging There are basically two types of IRC bridges for Matrix:
  • The regular and most used matrix-appservice-irc. This bridge takes Matrix to IRC (think of IRC users with the [m] suffix appended to their nicknames), and is what the matrix.org and other big homeservers (including matrix.debian.social) use. It s a complex service which allows thousands of Matrix users to connect to IRC networks, but that unfortunately has complex problems and is only worth using if you intend to host a community server.
  • A bouncer-like bridge called Heisenbridge. This is what I use personally. It takes IRC to Matrix, which means that people on IRC will not know that you re using Matrix. This bridge is much simpler, and because it acts like a bouncer it s pretty much impossible for it to cause problems with the IRC network.
Due to the fact that I sometimes like to use other IRC clients, I still run a regular ZNC bouncer, and I use Heisenbridge to connect to my ZNC. This means that I can use, e.g., ERC inside Emacs and my Matrix bridge at the same time. But you don t necessarily need to run another bouncer; you can simply use Heisenbridge and connect directly to the IRC network(s) you want. A word of caution, though: unlike ZNC, Heisenbridge doesn t support per-user configuration when you use it in bouncer mode. This is the reason why you need to self-host it, and why it s not possible to offer the service to other users (they would have access to your IRC network configuration otherwise). It s also worth talking about logs. I find that keeping logs of everything that goes on IRC has saved me a bunch of times, and so I find it really important to continue doing that. Unfortunately, neither ement.el nor Element support logging things out of the box (at least not that I know). This is also one of the reasons why I still keep my ZNC around: I configure it to log everything.

Problem #3: Telegram I don t use Telegram myself, but unfortunately several people from the Debian community do, especially in Brazil. There is a whole Debian community on Telegram, and I wanted to be able to bridge our Debian Matrix channels to their Telegram counterparts. I am currently using mautrix-telegram for that, and it s working great. You need someone with a Telegram account to configure their credentials so that the bridge can connect to it, but afterwards it s really easy to bridge channels together.

Problem #4: GitLab webhooks Something else I wanted to be able to do was to receive notifications regarding new issues, merge requests and other activities from Salsa. For this, I m using maubot, which is awesome and has a huge list of plugins. I m using the gitlab one.

Final thoughts Overall, I m satisfied with the setup I have now. It has certainly taken some time and effort to find the right tool for each problem I needed to solve, and I still feel like there are some rough edges to soften (like the fact that my Emacs client doesn t support E2E encryption out of the box, or the whole logging situation), but otherwise things are working fine and I haven t had any big problems with the deployment. You do have to be much more careful about stuff (for example, when I installed an unrelated service that hijacked my Apache configuration and made Matrix s federation silently stop working), though. If you have more specific questions about any part of my setup, shoot me an email and I ll do my best to help. Happy chatting!

29 August 2024

Jonathan Carter: Orphaning bcachefs-tools in Debian

Around a decade ago, I was happy to learn about bcache a Linux block cache system that implements tiered storage (like a pool of hard disks with SSDs for cache) on Linux. At that stage, ZFS on Linux was nowhere close to where it is today, so any progress on gaining more ZFS features in general Linux systems was very welcome. These days we care a bit less about tiered storage, since any cost benefit in using anything else than nvme tends to quickly evaporate compared to time you eventually lose on it. In 2015, it was announced that bcache would grow into its own filesystem. This was particularly exciting and it caused quite a buzz in the Linux community, because it brought along with it more features that compare with ZFS (and also btrfs), including built-in compression, built-in encryption, check-summing and RAID implementations. Unlike ZFS, it didn t have a dkms module, so if you wanted to test bcachefs back then, you d have to pull the entire upstream bcachefs kernel source tree and compile it. Not ideal, but for a promise of a new, shiny, full-featured filesystem, it was worth it. In 2019, it seemed that the time has come for bcachefs to be merged into Linux, so I thought that it s about time we have the userspace tools (bcachefs-tools) packaged in Debian. Even if the Debian kernel wouldn t have it yet by the time the bullseye (Debian 11) release happened, it might still have been useful for a future backported kernel or users who roll their own. By total coincidence, the first git snapshot that I got into Debian (version 0.1+git20190829.aa2a42b) was committed exactly 5 years ago today. It was quite easy to package it, since it was written in C and shipped with a makefile that just worked, and it made it past NEW into unstable in 19 January 2020, just as I was about to head off to FOSDEM as the pandemic started, but that s of course a whole other story. Fast-forwarding towards the end of 2023, version 1.2 shipped with some utilities written in Rust, this caused a little delay, since I wasn t at all familiar with Rust packaging yet, so I shipped an update that didn t yet include those utilities, and saw this as an opportunity to learn more about how the Rust eco-system worked and Rust in Debian. So, back in April the Rust dependencies for bcachefs-tools in Debian didn t at all match the build requirements. I got some help from the Rust team who says that the common practice is to relax the dependencies of Rust software so that it builds in Debian. So errno, which needed the exact version 0.2, was relaxed so that it could build with version 0.4 in Debian, udev 0.7 was relaxed for 0.8 in Debian, memoffset from 0.8.5 to 0.6.5, paste from 1.0.11 to 1.08 and bindgen from 0.69.9 to 0.66. I found this a bit disturbing, but it seems that some Rust people have lots of confidence that if something builds, it will run fine. And at least it did build, and the resulting binaries did work, although I m personally still not very comfortable or confident about this approach (perhaps that might change as I learn more about Rust). With that in mind, at this point you may wonder how any distribution could sanely package this. The problem is that they can t. Fedora and other distributions with stable releases take a similar approach to what we ve done in Debian, while distributions with much more relaxed policies (like Arch) include all the dependencies as they are vendored upstream. As it stands now, bcachefs-tools is impossible to maintain in Debian stable. While my primary concerns when packaging, are for Debian unstable and the next stable release, I also keep in mind people who have to support these packages long after I stopped caring about them (like Freexian who does LTS support for Debian or Canonical who has long-term Ubuntu support, and probably other organisations that I ve never even heard of yet). And of course, if bcachfs-tools don t have any usable stable releases, it doesn t have any LTS releases either, so anyone who needs to support bcachefs-tools long-term has to carry the support burden on their own, and if they bundle it s dependencies, then those as well. I ll admit that I don t have any solution for fixing this. I suppose if I were upstream I might look into the possibility of at least supporting a larger range of recent dependencies (usually easy enough if you don t hop onto the newest features right away) so that distributions with stable releases only need to concern themselves with providing some minimum recent versions, but even if that could work, the upstream author is 100% against any solution other than vendoring all its dependencies with the utility and insisting that it must only be built using these bundled dependencies. I ve made 6 uploads for this package so far this year, but still I constantly get complaints that it s out of date and that it s ancient. If a piece of software is considered so old that it s useless by the time it s been published for two or three months, then there s no way it can survive even a usual stable release cycle, nevermind any kind of long-term support. With this in mind (not even considering some hostile emails that I recently received from the upstream developer or his public rants on lkml and reddit), I decided to remove bcachefs-tools from Debian completely. Although after discussing this with another DD, I was convinced to orphan it instead, which I have now done. I made an upload to experimental so that it s still available if someone wants to work on it (without having to go through NEW again), it s been removed from unstable so that it doesn t migrate to testing, and the ancient (especially by bcachefs-tools standards) versions that are in stable and oldstable will be removed too, since they are very likely to cause damage with any recent kernel versions that support bcachefs. And so, my adventure with bcachefs-tools comes to an end. I d advise that if you consider using bcachefs for any kind of production use in the near future, you first consider how supportable it is long-term, and whether there s really anyone at all that is succeeding in providing stable support for it.

Michael Ablassmeier: proxmox backup S3 proxy

A few weeks ago Tiziano Bacocco started a small project to implement a (golang) proxy that allows to store proxmox backups on S3 compatible storage: pmoxs3backuproxy, a feature which the current backup server does not have. I wanted to have a look at the Proxmox Backup Server implementation for a while, so i jumped on the wagon and helped with adding most of the API endpoints required to seamlessly use it as drop-in replacement in PVE. The current version can be configured as storage backend in PVE. You can then schedule your backups to the S3 storage likewise. It now supports both the Fixed index format required to create virtual machine backups and the Dynamic index format, used by the regular proxmox-backup-client for file and container backups. (full and incremental) The other endpoints like adding notes, removing or protecting backups, mounting images using the PVE frontend (or proxmox-backup-client) work too. It comes with a garbage collector that does prune the backup storage if snapshots expire and runs integrity checks on the data. You can also configure it as so called remote storage in the Proxmox Backup server itself and pull back complete buckets using proxmox-backup-manager pull , if your local datastore crashes. I think it will become more interesting if future proxmox versions will allow to push backups to other stores, too.

22 August 2024

Matthew Garrett: What the fuck is an SBAT and why does everyone suddenly care

Short version: Secure Boot Advanced Targeting and if that's enough for you you can skip the rest you're welcome.

Long version: When UEFI Secure Boot was specified, everyone involved was, well, a touch naive. The basic security model of Secure Boot is that all the code that ends up running in a kernel-level privileged environment should be validated before execution - the firmware verifies the bootloader, the bootloader verifies the kernel, the kernel verifies any additional runtime loaded kernel code, and now we have a trusted environment to impose any other security policy we want. Obviously people might screw up, but the spec included a way to revoke any signed components that turned out not to be trustworthy: simply add the hash of the untrustworthy code to a variable, and then refuse to load anything with that hash even if it's signed with a trusted key.

Unfortunately, as it turns out, scale. Every Linux distribution that works in the Secure Boot ecosystem generates their own bootloader binaries, and each of them has a different hash. If there's a vulnerability identified in the source code for said bootloader, there's a large number of different binaries that need to be revoked. And, well, the storage available to store the variable containing all these hashes is limited. There's simply not enough space to add a new set of hashes every time it turns out that grub (a bootloader initially written for a simpler time when there was no boot security and which has several separate image parsers and also a font parser and look you know where this is going) has another mechanism for a hostile actor to cause it to execute arbitrary code, so another solution was needed.

And that solution is SBAT. The general concept behind SBAT is pretty straightforward. Every important component in the boot chain declares a security generation that's incorporated into the signed binary. When a vulnerability is identified and fixed, that generation is incremented. An update can then be pushed that defines a minimum generation - boot components will look at the next item in the chain, compare its name and generation number to the ones stored in a firmware variable, and decide whether or not to execute it based on that. Instead of having to revoke a large number of individual hashes, it becomes possible to push one update that simply says "Any version of grub with a security generation below this number is considered untrustworthy".

So why is this suddenly relevant? SBAT was developed collaboratively between the Linux community and Microsoft, and Microsoft chose to push a Windows update that told systems not to trust versions of grub with a security generation below a certain level. This was because those versions of grub had genuine security vulnerabilities that would allow an attacker to compromise the Windows secure boot chain, and we've seen real world examples of malware wanting to do that (Black Lotus did so using a vulnerability in the Windows bootloader, but a vulnerability in grub would be just as viable for this). Viewed purely from a security perspective, this was a legitimate thing to want to do.

(An aside: the "Something has gone seriously wrong" message that's associated with people having a bad time as a result of this update? That's a message from shim, not any Microsoft code. Shim pays attention to SBAT updates in order to avoid violating the security assumptions made by other bootloaders on the system, so even though it was Microsoft that pushed the SBAT update, it's the Linux bootloader that refuses to run old versions of grub as a result. This is absolutely working as intended)

The problem we've ended up in is that several Linux distributions had not shipped versions of grub with a newer security generation, and so those versions of grub are assumed to be insecure (it's worth noting that grub is signed by individual distributions, not Microsoft, so there's no externally introduced lag here). Microsoft's stated intention was that Windows Update would only apply the SBAT update to systems that were Windows-only, and any dual-boot setups would instead be left vulnerable to attack until the installed distro updated its grub and shipped an SBAT update itself. Unfortunately, as is now obvious, that didn't work as intended and at least some dual-boot setups applied the update and that distribution's Shim refused to boot that distribution's grub.

What's the summary? Microsoft (understandably) didn't want it to be possible to attack Windows by using a vulnerable version of grub that could be tricked into executing arbitrary code and then introduce a bootkit into the Windows kernel during boot. Microsoft did this by pushing a Windows Update that updated the SBAT variable to indicate that known-vulnerable versions of grub shouldn't be allowed to boot on those systems. The distribution-provided Shim first-stage bootloader read this variable, read the SBAT section from the installed copy of grub, realised these conflicted, and refused to boot grub with the "Something has gone seriously wrong" message. This update was not supposed to apply to dual-boot systems, but did anyway. Basically:

1) Microsoft applied an update to systems where that update shouldn't have been applied
2) Some Linux distros failed to update their grub code and SBAT security generation when exploitable security vulnerabilities were identified in grub

The outcome is that some people can't boot their systems. I think there's plenty of blame here. Microsoft should have done more testing to ensure that dual-boot setups could be identified accurately. But also distributions shipping signed bootloaders should make sure that they're updating those and updating the security generation to match, because otherwise they're shipping a vector that can be used to attack other operating systems and that's kind of a violation of the social contract around all of this.

It's unfortunate that the victims here are largely end users faced with a system that suddenly refuses to boot the OS they want to boot. That should never happen. I don't think asking arbitrary end users whether they want secure boot updates is likely to result in good outcomes, and while I vaguely tend towards UEFI Secure Boot not being something that benefits most end users it's also a thing you really don't want to discover you want after the fact so I have sympathy for it being default on, so I do sympathise with Microsoft's choices here, other than the failed attempt to avoid the update on dual boot systems.

Anyway. I was extremely involved in the implementation of this for Linux back in 2012 and wrote the first prototype of Shim (which is now a massively better bootloader maintained by a wider set of people and that I haven't touched in years), so if you want to blame an individual please do feel free to blame me. This is something that shouldn't have happened, and unless you're either Microsoft or a Linux distribution it's not your fault. I'm sorry.

comment count unavailable comments

15 August 2024

Joerg Jaspert: Electric Car, Vacation trip

Electric Car A while ago I got my hands on an electric car - after not having owned a car for most of my life (there really is not much need here). It wasn t planned nor a goal of mine, but it kind of came out of talks with my boss , so now it s there. Due to some special rules in the german tax system it turns out really cheap for me - it comes from the company, which allows personal use. So I have to pay taxes on the value of it plus whichever amount of kilometers I have to drive to work. And the latter is what is good for me - I have homeoffice in my contract, so no drive to work, except maybe once a year for something. So no regular trip to calculate, only if I ever really have to do a trip to the office. And it being an electric car, running costs are also cheap. Way below the outdated tech that needs gas to run, is annoyingly loud and stinks.

The car In the past I used either car sharing or renting a car when I needed one, depending on what I actually needed. Last times renting I already tried electric variants, so I could compare this with. What I have now is a Citroen e-Berlingo XL from 2023 (so not the latest change from 2024), which is on the huge size for space, but small for battery. It has 7 seats (though I have the last 2 currently taken out, no daily need) and more storage space than I need even on a vacation trip. The engine (or well, it s battery) is on the small side - a capacity of 50 kWh means it only has 278km reach according to WLTP. That actually is a good bit less - as usual, those numbers are lying for the producer. Turns out that on highways in a more realistic mode than WLTP (read: real driving) it s somewhat around 120km before one wants a charger again. But, looking around, at least in Germany that is not a big problem, there are more than enough chargers available.

Driving Actually driving experience is good. It sure is a huge car (4.7m long, 1.85m high/wide) and feels more like driving a (small) bus, but it is easy to handle. Maximum speed is limited (or the small battery would suck even more) to 135km/h, but that is more than enough. Even on german highways. Did my vacation trip with cruise control set to 115km/h and very few times only went above that manually. Real relaxed driving that was.

Vacation trip So we had a vacation just recently, and instead of renting a car for the trip we, of course, wanted to take the e-Berlingo. Distance was about double what the car can (realistically!) do, so one charging stop in the middle somewhere was a must. Not having had to charge on highways yet - and entirely new with this car - that made for a bit of nervousness, but it all turned out really good. There are really nice tools like ABRP to plan your trip including charging, which can take live data of availability of charging points into the planning. And it turned out nice - we reached our planned charging point and found a long queue of cars waiting. But turns out it was all those poor folks that need actual gasoline for their outdated combustion engines. The charging points for EV cars still had enough free space, so we could bypass the queue and directly start charging. We also did not need to repark the car after just a few minutes, we could directly start our break. With a charging time of approx. 30 minutes using the fast charger, such a break is long enough to get enough energy for the next part of the trip and short enough to not be annoying.

Charging prices At home it depends. If one has some photovoltaic system to get power, charging is basically free. If not it depends on whichever contract one has, costs will be somewhat between 0 and ~30cent per kWh. Not much, and way below gasoline costs. Outside, using a fast charger, prices vary depending on where you charge - and with what charging card. Prices between 40 and 70cent / kWh, and the same charging point can vary, just from the card one uses. That is a thing that the EU could actually go and better regulate, similar to the phone regulations it took. Still, the costs are still way below gasoline.

Charging cards There is a huge amount of different providers available, and all do their own things in pricing and how one can use them. They do have standards (say, the plugs are standardized, by now the way to start charging also), and that enables roaming (use a charging card of one provider at a charging point of another), but other than that, it seems to be random. That is - if you use card A on a charging point of Provider B you may pay 0.49cents, if you use card C on the same point, it may charge you 0.79cents. And card D isn t taken at all. Some (the newer ones) you can pay directly by credit card, many you can t. Some may allow paypal or Google/Apple Pay. So in the end you need more than just one charging card - I collected 8 free ones by now - just to be sure you can find a combination that isn t hugely overpriced.

31 July 2024

Matthew Palmer: Health Industry Company Sues to Prevent Certificate Revocation

It s not often that a company is willing to make a sworn statement to a court about how its IT practices are incompatible with the needs of the Internet, but when they do it s popcorn time.

The Combatants In the red corner, weighing in at nah, I m not going to do that schtick. The plaintiff in the case is Alegeus Technologies, LLC, a Delaware Corporation that, according to their filings, is a leading provider of a business-tobusiness, white-label funding and payment platform for healthcare carriers and third-party administrators to administer consumer-directed employee benefit programs . Not being subject to the US bonkers health care system, I have only a passing familiarity with the sorts of things they do, but presumably it involves moving a lot of money around, which is sometimes important. The defendant is DigiCert, a CA which, based on analysis I ve done previously, is the second-largest issuer of WebPKI certificates by volume.

The History According to a recently opened Mozilla CA bug, DigiCert found an issue in their domain control validation workflow, that meant it may have been possible for a miscreant to have certificates issued to them that they weren t legitimately entitled to. Given that validating domain names is basically the YOU HAD ONE JOB! of a CA, this is a big deal. The CA/Browser Forum Baseline Requirements (BRs) (which all CAs are required to adhere to, by virtue of their being included in various browser and OS trust stores), say that revocation is required within 24 hours when [t]he CA obtains evidence that the validation of domain authorization or control for any Fully Qualified Domain Name or IP address in the Certificate should not be relied upon (section 4.9.1.1, point 5). DigiCert appears to have at least tried to do the right thing, by opening the above Mozilla bug giving some details of the problem, and notifying their customers that their certificates were going to be revoked. One may quibble about how fast they re doing it, but they re giving it a decent shot, at least. A complicating factor in all this is that, only a touch over a month ago, Google Chrome announced the removal of another CA, Entrust, from its own trust store program, citing a pattern of compliance failures, unmet improvement commitments, and the absence of tangible, measurable progress in response to publicly disclosed incident reports . Many of these compliance failures were failures to revoke certificates in a timely manner. One imagines that DigiCert would not like to gain a reputation for tardy revocation, particularly at the moment.

The Legal Action Now we come to Alegeus Technologies. They ve opened a civil case whose first action is to request the issuance of a Temporary Restraining Order (TRO) that prevents DigiCert from revoking certificates issued to Alegeus (which the court has issued). This is a big deal, because TROs are legal instruments that, if not obeyed, constitute contempt of court (or something similar) and courts do not like people who disregard their instructions. That means that, in the short term, those certificates aren t getting revoked, despite the requirement imposed by root stores on DigiCert that the certificates must be revoked. DigiCert is in a real rock / hard place situation here: revoke and get punished by the courts, or don t revoke and potentially (though almost certainly not, in the circumstances) face removal from trust stores (which would kill, or at least massively hurt, their business). The reasons that Alegeus gives for requesting the restraining order is that [t]o Reissue and Reinstall the Security Certificates, Alegeus must work with and coordinate with its Clients, who are required to take steps to rectify the certificates. Alegeus has hundreds of such Clients. Alegeus is generally required by contract to give its clients much longer than 24 hours notice before executing such a change regarding certification. In the filing, Alegeus does acknowledge that DigiCert is a voluntary member of the Certification Authority Browser Forum (CABF), which has bylaws stating that certificates with an issue in their domain validation must be revoked within 24 hours. This is a misstatement of the facts, though. It is the BRs, not the CABF bylaws, that require revocation, and the BRs apply to all CAs that wish to be included in browser and OS trust stores, not just those that are members of the CABF. In any event, given that Alegeus was aware that DigiCert is required to revoke certificates within 24 hours, one wonders why Alegeus went ahead and signed agreements with their customers that required a lengthy notice period before changing certificates. What complicates the situation is that there is apparently a Master Services Agreement (MSA) that states that it constitutes the entire agreement between the parties and that MSA doesn t mention certificate revocation anywhere relevant. That means that it s not quite so cut-and-dried that DigiCert does, in fact, have the right to revoke those certificates. I d expect a lot of update to your Master Services Agreement emails to be going out from DigiCert (and other CAs) in the near future to clarify this point. Not being a lawyer, I can t imagine which way this case might go, but there s one thing we can be sure of: some lawyers are going to able to afford that trip to a tropical paradise this year.

The Security Issues The requirement for revocation within 24 hours is an important security control in the WebPKI ecosystem. If a certificate is misissued to a malicious party, or is otherwise compromised, it needs to be marked as untrustworthy as soon as possible. While revocation is far from perfect, it is the best tool we have. In this court filing, Alegeus has claimed that they are unable to switch certificates with less than 24 hours notice (due to contractual SLAs ). This is a pretty big problem, because there are lots of reasons why a certificate might need to be switched out Very Quickly. As a practical example, someone with access to the private key for your SSL certificate might decide to use it in a blog post. Letting that sort of problem linger for an extended period of time might end up being a Pretty Big Problem of its own. An organisation that cannot respond within hours to a compromised certificate is playing chicken with their security.

The Takeaways Contractual obligations that require you to notify anyone else of a certificate (or private key) changing are bonkers, and completely antithetical to the needs of the WebPKI. If you have to have them, you re going to want to start transitioning to a private PKI, wherein you can do whatever you darn well please with revocation (or not). As these sorts of problems keep happening, trust stores (and hence CAs) are going to crack down on this sort of thing, so you may as well move sooner rather than later. If you are an organisation that uses WebPKI certificates, you ve got to be able to deal with any kind of certificate revocation event within hours, not days. This basically boils down to automated issuance and lifecycle management, because having someone manually request and install certificates is terrible on many levels. There isn t currently a completed standard for notifying subscribers if their certificates need premature renewal (say, due to needing to be revoked), but the ACME Renewal Information Extension is currently being developed to fill that need. Ask your CA if they re tracking this standards development, and when they intend to have the extension available for use. (Pro-tip: if they say we ll start doing development when the RFC is published , run for the hills; that s not how responsible organisations work on the Internet).

The Givings If you ve found this helpful, consider shouting me a refreshing beverage. Reading through legal filings is thirsty work!

21 July 2024

Russell Coker: SE Linux Policy for Dell Management

The recent issue of Windows security software killing computers has reminded me about the issue of management software for Dell systems. I wrote policy for the Dell management programs that extract information from iDRAC and store it in Linux. After the break I ve pasted in the policy. It probably needs some changes for recent software, it was last tested on a PowerEdge T320 and prior to that was used on a PowerEdge R710 both of which are old hardware and use different management software to the recent hardware. One would hope that the recent software would be much better but usually such hope is in vain. I deliberately haven t submitted this for inclusion in the reference policy because it s for proprietary software and also it permits many operations that we would prefer not to permit. The policy is after the break because it s larger than you want on a Planet feed. But first I ll give a few selected lines that are bad in a noteworthy way:
  1. sys_admin means the ability to break everything
  2. dac_override means break Unix permissions
  3. mknod means a daemon creates devices due to a lack of udev configuration
  4. sys_rawio means someone didn t feel like writing a device driver, maintaining a device driver for DKMS is hard and getting a driver accepted upstream requires writing quality code, in any case this is a bad sign.
  5. self:lockdown is being phased out, but used to mean bypassing some integrity protections, that would usually be related to sys_rawio or similar.
  6. dev_rx_raw_memory is bad, reading raw memory allows access to pretty much everything and execute of raw memory is something I can t imagine a good use for, the Reference Policy doesn t use this anywhere!
  7. dev_rw_generic_chr_files usually means a lack of udev configuration as udev should do that.
  8. storage_raw_write_fixed_disk shouldn t be needed for this sort of thing, it doesn t do anything that involves managing partitions.
Now without network access or other obvious ways of remote control this level of access while excessive isn t necessarily going to allow bad things to happen due to outside attack. But if there are bugs in the software there s nothing to stop it from giving the worst results.
allow dell_datamgrd_t self:capability   dac_override dac_read_search mknod sys_rawio sys_admin  ;
allow dell_datamgrd_t self:lockdown integrity;
dev_rx_raw_memory(dell_datamgrd_t)
dev_rw_generic_chr_files(dell_datamgrd_t)
dev_rw_ipmi_dev(dell_datamgrd_t)
dev_rw_sysfs(dell_datamgrd_t)
storage_raw_read_fixed_disk(dell_datamgrd_t)
storage_raw_write_fixed_disk(dell_datamgrd_t)
allow dellsrvadmin_t self:lockdown integrity;
allow dellsrvadmin_t self:capability   sys_admin sys_rawio  ;
dev_read_raw_memory(dellsrvadmin_t)
dev_rw_sysfs(dellsrvadmin_t)
dev_rx_raw_memory(dellsrvadmin_t)
The best thing that Dell could do for their customers is to make this free software and allow the community to fix some of these issues.
Here is dellsrvadmin.te:
policy_module(dellsrvadmin,1.0.0)
require  
  type dmidecode_exec_t;
  type udev_t;
  type device_t;
  type event_device_t;
  type mon_local_test_t;
 
type dellsrvadmin_t;
type dellsrvadmin_exec_t;
init_daemon_domain(dellsrvadmin_t, dellsrvadmin_exec_t)
type dell_datamgrd_t;
type dell_datamgrd_exec_t;
init_daemon_domain(dell_datamgrd_t, dell_datamgrd_t)
type dellsrvadmin_var_t;
files_type(dellsrvadmin_var_t)
domain_transition_pattern(udev_t, dellsrvadmin_exec_t, dellsrvadmin_t)
modutils_domtrans(dellsrvadmin_t)
allow dell_datamgrd_t device_t:dir rw_dir_perms;
allow dell_datamgrd_t device_t:chr_file create;
allow dell_datamgrd_t event_device_t:chr_file   read write  ;
allow dell_datamgrd_t self:process signal;
allow dell_datamgrd_t self:fifo_file rw_file_perms;
allow dell_datamgrd_t self:sem create_sem_perms;
allow dell_datamgrd_t self:capability   dac_override dac_read_search mknod sys_rawio sys_admin  ;
allow dell_datamgrd_t self:lockdown integrity;
allow dell_datamgrd_t self:unix_dgram_socket create_socket_perms;
allow dell_datamgrd_t self:netlink_route_socket r_netlink_socket_perms;
modutils_domtrans(dell_datamgrd_t)
can_exec(dell_datamgrd_t, dmidecode_exec_t)
allow dell_datamgrd_t dellsrvadmin_var_t:dir rw_dir_perms;
allow dell_datamgrd_t dellsrvadmin_var_t:file manage_file_perms;
allow dell_datamgrd_t dellsrvadmin_var_t:lnk_file read;
allow dell_datamgrd_t dellsrvadmin_var_t:sock_file manage_file_perms;
kernel_read_network_state(dell_datamgrd_t)
kernel_read_system_state(dell_datamgrd_t)
kernel_search_fs_sysctls(dell_datamgrd_t)
kernel_read_vm_overcommit_sysctl(dell_datamgrd_t)
# for /proc/bus/pci/*
kernel_write_proc_files(dell_datamgrd_t)
corecmd_exec_bin(dell_datamgrd_t)
corecmd_exec_shell(dell_datamgrd_t)
corecmd_shell_entry_type(dell_datamgrd_t)
dev_rx_raw_memory(dell_datamgrd_t)
dev_rw_generic_chr_files(dell_datamgrd_t)
dev_rw_ipmi_dev(dell_datamgrd_t)
dev_rw_sysfs(dell_datamgrd_t)
files_search_tmp(dell_datamgrd_t)
files_read_etc_files(dell_datamgrd_t)
files_read_etc_symlinks(dell_datamgrd_t)
files_read_usr_files(dell_datamgrd_t)
logging_search_logs(dell_datamgrd_t)
miscfiles_read_localization(dell_datamgrd_t)
storage_raw_read_fixed_disk(dell_datamgrd_t)
storage_raw_write_fixed_disk(dell_datamgrd_t)
can_exec(mon_local_test_t, dellsrvadmin_exec_t)
allow mon_local_test_t dellsrvadmin_var_t:dir search;
allow mon_local_test_t dellsrvadmin_var_t:file read_file_perms;
allow mon_local_test_t dellsrvadmin_var_t:file setattr;
allow mon_local_test_t dellsrvadmin_var_t:sock_file write;
allow mon_local_test_t dell_datamgrd_t:unix_stream_socket connectto;
allow mon_local_test_t self:sem   create read write destroy unix_write  ;
allow dellsrvadmin_t self:process signal;
allow dellsrvadmin_t self:lockdown integrity;
allow dellsrvadmin_t self:sem create_sem_perms;
allow dellsrvadmin_t self:fifo_file rw_file_perms;
allow dellsrvadmin_t self:packet_socket create;
allow dellsrvadmin_t self:unix_stream_socket   connectto create_stream_socket_perms  ;
allow dellsrvadmin_t self:capability   sys_admin sys_rawio  ;
dev_read_raw_memory(dellsrvadmin_t)
dev_rw_sysfs(dellsrvadmin_t)
dev_rx_raw_memory(dellsrvadmin_t)
allow dellsrvadmin_t dellsrvadmin_var_t:dir rw_dir_perms;
allow dellsrvadmin_t dellsrvadmin_var_t:file manage_file_perms;
allow dellsrvadmin_t dellsrvadmin_var_t:lnk_file read;
allow dellsrvadmin_t dellsrvadmin_var_t:sock_file write;
allow dellsrvadmin_t dell_datamgrd_t:unix_stream_socket connectto;
kernel_read_network_state(dellsrvadmin_t)
kernel_read_system_state(dellsrvadmin_t)
kernel_search_fs_sysctls(dellsrvadmin_t)
kernel_read_vm_overcommit_sysctl(dellsrvadmin_t)
corecmd_exec_bin(dellsrvadmin_t)
corecmd_exec_shell(dellsrvadmin_t)
corecmd_shell_entry_type(dellsrvadmin_t)
files_read_etc_files(dellsrvadmin_t)
files_read_etc_symlinks(dellsrvadmin_t)
files_read_usr_files(dellsrvadmin_t)
logging_search_logs(dellsrvadmin_t)
miscfiles_read_localization(dellsrvadmin_t)
Here is dellsrvadmin.fc:
/opt/dell/srvadmin/sbin/.*        --        gen_context(system_u:object_r:dellsrvadmin_exec_t,s0)
/opt/dell/srvadmin/sbin/dsm_sa_datamgrd        --        gen_context(system_u:object_r:dell_datamgrd_t,s0)
/opt/dell/srvadmin/bin/.*        --        gen_context(system_u:object_r:dellsrvadmin_exec_t,s0)
/opt/dell/srvadmin/var(/.*)?                        gen_context(system_u:object_r:dellsrvadmin_var_t,s0)
/opt/dell/srvadmin/etc/srvadmin-isvc/ini(/.*)?        gen_context(system_u:object_r:dellsrvadmin_var_t,s0)

4 July 2024

Arturo Borrero Gonz lez: Wikimedia Toolforge: migrating Kubernetes from PodSecurityPolicy to Kyverno

Le ch teau de Val re et le Haut de Cry en juillet 2022 Christian David, CC BY-SA 4.0, via Wikimedia Commons This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform. Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects. We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it. Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead. In 2021 it was announced that the PSP mechanism would be deprecated, and removed in Kubernetes 1.25. Even though we had been warned years in advance, we did not prioritize the migration of PSP until we were in Kubernetes 1.24, and blocked, unable to upgrade forward without taking actions. The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post. First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions. The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency. I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected. Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace remember, we had 3.5k accounts. We created a Kyverno policy template that we would then render and inject for each account. For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment. We had planned the migration from PSP to Kyverno policies in stages, like this:
  1. update our internal template generators to make Pod security settings explicit
  2. introduce Kyverno policies in Audit mode
  3. see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
  4. modify Kyverno policies and set them in Enforce mode
  5. drop PSP
In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice. In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred. This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure, reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs: I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was DefaultProcMount . However, that string is the name of the internal variable in the source code, whereas the actual default value is the string Default . This caused pods to be rightfully rejected by Kyverno while we figured the problem. I sent a patch upstream to fix this problem. We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration. This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete. For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

1 July 2024

Abhijith PA: A lazy local file sharing setup

At home, I have both a laptop and a *desktop PC. Most of my essential things, such as emails, repositories, password managers, contacts, and calendars are synced between the two devices. However, when I need to share some documents and I am lazy enough to go pick a flash drive, my only option is to push them to the Internet and download them on the other system, which is sitting at 20~ meters away. Typically, I do this either through email attachments or a matrix client. Occasionally, I think about setting up a network storage solution at home. But then I ask myself do I really need one. In my home network, I already have a Raspberry Pi running as my Wi-Fi router, doing DNS level ad blocking with Dnsmasq and DNS over TLS with stubby. Rpi has a 16GB memory card. I can mount RPi remote directory to both machines. I use pcmanfm as my file manager. It has the ability (like every other file managers) to mount remote storage over ssh. But one annoying thing is that whenever I open the mount directory, by default it shows the root file system of the remote device even when I explicitly mentioned the path. Then I discovered sshfs. I wrote the following script, which mount remote directory and open that in pcmanfm.
#!/bin/bash
LOCMOUNT="/home/user/Public"
sshfs raspberrypi:Public $LOCMOUNT
pcmanfm $MOUNT
I haven t enabled any encryption for the memory until now since other than some logs RPi wasn t writing anything to it. I set up fscrypt on Rpi storage now. And ta-da, a lazy person s local sharing solution. *Desktop - Well, technically it s an old laptop with a broken keyboard and trackpad, connected to a monitor, keyboard and mouse. I don t feel keeping it on a shelf.

26 June 2024

Gunnar Wolf: Many terabytes for students to play with. Thanks Debian!

LIDSOL students receiving their new hard drives My students at LIDSOL (Laboratorio de Investigaci n y Desarrollo de Software Libre, Free Software Research and Development Lab) at Facultad de Ingenier a, UNAM asked me to help them get the needed hardware to set up a mirror for various free software projects. We have some decent servers (not new servers, but mirrors don t require to be top-performance), so A couple of weeks ago, I approached the Debian Project Leader (DPL) and suggested we should buy two 16TBhard drives for this project, as it is the most reasonable cost per byte point I found. He agreed, and I bought the drives. Today we had a lab meeting, and I handed them over the hardware. Of course, they are very happy and thankful with the Debian project In the last couple of weeks, they have already set up an Archlinux mirror (https://archlinux.org/mirrors/fi-b.unam.mx), and now that they have heaps of storage space, plans are underway to set up various other mirrors (of course, a Debian mirror will be among the first).

19 June 2024

Sahil Dhiman: First Iteration of My Free Software Mirror

As I m gearing towards setting up a Free Software download mirror in India, it occurred to me that I haven t chronicled the work and motivation behind setting up the original mirror in the first place. Also, seems like it would be good to document stuff here for observing the progression, as the mirror is going multi-country now. Right now, my existing mirror i.e., mirrors.de.sahilister.net (was mirrors.sahilister.in), is hosted in Germany and serves traffic for Termux, NomadBSD, Blender, BlendOS and GIMP. For a while in between, it hosted OSMC project mirror as well. To explain what is a Free Software download mirror thing is first, I ll quote myself from work blog -
As most Free Software doesn t have commercial backing and require heavy downloads, the concept of software download mirrors helps take the traffic load off of the primary server, leading to geographical redundancy, higher availability and faster download in general.
So whenever someone wants to download a particular (mirrored) software and click download, upstream redirects the download to one of the mirror server which is geographical (or in other parameters) nearby to the user, leading to faster downloads and load sharing amongst all mirrors. Since the time I got into Linux and servers, I always wanted to help the community somehow, and mirroring seemed to be the most obvious thing. India seems to be a country which has traditionally seen less number of public download mirrors. IITB, TiFR, and some of the public institutions used to host them for popular Linux and Free Softwares, but they seem to be diminishing these days. In the last months of 2021, I started using Termux and saw that it had only a few mirrors (back then). I tried getting a high capacity, high bandwidth node in budget but it was hard in India in 2021-22. So after much deliberation, I decided to go where it s available and chose a German hosting provider with the thought of adding India node when conditions are favorable (thankfully that happened, and India node is live too now.). Termux required only 29 GB of storage, so went ahead and started mirroring it. I raised this issue in Termux s GitHub repository in January 2022. This blog post chronicles the start of the mirror. Termux has high request counts from a mirror point of view. Each Termux client, usually checks every mirror in selected group for availability before randomly selecting one for download (only other case is when client has explicitly selected a single mirror using termux-repo-change). The mirror started getting thousands of requests daily due to this but only a small percentage would actually get my mirror in selection, so download traffic was lower. Similar thing happened with OSMC too (which I started mirroring later). With this start, I started exploring various project that would be benefit from additional mirrors. Public information from Academic Computer Club in Ume s mirror and Freedif s mirror stats helped to figure out storage and bandwidth requirements for potential projects. Fun fact, Academic Computer Club in Ume (which is one of the prominent Debian, Ubuntu etc.) mirror, now has 200 Gbits/s uplink to the internet through SUNET. Later, I migrated to a different provider for better speeds and added LibreSpeed test on the mirror server. Those were fun times. Between OSMC, Termux and LibreSpeed, I was getting almost 1.2 millions hits/day on the server at its peak, crossing for the first time a TB/day traffic number. Next came Blender, which took the longest time to set up of around 9 10 months. Blender had a push-trigger requirement for rsync from upstream that took quite some back and forth. It now contributes the most amount of traffic on the mirror. On release days, mirror does more than 3 TB/day and normal days, it hovers around 2 TB/day. Gimp project is the latest addition. At one time, the mirror traffic touched 4.97 TB/day traffic number. That s when I decided on dropping LibreSpeed server to solely focus on mirroring for now, keeping the bandwidth allotment for serving downloads only. The mirror projects selection grew organically. I used to reach out many projects discussing the need of for additional mirrors. Some projects outright denied mirroring request as Germany already has a good academic mirrors boosting 20-25 Gbits/s speeds from FTP era, which seems fair. Finding the niche was essential to only add softwares, which would truly benefit from additional capacity. There were months when nothing much would happen with the mirror, rsync would continue to update the mirror while nginx would keep on serving the traffic. Nowadays, the mirror pushes around 70 TB/month. I occasionally check logs, vnstat, add new security stuff here and there and pay the bills. It now saturates the Gigabit link sometimes and goes beyond that, peaking around 1.42 Gbits/s (the hosting provider seems to be upping their game). The plan is to upgrade the link to better speeds. vnstat yearly
Yearly traffic stats (through vnstat -y )
On the way, learned quite a few things like - GeoIP Map of Clients from Yesterday Access Logs
GeoIP Map of Clients from Yesterday's Access Logs. Click to enlarge
Generated from IPinfo.io
In hindsight, the statistics look amazing, hundreds of TBs of traffic served from the mirror, month after month. That does show that there s still an appetite for public mirrors in time of commercially donated CDNs and GitHub. The world could have done with one less mirror, but it saved some time, lessened the burden for others, while providing redundancy and traffic localization with one additional mirror. And it s fun for someone like me who s into infrastructure that powers the Internet. Now, I ll try focusing and expanding the India mirror, which in itself started pushing almost half a TB/day. Long live Free Software and public download mirrors.

6 June 2024

Debian Brasil: MiniDebConf Belo Horizonte 2024 - a brief report

From April 27th to 30th, 2024, MiniDebConf Belo Horizonte 2024 was held at the Pampulha Campus of UFMG - Federal University of Minas Gerais, in Belo Horizonte city. MiniDebConf BH 2024 banners This was the fifth time that a MiniDebConf (as an exclusive in-person event about Debian) took place in Brazil. Previous editions were in Curitiba (2016, 2017, and 2018), and in Bras lia 2023. We had other MiniDebConfs editions held within Free Software events such as FISL and Latinoware, and other online events. See our event history. Parallel to MiniDebConf, on 27th (Saturday) FLISOL - Latin American Free Software Installation Festival took place. It's the largest event in Latin America to promote Free Software, and It has been held since 2005 simultaneously in several cities. MiniDebConf Belo Horizonte 2024 was a success (as were previous editions) thanks to the participation of everyone, regardless of their level of knowledge about Debian. We value the presence of both beginner users who are familiarizing themselves with the system and the official project developers. The spirit of welcome and collaboration was present during all the event. MiniDebConf BH 2024 flisol 2024 edition numbers During the four days of the event, several activities took place for all levels of users and collaborators of the Debian project. The official schedule was composed of: MiniDebConf BH 2024 palestra The final numbers for MiniDebConf Belo Horizonte 2024 show that we had a record number of participants. Of the 224 participants, 15 were official Brazilian contributors, 10 being DDs (Debian Developers) and 05 (Debian Maintainers), in addition to several unofficial contributors. The organization was carried out by 14 people who started working at the end of 2023, including Prof. Lo c Cerf from the Computing Department who made the event possible at UFMG, and 37 volunteers who helped during the event. As MiniDebConf was held at UFMG facilities, we had the help of more than 10 University employees. See the list with the names of people who helped in some way in organizing MiniDebConf Belo Horizonte 2024. The difference between the number of people registered and the number of attendees in the event is probably explained by the fact that there is no registration fee, so if the person decides not to go to the event, they will not suffer financial losses. The 2024 edition of MiniDebconf Belo Horizonte was truly grand and shows the result of the constant efforts made over the last few years to attract more contributors to the Debian community in Brazil. With each edition the numbers only increase, with more attendees, more activities, more rooms, and more sponsors/supporters. MiniDebConf BH 2024 grupo

MiniDebConf BH 2024 grupo Activities The MiniDebConf schedule was intense and diverse. On the 27th, 29th and 30th (Saturday, Monday and Tuesday) we had talks, discussions, workshops and many practical activities. MiniDebConf BH 2024 palestra On the 28th (Sunday), the Day Trip took place, a day dedicated to sightseeing around the city. In the morning we left the hotel and went, on a chartered bus, to the Belo Horizonte Central Market. People took the opportunity to buy various things such as cheeses, sweets, cacha as and souvenirs, as well as tasting some local foods. MiniDebConf BH 2024 mercado After a 2-hour tour of the Market, we got back on the bus and hit the road for lunch at a typical Minas Gerais food restaurant. MiniDebConf BH 2024 palestra With everyone well fed, we returned to Belo Horizonte to visit the city's main tourist attraction: Lagoa da Pampulha and Capela S o Francisco de Assis, better known as Igrejinha da Pampulha. MiniDebConf BH 2024 palestra We went back to the hotel and the day ended in the hacker space that we set up in the events room for people to chat, packaging, and eat pizzas. MiniDebConf BH 2024 palestra Crowdfunding For the third time we ran a crowdfunding campaign and it was incredible how people contributed! The initial goal was to raise the amount equivalent to a gold tier of R$ 3,000.00. When we reached this goal, we defined a new one, equivalent to one gold tier + one silver tier (R$ 5,000.00). And again we achieved this goal. So we proposed as a final goal the value of a gold + silver + bronze tiers, which would be equivalent to R$ 6,000.00. The result was that we raised R$7,239.65 (~ USD 1,400) with the help of more than 100 people! Thank you very much to the people who contributed any amount. As a thank you, we list the names of the people who donated. MiniDebConf BH 2024 doadores Food, accommodation and/or travel grants for participants Each edition of MiniDebConf brought some innovation, or some different benefit for the attendees. In this year's edition in Belo Horizonte, as with DebConfs, we offered bursaries for food, accommodation and/or travel to help those people who would like to come to the event but who would need some kind of help. In the registration form, we included the option for the person to request a food, accommodation and/or travel bursary, but to do so, they would have to identify themselves as a contributor (official or unofficial) to Debian and write a justification for the request. Number of people benefited: The food bursary provided lunch and dinner every day. The lunches included attendees who live in Belo Horizonte and the region. Dinners were paid for attendees who also received accommodation and/or travel. The accommodation was held at the BH Jaragu Hotel. And the travels included airplane or bus tickets, or fuel (for those who came by car or motorbike). Much of the money to fund the bursaries came from the Debian Project, mainly for travels. We sent a budget request to the former Debian leader Jonathan Carter, and He promptly approved our request. In addition to this event budget, the leader also approved individual requests sent by some DDs who preferred to request directly from him. The experience of offering the bursaries was really good because it allowed several people to come from other cities. MiniDebConf BH 2024 grupo Photos and videos You can watch recordings of the talks at the links below: And see the photos taken by several collaborators in the links below: Thanks We would like to thank all the attendees, organizers, volunteers, sponsors and supporters who contributed to the success of MiniDebConf Belo Horizonte 2024. MiniDebConf BH 2024 grupo Sponsors Gold: Silver: Bronze: Supporters Organizers

2 June 2024

Jacob Adams: What to Do When You Forget Your Root Password

Forgetting your root password would initially seem like a problem requiring a full re-install, one that you can t easily recover from without wiping everything away. Forgetting your user password can of course be solved by changing it as root, as in the following, which changes the password for user jacob:
# passwd jacob
but only the root user can change their own password, so you need to somehow get root access in order to do so.

Changing Root s Password with Sudo This one is probably obvious, but if you have a user with the ability to use sudo, then you can change root s password without access to the root account by running:
$ sudo passwd
which will reset the password for the root account without requiring the existing password.

Boot Directly to a Shell Getting root access to any Linux machine you have physical access to is surprisingly simple. You can just boot the machine directly into a root shell without any access control, i.e. passwords.

Why You Should Always Encrypt Your Storage1 To boot directly to a shell you need to append the following text to the kernel command line:
init=/bin/sh
(You could use pretty much any program here, but you re putting your system into a weird state doing this, and so I d recommend the simplest approach.)

GRUB GRUB will allow you to edit boot parameters on startup using the e key. You ll then be presented with a editor2 that you can use to change the kernel command line by appending to the linux line. E.g. If your editor looks like this:
        load_video
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_gpt
        insmod ext2
        search --no-floppy --fs-uuid --set=root abcd1234-5678-0910-1112-abcd12345678
        echo    'Loading Linux 6.1.0-21-amd64 ...'
        linux   /boot/vmlinuz-6.1.0-21-amd64 root=UUID=abcd1234-5678-0910-1112-abcd12345678 ro  quiet
        echo    'Loading initial ramdisk ...'
        initrd  /boot/initrd.img-6.1.0-21-amd64
Then you would add init=/bin/sh like so:
        load_video
        insmod gzio
        if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
        insmod part_gpt
        insmod ext2
        search --no-floppy --fs-uuid --set=root abcd1234-5678-0910-1112-abcd12345678
        echo    'Loading Linux 6.1.0-21-amd64 ...'
        linux   /boot/vmlinuz-6.1.0-21-amd64 root=UUID=abcd1234-5678-0910-1112-abcd12345678 ro  quiet init=/bin/sh
        echo    'Loading initial ramdisk ...'
        initrd  /boot/initrd.img-6.1.0-21-amd64
Once you ve edited it you can start your machine with Ctrl+x, as you can see from the prompt text under the editor.

Raspberry Pi cmdline.txt On a Raspberry Pi you ll want to append the above to only line of the cmdline.txt file on the boot partition of the SD card. This is the first partition of the disk, the one that is FAT32. You ll need to do this on another machine, since if you had root access to edit cmdline.txt you could also just change your password. As it is a FAT32 partition on an SD card, it should be editable on any other machine that supports SD cards. E.g. If your cmdline.txt looks like this
console=serial0,115200 console=tty1 root=PARTUUID=fb33757d-02 rootfstype=ext4 fsck.repair=yes rootwait quiet
Then you would add init=/bin/sh like so:
console=serial0,115200 console=tty1 root=PARTUUID=fb33757d-02 rootfstype=ext4 fsck.repair=yes rootwait quiet init=/bin/sh

Mount Read / Write Since you re replacing the init process of the machine with a shell, no other processes will be running. Also, your root filesystem will be mounted read-only, as init is expected to remount it read-write as needed.
# mount -o remount,rw /

Change Root Password Once you ve remounted the root filesystem, all that s needed is to run the passwd command.
# passwd
Since you re running the command as root you won t need to provide your existing password, and will only need to type a new password twice. Now of course you simply need to remember that password in order to ensure you don t need to do this again.

Reboot Safely You now cannot follow the standard reboot process here, as you re only running one process. Therefore it s important to put your root filesystem back into read-only before powering off your machine:
# mount -o remount,ro /
Once you ve done that you just need to hold down the power button until the machine completely powers off or pull the plug. And then you re done! Boot the computer again and you ll have everything working as normal, with a root password you remember.
  1. Not that this is the only reason, anyone with physical access to your machine could also boot it into another operating system they control, or just remove your storage device and put it into another computer, or probably other things I m not thinking of now. You should always encrypt your devices.
  2. The editor uses emacs-like keybindings. The manual includes a list of all the options available.

1 June 2024

Ian Jackson: What your vote is worth - a back of the envelope calculation

tl;dr: Your vote really counts! Each vote in a UK General Election is worth maybe 100,000 - to you and all your fellow citizens taken together. If you really care about the welfare of everyone affected by actions of the UK government, then it s worth that to you too. Introduction It seems a common perception that one vote, in amongst all those millions, doesn t really matter. So maybe it s not worth voting. But, voting is (largely) what determines what the government does - and the government is big. It s as big as all the people. If you are the kind of person who cares about what happens to everyone in your polity and indeed everyone its actions affect, then even your one vote is very important indeed. A method for back of the envelope calculation It would be nice to give a quantitative estimate. Many things in our society are measured in money, so let s try taking a stab at calculating the money value of your vote. The argument I m going to make is this: the government (by which I include the legislature), which is selected by our votes, decides how to spend the national budget. So, basically, I m going to divide the budget, by the electorate. UK Parliament UK Parliamentary elections decide not only the House of Commons, but, through that, the government. The upper house, the House of Lords, has very limited influence. So I think it s fair to regard the Parliamentary election as, simply, controlling that budget. Being lazy, I m going to use Wikipedia data. We have the size of the electorate, for 2019, 47.6 million. But your influence isn t shared with the whole electorate, only with the other people who also vote. Turnout in 2019 was 67.3%. The 2019 budget isn t listed but I ll just average the 2018 and March 2020 figures 842bn and 873bn, so 857 billion. (Strictly speaking I should add up the budgets for the period of the Parliament, but that seems like a lot of effort.) There s a discrepancy in the timescale we need to account for. Your vote influences the budgets for several years, depending how long it is until the next election. Taking Wikipedia s list of elections this century there ve been 7 in 24 years. So that s an average of about 3.4y. So, multiplying it through, we have ( 857b * (24 / 7)) / (47.6M * 67.3%), giving a guess at the value of your UK General Election vote: 92,000. European Parliament 2022 budget for the European Union (Wikipedia again) was 170.6 bn. The last election, in 2019, had a turnout of 198,352,638. Each EU Parliament lasts 5 years. The Parliament, however, shares responsibility for the budget with the European Council, which is controlled, ultimately, by national governments. We have to pick a numerical value for the Parliament s share of the influence. Over the past years the Parliament has gradually been more willing to exercise its powers in this area. I m going to arbitrarily call its share 50%. The calculation, then, is 170.6 bn * 5 * 50% / 198M, giving a guess at the value of your EU Parliamentary Election vote: 2150. This much smaller figure reflects simply that the EU doesn t spend very much money, for a polity of its size. (Those stories in the British press giving the impression that the EU is massively wasteful are, simply, lies.) The interaction of this calculation with the Council s share of the influence, and with national budgets, is a bit of a question, but given the much smaller amounts involved, it doesn t seem worth thinking about that too hard. Only if you care about other people as much as yourself! All of this is only true for you if you value and want to help everyone in your society. That includes immigrants, women, unemployed people, disabled people, people who are much poorer or richer than you, etc. If you think about it in purely personal terms, your vote is hardly worth anything - because while the effect of your vote, overall, is very large, that effect is shared by everyone in your polity. So if you only care about yourself, voting is a total waste of time. The more selfish and xenophobic and racist and so on you are - caring only about people like yourself - the less your vote is worth. This is why voting is rightly seen as a civic duty. I just spent 30 to courier my EP vote to Den Haag. That only makes sense because I m very willing to spend that 30 to try to improve the spending of the 2000 or so that s my share of the EU budget. This is a very rough analysis These calculations neglect a lot of very important things: politics isn t just about the allocation of resources. It s also about values, and bad politics can seriously harm people. Arguably many of those effects of your vote, are much more important than just how the budget is set and spent. It would be interesting to see an attempt at a similar analysis but for taking into account life and death questions like hate crime, traffic violence, healthcare, refugees welfare, and so on. I m not sure how to approach that. Maybe some real social scientists have done so? References welcome. Also, even on its own terms, this analysis is very rough and ready. We haven t modelled the ability of the government to change its tax rates; perhaps we should be multiplying GDP (or some other better measure) by 90% percentile total tax rate amongst countries like this one . The amount of influence that can be wielded by one vote is probably nonlinear in the size of the political faction, but IDK in which direction. In unfair voting systems like the UK s, some people s votes are worth much more than others. In a very marginal constituency, which is a target seat, your vote might be worth tens of millions. In a safe seat, it might only be worth a few thousand. And in practical terms you don t get to choose precisely the policies you want; you have to pick a party, which is sometimes very much a question of the lesser evil. So, there is much I haven t modelled. But the key point stands: Conclusion Although your vote is diluted by everyone else s votes, together, we control the government, which affects us all. So if you care about the whole of society, the big numbers in the divisor, and the numerator, cancel out. You can think of your vote as controlling one citizen s worth of government activity.
edited 2024-06-01 09:40 Z to fix a grammar botch


comment count unavailable comments

26 May 2024

Russell Coker: USB-A vs USB-C

USB-A is the original socket for USB at the PC end. There are 2 variants of it, the first is for USB 1.1 to USB 2 and the second is for USB 3 which adds extra pins in a plug and socket compatible manner you can plug a USB-A device into a USB-A socket without worrying about the speeds of each end as long as you don t need USB 3 speeds. The differences between USB-A and USB-C are:
  1. USB-C has the same form factor as Thunderbolt and the Thunderbolt protocol can run over it if both ends support it.
  2. USB-C generally supports higher power modes for charging (like 130W for Dell laptops, monitors, and plugpacks) but there s no technical reason why USB-A couldn t do it. You can buy chargers that do 60W over USB-A which could power one of our laptops via a USB-A to USB-C cable. So high power USB-A is theoretically possible but generally you won t see it.
  3. USB-C has DisplayPort alternate mode which means using some of the wires for DisplayPort.
  4. USB-C is more likely to support the highest speeds than USB-A sockets for super speed etc. This is not a difference in the standards just a choice made by manufacturers.
While USB-C tends to support higher power delivery modes in actual implementations for connecting to a PC the PC end seems to only support lower power modes regardless of port. I think it would be really good if workstations could connect to monitors via USB-C and provide power, DisplayPort, and keyboard, mouse, etc over the same connection. But unfortunately the PC and monitor ends don t appear to support such things. If you don t need any of those benefits in the list above (IE you are using USB for almost anything we do other than connecting a laptop to a dock/monitor/charger) then USB-A will do the job just as well as USB-C. The choice of which type to use should be based on price and which ports are available, EG My laptop has 2*USB-C ports and 2*USB-A so given that one USB-C port is almost always used for the monitor or for charging I don t really want to use USB-C for anything else to avoid running out of ports. When buying USB devices you can t always predict which systems you will need to connect them to. Currently there are a lot of systems without USB-C that are working well and have no need to be replaced. I haven t yet seen a system where the majority of ports are USB-C but that will probably happen in the next few years. Maybe in 2027 there will be PCs on sale with only two USB-A sockets forcing people who don t want to use a USB hub to save both of them for keyboard and mouse. Currently USB-C keyboards and mice are available on AliExpress but they are expensive and I haven t seen them in Australian stores. Most computer users don t wear out keyboards or mice so a lot of USB-A keyboard and mice will be in service for a long time. As an aside there are still many PCs with PS/2 keyboard and mouse ports in service so these things don t go away for a long time. There is one corner case where USB-C is convenient which is when you want to connect a mass storage device for system recovery or emergency backup, want a high speed, and don t want to spend time figuring out which of the ports are super speed (which can be difficult at the back of a PC with poor lighting). With USB-C you can expect a speed of at least 5Gbit/s and don t have to worry about accidentally connecting to a USB 2 port as is the situation with USB-A. For my own use the only times that I prefer USB-C over USB-A are for devices to connect to phones. Eventually I ll get a laptop that only has USB-C ports and this will change, but even then adaptors are possible. For someone who doesn t know the details of how things works it s not unreasonable to just buy the newest stuff and assume it s better as it usually is. But hopefully blog posts like this can help people make more informed decisions.

20 May 2024

Debian Brasil: MiniDebConf Belo Horizonte 2024 - um breve relato

De 27 a 30 de abril de 2024 foi realizada a MiniDebConf Belo Horizonte 2024 no Campus Pampulha da UFMG - Universidade Federal de Minas Gerais, em Belo Horizonte - MG. MiniDebConf BH 2024 banners Esta foi a quinta vez que uma MiniDebConf (como um evento presencial exclusivo sobre Debian) aconteceu no Brasil. As edi es anteriores foram em Curitiba (2016, 2017, e 2018), e em Bras lia 2023. Tivemos outras edi es de MiniDebConfs realizadas dentro de eventos de Software Livre como o FISL e a Latinoware, e outros eventos online. Veja o nosso hist rico de eventos. Paralelamente MiniDebConf, no dia 27 (s bado) aconteceu o FLISOL - Festival Latino-americano de Instala o de Software Livre, maior evento da Am rica Latina de divulga o de Software Livre realizado desde o ano de 2005 simultaneamente em v rias cidades. A MiniDebConf Belo Horizonte 2024 foi um sucesso (assim como as edi es anteriores) gra as participa o de todos(as), independentemente do n vel de conhecimento sobre o Debian. Valorizamos a presen a tanto dos(as) usu rios(as) iniciantes que est o se familiarizando com o sistema quanto dos(as) desenvolvedores(as) oficiais do projeto. O esp rito de acolhimento e colabora o esteve presente em todos os momentos. MiniDebConf BH 2024 flisol N meros da edi o 2024 Durante os quatro dias de evento aconteceram diversas atividades para todos os n veis de usu rios(as) e colaboradores(as) do projeto Debian. A programa o oficial foi composta de: MiniDebConf BH 2024 palestra Os n meros finais da MiniDebConf Belo Horizonte 2024 mostram que tivemos um recorde de participantes. Dos 224 participantes, 15 eram contribuidores(as) oficiais brasileiros sendo 10 DDs (Debian Developers) e 05 (Debian Maintainers), al m de diversos(as) contribuidores(as) n o oficiais. A organiza o foi realizada por 14 pessoas que come aram a trabalhar ainda no final de 2023, entre elas o Prof. Lo c Cerf do Departamento de Computa o que viabilizou o evento na UFMG, e 37 volunt rios(as) que ajudaram durante o evento. Como a MiniDebConf foi realizado nas instala es da UFMG, tivemos a ajuda de mais de 10 funcion rios da Universidade. Veja a lista com os nomes das pessoas que ajudaram de alguma forma na realiza o da MiniDebConf Belo Horizonte 2024. A diferen a entre o n mero de pessoas inscritas e o n mero de pessoas presentes provavelmente se explica pelo fato de n o haver cobran a de inscri o, ent o se a pessoa desistir de ir ao evento ela n o ter preju zo financeiro. A edi o 2024 da MiniDebconf Belo Horizonte foi realmente grandiosa e mostra o resultado dos constantes esfor os realizados ao longo dos ltimos anos para atrair mais colaboradores(as) para a comunidade Debian no Brasil. A cada edi o os n meros s aumentam, com mais participantes, mais atividades, mais salas, e mais patrocinadores/apoiadores. MiniDebConf BH 2024 grupo

MiniDebConf BH 2024 grupo Atividades A programa o da MiniDebConf foi intensa e diversificada. Nos dias 27, 29 e 30 (s bado, segunda e ter a-feira) tivemos palestras, debates, oficinas e muitas atividades pr ticas. MiniDebConf BH 2024 palestra J no dia 28 (domingo), ocorreu o Day Trip, um dia dedicado a passeios pela cidade. Pela manh sa mos do hotel e fomos, em um nibus fretado, para o Mercado Central de Belo Horizonte. O pessoal aproveitou para comprar v rias coisas como queijos, doces, cacha as e lembrancinhas, al m de experimentar algumas comidas locais. MiniDebConf BH 2024 mercado Depois de 2 horas de passeio pelo Mercado, voltamos para o nibus e pegamos a estrada para almo armos em um restaurante de comida t pica mineira. MiniDebConf BH 2024 palestra Com todos bem alimentados, voltamos para Belo Horizonte para visitarmos o principal ponto tur stico da cidade: a Lagoa da Pampulha e a Capela S o Francisco de Assis, mais conhecida como Igrejinha da Pampulha. MiniDebConf BH 2024 palestra Voltamos para o hotel e o dia terminou no hacker space que montamos na sala de eventos para o pessoal conversar, empacotar, e comer umas pizzas. MiniDebConf BH 2024 palestra Financiamento coletivo Pela terceira vez fizemos uma campanha de financiamento coletivo e foi incr vel como as pessoas contribu ram! A meta inicial era arrecadar o valor equivalente a uma cota ouro de R$ 3.000,00. Ao atingirmos essa meta, definimos uma nova, equivalente a uma cota ouro + uma cota prata (R$ 5.000,00). E novamente atingimos essa meta. Ent o propusermos como meta final o valor de uma cota ouro + prata + bronze, que seria equivalente a R$ 6.000,00. O resultado foi que arrecadamos R$ 7.239,65 com a ajuda de mais de 100 pessoas! Muito obrigado as pessoas que contribu ram com qualquer valor. Como forma de agradecimento, listamos os nomes das pessoas que doaram. MiniDebConf BH 2024 doadores Bolsas de alimenta o, hospedagem e/ou passagens para participantes Cada edi o da MiniDebConf trouxe alguma inova o, ou algum benef cio diferente para os(a) participantes. Na edi o deste ano em Belo Horizonte, assim como acontece nas DebConfs, oferecemos bolsas de alimenta o, hospedagem e/ou passagens para ajudar aquelas pessoas que gostariam de vir para o evento mas que precisariam de algum tipo de ajuda. No formul rio de inscri o, colocamos a op o para a pessoa solicitar bolsa de alimenta o, hospedagem e/ou passagens, mas para isso, ela deveria se identificar como contribuidor(a) (oficial ou n o oficial) do Debian e escrever uma justificativa para o pedido. N mero de pessoas beneficiadas: A bolsa de alimenta o forneceu almo o e jantar todos os dias. Os almo os inclu ram pessoas que moram em Belo Horizonte e regi o. J o jantares foram pagos para os(as) participantes que tamb m receberam a bolsa de hospedagem e/ou passagens. A hospedagem foi realizada no Hotel BH Jaragu . E as passagens inclu ram de avi o ou de nibus, ou combust vel (para quem veio de carro ou moto). Boa parte do dinheiro para custear as bolsas vieram do Projeto Debian, principalmente para as passagens. Enviamos um or amento o ent o l der do Debian Jonathan Carter, e ele prontamente aprovou o nosso pedido. Al m deste or amento do evento, o l der tamb m aprovou os pedidos individuais enviados por alguns DDs que preferiram solicitar diretamente para ele. A experi ncia de oferecer as bolsas foi realmente muito boa porque permitiu a vinda de v rias pessoas de outras cidades. MiniDebConf BH 2024 grupo Fotos e v deos Voc pode assistir as grava es das palestras nos links abaixo: E ver as fotos feitas por v rios(as) colaboradores(as) nos links abaixo: Agradecimentos Gostar amos de agradecer a todos(as) os(as) participantes, organizadores(as), volunt rios(as), patrocinadores(as) e apoiadores(as) que contribu ram para o sucesso da MiniDebConf Belo Horizonte 2024. MiniDebConf BH 2024 grupo Patrocinadores Ouro: Prata: Bronze: Apoiadores Organiza o

Next.