Search Results: "arturo"

1 April 2024

Arturo Borrero Gonz lez: Kubecon and CloudNativeCon 2024 Europe summary

This blog post shares my thoughts on attending Kubecon and CloudNativeCon 2024 Europe in Paris. It was my third time at this conference, and it felt bigger than last year s in Amsterdam. Apparently it had an impact on public transport. I missed part of the opening keynote because of the extremely busy rush hour tram in Paris. On Artificial Intelligence, Machine Learning and GPUs Talks about AI, ML, and GPUs were everywhere this year. While it wasn t my main interest, I did learn about GPU resource sharing and power usage on Kubernetes. There were also ideas about offering Models-as-a-Service, which could be cool for Wikimedia Toolforge in the future. See also:

On security, policy and authentication This was probably the main interest for me in the event, given Wikimedia Toolforge was about to migrate away from Pod Security Policy, and we were currently evaluating different alternatives. In contrast to my previous attendances to Kubecon, where there were three policy agents with presence in the program schedule, Kyverno, Kubewarden and OpenPolicyAgent (OPA), this time only OPA had the most relevant sessions. One surprising bit I got from one of the OPA sessions was that it could work to authorize linux PAM sessions. Could this be useful for Wikimedia Toolforge? OPA talk

I attended several sessions related to authentication topics. I discovered the keycloak software, which looks very promising. I also attended an Oauth2 session which I had a hard time following, because I clearly missed some additional knowledge about how Oauth2 works internally. I also attended a couple of sessions that ended up being a vendor sales talk. See also:

On container image builds, harbor registry, etc This topic was also of interest to me because, again, it is a core part of Wikimedia Toolforge. I attended a couple of sessions regarding container image builds, including topics like general best practices, image minimization, and buildpacks. I learned about kpack, which at first sight felt like a nice simplification of how the Toolforge build service was implemented. I also attended a session by the Harbor project maintainers where they shared some valuable information on things happening soon or in the future , for example:

new harbor command line interface coming soon. Only the first iteration though.
harbor operator, to install and manage harbor. Looking for new maintainers, otherwise going to be archived.
the project is now experimenting with adding support to hosting more artifacts: maven, NPM, pypi. I wonder if they will consider hosting Debian .deb packages.

On networking I attended a couple of sessions regarding networking. One session in particular I paid special attention to, ragarding on network policies. They discussed new semantics being added to the Kubernetes API. The different layers of abstractions being added to the API, the different hook points, and override layers clearly resembled (to me at least) the network packet filtering stack of the linux kernel (netfilter), but without the 20 (plus) years of experience building the right semantics and user interfaces. Network talk

I very recently missed some semantics for limiting the number of open connections per namespace, see Phabricator T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis This functionality should be available in the lower level tools, I mean Netfilter. I may submit a proposal upstream at some point, so they consider adding this to the Kubernetes API. Final notes In general, I believe I learned many things, and perhaps even more importantly I re-learned some stuff I had forgotten because of lack of daily exposure. I m really happy that the cloud native way of thinking was reinforced in me, which I still need because most of my muscle memory to approach systems architecture and engineering is from the old pre-cloud days. That being said, I felt less engaged with the content of the conference schedule compared to last year. I don t know if the schedule itself was less interesting, or that I m losing interest? Finally, not an official track in the conference, but we met a bunch of folks from Wikimedia Deutschland. We had a really nice time talking about how wikibase.cloud uses Kubernetes, whether they could run in Wikimedia Cloud Services, and why structured data is so nice. Group photo

13 February 2024

Arturo Borrero Gonz lez: Back to the Wikimedia Foundation!

In October 2023, I departed from the Wikimedia Foundation, the non-profit organization behind well-known projects like Wikipedia and others, to join Spryker. However, in January 2024 Spryker conducted a round of layoffs reportedly due to budget and business reasons. I was among those affected, being let go just three months after joining the company. Fortunately, the Wikimedia Cloud Services team, where I previously worked, was still seeking to backfill my position, so I reached out to them. They graciously welcomed me back as a Senior Site Reliability Engineer, in the same team and position as before. Although this three-month career detour wasn t the outcome I initially envisioned, I found it to be a valuable experience. During this time, I gained knowledge in a new tech stack, based on AWS, and discovered new engineering methodologies. Additionally, I had the opportunity to meet some wonderful individuals. I believe I have emerged stronger from this experience. Returning to the Wikimedia Foundation is truly motivating. It feels privileged to be part of this mature organization, its community, and movement, with its inspiring mission and values. In addition, I m hoping that this also means I can once again dedicate a bit more attention to my FLOSS activities, such as my duties within the Debian project. My email address is back online: aborrero@wikimedia.org. You can find me again in the IRC libera.chat server, in the usual wikimedia channels, nick arturo.

14 December 2023

Arturo Borrero Gonz lez: OpenTofu: handcrafted include-file mechanism with YAML

I recently started playing with Terraform/OpenTofu almost on a daily basis. The other day I was working with Amazon Managed Prometheus (or AMP), and wanted to define prometheus alert rules on YAML files. I decided that I needed a way to put the alerts on a bunch of files, and then load them by the declarative code, on the correct AMP workspace. I came up with this code pattern that I m sharing here, for my future reference, and in case it is interesting to someone else. The YAML file where I specify the AMP workspace, and where the alert rule files live:

---
alert_files:
  my_alerts_production:
    amp:
      workspace: "production"
    files: "alert_rules/production/*.yaml"
  my_alerts_staging:
    amp:
      workspace: "staging"
    files: "alert_rules/staging/*.yaml"

Note the files entry contains a file pattern. I will later expand the pattern using the fileset() function. Each rule file would be something like this:

---
name: "my_rule_namespace"
rule_data:  
  # this is prometheus-specific config
  groups:
    - name: "example_alert_group"
      rules:
      - alert: Example_Alert_Cpu
        # just arbitrary values, to produce an example alert
        expr: avg(rate(ecs_cpu_seconds_total container=~"something" [2m])) > 1
        for: 10s
        annotations:
          summary: "CPU usage is too high"
          description: "The container average CPU usage is too high."

I m interested in the data structure mutating into something similar to this:

---
alert_files:
  my_alerts_production:
    amp:
      workspace: "production"
    alerts_data:
      - name: rule_namespace_1
        rule_data:  
          # actual alert definition here
          [..]
      - name: rule_namespace_2
        rule_data:  
          # actual alert definition here
          [..]
  my_alerts_staging:
    amp:
      workspace: "staging"
    alerts_data:
      - name: rule_namespace_1
        rule_data:  
          # actual alert definition here
          [..]
      - name: rule_namespace_2
        rule_data:  
          # actual alert definition here
          [..]

This is the algorithm that does the trick:

locals  
  alerts_config =  
    for x, y in  
      for k, v in local.config.alert_files :
      k =>  
        amp : (v.amp),
        files : fileset("", v.files)
       
        : x =>  
      amp : (y.amp),
      alertmanager_data : [
        for z in(y.files) :
        yamldecode(file(z))
      ]
     
   
 

Because the declarative nature of the Terraform/OpenTofu language, I needed to implement 3 different for loops. Each loop reads the map and transforms it in some way, passing the result into the next loop. A bit convoluted if you ask me. To explain the logic, I think it makes more sense to read it from inside out. First loop:

    for k, v in local.config.alert_files :
    k =>  
        amp : (v.amp),
        files : fileset("", v.files)
     

This loop iterates the input YAML map in key-value pairs, remapping each amp entry, and expanding the file globs using the fileset() into a temporal files entry. Second loop:

    for x, y in  
        # previous fileset() loop
        : x =>  
      amp : (y.amp),
      alertmanager_data : [
        # yamldecode() loop
      ]
     

This intermediate loop is responsible for building the final data structure. It iterates the previous fileset() loop to remap it calling the next loop, the yamldecode() one. Note how the amp entry is being rebuilt in each remap (first loop and this one), otherwise we would lose it! Third loop:

    alertmanager_data : [
        for z in(y.files) :
        yamldecode(file(z))
    ]

And finally, this is maybe the easiest loop of the 3, we iterate the temporal file entry that was created in the first loop, calling yamldecode() for each of the file names generated by fileset(). The resulting data structure should allow you to easily create resources later in a for_each loop.

20 November 2023

Arturo Borrero Gonz lez: New job at Spryker

Last month, in October 2023, I started a new job as a Senior Site Reliability Engineer at the Germany-headquartered technology company Spryker. They are primarily focused on e-commerce and infrastructure businesses. I joined this company with excitement and curiosity, as I would be working with a new technology stack: AWS and Terraform. I had not been directly exposed to them in the past, but I was aware of the vast industry impact of both. Suddenly, things like PXE boot, NIC firmware, and Linux kernel upgrades are mostly AWS problems. This is a full-time, 100% remote job position. I m writing this post one month into the new position, and all I can say is that the onboarding was smooth, and I found some good people in my team. Prior to joining Spryker, I was a Senior SRE at the Wikimedia Cloud Services team at the Wikimedia Foundation. I had been there since 2017, for 6 years. It was a privilege working there.

31 May 2023

Arturo Borrero Gonz lez: Wikimedia Hackathon 2023 Athens summary

During the weekend of 19-23 May 2023 I attended the Wikimedia hackathon 2023 in Athens, Greece. The event physically reunited folks interested in the more technological aspects of the Wikimedia movement in person for the first time since 2019. The scope of the hacking projects include (but was not limited to) tools, wikipedia bots, gadgets, server and network infrastructure, data and other technical systems. My role in the event was two-fold: on one hand I was in the event because of my role as SRE in the Wikimedia Cloud Services team, where we provided very valuable services to the community, and I was expected to support the technical contributors of the movement that were around. Additionally, and because of that same role, I did some hacking myself too, which was specially augmented given I generally collaborate on a daily basis with some community members that were present in the hacking room. The hackathon had some conference-style track and I ran a session with my coworker Bryan, called Past, Present and Future of Wikimedia Cloud Services (Toolforge and friends) (slides) which was very satisfying to deliver given the friendly space that it was. I attended a bunch of other sessions, and all of them were interesting and well presented. The number of ML themes that were present in the program schedule was exciting. I definitely learned a lot from attending those sessions, from how LLMs work, some fascinating applications for them in the wikimedia space, to what were some industry trends for training and hosting ML models. Session

Despite the sessions, the main purpose of the hackathon was, well, hacking. While I was in the hacking space for more than 12 hours each day, my ability to get things done was greatly reduced by the constant conversations, help requests, and other social interactions with the folks. Don t get me wrong, I embraced that reality with joy, because the social bonding aspect of it is perhaps the main reason why we gathered in person instead of virtually. That being said, this is a rough list of what I did:

Helped review the status of Migrate bldrwnsch from Toolforge GridEngine to Toolforge Kubernetes
Helped the maintainer of the bodh Toolforge tool get it working with a reverse proxy and started conversations about how to facilitate the use case.
Discussed persistent storage options in Toolforge, which resulted in some conversations within the WMCS team.
Had several debates with several folks on what computing abstractions we are providing, including Toolforge as a PaaS, raw Kubernetes access, or even if we should continue offering virtual machines to the community.
Patched toolforge-weld to support some stuff required by jobs-framework-cli. Also made a release of it.
Started a patch for jobs-framework-cli to use toolforge-weld. Which is still unfinished as of this writing.
Debated about how to host ML models, what to do with GPUs etc.
Suggested fellow SRE Riccardo in working on cloud-private subnet: introduce new domain to integrate with Netbox.
Uncovered a flavor definition problem in our Openstack deployment.
Reviewed many Toolforge account requests (most of them not related to the hackathon though), some quota requests and similar things.

The hackathon was also the final days of Technical Engagement as an umbrella group for WMCS and Developer Advocacy teams within the Technology department of the Wikimedia Foundation because of an internal reorg.. We used the chance to reflect on the pleasant time we have had together since 2019 and take a final picture of the few of us that were in person in the event. Technical Engagement

It wasn t the first Wikimedia Hackathon for me, and I felt the same as in previous iterations: it was a welcoming space, and I was surrounded by friends and nice human beings. I ended the event with a profound feeling of being privileged, because I was part of the Wikimedia movement, and because I was invited to participate in it.

27 April 2023

Arturo Borrero Gonz lez: Kubecon and CloudNativeCon 2023 Europe summary

This post serves as a report from my attendance to Kubecon and CloudNativeCon 2023 Europe that took place in Amsterdam in April 2023. It was my second time physically attending this conference, the first one was in Austin, Texas (USA) in 2017. I also attended once in a virtual fashion. The content here is mostly generated for the sake of my own recollection and learnings, and is written from the notes I took during the event. The very first session was the opening keynote, which reunited the whole crowd to bootstrap the event and share the excitement about the days ahead. Some astonishing numbers were announced: there were more than 10.000 people attending, and apparently it could confidently be said that it was the largest open source technology conference taking place in Europe in recent times. It was also communicated that the next couple iteration of the event will be run in China in September 2023 and Paris in March 2024. More numbers, the CNCF was hosting about 159 projects, involving 1300 maintainers and about 200.000 contributors. The cloud-native community is ever-increasing, and there seems to be a strong trend in the industry for cloud-native technology adoption and all-things related to PaaS and IaaS. The event program had different tracks, and in each one there was an interesting mix of low-level and higher level talks for a variety of audience. On many occasions I found that reading the talk title alone was not enough to know in advance if a talk was a 101 kind of thing or for experienced engineers. But unlike in previous editions, I didn t have the feeling that the purpose of the conference was to try selling me anything. Obviously, speakers would make sure to mention, or highlight in a subtle way, the involvement of a given company in a given solution or piece of the ecosystem. But it was non-invasive and fair enough for me. On a different note, I found the breakout rooms to be often small. I think there were only a couple of rooms that could accommodate more than 500 people, which is a fairly small allowance for 10k attendees. I realized with frustration that the more interesting talks were immediately fully booked, with people waiting in line some 45 minutes before the session time. Because of this, I missed a few important sessions that I ll hopefully watch online later. Finally, on a more technical side, I ve learned many things, that instead of grouping by session I ll group by topic, given how some subjects were mentioned in several talks. On gitops and CI/CD pipelines Most of the mentions went to FluxCD and ArgoCD. At that point there were no doubts that gitops was a mature approach and both flux and argoCD could do an excellent job. ArgoCD seemed a bit more over-engineered to be a more general purpose CD pipeline, and flux felt a bit more tailored for simpler gitops setups. I discovered that both have nice web user interfaces that I wasn t previously familiar with. However, in two different talks I got the impression that the initial setup of them was simple, but migrating your current workflow to gitops could result in a bumpy ride. This is, the challenge is not deploying flux/argo itself, but moving everything into a state that both humans and flux/argo can understand. I also saw some curious mentions to the config drifts that can happen in some cases, even if the goal of gitops is precisely for that to never happen. Such mentions were usually accompanied by some hints on how to operate the situation by hand. Worth mentioning, I missed any practical information about one of the key pieces to this whole gitops story: building container images. Most of the showcased scenarios were using pre-built container images, so in that sense they were simple. Building and pushing to an image registry is one of the two key points we would need to solve in Toolforge Kubernetes if adopting gitops. In general, even if gitops were already in our radar for Toolforge Kubernetes, I think it climbed a few steps in my priority list after the conference. Another learning was this site: https://opengitops.dev/. Group

On etcd, performance and resource management I attended a talk focused on etcd performance tuning that was very encouraging. They were basically talking about the exact same problems we have had in Toolforge Kubernetes, like api-server and etcd failure modes, and how sensitive etcd is to disk latency, IO pressure and network throughput. Even though Toolforge Kubernetes scale is small compared to other Kubernetes deployments out there, I found it very interesting to see other s approaches to the same set of challenges. I learned how most Kubernetes components and apps can overload the api-server. Because even the api-server talks to itself. Simple things like kubectl may have a completely different impact on the API depending on usage, for example when listing the whole list of objects (very expensive) vs a single object. The conclusion was to try avoiding hitting the api-server with LIST calls, and use ResourceVersion which avoids full-dumps from etcd (which, by the way, is the default when using bare kubectl get calls). I already knew some of this, and for example the jobs-framework-emailer was already making use of this ResourceVersion functionality. There have been a lot of improvements in the performance side of Kubernetes in recent times, or more specifically, in how resources are managed and used by the system. I saw a review of resource management from the perspective of the container runtime and kubelet, and plans to support fancy things like topology-aware scheduling decisions and dynamic resource claims (changing the pod resource claims without re-defining/re-starting the pods). On cluster management, bootstrapping and multi-tenancy I attended a couple of talks that mentioned kubeadm, and one in particular was from the maintainers themselves. This was of interest to me because as of today we use it for Toolforge. They shared all the latest developments and improvements, and the plans and roadmap for the future, with a special mention to something they called kubeadm operator , apparently capable of auto-upgrading the cluster, auto-renewing certificates and such. I also saw a comparison between the different cluster bootstrappers, which to me confirmed that kubeadm was the best, from the point of view of being a well established and well-known workflow, plus having a very active contributor base. The kubeadm developers invited the audience to submit feature requests, so I did. The different talks confirmed that the basic unit for multi-tenancy in kubernetes is the namespace. Any serious multi-tenant usage should leverage this. There were some ongoing conversations, in official sessions and in the hallway, about the right tool to implement K8s-whitin-K8s, and vcluster was mentioned enough times for me to be convinced it was the right candidate. This was despite of my impression that multiclusters / multicloud are regarded as hard topics in the general community. I definitely would like to play with it sometime down the road. On networking I attended a couple of basic sessions that served really well to understand how Kubernetes instrumented the network to achieve its goal. The conference program had sessions to cover topics ranging from network debugging recommendations, CNI implementations, to IPv6 support. Also, one of the keynote sessions had a reference to how kube-proxy is not able to perform NAT for SIP connections, which is interesting because I believe Netfilter Conntrack could do it if properly configured. One of the conclusions on the CNI front was that Calico has a massive community adoption (in Netfilter mode), which is reassuring, especially considering it is the one we use for Toolforge Kubernetes. Slide

On jobs I attended a couple of talks that were related to HPC/grid-like usages of Kubernetes. I was truly impressed by some folks out there who were using Kubernetes Jobs on massive scales, such as to train machine learning models and other fancy AI projects. It is acknowledged in the community that the early implementation of things like Jobs and CronJobs had some limitations that are now gone, or at least greatly improved. Some new functionalities have been added as well. Indexed Jobs, for example, enables each Job to have a number (index) and process a chunk of a larger batch of data based on that index. It would allow for full grid-like features like sequential (or again, indexed) processing, coordination between Job and more graceful Job restarts. My first reaction was: Is that something we would like to enable in Toolforge Jobs Framework? On policy and security A surprisingly good amount of sessions covered interesting topics related to policy and security. It was nice to learn two realities:

kubernetes is capable of doing pretty much anything security-wise and create greatly secured environments.
it does not by default. The defaults are not security-strict on purpose.

It kind of made sense to me: Kubernetes was used for a wide range of use cases, and developers didn t know beforehand to which particular setup they should accommodate the default security levels. One session in particular covered the most basic security features that should be enabled for any Kubernetes system that would get exposed to random end users. In my opinion, the Toolforge Kubernetes setup was already doing a good job in that regard. To my joy, some sessions referred to the Pod Security Admission mechanism, which is one of the key security features we re about to adopt (when migrating away from Pod Security Policy). I also learned a bit more about Secret resources, their current implementation and how to leverage a combo of CSI and RBAC for a more secure usage of external secrets. Finally, one of the major takeaways from the conference was learning about kyverno and kubeaudit. I was previously aware of the OPA Gatekeeper. From the several demos I saw, it was to me that kyverno should help us make Toolforge Kubernetes more sustainable by replacing all of our custom admission controllers with it. I already opened a ticket to track this idea, which I ll be proposing to my team soon. Final notes In general, I believe I learned many things, and perhaps even more importantly I re-learned some stuff I had forgotten because of lack of daily exposure. I m really happy that the cloud native way of thinking was reinforced in me, which I still need because most of my muscle memory to approach systems architecture and engineering is from the old pre-cloud days. List of sessions I attended on the first day:

List of sessions I attended on the second day:

List of sessions I attended on third day:

The videos have been published on Youtube.

30 January 2023

Arturo Borrero Gonz lez: Debian and the adventure of the screen resolution

I read somewhere a nice meme about Linux: Do you want an operating system or do you want an adventure? I love it, because it is so true. What you are about to read is my adventure to set a usable screen resolution in a fresh Debian testing installation. The context is that I have two different Lenovo Thinkpad laptops with 16 screen and nvidia graphic cards. They are both installed with the latest Debian testing. I use the closed-source nvidia drivers (they seem to work better than the nouveau module). The desktop manager and environment that I use is lightdm + XFCE4. The monitor native resolution in both machines is very high: 3840x2160 (or 4K UHD if you will). The thing is that both laptops show an identical problem: when freshly installed with the Debian default config, the native resolution is in use. For a 16 screen laptop, this high resolution means that the font is tiny. Therefore, the raw native resolution renders the machine almost unusable. This is a picture of what you get by running htop in the console (tty1, the terminal you would get by hitting CTRL+ALT+F1) with the default install: Linux tty console with high resolution and tiny fonts

Linux tty console with high resolution and tiny fonts

Everything in the system is affected by this:

the grub menu is unreadable. Thanksfully the right option is selected by default.
the tty console, with the boot splash by systemd is unreadable as well. There are some colors, so you at least see some systemd stuff happening in green .
when lightdm starts, the resolution keeps being very high. Can barely click the login button.
when XFCE4 starts, it is a pain to navigate the menu and click the right buttons to set a more reasonable resolution.

The adventure begins after installing the system. Each of these four points must be fixed by hand by the user. XFCE4 Point #4 is the easiest. Navigate with the mouse pointer to the tiny Applications menu, then Settings, then Displays. This is more or less the same in every other desktop operating system. There are no further actions required to persist this setting. Thanks you XFCE4. lightdm Point #3, about lightdm, is more tricky to solve. It involves running xrandr when lightdm sets up the display. Nobody will tell you this trick. You have to search for it on the internet. Thankfully is a common problem, and a person who knows what to search for can find good results. The file /etc/lightdm/lightdm.conf needs to contain something like this:

[LightDM]
[Seat:*]
# set up correct display resolution
display-setup-script=sh -c -- "xrandr -s 1920x1080"

By the way, depending on your system hardware setup, you may also need an additional call to xrandr here. If you want to plug in an HDMI monitor, chances are you require something like xrandr --setprovideroutputsource NVIDIA-G0 modesetting && xrandr --auto to instruct the NVIDIA graphic card to work will with the kernel graphic system. In my case, one of my laptops require it, so I have:

[LightDM]
[Seat:*]
# don't ask me to type my username
greeter-hide-users=false
# set up correct display resolution, and prepare NVIDIA card for HDMI output
display-setup-script=sh -c "xrandr -s 1920x1080 && xrandr --setprovideroutputsource NVIDIA-G0 modesetting && xrandr --auto"

grub Point #1 about the grub menu is also not trivial to solve, but also widely known on the internet. Grub allows you to set arbitrary graphical modes. In Debian systems, adding something like GRUB_GFXMODE=1024x768 to /etc/default/grub and then running sudo update-grub should do the magic. console So we get to point #2 about the tty1 console. For months, I ve been investing my scarce personal time into trying to solve this annoyance. There are a lot of conflicting information about this on the internet. Plenty of misleading solutions, essays about framebuffer, kernel modeset, and other stuff I don t want to read just to get my tty1 in a readable status. People point in different directions, like using GRUB_GFXPAYLOAD_LINUX=keep in /etc/default/grub. Which is a good solution, but won t work: my best bet is that the kernel indeed keeps the resolution as told by grub, but the moment systemd loads the nvidia driver, it enables 4K in the display and the console gets the high resolution. Actually, for a few weeks, I blamed plymouth. Because the plymouth service is loaded early by systemd, it could be responsible for setting some of the display settings. It actually contains some (undocummented) DeviceScale configuration option that is seemingly aimed to integrate into high resolution scenarios. I played with it to no avail. Some folks from IRC suggested reconfiguring the console-font package. Back-then unknown to me. Running sudo dpkg-reconfigure console-font would indeed show a menu to select some preferences for the console, including font size. But apparently, a freshly installed system already uses the biggest possible, so this was a dead end. Other option I evaluted for a few days was touching the kernel framebuffer setting. I honestly don t understand this, and all the solutions pointing to use fbset didn t work for me anyways. This is the default framebuffer configuration in one of the laptops:

user@debian:~$ fbset -i

mode "3840x2160"
    geometry 3840 2160 3840 2160 32
    timings 0 0 0 0 0 0 0
    accel true
    rgba 8/16,8/8,8/0,0/0
endmode
Frame buffer device information:
    Name        : i915drmfb
    Address     : 0
    Size        : 33177600
    Type        : PACKED PIXELS
    Visual      : TRUECOLOR
    XPanStep    : 1
    YPanStep    : 1
    YWrapStep   : 0
    LineLength  : 15360
    Accelerator : No

Playing with these numbers, I was able to modify the geometry of the console, only to reduce the panel to a tiny square in the console display (with equally small fonts anyway). If it was possible to scale or resize the panel in other way, I was unable to understand how to do so by reading the associated docs. One day, out of despair, I tried disabling kernel modesetting (or KMS). It indeed got me a more readable tty1, only to prevent the whole graphic stack from starting, with Xorg complaining about the lack of kernel modeset. After lots of wasted time, I decided to blame the NVIDIA graphic card. Because why not: a closed source module in my system looks fishy. I registered in their official forum and wrote a message about my suspicion on the module, asking for advice on how to modify the driver default resolution. I was hoping that something like modprobe nvidia my_desired_resolution=1920x1080 could exist. Apparently not :-( I was about to give up. I had walked every corner of the known internet. I even tried summoning the ancient gods, I used ChatGPT. I asked the AI god for mercy, for a working solution to no avail. Then I decided to change the kind of queries I was issuing the search engines (don t ask me, I no longer remember). Eventually I landed in this askubuntu.com page. The question described the exact same problem I was experiencing. Finally, that was encouraging! I was not alone in my adventure after all! The solution section included a font size I hadn t seen before in my previous tests: 16x32. More excitement! I did all the steps. I installed the xfonts-terminus package, and in the file /etc/default/console-setup I put:

ACTIVE_CONSOLES="/dev/tty[1-6]"
CHARMAP="ISO-8859-15"
CODESET="guess"
FONTFACE="Terminus"
FONTSIZE="16x32"
VIDEOMODE=

Then I run setupcon from a tty, and the miracle happened! I finally got a bigger font in the tty1 console! Turned out a potential solution was about playing with console-setup, which I had tried wihtout success before. I m not even sure if the additional package was required. This is how my console looks now: Linux tty console with high resolution but not so tiny fonts

Linux tty console with high resolution but not so tiny fonts

The truth is the solution is satisfying only to a degree. I m a person with good eyesight and can work with these bit larger fonts. I m not sure if I can get larger fonts using this method, honestly. After some search, I discovered that some folks already managed to describe the problem in detail and filed a proper bug report in Debian, see #595696 opened more than 10 years ago. 2023 is the year of linux on the desktop Nope. I honestly don t see how this disconnected pile of settings can be all reconciled together. Can we please have a systemd-whatever that homogeinizes all of this mess? I m referring to grub + kernel drivers + console + lightdm + XFCE4. Next adventure When I lock the desktop (with CTRL+ALT+L) and close the laptop lid to suspend it, then reopen it, type the login info into the lightdm greeter, then the desktop environment never loads, black screen. I have already tried the first few search results without luck. Perhaps the nvidia card is to blame this time? Perhaps poorly coupled power management by the different system software pieces? Who knows what s going on here. This will probably be my next Debian desktop adventure.

6 November 2022

Arturo Borrero Gonz lez: Home network refresh: 10G and IPv6

A few days ago, my home network got a refresh that resulted in the enablement of some next-generation technologies for me and my family. Well, next-generation or current-generation, depending on your point of view. Per the ISP standards in Spain (my country), what I ll describe next is literally the most and latest you can get. The post title spoiled it already. I have now 10G internet uplink and native IPv6 since I changed my ISP to https://digimobil.es. My story began a few months ago when a series of fiber deployments started in my neighborhood by a back-then mostly unknown ISP (digimobil). The workers were deploying the fiber inside the street sewers, and I noticed that they were surrounded by advertisements promoting the fastest FTTH deployment in Spain. Indeed, their website was promoting 1G and 10G fiber, so a few days later I asked the workers when would that be available for subscription. They told me to wait just a couple of months, and the wait ended this week. I called the ISP, and a marketing person told me a lot of unnecessary information about how good service I was purchasing. I asked about IPv6 availability, but that person had no idea. They called me the next day to confirm that the home router they were installing would support both IPv6 and Wi-Fi 6. I was suspicious about nobody in the marketing department knowing anything about any of the two technologies, but I decided to proceed anyway. Just 24 hours after calling them, a technician came to my house and 45 minutes later the setup was ready. The new home router was a ZTE ZXHN F8648P unit. By the way, it had Linux inside, but I got no GPL copyright notice or anything. It had 1x10G and 4x1G ethernet LAN ports. The optical speed tests that the technician did were giving between 8 Gbps to 9 Gbps in uplink speed, which seemed fair enough. Upon quick search, there is apparently a community of folks online which already know how to get the most out of this router by unbloking the root account (sorry, in spanish only) and using other tools. When I plugged the RJ45 in my laptop, the magic happened: the interface got a native, public IPv6 from the router. I ran to run the now-classic IPv6 browser test at https://test-ipv6.com/. And here is the result: IPv6 test

If you are curious, this was the IPv6 prefix whois information:

route6: 2a0c:5a80::/29
descr: Digi Spain Telecom S.L.U.
origin: AS57269

They were handing my router a prefix like 2a0c:5a80:2218:4a00::/56. I ignored if the prefix was somehow static, dynamic, just for me, or anything else. I ve been waiting for native IPv6 at home for years. In the past, I ve had many ideas and projects to host network services at home leveraging IPv6. But when I finally got it, I didn t know what to do next. I had a 7 months old baby, and honestly I didn t have the spare time to play a lot with the setup. Actually, I had no need or use for such fast network either. But my coworker Andrew convinced me: given the price 30 EUR / month, I didn t have any reason not to buy it. In fact, I didn t have any 10G-enabled NIC at home. I had a few laptops with 2.5G ports, though, and that was enough to experience the new network speeds. Since this write-up was inspired by the now almost-legenday post by Michael Stapelberg My upgrade to 25 Gbit/s Fiber To The Home, I contacted him, and he suggested running a few speed tests using the Ookla suite against his own server. Here are the results:

$ docker run --net host --rm -it docker.io/stapelberg/speedtest:latest -s 50092
[..]
     Server: Michael Stapelberg - Zurich (id = 50092)
        ISP: Digi Spain
    Latency:    34.29 ms   (0.20 ms jitter)
   Download:  2252.42 Mbps (data used: 3.4 GB )
     Upload:  2239.27 Mbps (data used: 2.8 GB )
Packet Loss:     0.0%
 Result URL: https://www.speedtest.net/result/c/cc8d6a78-c6f8-4f71-b554-a79812e10106

$ docker run --net host --rm -it docker.io/stapelberg/speedtest:latest -s 50092
[..]
     Server: Michael Stapelberg - Zurich (id = 50092)
        ISP: Digi Spain
    Latency:    34.05 ms   (0.21 ms jitter)
   Download:  2209.85 Mbps (data used: 3.2 GB )
     Upload:  2223.45 Mbps (data used: 2.9 GB )
Packet Loss:     0.0%
 Result URL: https://www.speedtest.net/result/c/32f9158e-fc1a-47e9-bd33-130e66c25417

This is over IPv6. Very satisfying. Bonus point: when I called my former ISP to cancel the old subscription the conversation was like:

I want to cancel the service.
What s the reason?
I got upgraded to 10G by another ISP
The speed is measured in MB, not G.
Ok, I got upgraded to 10.000 MB
That s not possible.
Well

I didn t even bother mentioning IPv6. Cheers!

3 November 2022

Arturo Borrero Gonz lez: New OpenPGP key and new email

I m trying to replace my old OpenPGP key with a new one. The old key wasn t compromised or lost or anything bad. Is still valid, but I plan to get rid of it soon. It was created in 2013. The new key id fingerprint is: AA66280D4EF0BFCC6BFC2104DA5ECB231C8F04C4 I plan to use the new key for things like encrypted emails, uploads to the Debian archive, and more. Also, the new key includes an identity with a newer personal email address I plan to use soon: arturo.bg@arturo.bg The new key has been uploaded to some public keyservers. If you would like to sign the new key, please follow the steps in the Debian wiki.

If you are curious about what that long code block contains, check this https://cirw.in/gpg-decoder/ For the record, the old key fingerprint is: DD9861AB23DC3333892E07A968E713981D1515F8 Cheers!

25 October 2022

Arturo Borrero Gonz lez: Netfilter Workshop 2022 summary

This is my report from the Netfilter Workshop 2022. The event was held on 2022-10-20/2022-10-21 in Seville, and the venue was the offices of Zevenet. We started on Thursday with Pablo Neira (head of the project) giving a short welcome / opening speech. The previous iteration of this event was in virtual fashion in 2020, two years ago. In the year 2021 we were unable to meet either in person or online. This year, the number of participants was just eight people, and this allowed the setup to be a bit more informal. We had kind of an un-conference style meeting, in which whoever had something prepared just went ahead and opened a topic for debate. In the opening speech, Pablo did a quick recap on the legal problems the Netfilter project had a few years ago, a topic that was settled for good some months ago, in January 2022. There were no news in this front, which was definitely a good thing. Moving into the technical topics, the workshop proper, Pablo started to comment on the recent developments to instrument a way to perform inner matching for tunnel protocols. The current implementation supports VXLAN, IPIP, GRE and GENEVE. Using nftables you can match packet headers that are encapsulated inside these protocols. He mentioned the design and the goals, that was to have a kernel space setup that allows adding more protocols by just patching userspace. In that sense, more tunnel protocols will be supported soon, such as IP6IP, UDP, and ESP. Pablo requested our opinion on whether if nftables should generate the matching dependencies. For example, if a given tunnel is UDP-based, a dependency match should be there otherwise the rule won t work as expected. The agreement was to assist the user in the setup when possible and if not, print clear error messages. By the way, this inner thing is pure stateless packet filtering. Doing inner-conntracking is an open topic that will be worked on in the future. Pablo continued with the next topic: nftables automatic ruleset optimizations. The times of linear ruleset evaluation are over, but some people have a hard time understanding / creating rulesets that leverage maps, sets, and concatenations. This is where the ruleset optimizations kick in: it can transform a given ruleset to be more optimal by using such advanced data structures. This is purely about optimizing the ruleset, not about validating the usefulness of it, which could be another interesting project. There were a couple of problems mentioned, however. The ruleset optimizer can be slow, O(n!) in worst case. And the user needs to use nested syntax. More improvements to come in the future. Next was Stefano Brivio s turn (Red Hat engineer). He had been involved lately in a couple of migrations to nftables, in particular libvirt and KubeVirt. We were pointed to https://libvirt.org/firewall.html, and Stefano walked us through the 3 or 4 different virtual networks that libvirt can create. He evaluated some options to generate efficient rulesets in nftables to instrument such networks, and commented on a couple of ideas: having a null matcher in nftables set expression. Or perhaps having kind of subsets, something similar to a view in a SQL database. The room spent quite a bit of time debating how the nft_lookup API could be extended to support such new search operations. We also discussed if having intermediate facilities such as firewalld could provide the abstraction levels that could make developers more comfortable. Using firewalld also may have the advantage that coordination between different system components writing ruleset to nftables is handled by firewalld itself and developers are freed of the responsibility of doing it right. Next was Fernando F. Mancera (Red Hat engineer). He wanted to improve error reporting when deleting table/chain/rules with nftables. In general, there are some inconsistencies on how tables can be deleted (or flushed). And there seems to be no correct way to make a single table go away with all its content in a single command. The room agreed in that the commands destroy table and delete table should be defined consistently, with the following meanings:

destroy: nuke the table, don t fail if it doesn t exist
delete: delete the table, but the command will fail if it doesn t exist

This topic diverted into another: how to reload/replace a ruleset but keep stateful information (such as counters). Next was Phil Sutter (Netfilter coreteam member and Red Hat engineer). He was interested in discussing options to make iptables-nft backward compatible. The use case he brought was simple: What happens if a container running iptables 1.8.7 creates a ruleset with features not supported by 1.8.6. A later container running 1.8.6 may fail to operate. Phil s first approach was to attach additional metadata into rules to assist older iptables-nft in decoding and printing the ruleset. But in general, there are no obvious or easy solutions to this problem. Some people are mixing different tooling version, and there is no way all cases can be predicted/covered. iptables-nft already refuses to work in some of the most basic failure scenarios. An other way to approach the issue could be to introduce some kind of support to print raw expressions in iptables-nft, like -m nft xyz. Which feels ugly, but may work. We also explored playing with the semantics of release version numbers. And another idea: store strings in the nft rule userdata area with the equivalent matching information for older iptables-nft. In fact, what Phil may have been looking for is not backwards but forward compatibility. Phil was undecided which path to follow, but perhaps the most common-sense approach is to fall back to a major release version bump (2.x.y) and declaring compatibility breakage with older iptables 1.x.y. That was pretty much it for the first day. We had dinner together and went to sleep for the next day. The room

The second day was opened by Florian Westphal (Netfilter coreteam member and Red Hat engineer). Florian has been trying to improve nftables performance in kernels with RETPOLINE mitigations enabled. He commented that several workarounds have been collected over the years to avoid the performance penalty of such mitigations. The basic strategy is to avoid function indirect calls in the kernel. Florian also described how BPF programs work around this more effectively. And actually, Florian tried translating nf_hook_slow() to BPF. Some preliminary benchmarks results were showed, with about 2% performance improvement in MB/s and PPS. The flowtable infrastructure is specially benefited from this approach. The software flowtable infrastructure already offers a 5x performance improvement with regards the classic forwarding path, and the change being researched by Florian would be an addition on top of that. We then moved into discussing the meeting Florian had with Alexei in Zurich. My personal opinion was that Netfilter offers interesting user-facing interfaces and semantics that BPF does not. Whereas BPF may be more performant in certain scenarios. The idea of both things going hand in hand may feel natural for some people. Others also shared my view, but no particular agreement was reached in this topic. Florian will probably continue exploring options on that front. The next topic was opened by Fernando. He wanted to discuss Netfilter involvement in Google Summer of Code and Outreachy. Pablo had some personal stuff going on last year that prevented him from engaging in such projects. After all, GSoC is not fundamental or a priority for Netfilter. Also, Pablo mentioned the lack of support from others in the project for mentoring activities. There was no particular decision made here. Netfilter may be present again in such initiatives in the future, perhaps under the umbrella of other organizations. Again, Fernando proposed the next topic: nftables JSON support. Fernando shared his plan of going over all features and introduce programmatic tests from them. He also mentioned that the nftables wiki was incomplete and couldn t be used as a reference for missing tests. Phil suggested running the nftables python test-suite in JSON mode, which should complain about missing features. The py test suite should cover pretty much all statements and variations on how the nftables expression are invoked. Next, Phil commented on nftables xtables support. This is, supporting legacy xtables extensions in nftables. The most prominent problem was that some translations had some corner cases that resulted in a listed ruleset that couldn t be fed back into the kernel. Also, iptables-to-nftables translations can be sloppy, and the resulting rule won t work in some cases. In general, nft list ruleset nft -f may fail in rulesets created by iptables-nft and there is no trivial way to solve it. Phil also commented on potential iptables-tests.py speed-ups. Running the test suite may take very long time depending on the hardware. Phil will try to re-architect it, so it runs faster. Some alternatives had been explored, including collecting all rules into a single iptables-restore run, instead of hundreds of individual iptables calls. Next topic was about documentation on the nftables wiki. Phil is interested in having all nftables code-flows documented, and presented some improvements in that front. We are trying to organize all developer-oriented docs on a mediawiki portal, but the extension was not active yet. Since I worked at the Wikimedia Foundation, all the room stared at me, so at the end I kind of committed to exploring and enabling the mediawiki portal extension. Note to self: is this perhaps https://www.mediawiki.org/wiki/Portals ? Next presentation was by Pablo. He had a list of assorted topics for quick review and comment.

We discussed nftables accept/drop semantics. People that gets two or more rulesets from different software are requesting additional semantics here. A typical case is fail2ban integration. One option is quick accept (no further evaluation if accepted) and the other is lazy drop (don t actually drop the packet, but delay decision until the whole ruleset has been evaluated). There was no clear way to move forward with this.
A debate on nft userspace memory usage followed. Some people are running nftables on low end devices with very little memory (such as 128 MB). Pablo was exploring a potential solution: introducing struct constant_expr, which can reduce 12.5% mem usage.
Next we talked about repository licensing (or better, relicensing to GPLv2+). Pablo went over a list of files in the nftables tree which had diverging licenses. All people in the room agreed on this relicensing effort. A mention to the libreadline situation was made.
Another quick topic: a bogus EEXIST in nft_rbtree. Pablo & Stefano to work in a patch.
Next one was conntrack early drop in flowtable. Pablo is studying use cases for some legitimate UDP unidirectional flows (like RTP traffic).
Pablo and Stefano discussed pipapo not being atomic on updates. Stefano already looked into it, and one of the ideas was to introduce a new commit API for sets.
The last of the quick topics was an idea to have a global table in nftables. Or some global items, like sets. Folk in the community keep asking for this. Some ideas were discussed, like perhaps adding a family agnostic family. But then there would be a challenge: nftables would need to generate byte code that works in any of the hooks. There was no immediate way of addressing this. The idea of having templated tables/sets circulated again as a way of reusing data across namespaces/families.

Following this, a new topic was introduced by Stefano. He wanted to talk about nft_set_pipapo, documentation, what to do next, etc. He did a nice explanation of how the pipapo algorithm works for element inserts, lookups, and deletion. The source code is pretty well documented, by the way. He showed performance measurements of different data types being stored in the structure. After some lengthly debate on how to introduce changes without breaking usage for users, he declared some action items: writing more docs, addressing problems with non-atomic set reloads and a potential rework of nft_rbtree. After that, the next topic was kubernetes & netfilter , also by Stefano. Actually, this topic was very similar to what we already discussed regarding libvirt. Developers want to reduce packet matching effort, but also often don t leverage nftables most performant features, like sets, maps or concatenations. Some Red Hat developers are already working on replacing everything with native nftables & firewalld integrations. But some rules generators are very bad. Kubernetes (kube-proxy) is a known case. Developers simply won t learn how to code better ruleset generators. There was a good question floating around: What are people missing on first encounter with nftables? The Netfilter project doesn t have a training or marketing department or something like that. We cannot force-educate developers on how to use nftables in the right way. Perhaps we need to create a set of dedicated guidelines, or best practices, in the wiki for app developers that rely on nftables. Jozsef Kadlecsik (Netfilter coreteam) supported this idea, and suggested going beyond: such documents should be written exclusively from the nftables point of view: stop approaching the docs as a comparison to the old iptables semantics. Related to that last topic, next was Laura Garc a (Zevenet engineer, and venue host). She shared the same information as she presented in the Kubernetes network SIG in August 2020. She walked us through nftlb and kube-nftlb, a proof-of-concept replacement for kube-proxy based on nftlb that can outperform it. For whatever reason, kube-nftlb wasn t adopted by the upstream kubernetes community. She also covered latest changes to nftlb and some missing features, such as integration with nftables egress. nftlb is being extended to be a full proxy service and a more robust overall solution for service abstractions. In a nutshell, nftlb uses a templated ruleset and only adds elements to sets, which is exactly the right usage of the nftables framework. Some other projects should follow its example. The performance numbers are impressive, and from the early days it was clear that it was outperforming classical LVS-DSR by 10x. I used this opportunity to bring a topic that I wanted to discuss. I ve seen some SRE coworkers talking about katran as a replacement for traditional LVS setups. This software is a XDP/BPF based solution for load balancing. I was puzzled about what this software had to offer versus, for example, nftlb or any other nftables-based solutions. I commented on the highlighs of katran, and we discussed the nftables equivalents. nftlb is a simple daemon which does everything using a JSON-enabled REST API. It is already packaged into Debian, ready to use, whereas katran feels more like a collection of steps that you need to run in a certain order to get it working. All the hashing, caching, HA without state sharing, and backend weight selection features of katran are already present in nftlb. To work on a pure L3/ToR datacenter network setting, katran uses IPIP encapsulation. They can t just mangle the MAC address as in traditional DSR because the backend server is on a different L3 domain. It turns out nftables has a nft_tunnel expression that can do this encapsulation for complete feature parity. It is only available in the kernel, but it can be made available easily on the userspace utility too. Also, we discussed some limitations of katran, for example, inability to handle IP fragmentation, IP options, and potentially others not documented anywhere. This seems to be common with XDP/BPF programs, because handling all possible network scenarios would over-complicate the BPF programs, and at that point you are probably better off by using the normal Linux network stack and nftables. In summary, we agreed that nftlb can pretty much offer the same as katran, in a more flexible way. Group photo

Finally, after many interesting debates over two days, the workshop ended. We all agreed on the need for extending it to 3 days next time, since 2 days feel too intense and too short for all the topics worth discussing. That s all on my side! I really enjoyed this Netfilter workshop round.

31 July 2022

Paul Wise: FLOSS Activities July 2022

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

SPTAG: print error reason

duck: add (1 2), sort indicator phrases

aptitude: fix GCC 12 FTBFS

Debian security tracker: link GitHub advisories

Debian packages website: sync contact footer

Debian sysadmin scripts: use TLS not StartTLS

Debian BTS usertags: fix some Hurd, ftp-master and release manager usertags

Debian package uploads: tldextract, sptag

Debian wiki pages: Arturo_Teslao, DebConf (23/Artwork, 24/Artwork/LogoProposals), DebianDay/2022/Netherlands/Breda, DebianRepository/SetupWithReprepro, DebianUnstable, Derivatives/Census (BOSSlinux, Ubuntu), ExternalEntities, Games/PlayIt, Hardware/Wanted, ICTL/TrustedOrganizationCriteria, KernelDKMS, LoongArch, MemberBenefits, Mobile, mtr, qa.debian.org/Join, research/publications, Ripping, SIMDEverywhere, SystemdNetworkd, Teams/Design, Teams/Publicity/Meetings/DebConf21_session

Debian sysadmin wiki: document debian-admin subscription

Wikipedia: Message-ID

Issues

Missing dependency in libgd

Test failures in SPTAG (1 2)

Broken documentation in pdb-tools

Features in SIMDe, lintian

Underlinking in libgprofng

Missing include in cwidget

Warnings in stem

Invalid syntax in ncbi-igblast

Error details needed in SPTAG

Review

Spam: reported 4 Debian bug reports and 49 Debian mailing list posts

Debian wiki: RecentChanges for the month

Debian BTS usertags: changes for the month

Debian screenshots:

approved cbonsai cmake-qt-gui diffpdf flam3-utils gbsplay kdeconnect kde-plasma-desktop network-manager paper-icon-theme plasma-desktop qtav-players wyrd

rejected mesa-vulkan-drivers (icon), mintpy/python3-mintpy (website), fpart (logo), xserver-xorg-video-sisusb/safecopy/ca-certificates/weboob-qt (selfie), vlc (photo of Microsoft Windows), chromium-browser/android-sdk (mobile phone app, private information)

Administration

Debian BTS: unarchive/reopen/triage bugs for reintroduced packages

Debian servers: check full disks, ping users of excessive disk usage, restart a hung service

Debian wiki: approve accounts

Communication

Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors The SPTAG, SIMDEverywhere, cwidget, aptitude, tldextract work was sponsored. All other work was done on a volunteer basis.

23 May 2022

Arturo Borrero Gonz lez: Toolforge Jobs Framework

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. This post continues the discussion of Toolforge updates as described in a previous post. Every non-trivial task performed in Toolforge (like executing a script or running a bot) should be dispatched to a job scheduling backend, which ensures that the job is run in a suitable place with sufficient resources. Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. The basic principle of running jobs is fairly straightforward:

You create a job from a submission server (usually login.toolforge.org).
The backend finds a suitable execution node to run the job on, and starts it once resources are available.
As it runs, the job will send output and errors to files until the job completes or is aborted.

So far, if a tool developer wanted to work with jobs, the Toolforge Grid Engine backend was the only suitable choice. This is despite the fact that Kubernetes supports this kind of workload natively. The truth is that we never prepared our Kubernetes environment to work with jobs. Luckily that has changed.

We no longer want to run Grid Engine In a previous blog post we shared information about our desired future for Grid Engine in Toolforge. Our intention is to discontinue our usage of this technology.

Convenient way of running jobs on Toolforge Kubernetes Some advanced Toolforge users really wanted to use Kubernetes. They were aware of the lack of abstractions or helpers, so they were forced to use the raw Kubernetes API. Eventually, they figured everything out and managed to succeed. The result of this move was in the form of [docs on Wikitech][raws] and a few dozen jobs running on Kubernetes for the first time. We were aware of this, and this initiative was much in sync with our ultimate goal: to promote Kubernetes over Grid Engine. We rolled up our sleeves and started thinking of a way to abstract and make it easy to run jobs without having to deal with lots of YAML and the raw Kubernetes API. There is a precedent: the webservice command does exactly that. It hides all the details behind a simple command line interface to start/stop a web app running on Kubernetes. However, we wanted to go even further, be more flexible and prepare ourselves for more situations in the future: we decided to create a complete new REST API to wrap the jobs functionality in Toolforge Kubernetes. The Toolforge Jobs Framework was born.

Toolforge Jobs Framework components The new framework is a small collection of components. As of this writing, we have three:

The REST API responsible for creating/deleting/listing jobs on the Kubernetes system.

A command line interface to interact with the REST API above.

An emailer to notify users about their jobs activity in the Kubernetes system.

There were a couple of challenges that weren t trivial to solve. The authentication and authorization against the Kubernetes API was one of them. The other was deciding on the semantics of the new REST API itself. If you are curious, we invite you to take a look at the documentation we have in wikitech.

Open beta phase Once we gained some confidence with the new framework, in July 2021 we decided to start a beta phase. We suggested some advanced Toolforge users try out the new framework. We tracked this phase in Phabricator, where our collaborators quickly started reporting some early bugs, helping each other, and creating new feature requests. Moreover, when we launched the Grid Engine migration from Debian 9 Stretch to Debian 10 Buster we took a step forward and started promoting the new jobs framework as a viable replacement for the grid. Some official documentation pages were created on wikitech as well. As of this writing the framework continues in beta phase. We have solved basically all of the most important bugs, and we already started thinking on how to address the few feature requests that are missing. We haven t yet established yet the criteria for leaving the beta phase, but it would be good to have:

Critical bugs fixed and most feature requests addressed (or at least somehow planned).

Proper automated test coverage. We can do better on testing the different software components to ensure they are as bug free as possible. This also would make sure that contributing changes is easy.

REST API swagger integration.

Deployment automation. Deploying the REST API and the emailer is tedious. This is tracked in Phabricator.

Documentation, documentation, documentation.

Limitations One of the limitations we bear in mind since early on in the development process of this framework was the support for mixing different programming languages or runtime environments in the same job. Solving this limitation is currently one of the WMCS team priorities, because this is one of the key workflows that was available on Grid Engine. The moment we address it, the framework adoption will grow, and it will pretty much enable the same workflows as in the grid, if not more advanced and featureful. Stay tuned for more upcoming blog posts with additional information about Toolforge. This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

13 May 2022

Arturo Borrero Gonz lez: Toolforge GridEngine Debian 10 Buster migration

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. In accordance with our operating system upgrade policy, we should migrate our servers to Debian Buster. As discussed in the previous post, one of the most important and successful services provided by the Wikimedia Cloud Services team at the Wikimedia Foundation is Toolforge. Toolforge is a platform that allows users and developers to run and use a variety of applications with the ultimate goal of helping the Wikimedia mission from the technical side. As you may know already, all Wikimedia Foundation servers are powered by Debian, and this includes Toolforge and Cloud VPS. The Debian Project mostly follows a two year cadence for releases, and Toolforge has been using Debian Stretch for some years now, which nowadays is considered old-old-stable . In accordance with our operating system upgrade policy, we should migrate our servers to Debian Buster. Toolforge s two different backend engines, Kubernetes and Grid Engine, are impacted by this upgrade policy. Grid Engine is notably tied to the underlying Debian release, and the execution environment offered to tools running in the grid is limited to what the Debian archive contains for a given release. This is unlike in Kubernetes, where tool developers can leverage container images and decouple the runtime environment selection from the base operating system. Since the Toolforge grid original conception, we have been doing the same operation over and over again:

Prepare a parallel grid deployment with the new operating system.
Ask our users (tool developers) to evaluate a newer version of their runtime and programming languages.
Introduce a migration window and coordinate a quick migration.
Finally, drop the old operating system from grid servers.

We ve done this type of migration several times before. The last few ones were Ubuntu Precise to Ubuntu Trusty and Ubuntu Trusty to Debian Stretch. But this time around we had some special angles to consider.

So, you are upgrading the Debian release

You are migrating to Debian 11 Bullseye, no?

No, we re migrating to Debian 10 Buster

Wait, but Debian 11 Bullseye exists!

Yes, we know! Let me explain

We re migrating the grid from Debian 9 Stretch to Debian 10 Buster, but perhaps we should be migrating from Debian 9 Stretch to Debian 11 Bullseye directly. This is a legitimate concern, and we discussed it in September 2021. Back then, our reasoning was that skipping to Debian 11 Bullseye would be more difficult for our users, especially because greater jump in version numbers for the underlying runtimes. Additionally, all the migration work started before Debian 11 Bullseye was released. Our original intention was for the migration to be completed before the release. For a couple of reasons the project was delayed, and when it was time to restart the project we decided to continue with the original idea. We had some work done to get Debian 10 Buster working correctly with the grid, and supporting Debian 11 Bullseye would require an additional effort. We didn t even check if Grid Engine could be installed in the latest Debian release. For the grid, in general, the engineering effort to do a N+1 upgrade is lower than doing a N+2 upgrade. If we had tried a N+2 upgrade directly, things would have been much slower and difficult for us, and for our users. In that sense, our conclusion was to not skip Debian 10 Buster.

We no longer want to run Grid Engine In a previous blog post we shared information about our desired future for Grid Engine in Toolforge. Our intention is to discontinue our usage of this technology.

No grid? What about my tools? Traditionally there have been two main workflows or use cases that were supported in the grid, but not in our Kubernetes backend:

Running jobs, long-running bots and other scheduled tasks.

Mixing runtime environments (for example, a nodejs app that runs some python code).

The good news is that work to handle the continuity of such use cases has already started. This takes the form of two main efforts:

The Toolforge buildpacks project to support arbitrary runtime environments.

The Toolforge Jobs Framework to support jobs, scheduled tasks, etc.

In particular, the Toolforge Jobs Framework has been available for a while in an open beta phase. We did some initial design and implementation, then deployed it in Toolforge for some users to try it and report bugs, report missing features, etc. These are complex, and feature-rich projects, and they deserve a dedicated blog post. More information on each will be shared in the future. For now, it is worth noting that both initiatives have some degree of development already. The conclusion Knowing all the moving parts, we were faced with a few hard questions when deciding how to approach the Debian 9 Stretch deprecation:

Should we not upgrade the grid, and focus on Kubernetes instead? Let Debian 9 Stretch be the last supported version on the grid?

What is the impact of these decisions on the technical community? What is best for our users?

The choices we made are already known in the community. A couple of weeks ago we announced the Debian 9 Stretch Grid Engine deprecation. In parallel to this migration, we decided to promote the new Toolforge Jobs Framework, even if it s still in beta phase. This new option should help users to future-proof their tool, and reduce maintenance effort. An early migration to Kubernetes now will avoid any more future grid problems. We truly hope that Debian 10 Buster is the last version we have for the grid, but as they say, hope is not a good strategy when it comes to engineering. What we will do is to work really hard in bringing Toolforge to the service level we want, and that means to keep developing and enabling more Kubernetes-based functionalities. Stay tuned for more upcoming blog posts with additional information about Toolforge. This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

4 April 2022

Arturo Borrero Gonz lez: Wikimedia Toolforge and Grid Engine

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. One of the most important and successful products provided by the Wikimedia Cloud Services team at the Wikimedia Foundation is Toolforge, a hosting service commonly known in the industry as Platform as a Service (PaaS). In particular, it is a platform that allows users and developers to run and use a variety of applications with the ultimate goal of helping the Wikimedia mission from the technical side. Toolforge is powered by two different backend engines, Kubernetes and Grid Engine. The two backends have traditionally offered different features for tool developers. But as time moves forward we ve learnt that Kubernetes is the future. Explaining why is the purpose of this blog post: we want to share more information and reasoning behind this mindset. There are a number of reasons that make Grid Engine poorly suitable to remain as execution backend in Toolforge:

There has not been a new Grid Engine release (bug fixes, security patches, or otherwise) since 2016. This doesn t feel like a project being actively developed or maintained.
The grid has poor support and controls for important aspects such as high availability, fault tolerance and self-recovery.
Maintaining a healthy grid requires plenty of manual operations, like manual queue cleanups in case of failures, hand-crafted scripts for pooling/depooling nodes, etc.
There is no good or modern monitoring support for the grid, and we need to craft and maintain several monitoring pieces for proper observability, and to be able to do proper maintenance.
The grid is also strongly tied to the underlying operating system release version. Migrating from one Debian version to the next is painful (a dedicated blog post about this will follow shortly).
The grid imposes a strong dependency on NFS, another old technology. We would like to reduce dependency on NFS overall, and in the future we will explore NFS-free approaches for Toolforge.
In general, Grid Engine is old software, old technology, which can be replaced by more modern approaches for providing an equivalent or better service.

As mentioned above, our desire is to cover all our grid-like needs with Kubernetes, a technology which has several benefits:

Good high availability, fault tolerance and self-recovery semantics, constructs and facilities.
Maintaining a running Kubernetes cluster requires little manual operations.
There are good monitoring and observability options for Kubernetes deployments, including seamless integration with industry standards like prometheus.
Our current approach to deploying and upgrading Kubernetes is independent of the underlying operating system.
While our current Kubernetes deployment uses NFS as a central component, there is support for using other, more modern, approaches for the kind of shared storage needs we have in Toolforge.
In general, Kubernetes is a modern technology, with a vibrant and healthy community, that enables new use cases and has enough flexibility to adapt legacy ones.

The relationship between Toolforge and Grid Engine has been interesting over the years. The grid has been used for quite a lot of time, we have plenty of documentation and established good practices. On the other hand, the grid is hard to maintain, imposes a heavy burden on the WMCS team and is a technology we must eventually discontinue. How to accommodate the two realities is a refreshing challenge, one that we hope to tackle together in the near future. A tradeoff exists here, but it is clear to us which option is best. So we will work on deprecating and removing Grid Engine and migrating use cases into Kubernetes. This deprecation, however, will be done with care, as we know our technical community relies on the grid for some import Toolforge tools. And some of these workflows will need some adaptation in order to be fully supported on Kubernetes. Stay tuned for more information on present and next works surrounding the Wikimedia Toolforge service. The next blog post will share more concrete details. This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

20 October 2021

Arturo Borrero Gonz lez: Iterating on how we do NFS at Wikimedia Cloud Services

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. NFS is a central piece of infrastructure that is essential to services like Toolforge. Recently, the Cloud Services team at Wikimedia had been reviewing how we do NFS. The current situation NFS is a central piece of technology for some of the services that the Wikimedia Cloud Services team offers to the community. We have several shares that power different use cases: Toolforge user home directories live on NFS, and Cloud VPS users can also access dumps using this protocol. The current setup involves several physical hardware servers, with about 20TB of storage, offering shares over 10G links to the cloud. For the system to be more fault-tolerant, we duplicate each share for redundancy using DRBD. Running NFS on dedicated hardware servers has traditionally offered us advantages: mostly on the performance and the capacity fields. As time has passed, we have been enumerating more and more reasons to review how we do NFS. For one, the current setup is in violation of some of our internal rules regarding realm separation. Additionally, we had been longing for additional flexibility managing our servers: we wanted to use virtual machines managed by Openstack Nova. The DRBD-based high-availability system required mostly a hand-crafted procedure for failover/failback. There s also some scalability concerns as NFS is easy to grow up, but not to grow horizontally, and of course, we have to be able to keep the tenancy setup while doing so, something that NFS does by using LDAP/Unix users and may get complicated too when growing. In general, the servers have become too big to fail , clearly technical debt, and it has taken us years to decide on taking on the task to rethink the architecture. It s worth mentioning that in an ideal world, we wouldn t depend on NFS, but the truth is that it will still be a central piece of infrastructure for years to come in services like Toolforge. Over a series of brainstorming meetings, the WMCS team evaluated the situation and sorted out the many moving parts. The team managed to boil down the potential service future to two competing options:

Adopt and introduce a new Openstack component into our cloud: Manila this was the right choice if we were interested in a general NFS as a service offering for our Cloud VPS users.
Put the data on Cinder volumes and serve NFS from a couple of virtual machines created by hand this was the right choice if we wanted something that required low effort to engineer and adopt.

Then we decided to research both options in parallel. For a number of reasons, the evaluation was timeboxed to three weeks. Both ideas had a couple of points in common: the NFS data would be stored on our Ceph farm via Cinder volumes, and we would rely on Ceph reliability to avoid using DRBD. Another open topic was how to back up data from Ceph, to store our important bits in more than one basket. We will get to the back up topic later. The manila experiment The Wikimedia Foundation was an early adopter of some Openstack components (Nova, Glance, Designate, Horizon), but Manila was never evaluated for usage until now. Our approach for this experiment was to closely follow the upstream guidelines. We read the documentation and tried to understand the different setups you can build with Manila. As we often feel with other Openstack components, the documentation doesn t perfectly describe how to introduce a given component in your particular local setup. Here we use an admin-controller flat-topology Neutron network. This network is shared by all tenants (or projects) in our Openstack deployment. Also, Manila can use many different driver backends, for things like NetApps or CephFS that we don t use , yet. After some research, the generic driver was the one that seemed to better fit our use case. The generic driver leverages Nova virtual machines instances plus Cinder volume to create and manage the shares. In general, Manila supports two operational modes, whether it should create/destroy the share servers (i.e, the virtual machine instances) or not. This option is called driver_handles_share_server (or DHSS) and takes a boolean value. We were interested in trying with DHSS=true, to really benefit from the potential of the setup. Manila diagram

NFS idea 6, original image in Wikitech So, after sorting all these variables, we moved on with our initial testing. We built a PoC setup as depicted in the diagram above, with the manila-share component running in a virtual machine inside the cloud. The PoC led to us reporting several bugs upstream:

In some cases we tried to address these bugs ourselves:

It s worth mentioning that the upstream community was extra-welcoming to us, and we re thankful for that. However, at the end of our three-week period, our Manila setup still wasn t working as expected. Your experience may change with other drivers perhaps the ZFSonLinux or the CephFS ones. In general, we were having trouble making the setup work as expected, so we decided to abandon this approach in favor of the other option we were considering at the beginning. Simple virtual machine serving NFS The alternative was to create a Nova virtual machine instance by hand and to configure it using puppet. We have been investing in an automation framework lately, so the idea is to not actually create the server by hand. Anyway, the data would be decoupled from the instance into Cinder volumes, which led us to the question we left for later: How should we back up those terabytes of important information? Just to be clear, the backup problem was independent of the above options; with Manila we would still have had to solve the same challenge. We would like to see our data be backed up somewhere else other than in Ceph. And that s exactly where we are at right now. We ve been exploring different backup strategies and will finally use the Cinder backup API. Conclusion The iteration will end with the dedicated NFS hardware servers being stopped, and the shares being served from within the cloud. The migration will take some time to happen because we will check and double-check that everything works as expected (including from the performance point of view) before making definitive changes. We already have some plans to make sure our users experience as little service impact as possible. The most troublesome shares will be those related to Toolforge. At some point we will need to disallow writes to the NFS share, rsync the data out of the hardware servers into the Cinder volumes, point the NFS clients to the new virtual machines, and then enable writes again. The main Toolforge share has about 8TB of data, so this will take a while. We will have more updates in the future. Who knows, perhaps our next-next iteration, in a couple of years, will see us adopting Openstack Manila for good. Featured image credit: File:(from break water) Manila Skyline panoramio.jpg, ewol, CC BY-SA 3.0 This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

29 June 2021

Arturo Borrero Gonz lez: Last couple of talks

In the last few months I presented several talks. Topics ranged from a round table on free software, to sharing some of my work as SRE in the Cloud Services team at the Wikimedia Foundation. For some of them the videos are now published, so I would like to write a reference here, mostly as a way to collect such events for my own record. Isn t that what a blog is all about, after all? Before you continue reading, let me mention that the two talks I ll reference were given in my native Spanish. The videos are hosted on YouTube and autogenerated subtitles should be available, with doubtful quality though. Also, there was at least one additional private talk that I m not allowed to comment on here today. I was invited to participate in a Docker community event called Kroquecon, which was aimed at pushing the spanish-speaking Kubernetes community around the world. The event name is a word play with Kubernetes , conference and croqueta , typical Spanish food. The talk happened on 2021-04-29, and I was part of a round table about free software, communities and how to join and participate in such projects. I commented on my experience in both the Debian project, Netfilter and my several years in Google Summer of Code (3 as student, 2 as mentor). Video of the event:

The other event was the CNCF-supported Kubernetes Community Days Spain (KCD Spain). During Kroquecon I was encouraged to submit a talk proposal for this event, to talk about something related to our use of Kubernetes in the Wikimedia Cloud Services team at the Wikimedia Foundation. The proposal was originally rejected. Then, a couple of weeks before the event itself, I was contacted by the organizers with a greenlight to give it because the other speaker couldn t make it. My coworker David Caro joined me in the presentation. It was titled Conoce Wikimedia Toolforge, plataforma basada en Kubernetes (or Meet Wikimedia Toolforge, Kubernetes-based platform ). We explained what Wikimedia Cloud Services is, focusing on Toolforge, and in particular how we use Kubernetes to enable the platform s most interesting use cases. We covered several interesting topics, including how we handle multi-tenancy, or the problems we had with the etcd & ceph combo. The slides we used are available. Video of the event:

EOF

18 March 2021

Arturo Borrero Gonz lez: Debugging ip token set RTNETLINK error

At the Wikimedia Foundation they configure basically all servers with IPv4/IPv6 dual stack, at least in the control plane interface (those used for SSH management, etc). IPv6 is not supported yet on the Cloud Services dataplane (openstack), but it will in the near future. An elegant solution for this IPv4/IPv6 dual stack configuration in the control plane is to embed the IPv4 address into the IPv6 address, something like this:

IPv4: 208.80.154.23
IPv6: 2620:0:861:1:208:80:154:23
                   ^^^^^^^^^^^^^

The gateways send prefix advertisements and by default IPv6 autoconfiguration uses such routers prefixes plus the interface MAC address to build the final address. This behavior can be somewhat changed, you can explicitly set some data instead of the MAC address for the kernel to use when building the final IPv6 address. The trick is to use the ip token command, and further context and reasoning for this can be seen in the puppet manifest that the SRE team uses to configure this:

The token command explicitly configures the lower 64 bits that will be used with any autoconf
address, as opposed to one derived from the macaddr. This aligns the autoconf-assigned address with
the fixed one set above, and can do so as a pre-up command to avoid ever having another address
even temporarily, when this is all set up before boot.

For example:

user@debian:~$ sudo ip token list
token ::208:80:154:23 dev eno1
token :: dev eno2
token :: dev eno3
token :: dev eno4
user@debian:~$ sudo ip token set dev eno2 ::beef
user@debian:~$ ip token get dev eno2
token ::beef dev eno2
user@debian:~$ ip token list
token ::208:80:154:23 dev eno1
token ::beef dev eno2
token :: dev eno3
token :: dev eno4

I was installing a new server the other day and the networking service (ifupdown) was failing to bring the interface up. The issue was the ip token command failing:

root@cloudgw2002-dev:~# ip token set ::10:192:20:18 dev eno1
RTNETLINK answers: Invalid argument

RTNETLINK error reporting is awful. What could be wrong about that command? Well, I had to dig into the kernel source code to try figuring out where was that EINVAL being generated. The routine can be seen in the function inet6_set_iftoken() (source):

static int inet6_set_iftoken(struct inet6_dev *idev, struct in6_addr *token)
 
	struct inet6_ifaddr *ifp;
	struct net_device *dev = idev->dev;
	bool clear_token, update_rs = false;
	struct in6_addr ll_addr;
	ASSERT_RTNL();
	if (!token)
		return -EINVAL;
	if (dev->flags & (IFF_LOOPBACK   IFF_NOARP))
		return -EINVAL;
	if (!ipv6_accept_ra(idev))
		return -EINVAL;
	if (idev->cnf.rtr_solicits == 0)
		return -EINVAL;
[..]

I couldn t identify the reason at first glance, apparently I pass every check in there:

the token is being set, obviously
the interface is not loopback
the interface has the sysctl net.ipv6.conf.eno1.accept_ra = 1 (enabled)
the interface has the sysctl net.ipv6.conf.eno1.router_solicitations = -1 (disabled)

Something was wrong here. Upon deeper checks, I finally read the ipv6_accept_ra() function (source):

static inline bool ipv6_accept_ra(struct inet6_dev *idev)
 
	/* If forwarding is enabled, RA are not accepted unless the special
	 * hybrid mode (accept_ra=2) is enabled.
	 */
	return idev->cnf.forwarding ? idev->cnf.accept_ra == 2 :
	    idev->cnf.accept_ra;
 

That was it! Forwarding was enabled on the interface, and it shouldn t. I fixed my configuration with a simple patch for the sysctl configuration. Original issue on the Wikimedia Foundation ticketing system: Phabricator T277287

5 March 2021

Arturo Borrero Gonz lez: Openstack Neutron L3 failover issues

In the Cloud Services team at the Wikimedia Foundation we use Openstack Neutron to build our virtual network, and in particular, we rely on the neutron-l3-agent for implementing all the L3 connectivity, topology and policing. This includes basic packet firewalling and NAT. As of this writing, we are using Openstack version Train. We run the neutron-l3-agent on standard linux hardware servers with 10G NICs, and in general it works really well. Our setup is rather simple: we have a couple of servers for redundancy (note: upstream recommends having 3) and each server runs an instance of neutron-l3-agent. We don t use DVR, so all ingress/egress network traffic (or north-south traffic) flows using these servers. Today we use a flat network topology in our cloud. This means that all of our virtual machines share the same router gateway. Therefore, we only have one software-defined router. Neutron does a very smart thing: each software-defined router is implemented on a linux network namespace (netns). Each router living on its own netns, the namespace contains all IP addresses, routes, interfaces, netfilter firewalling rules, NAT configuration, etc. Additionally, we configure the agents and software-defined routers to be deployed on an high availability fashion. Neutron implements this by running an instance of keepalived (VRRP) inside each router netns. The gateway IP is therefore a virtual address that can move between the two instances of the neutron-l3-agent. In our setup we rely very heavily on IPv4 NAT, we use it for the egress traffic (SNAT) and also for Openstack floating IPs (SNAT/DNAT). Neutron uses Netfilter rules for configuring the whole NAT setup. There is, however, no apparent mechanism in the neutron-l3-agent to ensure continuity of NATed TCP connections when a failover happens. If all NATed TCP connections are flowing using neutron-l3-agent node A, and a failover happens, node B wont be able to pick up the opened TCP streams and therefore all connections will need to be re-established. This happens because the conntrack information that Netfilter uses to perform NAT is not present in node B when the failover happens. NAT is, in general, a stateful thing and should be treated that way. It seems some folks upstream are aware of this, and there are even some blueprints to introduce conntrackd alongside keepalived. The actual implementation didn t happen so far, apparently. I wonder, are we the only cloud deployment suffering this issue? Our users have been experiencing annoying connection cuts every time we failover the neutron-l3-agent for whatever reason: an upgrade, server reboot, etc. After numerous incidents I opened a ticket in our phabricator instance and started working on it. I ve been trying to improve this situation by following a couple of strategies:

running conntrackd by hand inside the software-defined router netns. This deserves its own blog post and won t comment more on this today.
setting additional sysctl configuration on the affected nents. This is the point I will be covering in this post.

To help the conntrack engine deal with NATed TCP connections I decided to try setting the sysctl key net.netfilter.nf_conntrack_tcp_be_liberal=1. But it turns out this sysctl key is not inherited from the main netns when the software-defined netns is created (i.e, is per-netns). I needed a mechanism to detect netns events (creation) and then inject the sysctl setting. Neutron apparently doesn t have any facility to hook/call commands at netns creation time. This felt like a very fun coding challenge, so I did it! I learned that netns have a /var/run/netns/<netns> entry, so I thought it would be easy to listen to file system events and react to them. I wanted to use some python, so I decided to go with the pyinotify library. I ended up creating a simple netns-events.py daemon/script. By using regexes it allows matching on netns names and running arbitrary commands on certain situations:

inmediatly when the daemon itself starts, if a matching netns exists (daemon_startup_actions)
on inotify events (inotify_actions), such as IN_CREATE.

The daemon reads a yaml configuration file like this:

---
# $NETNS env var is provided by the runner daemon
- netns_regex: ^qrouter-.*
  daemon_startup_actions:
    - ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1
    - ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_loose=1
  inotify_actions:
    - IN_CREATE:
        - ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1
        - ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_loose=1

We run this daemon as a systemd service, side by side with the neutron-l3-agent systemd service. Apparently the nf_conntrack_tcp_be_liberal sysctl key is only read either at module load time or netns creation time, so it is really important the netns-events.py daemon runs before neutron-l3-agent starts doing its thing. Anyway, this won t probably solve all of our problems. I m still on a massive rabbit hole with conntrackd, that I ll leave for another day. Other than this small script to set the sysctl keys I have more questions than answers. Again, I wonder what others do, or if there is something fundamentally wrong with our cloud network architecture. I ll keep researching how to deploy neutron-l3-agent in a more realiable fashion. If you have more hints, are in the same situation, or have any other insight, please let me know!

27 November 2020

Arturo Borrero Gonz lez: Netfilter virtual workshop 2020 summary

Once a year folks interested in Netfilter technologies gather together to discuss past, ongoing and future works. The Netfilter Workshop is an opportunity to share and discuss new ideas, the state of the project, bring people together to work & hack and to put faces to people who otherwise are just email names. This is an event that has been happening since at least 2001, so we are talking about a genuine community thing here. It was decided there would be an online format, split in 3 short meetings, once per week on Fridays. I was unable to attend the first session on 2020-11-06 due to scheduling conflict, but I made it to the sessions on 2020-11-13 and 2020-11-20. I would say the sessions were joined by about 8 to 10 people, depending on the day. This post is a summary with some notes on what happened in this edition, with no special order. Pablo did the classical review of all the changes and updates that happened in all the Netfilter project software components since last workshop. I was unable to watch this presentation, so I have nothing special to comment. However, I ve been following the development of the project very closely, and there are several interesting things going on, some of them commented below. Florian Westphal brought to the table status on some open/pending work for mptcp option matching, systemd integration and finally interfacing from nft with cgroupv2. I was unable to participate in the talk for the first two items, so I cannot comment a lot more. On the cgroupv2 side, several options were evaluated to how to match them, identification methods, the hierarchical tree that cgroups present, etc. We will have to wait a bit more to see how the final implementation looks like. Also, Florian presented his concerns on conntrack hash collisions. There are no real-world known issues at the moment, but there is an old paper that suggests we should keep and eye on this and introduce improvements to prevent future DoS attack vectors. Florian mentioned these attacks are not practical at the moment, but who knows in a few years. He wants to explore introducing RB trees for conntrack. It will probably be a rbtree structure of hash tables in order to keep supporting parallel insertions. He was encouraged by others to go ahead and play/explore with this. Phil Sutter shared his past and future iptables development efforts. He highlighted fixed bugs and his short/midterm TODO list. I know Phil has been busy lately fixing iptables-legacy/iptables-nft incompatibilities. Basically addressing annoying bugs discovered by all ruleset managers out there (kubernetes, docker, openstack neutron, etc). Lots of work has been done to improve the situation; moreover I myself reported, or forwarded from the Debian bug tracker, several bugs. Anyway I was unable to attend this talk, only learnt a few bits in the following sessions, so I don t have a lot to comment here. But when I was fully present, I was asked by Phil about the status of netfilter components in Debian, and future plans. I shared my information. The idea for the next Debian stable release is to don t include iptables in the installer, and include nftables instead. Since Debian Buster, nftables is the default firewalling tool anyway. He shared the plans for the RedHat-related ecosystem, and we were able to confirm that we are basically in sync. Pablo commented on the latest Netfilter flowtable enhancements happening. Using the flowtable infrastructure, one can create kernel network bypasses to speed up packet throughput. The latest changes are aimed for bridge and VLAN enabled setups. The flowtable component will now know how to bypass in these 2 network architectures as well as the previously supported ingress hook. This is basically aimed for virtual machines and containers scenarios. There was some debate on use cases and supported setups. I commented that a bunch of virtual machines connected to a classic linux bridge and then doing NAT is basically what Openstack Neutron does, specifically in DVR setups. Same can be found in some container-based environments. Early/simple benchmarks done by Pablo suggest there could be huge performance improvements for those use cases. There was some inevitable comparison of this approach to what others, like DPDK/XDP can do. A point was raised about this being a more generic and operating system-integrated solution, which should make it more extensible and easier to use. flowtable for bridges

Stefano Bravio commented on several open topics for nftables that he is interested on working on. One of them, issues related to concatenations + vmap issues. He also addressed concerns with people s expectations when migrating from ipset to nftables. There are several corner features in ipset that aren t currently supported in nftables, and we should document them. Stefano is also wondering about some tools to help in the migration. A translation layer like there is in place for iptables. Eric Gaver commented there are a couple of semantics that will not be suitable for translation, such as global sets, or sets of sets. But ipset is way simpler than iptables, so a translation mechanism should probably be created. In any case, there was agreement that anything that helps people migrate is more than welcome, even if it doesn t support 100% of the use cases. Stefano is planning to write documentation in the nftables wiki on how the pipapo algorithm works and the supported use cases. Other plans by Stefano include to work on some optimisations for faster matches. He mentioned using architecture specific instruction to speed up sets operations, like lookups. Finally, he commented that some folks working with eBPF have showed interest in reusing some parts of the nftables sets infrastructure (pipapo) because they have detected performance issues in their own data structures in some cases. It is not clear how to best achieve it, how to better bridge the two things together. Probably the ideal is to generalize the pipapo data structures and integrate it into the generic bitmap library or something which can be used by anyone. Anyway, he hopes to get some more time to focus on Netfilter stuff begining with the next year, in a couple of months. Moving a bit away from the pure software development topics, Pablo commented on the netfilter.org infrastructure. Right now the servers are running on gandi.net, on virtual machines that are being basically donated to us. He pointed that the plan is to simplify the infrastructure. For that reason, for example, FTP services has been shut down. Rsync services have been shut down as well, so basically we no longer have a mirrors infrastructure. The bugzilla and wikis we have need some attention, given they are old software pieces, and we need to migrate them to be more modern. Finally, the new logo that was created was presented. Later on, we spent a good chunk of the meeting discussing options on how to address the inevitable iptables deprecation situation. There are some open questions, and we discussed several approaches. From doing nothing at all, which means keeping the current status-quo, to setting a deadline date for the deprecation like the python community did with python2. I personally like this deadline idea, but it is perceived like a negative push by other. We all agree that the current do nothing approach is not sustainable either. Probably the way to go is basically to be more informative. We need to clearly communicate that choosing iptables for anything in 2020 is a bad idea. There are additional initiatives to help on this topic, without being too aggressive. A FAQ will probably be introduced. Eric Garver suggested we should bring nftables front and center. Given the website still mentions iptables everywhere, we will probably refresh the web content, introduce additional informative banners and similar things. There was an interesting talk on the topic of nft table ownership. The idea is to attach a table, and all the child objects, to a process. Then, we prevent any modifications to the table or the child objects by external entities. Basically, allocating and locking a table for a certain netlink socket. This is a nice way for ruleset managers, like firewalld, to ensure they have full control of what s happening to their ruleset, reducing the chances for ending with an inconsistent configuration. There is a proof-of-concept patch by Pablo to support this, and Eric mentioned he is pretty much interested in any improvements to support this use case. The final time block in the final session day was dedicated to talk about the next workshop. We are all very happy we could meet. Meeting virtually is way easier (and cheaper) than in person. Perhaps we can make it online every 3 or 6 months instead of, or in addition to, one big annual physical event. We will see what happens next year. That s all on my side!

22 November 2020

Arturo Borrero Gonz lez: How to use nftables from python

One of the most interesting (and possibly unknown) features of the nftables framework is the native python interface, which allows python programs to access all nft features programmatically, from the source code. There is a high-level library, libnftables, which is responsible for translating the human-readable syntax from the nft binary into low-level expressions that the nf_tables kernel subsystem can run. The nft command line utility basically wraps this library, where all actual nftables logic lives. You can only imagine how powerful this library is. Originally written in C, ctypes is used to allow native wrapping of the shared lib object using pure python. To use nftables in your python script or program, first you have to install the libnftables library and the python bindings. In Debian systems, installing the python3-nftables package should be enough to have everything ready to go. To interact with libnftables you have 2 options, either use the standard nft syntax or the JSON format. The standard format allows you to send commands exactly like you would do using the nft binary. That format is intended for humans and it doesn t make a lot of sense in a programmatic interaction. Whereas JSON is pretty convenient, specially in a python environment, where there are direct data structure equivalents. The following code snippet gives you an example of how easy this is to use:

#!/usr/bin/env python3

import nftables
import json
nft = nftables.Nftables()
nft.set_json_output(True)
rc, output, error = nft.cmd("list ruleset")
print(json.loads(output))

This is functionally equivalent to running nft -j list ruleset. Basically, all you have to do in your python code is:

import the nftables & json modules
init the libnftables instance
configure library behavior
run commands and parse the output (ideally using JSON)

The key here is to use the JSON format. It allows adding ruleset modification in batches, i.e. to create tables, chains, rules, sets, stateful counters, etc in a single atomic transaction, which is the proper way to update firewalling and NAT policies in the kernel and to avoid inconsistent intermediate states. The JSON schema is pretty well documented in the libnftables-json(5) manpage. The following example is copy/pasted from there, and illustrates the basic idea behind the JSON format. The structure accepts an arbitrary amount of commands which are interpreted in order of appearance. For instance, the following standard syntax input:

flush ruleset
add table inet mytable
add chain inet mytable mychain
add rule inet mytable mychain tcp dport 22 accept

Translates into JSON as such:

  "nftables": [
      "flush":   "ruleset": null  ,
      "add":   "table":  
        "family": "inet",
        "name": "mytable"
     ,
      "add":   "chain":  
        "family": "inet",
        "table": "mytable",
        "chain": "mychain"
     
      "add":   "rule":  
        "family": "inet",
        "table": "mytable",
        "chain": "mychain",
        "expr": [
              "match":  
                "left":   "payload":  
                    "protocol": "tcp",
                    "field": "dport"
                 ,
                "right": 22
             ,
              "accept": null  
        ]
     
] 

I encourage you to take a look at the manpage if you want to know about how powerful this interface is. I ve created a git repository to host several source code examples using different features of the library: https://github.com/aborrero/python-nftables-tutorial. I plan to introduce more code examples as I learn and create them. There are several relevant projects out there using this nftables python integration already. One of the most important pieces of software is firewalld. They started using the JSON format back in 2019. In the past, people interacting with iptables programmatically would either call the iptables binary directly or, in the case of some C programs, hack libiptc/libxtables libraries into their source code. The native python approach to use libnftables is a huge step forward, which should come handy for developers, network engineers, integrators and other folks using the nftables framework in a pythonic environment. If you are interested to know how this python binding works, I invite you to take a look at the upstream source code, nftables.py, which contains all the magic behind the scenes.

Next.

Search Results: "arturo"

1 April 2024

13 February 2024

14 December 2023

20 November 2023

31 May 2023

27 April 2023

30 January 2023

6 November 2022

3 November 2022

25 October 2022

31 July 2022

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Issues Missing dependency in libgd Test failures in SPTAG (1 2) Broken documentation in pdb-tools Features in SIMDe, lintian Underlinking in libgprofng Missing include in cwidget Warnings in stem Invalid syntax in ncbi-igblast Error details needed in SPTAG

Administration Debian BTS: unarchive/reopen/triage bugs for reintroduced packages Debian servers: check full disks, ping users of excessive disk usage, restart a hung service Debian wiki: approve accounts

Communication Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors The SPTAG, SIMDEverywhere, cwidget, aptitude, tldextract work was sponsored. All other work was done on a volunteer basis.

23 May 2022

We no longer want to run Grid Engine In a previous blog post we shared information about our desired future for Grid Engine in Toolforge. Our intention is to discontinue our usage of this technology.

13 May 2022

We no longer want to run Grid Engine In a previous blog post we shared information about our desired future for Grid Engine in Toolforge. Our intention is to discontinue our usage of this technology.

4 April 2022

20 October 2021

29 June 2021

18 March 2021

5 March 2021

27 November 2020

22 November 2020

Issues

Missing dependency in libgd

Test failures in SPTAG (1 2)

Broken documentation in pdb-tools

Features in SIMDe, lintian

Underlinking in libgprofng

Missing include in cwidget

Warnings in stem

Invalid syntax in ncbi-igblast

Error details needed in SPTAG

Administration

Debian BTS: unarchive/reopen/triage bugs for reintroduced packages

Debian servers: check full disks, ping users of excessive disk usage, restart a hung service

Debian wiki: approve accounts

Communication

Respond to queries from Debian users and contributors on the mailing lists and IRC