Christian
David,
CC BY-SA 4.0, via Wikimedia Commons
This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.
Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into
Kyverno in the Wikimedia Toolforge platform.
Wikimedia Toolforge is a Platform-as-a-Service, built with
Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome
anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects.
We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up
webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the
underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some
advanced users do directly interact with it.
Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in
place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until
recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the
system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to
stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the
option to use 3rd party agents, like
OpenPolicyAgent Gatekeeper,
but we decided not to invest in them, and went with a native, built-in mechanism instead.
In 2021 it was
announced
that the PSP mechanism would be deprecated, and removed in Kubernetes 1.25. Even though we had been warned years in
advance, we did not prioritize the migration of PSP until we were in Kubernetes 1.24, and blocked, unable to upgrade
forward without taking actions.
The WMCS team explored different alternatives for this migration, but eventually we
decided to go with
Kyverno as a replacement for PSP. And so with that decision it began the
journey described in this blog post.
First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes:
maintain-kubeusers. This custom piece of
software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on
Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we
introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for
each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new
resources for each account, or iterate on their definitions.
The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was
doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the
accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons
beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an
implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more
state on this configmap was our solution to avoid additional NFS latency.
I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks
instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the
definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion
it was worth it, even if it took more time than expected.
Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the
transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like
transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each
account, so we decided to create one different Kyverno namespaced policy resource for each account namespace remember,
we had 3.5k accounts.
We created a
Kyverno policy
template
that we would then render and inject for each account.
For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called
lima-kilo, which was a local Kubernetes setup
replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.
We had planned the migration from PSP to Kyverno policies in stages, like this:
- update our internal template generators to make Pod security settings explicit
- introduce Kyverno policies in Audit mode
- see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and
correct them
- modify Kyverno policies and set them in Enforce mode
- drop PSP
In stage 1, we
updated things like the toolforge-jobs-framework and tools-webservice.
In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately.
Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any
administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users
were impacted. This was a
full scale
outage that required the
energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.
This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called
Toolsbeta. But we had not tested that many
policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno
policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a
few errors in our infrastructure, reached out to the Kyverno upstream
developers,
asking for advice, and at the end we did the following to
accommodate the setup to our needs:
- corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually
checking the
/healthz
HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
- having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete
4000 policy resources, each on a different namespace.
- greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual
machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of
memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB. In our cluster, a
single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly
not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional
memory pressure on the apiserver.
- corrected resource requests and limits of Kyverno, to more accurately describe our actual usage.
- increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more
timely by Kyverno.
I have to admit, I was
briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely,
and write our own custom admission controller out of concerns over performance of this architecture. However, after
applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt
at deploying it all went through just fine. No outage this time
When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting
securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value,
which per the docs it was
DefaultProcMount .
However, that string is the name of the internal variable in the source code, whereas the actual default value is the
string
Default . This caused pods to be
rightfully rejected by Kyverno while we figured the problem. I sent a
patch
upstream to fix this problem.
We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller
from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step
of the migration.
This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages
took roughly three months to complete.
For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider,
and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or
unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way
This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.