Dominique Dumont: How we solved storage API throttling on our Azure Kubernetes clusters
Hi
This issue was quite puzzling, so I m sharing how we investigated this issue. I hope it can be useful for you.
My client informed me that he was no longer able to install new instances of his application.
k9s showed that only some pods could not be created, only the ones that created physical volume (PV). The description of these pods showed a HTTP error 429 when creating pods: New PVC could not be created because we were throttled by Azure storage API.
This issue was confirmed by Azure diagnostic console on Kubernetes ( menu Diagnose and solve problems Cluster and Control Plane Availability and Performance Azure Resource Request Throttling ).
We had a lot of throttling:
Which were explained by the high call rate:
The first clue was found at the bottom of Azure diagnostic page:
According, to this page, throttling is done by services whose user agent is:
There, I found a lot of snapshots dating back from 2024. These were no longer managed by Kubernetes and should have been cleaned up. That was our smoking gun.
I guess that we had a chain of events like:
Go/go1.23.1 (amd64-linux) go-autorest/v14.2.1 Azure-SDK-For-Go/v68.0.0 storage/2021-09-01microsoft.com/aks-operat azsdk-go-armcompute/v1.0.0 (go1.22.3; linux)The main information is
Azure-SDK-For-Go
, which means the program making all these calls to storage API is written in Go. All our services are written in Typescript or Rust, so they are not suspect.
That leaves controllers running in kube-systems
namespace. I could not find anything suspects in the logs of these services.
At that point I was convinced that a component in Kubernetes control plane was making all those calls. Unfortunately, AKS is managed by Microsoft and I don t have access to the control plane logs.
However, we re realized that we had quite a lot of volumesnapshots
that are created in our clusters using k8s-scheduled-volume-snapshotter:
- about 1800 on dev instead of 240
- 1070 on preprod instead of 180
- 6800 on prod instead of 2400
preprod
and prod
were the number of snapshots were quite different.
We tried to get more information using Azure console on our snapshot account, but it was also broken by the throttling issue.
We were so puzzled that we decided to try L odagan s advice (tout cr mer pour repartir sur des bases saines, loosely translated as burn everything down to start from scratch ) and we destroyed piece by piece our dev cluster while checking if the throttling stopped.
First, we removed all our applications, no change.
Then, all ancillary components like rabbitmq
, cert-manager
were removed, no change.
Then, we tried remove the namespace containing our applications. But, we faced another issue: Kubernetes was unable to remove the namespace because it could not destroy some PVC
and volumesnapshots
. That was actually good news, because it meant that we were close to the actual issue.
We managed to destroy the PVC
and volumesnapshots
by removing their finalizers. Finalizers are some kind of markers that tell kubernetes that something needs to be done before actually deleting a resource.
The finalizers were removed with a command like:
kubectl patch volumesnapshots $ volumesnapshot \ -p ' \"metadata\": \"finalizers\":null ' --type mergeThen, we got the first progress : the throttling and high call rate stopped on our dev cluster. To make sure that the snapshots were the issue, we re-installed the ancillary components and our applications. Everything was copacetic. So, the problem was indeed with
PVC
and snapshots.
Even though we have backups outside of Azure, we weren t really thrilled at trying L odagan s method on our prod cluster
So we looked for a better fix to try on our preprod cluster.
Poking around in PVC
and volumesnapshots
, I finally found this error message in the description on a volumesnapshotcontents
:
Code="ShareSnapshotCountExceeded" Message="The total number of snapshots for the share is over the limit."The number of snapshots found in our cluster was not that high. So I wanted to check the snapshots present in our storage account using Azure console, which was still broken. Fortunately, Azure CLI is able to retry
HTTP
calls when getting 429
errors. I managed to get a list of snapshots with
az storage share list --account-name [redacted] --include-snapshots \
tee preprod-list.json
- too many snapshots in some volumes
- Kubernetes control plane tries to reconcile its internal status with Azure resources and frequently retries snapshot creation
- API throttling kicks in
- client not happy
k8s-scheduled-volume-snapshotter
creates new snapshots when it cannot list the old ones. So we had 4 new snapshots per day instead of one.
Since we had the chain of events, fixing the issue was not too difficult (but quite long ):
- stop
k8s-scheduled-volume-snapshotter
by disabling its cron job - delete all volumesnapshots and volume snapshots contents from k8s.
- since Azure API was throttled, we also had to remove their finalizers
- delete all snapshots from azure using
az
command and a Perl script (this step took several hours) - re-enable
k8s-scheduled-volume-snapshotter
k8s-scheduled-volume-snapshotter
.
Anyway, to avoid this problem is the future, we will:
- setup an alert on the number of snapshots per volume
- check with
k8s-scheduled-volume-snapshotter
author to:- better cope with HTTP
429
errors - cleanup orphan snapshots
- better cope with HTTP