So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)
We noticed some weirdness occurring on our cluster when Jobs
and CronJobs
started behaving strangely.
Jobs
were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.
Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob
.
The CronJob
was scheduled to run every minute, and the definition of the CronJob
had valid values for the history — sensible values for .spec.successfulJobsHistoryLimit
and .spec.failedJobsHistoryLimit
were set. And even if they weren’t, the defaults would (or should) be used.
So why did we have over 900 cron pods, and why weren’t they being cleaned up upon completion?
Just in case the number of pods were causing problems, I cleared out the completed pods:
kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)
But even after that, new jobs weren’t spawning pods. And in fact, more CronJob
pods were appearing in this namespace. So I disabled the CronJob
kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'
But that also didn’t help, pods were still being generated. Which is weird — why is a CronJob
still spawning pods even when it’s suspended?
So then I remembered that CronJobs
actually generate Job
objects. So I checked the Job
objects and found over 3000 Job
objects. Okay, something is seriously wrong here, there shouldn’t be 3000 Job
objects for something that only runs once a minute.
So I went and deleted all the CronJob
related Job
objects:
kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)
This reduced the pods down, but did not help us determine why the Job
objects were not spawning pods.
I decided to get Google onto the case and raised a support ticket.
Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)
2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted
2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node
Spot the problem?
The time between “Job is created” and “Pod is created” around 80 minutes in the first case, and 12 minutes in the second one. That’s right, it took 80 minutes for the Pod to be spawned.
And this is where it dawned on me about what was possibly going on.
- The
CronJob
spawned aJob
object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs - The next cycle, the
CronJob
looks to see if it has a running pod due to the.spec.concurrencyPolicy
value. - The
CronJob
does not find a running pod so generates anotherJob
object, which also gets stuck waiting for pod generation - And so on, and so on.
Each time, a new Job
gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job
to be added to the namespace which also gets stuck…
Eventually, the pod will generate but by then there’s now a backlog of Jobs
, meaning even if I suspended the CronJob
, it won’t have any effect until the Jobs
in the backlog are cleared or deleted (I had deleted them).
Google investigated further, and found the culprit:
Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.
So where does this leave us? Well, a few thoughts:
- A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single
CronJob
without cleanup —successfulJobsHistoryLimit
andfailedJobsHistoryLimit
values are seemingly ignored.
- This could break systems where the
CronJob
is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though theCronJob
has aconcurrencyPolicy
set as Forbid.
- If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a
CronJob
that runs once a minute — and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.