The “Snowball” Effect In Kubernetes

So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)

We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.

Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.

Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history — sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren’t, the defaults would (or should) be used.

So why did we have over 900 cron pods, and why weren’t they being cleaned up upon completion?

Just in case the number of pods were causing problems, I cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

But even after that, new jobs weren’t spawning pods. And in fact, more CronJob pods were appearing in this namespace. So I disabled the CronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

But that also didn’t help, pods were still being generated. Which is weird — why is a CronJob still spawning pods even when it’s suspended?

So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn’t be 3000 Job objects for something that only runs once a minute.

So I went and deleted all the CronJob related Job objects:

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.

I decided to get Google onto the case and raised a support ticket.

Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)

2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem?

The time between “Job is created” and “Pod is created” around 80 minutes in the first case, and 12 minutes in the second one. That’s right, it took 80 minutes for the Pod to be spawned.

And this is where it dawned on me about what was possibly going on.

  • The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
  • The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
  • The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
  • And so on, and so on.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck…

Eventually, the pod will generate but by then there’s now a backlog of Jobs, meaning even if I suspended the CronJob, it won’t have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.

So where does this leave us? Well, a few thoughts:

  • A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup — successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.
  • This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.
  • If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute — and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.

Apple can’t count….

My opinion of Apple and its practices has never been high. But this is just stupid.

Type in “1+2+3=” in an iOS 11 device’s calculator app, and you get 6 (correctly), but type it in quickly (as demonstrated in this video) and you get 24.

Sure, it’ll no doubt get patched soon and Apple will twist the incident to prove how fast they can push out updates compared to Android. But the point remains – how did such a bug make it past testing? And what OTHER, similarly stupid bugs that have yet to be detected, also make it past testing. And what if one of those bugs was in something fundamental? Something that breaks the functionality of the device? Something like the 1/1/1970 bug that would brick the device, or even the infamous “effective power” bug that would annoying reboot someone’s phone. Or even the famous crashsafari site that was only meant to crash safari but managed to crash the device too (originally, anyway).

OR, was there even ANY testing?

No, 900 million Android devices are not at risk from the ‘Quadrooter’ monster | Computerworld

You’ve probably seen articles inducing panic around the number of android devices vulnerable to this Quadrooter bug. But read through the below first.

 

 

Another day, another overblown Android security scare. Who’s ready for a reality check?

Source: No, 900 million Android devices are not at risk from the ‘Quadrooter’ monster | Computerworld

Guys, gals, aardvarks, fishes: I’m running out of ways to say this. Your Android device is not in any immediate danger of being taken over a super-scary malware monster.

It’s a silly thing to say, I realize, but we go through this same song and dance every few months: Some company comes out with a sensational headline about how millions upon millions of Android users are in danger (DANGER!) of being infected (HOLY HELL!) by a Big, Bad Virus™ (A WHAT?!) any second now. Countless media outlets (cough, cough) pick up the story and run with it, latching onto that same sensational language without actually understanding a lick about Android security or the context that surrounds it.

To wit: As you’ve no doubt seen by now, our latest Android malware scare du jour is something an antivirus software company called Check Point has smartly dubbed “Quadrooter” (a name worthy of Batman villain status if I’ve ever heard one). The company is shouting from the rooftops that 900 million (MILLION!) users are at risk of data loss, privacy loss, and presumably also loss of all bladder control — all because of this hell-raising “Quadrooter” demon and its presence on Qualcomm’s mobile processors.

“Without an advanced mobile threat detection and mitigation solution on the Android device, there is little chance a user would suspect any malicious behavior has taken place,” the company says in its panic-inducing press release.

Well, crikey: Only an advanced mobile threat detection and mitigation solution can stop this? Wait — like the one Check Point itself conveniently sells as a core part of its business? Hmm…that sure seems awfully coincidental.

TL;DR: A “mobile threat detection and mitigration solution” is already present on practically all of those 900 million Android devices. It’s a native part of the Android operating system called Verify Apps, and it’s been present in the software since 2012….. Android has had its own built-in multilayered security system for ages now. There’s the threat-scanning Verify Apps system we were just discussing. The operating system also automatically monitors for signs of SMS-based scams, and the Chrome Android browser keeps an eye out for any Web-based boogeymen.

Hacker posts Facebook bug report on Zuckerberg’s wall — RT News

Evidently, unfriending people on Facebook won’t stop them posting on your timeline, as this article proves. And the surprising thing (or unsurprising, depending on your perspective), is that Facebook refused to acknowledge it was a bug, so the whitehat went all the way to the top and posted on Zuckerberg’s wall.

Hacker posts Facebook bug report on Zuckerberg’s wall — RT News.

%d bloggers like this: