The “Snowball” Effect In Kubernetes

So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)

We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.

Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.

Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history — sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren’t, the defaults would (or should) be used.

So why did we have over 900 cron pods, and why weren’t they being cleaned up upon completion?

Just in case the number of pods were causing problems, I cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

But even after that, new jobs weren’t spawning pods. And in fact, more CronJob pods were appearing in this namespace. So I disabled the CronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

But that also didn’t help, pods were still being generated. Which is weird — why is a CronJob still spawning pods even when it’s suspended?

So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn’t be 3000 Job objects for something that only runs once a minute.

So I went and deleted all the CronJob related Job objects:

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.

I decided to get Google onto the case and raised a support ticket.

Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)

2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem?

The time between “Job is created” and “Pod is created” around 80 minutes in the first case, and 12 minutes in the second one. That’s right, it took 80 minutes for the Pod to be spawned.

And this is where it dawned on me about what was possibly going on.

  • The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
  • The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
  • The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
  • And so on, and so on.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck…

Eventually, the pod will generate but by then there’s now a backlog of Jobs, meaning even if I suspended the CronJob, it won’t have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.

So where does this leave us? Well, a few thoughts:

  • A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup — successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.
  • This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.
  • If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute — and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.

CKA Exam Passed

5 questions I could not answer, and one I could, but arguably that question was ambiguous

  1. Fix a broken cluster — kubelet was started but couldn’t connect to itself.
  2. Add node to cluster. Nodes do not have kubeadm installed.
  3. Static pod. Couldn’t find where the path was to put the manifests for the yaml.

4 and 5 I can’t remember the questions but will update if I remember

Ambiguous Question:

  1. Create a pod with a persistent volume, that isn’t persistent, and doesn’t tell you how big to make the PV. I used emptyDir, but that’s not really a PV (didn’t create a PV or a PVC)

CKAD Writeup

So I did the CKAD exam and it was one of the latest exams I’ve done, starting at 22:45 and finishing at 00:45. The CKAD exam is 2 hours versus the CKA’s 3 hours

And I went into the exam feeling relatively confident. But, damn, the 2 hours goes by really quickly.

Had several questions I wasn’t able to complete or only partially complete.

Liveness and Readiness Probes

This question wanted a pod to be restarted if an endpoint returns 500. Simple enough, but there was a catch, if another endpoint returns 500, then the application is starting, and so disregard the check.

I used similar by implementing this check as a curl command in a real life scenario (I should write a blog entry on that some time).

So in the exam, I did both the liveness and readiness checks to chain two curl commands together, if the first endpoint (/starting) in this case, returned 200, then it would do the next endpoint (/healthz) and return a fail if that gave a 500.

Buuuuut, the image didn’t have curl installed so the probes failed. I could use the hack I’ve used in my image and install curl as part of the check, but time constraints wouldn’t let me.

Persistent Volumes

Similar to the CKA question, there was a quirkily worded question here which wanted me to add a file to a node, create a pod that used hostPath and reserve a 1Gi PV. The documentation does not provide an example of that, just a pod with a hostPath as an internal volume: https://kubernetes.io/docs/concepts/storage/volumes/#hostpath

Network Policies

A technology I haven’t used in Kubernetes yet. They gave several policies, one that allowed “app:proxy” and one that allowed “app:db” and wanted ius to edit a pod to only be allowed to talk to only those.

We were not allowed to modify the policies. I can’t remember whether we were allowed to create new policies for this question

But both those policies use the app label. And the pod can’t have the same label with two values (I did try)

Though thinking about it now, and after a few checks, the NetworkPolicy object describes how to restrict traffic to the pods in question — so those selectors may be related to the pods the policy is restricting. I think I should have looked inside the policies more carefully to see what it was saying on the ingress rule and see if it was saying something like “app:frontend”, and then making sure the pod was labelled accordingly.

Ambassador” Sidecar Pattern

A big chunk of the exam time was taken up by the sidecar questions — far more time than I would have liked, to be honest.

They had a question on adaptor, using fluentd, which was fine, I got that to work, but also had another where I had to use HAProxy to proxy requests do a different port (ambassador pattern). A useful use case, but I ran out of time to finish it. I wanted to come back and revisit it if I had time, but didn’t.

CronJobs

Terminate a cronjob if it lasts longer than 17 seconds. There’s a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.

Thoughts

I don’t think I passed this, having so many issues is probably going to take me into the 60s mark.

LPIC-1 Expiry and Google+

Well, it was due to happen eventually, but I got an email saying my LPIC-1 certification is going to expire in 9 months, and I never got to finish LPIC-2.

Well, maybe I’ll redo it after I got my Kubernetes certifications

Finally while writing this post, I notice that WordPress is now removing Google+ support because Google are shutting it down. A pity really, since I did like Google+ and while it didn’t take off, a lot of the features were in G+ because general use, like Hangouts.

General Updates

So I haven’t been posting here much recently so here are some updates.

Been slowing trying to get back into running, have been slacking off WAAAAY too much lately. Tried using Aaptiv (@aaptiv) which is a training fitness app that has trainers talking you through the stuff, there are a few problems with it.

  1. When you use a stretch/strength training routine or yoga routine, you’re reliant on them telling you what to do, there’s no video guide to show you the correct form, and that’s bad. Other apps like FitBit Coach has videos where you can copy the coach to make sure you have the right form.
  2. On Treadmill/Running routines, they talk in mph, but treadmills here in the UK go in km/h, which requires conversion (1.0 mph = 1.6 kph)

On a separate note, I have bought another attempt at the CKA exam, but this time bought the bundle with the Kubernetes Fundamentals Training from Linux Foundation. Let’s see how different that is to Linux Academy’s training….

 

CKA Exam: Strike #2

I took my CKA exam for the second time — and failed again. This time. however got much closer to the pass mark than my first time.

Things I think I fluffed on:

Cluster DNS

pods, services and how they can show up using nslookup. I got caught up in trying to figure out why my DNS wasn’t working, and I think it’s because I was trying to nslookup from outside the cluster, which obviously would not resolve the “.cluster.local” domain correctly. I forgot that you can do an interactive, in-cluster shell using

kubectl run -i --tty busybox --image=busybox -- sh

Not to mention that doing nslookup {service}.svc.cluster.local won’t work, and you have to use -type=a to nslookup to get the ip address of the service to confirm it is resolving

etcd Snapshots

This got me both times. The first time I had no idea why doing a snapshot command was failing. The second time I figured out how to do the backup and how to invoke it from the pod, but still got it wrong. Now I figured out (and it was right in front of my face):

<br />WARNING:
Environment variable ETCDCTL_API is not set; defaults to etcdctl v2.
Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.

USAGE:
etcdctl [global options] command [command options] [arguments...]

VERSION:
3.2.18

I wasn’t using the ETCDCTL_API variable beforehand so it was falling back to V2 api, which doesn’t have the snapshot command:

<br /># etcdctl
NAME:
etcdctl - A simple command line client for etcd.

WARNING:
Environment variable ETCDCTL_API is not set; defaults to etcdctl v2.
Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.

USAGE:
etcdctl [global options] command [command options] [arguments...]

VERSION:
3.2.18

COMMANDS:
backup backup an etcd directory
cluster-health check the health of the etcd cluster
mk make a new key with a given value
mkdir make a new directory
rm remove a key or a directory
rmdir removes the key if it is an empty directory or a key-value pair
get retrieve the value of a key
ls retrieve a directory
set set the value of a key
setdir create a new directory or update an existing directory TTL
update update an existing key with a given value
updatedir update an existing directory
watch watch a key for changes
exec-watch watch a key for changes and exec an executable
member member add, remove and list subcommands
user user add, grant and revoke subcommands
role role add, grant and revoke subcommands
auth overall auth controls
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--debug output cURL commands which can be used to reproduce the request
--no-sync don't synchronize cluster information before sending request
--output simple, -o simple output response in the given format (simple, `extended` or `json`) (default: "simple")
--discovery-srv value, -D value domain name to query for SRV records describing cluster endpoints
--insecure-discovery accept insecure SRV records describing cluster endpoints
--peers value, -C value DEPRECATED - "--endpoints" should be used instead
--endpoint value DEPRECATED - "--endpoints" should be used instead
--endpoints value a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:2379,http://127.0.0.1:4001")
--cert-file value identify HTTPS client using this SSL certificate file
--key-file value identify HTTPS client using this SSL key file
--ca-file value verify certificates of HTTPS-enabled servers using this CA bundle
--username value, -u value provide username[:password] and prompt if password is not supplied.
--timeout value connection timeout per request (default: 2s)
--total-timeout value timeout for the command execution (except watch) (default: 5s)
--help, -h show help
--version, -v print the version

# ETCDCTL_API=3 etcdctl
NAME:
etcdctl - A simple command line client for etcd3.

USAGE:
etcdctl

VERSION:
3.2.18

API VERSION:
3.2

COMMANDS:
get Gets the key or a range of keys
put Puts the given key into the store
del Removes the specified key or range of keys [key, range_end)
txn Txn processes all the requests in one transaction
compaction Compacts the event history in etcd
alarm disarm Disarms all alarms
alarm list Lists all alarms
defrag Defragments the storage of the etcd members with given endpoints
endpoint health Checks the healthiness of endpoints specified in `--endpoints` flag
endpoint status Prints out the status of endpoints specified in `--endpoints` flag
watch Watches events stream on keys or prefixes
version Prints the version of etcdctl
lease grant Creates leases
lease revoke Revokes leases
lease timetolive Get lease information
lease keep-alive Keeps leases alive (renew)
member add Adds a member into the cluster
member remove Removes a member from the cluster
member update Updates a member in the cluster
member list Lists all members in the cluster
snapshot save Stores an etcd node backend snapshot to a given file
snapshot restore Restores an etcd member snapshot to an etcd directory
snapshot status Gets backend snapshot status of a given file
make-mirror Makes a mirror at the destination etcd cluster
migrate Migrates keys in a v2 store to a mvcc store
lock Acquires a named lock
elect Observes and participates in leader election
auth enable Enables authentication
auth disable Disables authentication
user add Adds a new user
user delete Deletes a user
user get Gets detailed information of a user
user list Lists all users
user passwd Changes password of user
user grant-role Grants a role to a user
user revoke-role Revokes a role from a user
role add Adds a new role
role delete Deletes a role
role get Gets detailed information of a role
role list Lists all roles
role grant-permission Grants a key to a role
role revoke-permission Revokes a key from a role
check perf Check the performance of the etcd cluster
help Help about any command

OPTIONS:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--command-timeout=5s timeout for short running command (excluding dial timeout)
--debug[=false] enable client-side debug logging
--dial-timeout=2s dial timeout for client connections
--endpoints=[127.0.0.1:2379] gRPC endpoints
-h, --help[=false] help for etcdctl
--hex[=false] print byte strings as hex encoded strings
--insecure-skip-tls-verify[=false] skip server certificate verification
--insecure-transport[=true] disable transport security for client connections
--key="" identify secure client using this TLS key file
--user="" username[:password] for authentication (prompt if password is not supplied)
-w, --write-out="simple" set the output format (fields, json, protobuf, simple, table)

And then I can run

ETCDCTL_API=3 etcdctl snapshot save snapshot.db --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

To create the snapshot.

Certificate Rotation

I need to look this one up — I had no idea how to rotate the certificates

Static Pods

I’d never directly dealt with static pods before this exam, and I don’t think I had this question in my first run, so it was one I didn’t know the answer do. A bit of hunting on the k8s side led me to figure out it was a static pod question, but I couldn’t find out where the exam cluster was looking for its static pod manifests. The question told me a directory, but my yaml didn’t seem to be picked up by the kubelet.

 

Final note

Generally, a lot of the questions from my first exam run showed up again in this run, which let me run through over half of the exam fairly quickly. I thought I was going to do better than my first run, and I did, but not by much.

Using the “change-cause” Kubernetes annotation as a changelog

Suppose you have an application you are deploying to your kubernetes cluster. For most purposes, running kubectl rollout history deployments/your-app will give you a very simple revision history.

$ kubectl rollout history deployments/awesome-app
REVISION  CHANGE-CAUSE
1         <none>

However, what if you had multiple deployments by different people. How would you know what was the reason for the deployment? Especially when you have something like this?

REVISION  CHANGE-CAUSE
1         <none>
2
3
4
5
...
...
100       <none>
101       <none>
102       <none>

It is possible to set a value into the change-cause field via an annotation, but that field is quite volatile, it is also filled/replaced if someone uses the --record flag when doing an apply. However, it can be utilised to make it much more useful:

REVISION  CHANGE-CAUSE
11        Deploy new version of awesome-app to test environment
12        Deploy new version of awesome-app to staging environment
13        Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018
14        Deploy new version of awesome-app with integration to gitlab v0.0.0 [test]

How is this done? Pretty simply, actually. here’s a snippet from the deploy script I use.

echo Deploy message?
read MESSAGE
if [ -z "$MESSAGE" ]; then
  MESSAGE="Deploy new version of awesome-app, $(date)"
  echo Blank message detected, defaulting to \"$MESSAGE\"
fi
echo Deploy updates...
cat deploy.yaml | sed s/'SUB_TIMESTAMP'/"$(date)"/g | kubectl replace -f -
kubectl annotate deployment awesome-app kubernetes.io/change-cause="$MESSAGE" --record=false --overwrite=true
kubectl rollout status deployments/awesome-app
kubectl rollout history deployment awesome-app

For lines 1 to 6, I read in a message from the terminal to populate the annotation, and if nothing is provided, a default is used.
On line 8, I replace the timestamp to trigger a change to the deployment (this can be anything, for example, changing the version tag of your docker image from awesome-app:release-1.0 to awesome-app:release-1.1)

Note that I used replace and not applyreplace will reset the deployment declaration, and since my deploy yaml does NOT contain a change-cause annotation, replace will remove the annotation.

On line 9, I annotate the deployment, making sure I don’t record it and overwrite the annotation in the event it’s there already (though those two switches might be redundant)

On line 10 I check the status of the rollout — this blocks until it is complete

On line 11, I then dump the deployment history.

This is an example of a script run:

$ ./deploy.sh
Deploy message?
[typed] Deploy new version of awesome-app with gitlab integration v0.0.0 [test]
Deploy updates...
deployment "awesome-app" replaced
deployment "awesome-app" annotated
Waiting for rollout to finish: 1 old replicas are pending termination...
deployment "awesome-app" successfully rolled out
deployments "awesome-app"
REVISION  CHANGE-CAUSE
11        Deploy new version of awesome-app, Thu 21 Jun 07:00:19 BST 2018
12        Deploy new version of awesome-app, Thu 21 Jun 07:00:52 BST 2018
13        Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018
14        Deploy new version of awesome-app with integration to gitlab v0.0.0 [test]

kubectl Displaying Taints

One of the questions in my CKA exam was how to display taints with kubectl. While you can use kubectl describe, it creates a lot of other information too.

Then I found out about jsonpath and it’s similarity to jq

You can display the taints with something like

for a in $(kubectl get nodes --no-headers | awk '{print $1}')
do
echo $a -- $(kubectl get nodes/$a -o jsonpath='{.spec.taints[*].key}{":"}{.spec.taints[*].effect}')
done

Sample output

ip-10-10-10-147.eu-west-2.compute.internal -- node-role.kubernetes.io/master:NoSchedule
ip-10-10-10-159.eu-west-2.compute.internal -- :

So the first one has a taint (it’s the master node) and the second one doesn’t.
(maybe I need to hack this a bit more when I have multiple taints but I’ll do that when I have some multi-tainted nodes to play with)

EDIT: Another way as provided by tdodds81

$ kubectl get nodes -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME TAINTS
ip-10-10-10-148.eu-west-2.compute.internal [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-10-10-218.eu-west-2.compute.internal
ip-10-10-10-239.eu-west-2.compute.internal
ip-10-10-10-249.eu-west-2.compute.internal
ip-10-10-10-51.eu-west-2.compute.internal

And with multi-taints, it looks like this (on a GKE cluster)

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME                                                  TAINTS
gke-test-cluster-k8s-n1-highmem-2-nod-589548dd-3z1v   [map[effect:NoSchedule key:key1 timeAdded:<nil> value:value1] map[value:value2 effect:NoSchedule key:key2 timeAdded:<nil>]]
gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-9lwk   <none>
gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-lnvl   <none>
gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-nt49   <none>
gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-xgbd   <none>
%d bloggers like this: