Kubernetes uses Service Accounts to control who can access what within the cluster, but once a request leaves the cluster, it will use a default account. Normally this is the default Google Compute Engine account in GKE, and this has extremely high level access and could result in a lot of damage if your cluster is compromised.
In this article, I will be setting up a GKE cluster using a minimal access service account and enabling Workflow Identity.
Workflow Identity will enable you to bind a Kubernetes service account to a service account in GCP. You can then control GCP permissions of that account from within GCP -- no RBAC/ABAC messing about needed (although you will still need to mess with RBAC/ABAC if you want to restrict that service account within Kubernetes, but that's a separate article.)
We define three variables here that we can reuse later -- the project, region and zone. These variables you can adjust to match your own setup.
The provider block (provider "google" {..}) references those variables and also refers to the credentials.json file that will be used to create the resources in your account.
Next we create the service account that we will bind to the cluster. This service account should contain minimal permissions as it will be the default account used by requests leaving the cluster. Only give it what is essential. You will notice I do not bind it to any roles.
resource "google_service_account" "cluster-serviceaccount" {
account_id = "cluster-serviceaccount"
display_name = "Service Account For Terraform To Make GKE Cluster"
}
Now let's define our cluster and node pool. This block can vary wildly on your circumstances, but I'll use a Kubernetes 1.16 single-zone cluster, with a e2-medium node size and have autoscaling enabled
variable "cluster_version" {
default = "1.16"
}
resource "google_container_cluster" "cluster" {
name = "tutorial"
location = var.zone
min_master_version = var.cluster_version
project = var.project
lifecycle {
ignore_changes = [
# Ignore changes to min-master-version as that gets changed
# after deployment to minimum precise version Google has
min_master_version,
]
}
# We can't create a cluster with no node pool defined, but
# we want to only use separately managed node pools. So we
# create the smallest possible default node pool and
# immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
workload_identity_config {
identity_namespace = "${var.project}.svc.id.goog"
}
}
resource "google_container_node_pool" "primary_preemptible_nodes" {
name = "tutorial-cluster-node-pool"
location = var.zone
project = var.project
cluster = google_container_cluster.cluster.name
node_count = 1
autoscaling {
min_node_count = 1
max_node_count = 5
}
version = var.cluster_version
node_config {
preemptible = true
machine_type = "e2-medium"
# Google recommends custom service accounts that have cloud-platform scope and
# permissions granted via IAM Roles.
service_account = google_service_account.cluster-serviceaccount.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
metadata = {
disable-legacy-endpoints = "true"
}
}
lifecycle {
ignore_changes = [
# Ignore changes to node_count, initial_node_count and version
# otherwise node pool will be recreated if there is drift between what
# terraform expects and what it sees
initial_node_count,
node_count,
version
]
}
}
Let's go through a few things on the above block:
variable "cluster_version" {
default = "1.16"
}
Defines a variable we will use to describe the version of Kubernetes we want on the master and worker nodes.
The ignore_changes block here tells terraform not to pay attention to changes in the min_master_version field. This is because even though we declare we wanted 1.16 as the version, GKE will put a Kubernetes variant of 1.16 onto the cluster. For example, the cluster might be created with version 1.16.9-gke.999 -- which is different to what Terraform expects, so if you were to run Terraform again, it would attempt to change the cluster version from 1.16.9-gke.999 to 1.16, cycling through the nodes again.
A GKE cluster must be created with a node pool. However it is easier to manage node pool separately, so this block tells Terraform to delete the default node pool when the cluster is created.
This enables Workload Identity and the namespace must be of the format {project}.svc.id.goog
Now let's move onto the Node Pool definition:
resource "google_container_node_pool" "primary_preemptible_nodes" {
name = "tutorial-cluster-node-pool"
location = var.zone
project = var.project
cluster = google_container_cluster.cluster.name
node_count = 1
autoscaling {
min_node_count = 1
max_node_count = 5
}
version = var.cluster_version
node_config {
preemptible = true
machine_type = "e2-medium"
# Google recommends custom service accounts that have cloud-platform scope and
# permissions granted via IAM Roles.
service_account = google_service_account.cluster-serviceaccount.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
metadata = {
disable-legacy-endpoints = "true"
}
}
lifecycle {
ignore_changes = [
# Ignore changes to node_count, initial_node_count and version
# otherwise node pool will be recreated if there is drift between what
# terraform expects and what it sees
initial_node_count,
node_count,
version
]
}
}
This sets up autoscaling with a starting node count of 1 and max node count of 5. Unlike with EKS, you don't need deploy the autoscaler into the cluster. Enabling this will natively allow Kubernetes to scale nodes up or down. The downside is you don't see as many messages compared to the deployed version, so it's sometimes harder to debug why a pod isn't triggering a scaleup.
resource "google_container_node_pool" "primary_preemptible_nodes" {
...
node_config {
preemptible = true
machine_type = "e2-medium"
# Google recommends custom service accounts that have cloud-platform scope and
# permissions granted via IAM Roles.
service_account = google_service_account.cluster-serviceaccount.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
metadata = {
disable-legacy-endpoints = "true"
}
}
...
}
Here we define the node config, we've got this set as a pool of pre-emptible nodes, of type e2-medium. We tie the nodes to the service account defined earlier and give it only the cloud-platform scope.
The metadata block is needed as if you don't specify it, the value disable-legacy-endpoints = "true" is assumed to be applied, and will cause the node pool to be respun each time you run terraform, as it thinks it need to apply the updated config to the pool.
resource "google_container_node_pool" "primary_preemptible_nodes" {
...
lifecycle {
ignore_changes = [
# Ignore changes to node_count, initial_node_count and version
# otherwise node pool will be recreated if there is drift between what
# terraform expects and what it sees
initial_node_count,
node_count,
version
]
}
}
Similar to the version field on the master node, we tell Terraform to ignore some fields if they have changed.
version we ignore for the same reason as on the master node -- the version deployed will be slightly different to the one we declared. initial_node_count we ignore because if the node pool has scaled up, not ignoring this will cause terraform to attempt to scale the nodes back down to the initial_node_count value, causing pods to be sent into Pending node_count we ignore for pretty much the same reason -- it will likely never be the initial value on a production system due to scale up.
With the basic skeleton setup, we can run Terraform to setup the stack. Yes we haven't actually bound anything to serviceaccounts, but that will come later.
Let's Terraform the infrastructure:
terraform init
terraform plan -out tfplan
terraform apply tfplan
Creation of the cluster can take between 5-15 minutes
Next, we need to get credentials and link into the cluster
Now let's do our first test. We'll use gsutil to run a list of GS buckets on our project.
kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls
This will run a docker image with gsutil in it and then remove the container when the command finishes.
The output should be something like this:
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Caller does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-5nzgt -c test -i -t' command when the pod is running
deployment.apps "test" deleted
As you can see, we get a 403. The default service account doesn't have permissions to access Google Storage.
Now let's setup the service account we will use for binding:
resource "google_service_account" "workload-identity-user-sa" {
account_id = "workload-identity-tutorial"
display_name = "Service Account For Workload Identity"
}
resource "google_project_iam_member" "storage-role" {
role = "roles/storage.admin"
# role = "roles/storage.objectAdmin"
member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
resource "google_project_iam_member" "workload_identity-role" {
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}
This block defines the service account in GCP that will be binding to.
resource "google_project_iam_member" "storage-role" {
role = "roles/storage.admin"
# role = "roles/storage.objectAdmin"
member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
This block assigns the Storage Admin role to the service account we just created -- essentially it is putting the service account in the Storage Admin group. Think of it more like adding the account to a group rather than assigning a permission or role to the account.
resource "google_project_iam_member" "workload_identity-role" {
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}
This block adds the service account as a Workload Identity User. You'll notice that the member field is a bit confusing. The ${var.project}.svc.id.goog bit indicates that it is a Workflow Identity namespace and the bit in [...] is the name of the Kubernetes service account we want to allow to be bound to this. This membership and an annotation on the service account (described below) will allow the service account in Kubernetes to essentially impersonate the service account in GCP and you will see this in the example.
With the service account setup in Terraform, let's run the Terraform apply steps again
terraform plan -out tfplan
terraform apply tfplan
Assuming it didn't error, we now have one half of the binding -- the GCP service account. We now need to create the service account inside Kubernetes.
You'll recall that we had a piece of data in the [...]: workload-identity-test/workload-identity-user this is our service account that we need to create. Below is the yaml for creating the namespace and the service account. Save this into the file workload-identity-user.yaml:
So the Kubernetes service account references the GCP service account and the GCP service references the Kubernetes service account.
Important Note: If you do not do the double referencing -- for example, if you forget to include the annotation on the service account or forget to put the referenced Kubernetes service account in the Workload Identity member block, then GKE will use the default service account specified on the node.
Now it's time to put it to the test. If everything is setup correct, run the previous test again:
kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls
You should still get the a 403 but with a different error message.
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Primary: /namespaces/{project}.svc.id.goog with additional claims does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-8ltvc -c test -i -t' command when the pod is running
deployment.apps "test" deleted
Let's now create the service accounts. This file should have been created by the earlier step:
$ kubectl apply -f workload-identity-test.yaml
namespace/workload-identity-test created
serviceaccount/workload-identity-user created
So now let's run the test again but this time, we specify the service account and also the namespace as a service account is tied to the namespace it resides in — in this case, the namespace of our service account is workload-identity-test
kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls
The output will show the buckets you have:
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
gs://backups/
gs://snapshots/
Session ended, resume using 'kubectl attach test-66754998f-sp79b -c test -i -t' command when the pod is running
deployment.apps "test" deleted
NOTE: If you're running a later version of Kubernetes or kubectl, you may get the following error:
Flag --serviceaccount has been deprecated, has no effect and will be removed in 1.24.
In that case, you need to instead use the --overrides switch:
kubectl run -it --rm -n workload-identity-test test --overrides='{ "apiVersion": "v1", "spec": { "serviceAccount": "workload-identity-test" } }' --image gcr.io/cloud-builders/gsutil ls
Let's now change the permissions on the GCP service account to prove it's the one being used change this block:
resource "google_project_iam_member" "storage-role" {
role = "roles/storage.admin"
# role = "roles/storage.objectAdmin"
member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
And change the active role like so:
resource "google_project_iam_member" "storage-role" {
# role = "roles/storage.admin" ## <-- comment this out
role = "roles/storage.objectAdmin" ## <-- uncomment this
member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
Run the terraform actions again:
terraform plan -out tfplan
terraform apply tfplan
Allow a few minutes for the change to propagate then run the test again:
kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls
(See earlier if you get an error regarding the serviceaccount switch)
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 workload-identity-tutorial@{project}.iam.gserviceaccount.com does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-66754998f-k5dm5 -c test -i -t' command when the pod is running
deployment.apps "test" deleted
And there you have it, the service account in the cluster: workload-identity-test/workload-identity-user is bound to the service account workload-identity-tutorial@{project}.iam.gserviceaccount.com on GCP, carrying the permissions it also has.
If the service account on Kubernetes is compromised in some way, you just need to revoke the permissions on the GCP service account and the Kubernetes service account no longer has any permissions to do anything in GCP.
For simplicity, here's the Terraform used for this tutorial. Replace what you need -- you can move things around and separate into other Terraform files if you wish -- I kept it in one file for simplicity.
variable "project" {
default = "REPLACE_ME"
}
variable "region" {
default = "europe-west2"
}
variable "zone" {
default = "europe-west2-a"
}
provider "google" {
project = var.project
region = var.region
zone = var.zone
credentials = file("credentials.json")
}
resource "google_service_account" "cluster-serviceaccount" {
account_id = "cluster-serviceaccount"
display_name = "Service Account For Terraform To Make GKE Cluster"
}
variable "cluster_version" {
default = "1.16"
}
resource "google_container_cluster" "cluster" {
name = "tutorial"
location = var.zone
min_master_version = var.cluster_version
project = var.project
lifecycle {
ignore_changes = [
# Ignore changes to min-master-version as that gets changed
# after deployment to minimum precise version Google has
min_master_version,
]
}
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1
workload_identity_config {
identity_namespace = "${var.project}.svc.id.goog"
}
}
resource "google_container_node_pool" "primary_preemptible_nodes" {
name = "tutorial-cluster-node-pool"
location = var.zone
project = var.project
cluster = google_container_cluster.cluster.name
node_count = 1
autoscaling {
min_node_count = 1
max_node_count = 5
}
version = var.cluster_version
node_config {
preemptible = true
machine_type = "e2-medium"
# Google recommends custom service accounts that have cloud-platform scope
# and permissions granted via IAM Roles.
service_account = google_service_account.cluster-serviceaccount.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
metadata = {
disable-legacy-endpoints = "true"
}
}
lifecycle {
ignore_changes = [
# Ignore changes to node_count, initial_node_count and version
# otherwise node pool will be recreated if there is drift between what
# terraform expects and what it sees
initial_node_count,
node_count,
version
]
}
}
resource "google_service_account" "workload-identity-user-sa" {
account_id = "workload-identity-tutorial"
display_name = "Service Account For Workload Identity"
}
resource "google_project_iam_member" "storage-role" {
role = "roles/storage.admin"
# role = "roles/storage.objectAdmin"
member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}
resource "google_project_iam_member" "workload_identity-role" {
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}
It's taken me nearly a year, but I finally figured out one of the questions that stumped me in my CKAD (writeup: https://blenderfox.com/2019/12/01/ckad-writeup/)
In the exam, the question was to terminate a cronjob if it lasts longer than 17 seconds. There’s a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.
Well, I finally had that situation recently at work and wanted to terminate a cronjob if it was active more than 5 minutes, since the job shouldn't take that long. Finally found out that the answer was not in the CronJob documentation, but in the Job documentation.
CronJobs spawn a Job resource, and within the specification, you can include spec.activeDeadlineSeconds. This will terminate the job pod at that time and will consider the job as failed.
So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)
We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.
Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.
Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.
The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history -- sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren't, the defaults would (or should) be used.
So why did we have over 900 cron pods, and why weren't they being cleaned up upon completion?
Just in case the number of pods were causing problems, I cleared out the completed pods:
But that also didn't help, pods were still being generated. Which is weird -- why is a CronJob still spawning pods even when it's suspended?
So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn't be 3000 Job objects for something that only runs once a minute.
So I went and deleted all the CronJob related Job objects:
kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)
This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.
I decided to get Google onto the case and raised a support ticket.
Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)
2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted
2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node
Spot the problem?
The time between "Job is created" and "Pod is created" around 80 minutes in the first case, and 12 minutes in the second one. That's right, it took 80 minutes for the Pod to be spawned.
And this is where it dawned on me about what was possibly going on.
The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
And so on, and so on.
Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck...
Eventually, the pod will generate but by then there's now a backlog of Jobs, meaning even if I suspended the CronJob, it won't have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).
Google investigated further, and found the culprit:
Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.
So where does this leave us? Well, a few thoughts:
A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup -- successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.
This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.
If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute -- and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.
5 questions I could not answer, and one I could, but arguably that question was ambiguous
Fix a broken cluster -- kubelet was started but couldn't connect to itself.
Add node to cluster. Nodes do not have kubeadm installed.
Static pod. Couldn't find where the path was to put the manifests for the yaml.
4 and 5 I can't remember the questions but will update if I remember
Ambiguous Question:
Create a pod with a persistent volume, that isn't persistent, and doesn't tell you how big to make the PV. I used emptyDir, but that's not really a PV (didn't create a PV or a PVC)
So I did the CKAD exam and it was one of the latest exams I've done, starting at 22:45 and finishing at 00:45. The CKAD exam is 2 hours versus the CKA's 3 hours
And I went into the exam feeling relatively confident. But, damn, the 2 hours goes by really quickly.
Had several questions I wasn't able to complete or only partially complete.
Liveness and Readiness Probes
This question wanted a pod to be restarted if an endpoint returns 500. Simple enough, but there was a catch, if another endpoint returns 500, then the application is starting, and so disregard the check.
I used similar by implementing this check as a curl command in a real life scenario (I should write a blog entry on that some time).
So in the exam, I did both the liveness and readiness checks to chain two curl commands together, if the first endpoint (/starting) in this case, returned 200, then it would do the next endpoint (/healthz) and return a fail if that gave a 500.
Buuuuut, the image didn't have curl installed so the probes failed. I could use the hack I've used in my image and install curl as part of the check, but time constraints wouldn't let me.
Persistent Volumes
Similar to the CKA question, there was a quirkily worded question here which wanted me to add a file to a node, create a pod that used hostPath and reserve a 1Gi PV. The documentation does not provide an example of that, just a pod with a hostPath as an internal volume: https://kubernetes.io/docs/concepts/storage/volumes/#hostpath
Network Policies
A technology I haven't used in Kubernetes yet. They gave several policies, one that allowed "app:proxy" and one that allowed "app:db" and wanted ius to edit a pod to only be allowed to talk to only those.
We were not allowed to modify the policies. I can't remember whether we were allowed to create new policies for this question
But both those policies use the app label. And the pod can't have the same label with two values (I did try)
Though thinking about it now, and after a few checks, the NetworkPolicy object describes how to restrict traffic to the pods in question -- so those selectors may be related to the pods the policy is restricting. I think I should have looked inside the policies more carefully to see what it was saying on the ingress rule and see if it was saying something like "app:frontend", and then making sure the pod was labelled accordingly.
"Ambassador" Sidecar Pattern
A big chunk of the exam time was taken up by the sidecar questions -- far more time than I would have liked, to be honest.
They had a question on adaptor, using fluentd, which was fine, I got that to work, but also had another where I had to use HAProxy to proxy requests do a different port (ambassador pattern). A useful use case, but I ran out of time to finish it. I wanted to come back and revisit it if I had time, but didn't.
CronJobs
Terminate a cronjob if it lasts longer than 17 seconds. There's a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.
Thoughts
I don't think I passed this, having so many issues is probably going to take me into the 60s mark.
Well, it was due to happen eventually, but I got an email saying my LPIC-1 certification is going to expire in 9 months, and I never got to finish LPIC-2.
Well, maybe I’ll redo it after I got my Kubernetes certifications
Finally while writing this post, I notice that Wordpress is now removing Google+ support because Google are shutting it down. A pity really, since I did like Google+ and while it didn’t take off, a lot of the features were in G+ because general use, like Hangouts.
So I haven’t been posting here much recently so here are some updates.
Been slowing trying to get back into running, have been slacking off WAAAAY too much lately. Tried using Aaptiv (@aaptiv) which is a training fitness app that has trainers talking you through the stuff, there are a few problems with it.
When you use a stretch/strength training routine or yoga routine, you're reliant on them telling you what to do, there's no video guide to show you the correct form, and that's bad. Other apps like FitBit Coach has videos where you can copy the coach to make sure you have the right form.
On Treadmill/Running routines, they talk in mph, but treadmills here in the UK go in km/h, which requires conversion (1.0 mph = 1.6 kph)
On a separate note, I have bought another attempt at the CKA exam, but this time bought the bundle with the Kubernetes Fundamentals Training from Linux Foundation. Let’s see how different that is to Linux Academy’s training….
I took my CKA exam for the second time – and failed again. This time. however got much closer to the pass mark than my first time.
Things I think I fluffed on:
Cluster DNS
pods, services and how they can show up using nslookup. I got caught up in trying to figure out why my DNS wasn’t working, and I think it’s because I was trying to nslookup from outside the cluster, which obviously would not resolve the “.cluster.local” domain correctly. I forgot that you can do an interactive, in-cluster shell using
[code lang=text]
kubectl run -i –tty busybox –image=busybox – sh
[/code]
Not to mention that doing nslookup {service}.svc.cluster.local won’t work, and you have to use -type=a to nslookup to get the ip address of the service to confirm it is resolving
etcd Snapshots
This got me both times. The first time I had no idea why doing a snapshot command was failing. The second time I figured out how to do the backup and how to invoke it from the pod, but still got it wrong. Now I figured out (and it was right in front of my face):
[code lang=text]
<br />WARNING:
Environment variable ETCDCTL_API is not set; defaults to etcdctl v2.
Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.
I wasn’t using the ETCDCTL_API variable beforehand so it was falling back to V2 api, which doesn’t have the snapshot command:
[code lang=text]
<br /># etcdctl
NAME:
etcdctl - A simple command line client for etcd.
WARNING:
Environment variable ETCDCTL_API is not set; defaults to etcdctl v2.
Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.
COMMANDS:
backup backup an etcd directory
cluster-health check the health of the etcd cluster
mk make a new key with a given value
mkdir make a new directory
rm remove a key or a directory
rmdir removes the key if it is an empty directory or a key-value pair
get retrieve the value of a key
ls retrieve a directory
set set the value of a key
setdir create a new directory or update an existing directory TTL
update update an existing key with a given value
updatedir update an existing directory
watch watch a key for changes
exec-watch watch a key for changes and exec an executable
member member add, remove and list subcommands
user user add, grant and revoke subcommands
role role add, grant and revoke subcommands
auth overall auth controls
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
–debug output cURL commands which can be used to reproduce the request
–no-sync don’t synchronize cluster information before sending request
–output simple, -o simple output response in the given format (simple, extended or json) (default: “simple”)
–discovery-srv value, -D value domain name to query for SRV records describing cluster endpoints
–insecure-discovery accept insecure SRV records describing cluster endpoints
–peers value, -C value DEPRECATED - “–endpoints” should be used instead
–endpoint value DEPRECATED - “–endpoints” should be used instead
–endpoints value a comma-delimited list of machine addresses in the cluster (default: “http://127.0.0.1:2379,http://127.0.0.1:4001”)
–cert-file value identify HTTPS client using this SSL certificate file
–key-file value identify HTTPS client using this SSL key file
–ca-file value verify certificates of HTTPS-enabled servers using this CA bundle
–username value, -u value provide username[:password] and prompt if password is not supplied.
–timeout value connection timeout per request (default: 2s)
–total-timeout value timeout for the command execution (except watch) (default: 5s)
–help, -h show help
–version, -v print the version
ETCDCTL_API=3 etcdctl
NAME:
etcdctl - A simple command line client for etcd3.
USAGE:
etcdctl
VERSION:
3.2.18
API VERSION:
3.2
COMMANDS:
get Gets the key or a range of keys
put Puts the given key into the store
del Removes the specified key or range of keys [key, range_end)
txn Txn processes all the requests in one transaction
compaction Compacts the event history in etcd
alarm disarm Disarms all alarms
alarm list Lists all alarms
defrag Defragments the storage of the etcd members with given endpoints
endpoint health Checks the healthiness of endpoints specified in --endpoints flag
endpoint status Prints out the status of endpoints specified in --endpoints flag
watch Watches events stream on keys or prefixes
version Prints the version of etcdctl
lease grant Creates leases
lease revoke Revokes leases
lease timetolive Get lease information
lease keep-alive Keeps leases alive (renew)
member add Adds a member into the cluster
member remove Removes a member from the cluster
member update Updates a member in the cluster
member list Lists all members in the cluster
snapshot save Stores an etcd node backend snapshot to a given file
snapshot restore Restores an etcd member snapshot to an etcd directory
snapshot status Gets backend snapshot status of a given file
make-mirror Makes a mirror at the destination etcd cluster
migrate Migrates keys in a v2 store to a mvcc store
lock Acquires a named lock
elect Observes and participates in leader election
auth enable Enables authentication
auth disable Disables authentication
user add Adds a new user
user delete Deletes a user
user get Gets detailed information of a user
user list Lists all users
user passwd Changes password of user
user grant-role Grants a role to a user
user revoke-role Revokes a role from a user
role add Adds a new role
role delete Deletes a role
role get Gets detailed information of a role
role list Lists all roles
role grant-permission Grants a key to a role
role revoke-permission Revokes a key from a role
check perf Check the performance of the etcd cluster
help Help about any command
OPTIONS:
–cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
–cert="" identify secure client using this TLS certificate file
–command-timeout=5s timeout for short running command (excluding dial timeout)
–debug[=false] enable client-side debug logging
–dial-timeout=2s dial timeout for client connections
–endpoints=[127.0.0.1:2379] gRPC endpoints
-h, –help[=false] help for etcdctl
–hex[=false] print byte strings as hex encoded strings
–insecure-skip-tls-verify[=false] skip server certificate verification
–insecure-transport[=true] disable transport security for client connections
–key="" identify secure client using this TLS key file
–user="" username[:password] for authentication (prompt if password is not supplied)
-w, –write-out=“simple” set the output format (fields, json, protobuf, simple, table)
[/code]
And then I can run
ETCDCTL_API=3 etcdctl snapshot save snapshot.db –cacert=/etc/kubernetes/pki/etcd/ca.crt –cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt –key=/etc/kubernetes/pki/etcd/healthcheck-client.key
To create the snapshot.
Certificate Rotation
I need to look this one up – I had no idea how to rotate the certificates
Static Pods
I’d never directly dealt with static pods before this exam, and I don’t think I had this question in my first run, so it was one I didn’t know the answer do. A bit of hunting on the k8s side led me to figure out it was a static pod question, but I couldn’t find out where the exam cluster was looking for its static pod manifests. The question told me a directory, but my yaml didn’t seem to be picked up by the kubelet.
Final note
Generally, a lot of the questions from my first exam run showed up again in this run, which let me run through over half of the exam fairly quickly. I thought I was going to do better than my first run, and I did, but not by much.
Suppose you have an application you are deploying to your kubernetes cluster. For most purposes, running kubectl rollout history deployments/your-app will give you a very simple revision history.
However, what if you had multiple deployments by different people. How would you know what was the reason for the deployment? Especially when you have something like this?
It is possible to set a value into the change-cause field via an annotation, but that field is quite volatile, it is also filled/replaced if someone uses the –record flag when doing an apply. However, it can be utilised to make it much more useful:
[code lang=text]
REVISION CHANGE-CAUSE
11 Deploy new version of awesome-app to test environment
12 Deploy new version of awesome-app to staging environment
13 Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018
14 Deploy new version of awesome-app with integration to gitlab v0.0.0 [test]
[/code]
How is this done? Pretty simply, actually. here’s a snippet from the deploy script I use.
[code lang=text]
echo Deploy message?
read MESSAGE
if [ -z “$MESSAGE” ]; then
MESSAGE=“Deploy new version of awesome-app, $(date)”
echo Blank message detected, defaulting to "$MESSAGE"
fi
echo Deploy updates…
cat deploy.yaml | sed s/‘SUB_TIMESTAMP’/"$(date)"/g | kubectl replace -f -
kubectl annotate deployment awesome-app kubernetes.io/change-cause="$MESSAGE" –record=false –overwrite=true
kubectl rollout status deployments/awesome-app
kubectl rollout history deployment awesome-app
[/code]
For lines 1 to 6, I read in a message from the terminal to populate the annotation, and if nothing is provided, a default is used.
On line 8, I replace the timestamp to trigger a change to the deployment (this can be anything, for example, changing the version tag of your docker image from awesome-app:release-1.0 to awesome-app:release-1.1)
Note that I used replace and not apply – replace will reset the deployment declaration, and since my deploy yaml does NOT contain a change-cause annotation, replace will remove the annotation.
On line 9, I annotate the deployment, making sure I don’t record it and overwrite the annotation in the event it’s there already (though those two switches might be redundant)
On line 10 I check the status of the rollout – this blocks until it is complete
On line 11, I then dump the deployment history.
This is an example of a script run:
[code lang=text]
$ ./deploy.sh
Deploy message?
[typed] Deploy new version of awesome-app with gitlab integration v0.0.0 [test]
Deploy updates…
deployment “awesome-app” replaced
deployment “awesome-app” annotated
Waiting for rollout to finish: 1 old replicas are pending termination…
deployment “awesome-app” successfully rolled out
deployments “awesome-app”
REVISION CHANGE-CAUSE
11 Deploy new version of awesome-app, Thu 21 Jun 07:00:19 BST 2018
12 Deploy new version of awesome-app, Thu 21 Jun 07:00:52 BST 2018
13 Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018
14 Deploy new version of awesome-app with integration to gitlab v0.0.0 [test]
[/code]
One of the questions in my CKA exam was how to display taints with kubectl. While you can use kubectl describe, it creates a lot of other information too.
Then I found out about jsonpath and it’s similarity to jq
You can display the taints with something like
[code lang=text]
for a in $(kubectl get nodes –no-headers | awk ‘{print $1}')
do
echo $a – $(kubectl get nodes/$a -o jsonpath='{.spec.taints[].key}{":"}{.spec.taints[].effect}')
done
So the first one has a taint (it’s the master node) and the second one doesn’t.
(maybe I need to hack this a bit more when I have multiple taints but I’ll do that when I have some multi-tainted nodes to play with)
EDIT: Another way as provided by tdodds81
[code lang=text]
$ kubectl get nodes -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME TAINTS
ip-10-10-10-148.eu-west-2.compute.internal [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-10-10-218.eu-west-2.compute.internal
ip-10-10-10-239.eu-west-2.compute.internal
ip-10-10-10-249.eu-west-2.compute.internal
ip-10-10-10-51.eu-west-2.compute.internal
[/code]
And with multi-taints, it looks like this (on a GKE cluster)
Well, exam is done – for the most part it went okay. A few questions were a bit ambiguous and there were several regarding etcd and low-level tinkering with the kubelet – which I hadn’t had too much experience with unfortunately. I’m hoping I did OK, though…
With less than 24 hours to go before my exam, I’m going to spend those last hours going through the review questions and see if I can still remember the content.
Caveat though – you will have to make sure that your node security group allows your bastion security group to talk to the nodes on the additional ports. By default, the only port that the bastions are able to talk to the node security groups on is SSH (22) only.
SONOMA, Calif., March 6, 2018 – Open Source Leadership Summit – The Cloud Native Computing Foundation® (CNCF®), which sustains and integrates open source technologies like Kubernetes® and Prometheus™, today announced that Kubernetes is the first project to graduate. To move from incubation to graduate, projects must demonstrate thriving adoption, a documented, structured governance process, and a strong commitment to community success and inclusivity.
Let’s assume you have an application that runs happily on its own and is stateless. No problem. You deploy it onto Kubernetes and it works fine. You kill the pod and it respins, happily continuing where it left off.
Let’s add three replicas to the group. That also is fine, since its stateless.
Let’s now change that so that the application is now stateful and requires storage of where it is in between runs. So you pre-provision a disk using EBS and hook that up into the pods, and convert the deployment to a stateful set. Great, it still works fine. All three will pick up where they left off.
Now, what if we wanted to share the same state between the replicas?
For example, what if these three replicas were frontend boxes to a website? Having three different disks is a bad idea unless you can guarantee they will all have the same content. Even if you can, there’s guaranteed to be a case where one or more of the boxes will be either behind or ahead of the other boxes, and consequently have a case where one or more of the boxes will serve the wrong version of content.
There are several options for shared storage, NFS is the most logical but requires you to pre-provision a disk that will be used and also to either have an NFS server outside the cluster or create an NFS pod within the cluster. Also, you will likely over-provision your disk here (100GB when you only need 20GB for example)
Another alternative is EFS, which is Amazon’s NFS storage, where you mount an NFS and only pay for the amount of storage you use. However, even when creating a filesystem in a public subnet, you get a private IP which is useless if you are not DirectConnected into the VPC.
Another option is S3, but how do you use that short of using “s3 sync” repeatedly?
One answer is through the use of s3fs and sshfs
We use s3fs to mount the bucket into a pod (or pods), then we can use those mounts via sshfs as an NFS-like configuration.
The downside to this setup is the fact it will be slower than locally mounted disks.
Having a master in a Kubernetes cluster is all very well and good, but if that master goes down the entire cluster cannot schedule new work. Pods will continue to run, but new ones cannot be scheduled and any pods that die will not get rescheduled.
Having multiple masters allows for more resiliency and can pick up when one goes down. However, as I found out, setting multi-master was quite problematic. Using the guide here only provided some help so after trashing my own and my company’s test cluster, I have expanded on the linked guide.
First add the subnet details for the new zone into your cluster definition – CIDR, subnet id, and make sure you name it something that you can remember. For simplicity, I called mine eu-west-2c. If you have a definition for utility (and you will if you use a bastion), make sure you have a utility subnet also defined for the new AZ
Now, create your master instance groups, you need an odd number to enable quorum and avoid split brain (I’m not saying prevent, and there are edge cases where this could be possible even with quorum). I’m going to add west-2b and west-2c. AWS recently introduced the third London AWS zone, so I’m going to use that.
Find the etcd and etcd-event pods and add them to this script. Change “clustername” to the name of your cluster, then run it. Confirm the member lists include both two members (in my case it would be etc-a and etc-b)
echo Member Lists
kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list
kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list
[/code]
(NOTE: the cluster will break at this point due to the missing second cluster member)
Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region
The script will run and output the status of the instance until it shows “ok”
[code lang=shell]
AWSSWITCHES="–profile personal –region eu-west-2"
INSTANCEID=master2instanceid
while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ]
do
sleep 5s
aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2
done
aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2
[/code]
ssh into the new master (or via bastion if needed)
edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest
Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing
Under ETCD_INITIAL_CLUSTER remove the third master definition
Run this a few times until you get a docker error saying you need more than one container name
There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.
echo Member Lists
kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list
kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list
Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region
The script will run and output the status of the instance until it shows “ok”
[code lang=shell]
AWSSWITCHES="–profile personal –region eu-west-2"
INSTANCEID=master3instanceid
while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ]
do
sleep 5s
aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2
done
aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2
[/code]
ssh into the new master (or via bastion if needed)
edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest
Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing
We DON’T need to remove the third master defintion this time, since this is the third master
Run this a few times until you get a docker error saying you need more than one container name
There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.
Kubernetes is an awesome piece of kit, you can set applications to run within the cluster, make it visible to only apps within the cluster and/or expose it to applications outside of the cluster.
As part of my tinkering, I wanted to setup a Docker Registry to store my own images without having to make them public via docker hub. Doing this proved a bit more complicated than expected since by default, it requires SSL which requires a certificate to be purchased and installed.
Enter Let’s Encrypt which allows you to get SSL certificates for free; and by using their API, you can set it to regularly renew. Kubernetes has the kube-lego project which allows this regular integration. So here, I’ll go through enabling an application (in this case, it’s a docker registry, but it can be anything).
First, lets ignore the lego project, and set up the application so that it is accessible normally. As mentioned above, this is the docker registry
I’m tying the registry storage to a pv claim, though you can modify this to tie to S3, instead etc.
Once you’ve applied this, verify your config is correct by ensuring you have an external endpoint for the service (use kubectl describe service registry | grep “LoadBalancer Ingress”). On AWS, this will be an ELB, on other clouds, you might get an IP. If you get an ELB, CNAME a friendly name to it. If you get an IP, create an A record for it. I’m going to use registry.blenderfox.com for this test.
Verify by doing this. Bear in mind it can take a while before DNS records updates so be patient.
host $(SERVICE_DNS)
So if I had set the service to be registry.blenderfox.com, I would do
host registry.blenderfox.com
If done correctly, this should resolve to the ELB then resolve to the ELB IP addresses.
Next, try to tag a docker image of the format registry-host:port/imagename, so, for example, registry.blenderfox.com:9000/my-image.
Next try to push it.
docker push registry.blenderfox.com:9000/my-image
It will fail because it can’t talk over https
docker push registry.blenderfox.com:9000/my-image
The push refers to repository [registry.blenderfox.com:9000/my-image]
Get https://registry.blenderfox.com:9000/v2/: http: server gave HTTP response to HTTPS client
So let’s now fix that.
Now let’s start setting up kube-lego
Checkout the code
git clone git@github.com:jetstack/kube-lego.git
cd into the relevant folder
cd kube-lego/examples/nginx
Open up nginx/configmap.yaml and change the body-size: “64m” line to a bigger value. This is the maximum size you can upload through nginx. You’ll see why this is an important change later.
Now, look for the external endpoint for the nginx service
kubectl describe service nginx -n nginx-ingress | grep “LoadBalancer Ingress”
Look for the value next to LoadBalancer Ingress. On AWS, this will be the ELB address.
CNAME your domain for your service (e.g. registry.blenderfox.com in this example) to that ELB. If you’re not on AWS, this may be an IP, in which case, just create an A record instead.
Open up lego/configmap.yaml and change the email address in there to be the one you want to use to request the certs.
docker tag registry.blenderfox.com:9000/my-image registry.blenderfox.com/my-image
docker push registry.blenderfox.com/my-image
Note we are not using a port this time as there is now support for SSL.
BOOM! Success.
The tls section indicates the host to request the cert on, and the backend section indicates which backend to pass the request onto. The body-size config is at the nginx level so if you don’t change it, you can only upload a maximum of 64m even if the backend service (docker registry in this case) can support it. I have it set here at “1g” so I can upload 1gb (some docker images can be pretty large)