Binding GCP Accounts to GKE Service Accounts with Terraform

19 March 2021 #

Kubernetes uses Service Accounts to control who can access what within the cluster, but once a request leaves the cluster, it will use a default account. Normally this is the default Google Compute Engine account in GKE, and this has extremely high level access and could result in a lot of damage if your cluster is compromised.

In this article, I will be setting up a GKE cluster using a minimal access service account and enabling Workflow Identity.

(This post is now also available on Medium)

Workflow Identity will enable you to bind a Kubernetes service account to a service account in GCP. You can then control GCP permissions of that account from within GCP -- no RBAC/ABAC messing about needed (although you will still need to mess with RBAC/ABAC if you want to restrict that service account within Kubernetes, but that's a separate article.)

What you will need for this tutorial:

A Google account
A Google Cloud account
Terraform on your local machine
kubectl on your local machine (can be installed as part of the Google Cloud SDK)
Google Cloud SDK on your local machine
A Google Cloud project setup
A service account with "Owner" permissions in your GCP project (the default compute engine account will normally work)
A credentials json file from that account -- this can be generated using
- gcloud iam service-accounts keys create credentials.json --iam-account={iam-account-email}

We will start by setting up our Terraform provider

variable "project" {
  default = "REPLACE_ME"
}

variable "region" {
  default = "europe-west2"
}

variable "zone" {
  default = "europe-west2-a"
}

provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

We define three variables here that we can reuse later -- the project, region and zone. These variables you can adjust to match your own setup.

The provider block (provider "google" {..}) references those variables and also refers to the credentials.json file that will be used to create the resources in your account.

Next we create the service account that we will bind to the cluster. This service account should contain minimal permissions as it will be the default account used by requests leaving the cluster. Only give it what is essential. You will notice I do not bind it to any roles.

resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

Now let's define our cluster and node pool. This block can vary wildly on your circumstances, but I'll use a Kubernetes 1.16 single-zone cluster, with a e2-medium node size and have autoscaling enabled

variable "cluster_version" {
  default = "1.16"
}

resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project

  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }

  # We can't create a cluster with no node pool defined, but
  # we want to only use separately managed node pools. So we
  # create the smallest possible default node pool and
  # immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

Let's go through a few things on the above block:

variable "cluster_version" {
  default = "1.16"
}

Defines a variable we will use to describe the version of Kubernetes we want on the master and worker nodes.

resource "google_container_cluster" "cluster" {
  ...
  min_master_version = var.cluster_version
  ...
  lifecycle {
    ignore_changes = [
      min_master_version,
    ]
  }
  ...
}

The ignore_changes block here tells terraform not to pay attention to changes in the min_master_version field. This is because even though we declare we wanted 1.16 as the version, GKE will put a Kubernetes variant of 1.16 onto the cluster. For example, the cluster might be created with version 1.16.9-gke.999 -- which is different to what Terraform expects, so if you were to run Terraform again, it would attempt to change the cluster version from 1.16.9-gke.999 to 1.16, cycling through the nodes again.

Next block to discuss:

resource "google_container_cluster" "cluster" {
  ...
  remove_default_node_pool = true
  initial_node_count       = 1
  ...
}

A GKE cluster must be created with a node pool. However it is easier to manage node pool separately, so this block tells Terraform to delete the default node pool when the cluster is created.

Final part of this block:

resource "google_container_cluster" "cluster" {
  ...
  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

This enables Workload Identity and the namespace must be of the format {project}.svc.id.goog

Now let's move onto the Node Pool definition:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and 
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

Let's go over a couple of blocks again:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
 ...
}

This sets up autoscaling with a starting node count of 1 and max node count of 5. Unlike with EKS, you don't need deploy the autoscaler into the cluster. Enabling this will natively allow Kubernetes to scale nodes up or down. The downside is you don't see as many messages compared to the deployed version, so it's sometimes harder to debug why a pod isn't triggering a scaleup.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
  ...
}

Here we define the node config, we've got this set as a pool of pre-emptible nodes, of type e2-medium. We tie the nodes to the service account defined earlier and give it only the cloud-platform scope.

The metadata block is needed as if you don't specify it, the value disable-legacy-endpoints = "true" is assumed to be applied, and will cause the node pool to be respun each time you run terraform, as it thinks it need to apply the updated config to the pool.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }
}

Similar to the version field on the master node, we tell Terraform to ignore some fields if they have changed.

version we ignore for the same reason as on the master node -- the version deployed will be slightly different to the one we declared.
initial_node_count we ignore because if the node pool has scaled up, not ignoring this will cause terraform to attempt to scale the nodes back down to the initial_node_count value, causing pods to be sent into Pending
node_count we ignore for pretty much the same reason -- it will likely never be the initial value on a production system due to scale up.

With the basic skeleton setup, we can run Terraform to setup the stack. Yes we haven't actually bound anything to serviceaccounts, but that will come later.

Let's Terraform the infrastructure:

terraform init
terraform plan -out tfplan
terraform apply tfplan

Creation of the cluster can take between 5-15 minutes

Next, we need to get credentials and link into the cluster

gcloud beta container clusters get-credentials tutorial --zone {cluster-zone} --project {project}

gcloud beta container clusters get-credentials tutorial --region {cluster-region} --project {project}

You should get some output like this:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for tutorial.

Now you should be able to run kubectl get pods --all-namespaces to see what's in your cluster (should be nothing other than the default system pods)

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                                             READY   STATUS    RESTARTS   AGE
kube-system   event-exporter-gke-666b7ffbf7-lw79x                              2/2     Running   0          13m
kube-system   fluentd-gke-scaler-54796dcbf7-6xnsg                              1/1     Running   0          13m
kube-system   fluentd-gke-skmsq                                                2/2     Running   0          4m23s
kube-system   gke-metadata-server-fsxj6                                        1/1     Running   0          9m29s
kube-system   gke-metrics-agent-pfdbp                                          1/1     Running   0          9m29s
kube-system   kube-dns-66d6b7c877-wk2nt                                        4/4     Running   0          13m
kube-system   kube-dns-autoscaler-645f7d66cf-spz4c                             1/1     Running   0          13m
kube-system   kube-proxy-gke-tutorial-tutorial-cluster-node-po-b531f1ee-8kpj   1/1     Running   0          9m29s
kube-system   l7-default-backend-678889f899-q6gsl                              1/1     Running   0          13m
kube-system   metrics-server-v0.3.6-64655c969-2lz6v                            2/2     Running   3          13m
kube-system   netd-7xttc                                                       1/1     Running   0          9m29s
kube-system   prometheus-to-sd-w9cwr                                           1/1     Running   0          9m29s
kube-system   stackdriver-metadata-agent-cluster-level-566c4b7cf9-7wmhr        2/2     Running   0          4m23s

Now let's do our first test. We'll use gsutil to run a list of GS buckets on our project.

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

This will run a docker image with gsutil in it and then remove the container when the command finishes.

The output should be something like this:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Caller does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-5nzgt -c test -i -t' command when the pod is running
deployment.apps "test" deleted

As you can see, we get a 403. The default service account doesn't have permissions to access Google Storage.

Now let's setup the service account we will use for binding:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

Again, let's go through the blocks:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

This block defines the service account in GCP that will be binding to.

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

This block assigns the Storage Admin role to the service account we just created -- essentially it is putting the service account in the Storage Admin group. Think of it more like adding the account to a group rather than assigning a permission or role to the account.

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

This block adds the service account as a Workload Identity User. You'll notice that the member field is a bit confusing. The ${var.project}.svc.id.goog bit indicates that it is a Workflow Identity namespace and the bit in [...] is the name of the Kubernetes service account we want to allow to be bound to this. This membership and an annotation on the service account (described below) will allow the service account in Kubernetes to essentially impersonate the service account in GCP and you will see this in the example.

With the service account setup in Terraform, let's run the Terraform apply steps again

terraform plan -out tfplan
terraform apply tfplan

Assuming it didn't error, we now have one half of the binding -- the GCP service account. We now need to create the service account inside Kubernetes.

You'll recall that we had a piece of data in the [...]: workload-identity-test/workload-identity-user this is our service account that we need to create. Below is the yaml for creating the namespace and the service account. Save this into the file workload-identity-user.yaml:

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: null
  name: workload-identity-test
spec: {}
status: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com
  name: workload-identity-user
  namespace: workload-identity-test

The important thing to note is the annotation on the service account:

  annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com

The annotation references the service account created by the Terraform block:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

So the Kubernetes service account references the GCP service account and the GCP service references the Kubernetes service account.

Important Note: If you do not do the double referencing -- for example, if you forget to include the annotation on the service account or forget to put the referenced Kubernetes service account in the Workload Identity member block, then GKE will use the default service account specified on the node.

Now it's time to put it to the test. If everything is setup correct, run the previous test again:

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

You should still get the a 403 but with a different error message.

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Primary: /namespaces/{project}.svc.id.goog with additional claims does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-8ltvc -c test -i -t' command when the pod is running
deployment.apps "test" deleted

Let's now create the service accounts. This file should have been created by the earlier step:

$ kubectl apply -f workload-identity-test.yaml
namespace/workload-identity-test created
serviceaccount/workload-identity-user created

So now let's run the test again but this time, we specify the service account and also the namespace as a service account is tied to the namespace it resides in — in this case, the namespace of our service account is workload-identity-test

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls

The output will show the buckets you have:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
gs://backups/
gs://snapshots/
Session ended, resume using 'kubectl attach test-66754998f-sp79b -c test -i -t' command when the pod is running
deployment.apps "test" deleted

NOTE: If you're running a later version of Kubernetes or kubectl, you may get the following error:

Flag --serviceaccount has been deprecated, has no effect and will be removed in 1.24.

In that case, you need to instead use the --overrides switch:

kubectl run -it --rm -n workload-identity-test test --overrides='{ "apiVersion": "v1", "spec": { "serviceAccount": "workload-identity-test" } }' --image gcr.io/cloud-builders/gsutil ls

Let's now change the permissions on the GCP service account to prove it's the one being used change this block:

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

And change the active role like so:

resource "google_project_iam_member" "storage-role" {
  # role = "roles/storage.admin"        ## <-- comment this out
  role   = "roles/storage.objectAdmin"  ## <-- uncomment this
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

Run the terraform actions again:

terraform plan -out tfplan
terraform apply tfplan

Allow a few minutes for the change to propagate then run the test again:

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls

(See earlier if you get an error regarding the serviceaccount switch)

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 workload-identity-tutorial@{project}.iam.gserviceaccount.com does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-66754998f-k5dm5 -c test -i -t' command when the pod is running
deployment.apps "test" deleted

And there you have it, the service account in the cluster: workload-identity-test/workload-identity-user is bound to the service account workload-identity-tutorial@{project}.iam.gserviceaccount.com on GCP, carrying the permissions it also has.

If the service account on Kubernetes is compromised in some way, you just need to revoke the permissions on the GCP service account and the Kubernetes service account no longer has any permissions to do anything in GCP.

For simplicity, here's the Terraform used for this tutorial. Replace what you need -- you can move things around and separate into other Terraform files if you wish -- I kept it in one file for simplicity.

variable "project" {
  default = "REPLACE_ME"
}

variable "region" {
  default = "europe-west2"
}

variable "zone" {
  default = "europe-west2-a"
}

provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

variable "cluster_version" {
  default = "1.16"
}

resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project

  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope
    # and permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }

  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin" 
  # role   = "roles/storage.objectAdmin" 
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

CKAD

2 November 2020 #

It's taken me nearly a year, but I finally figured out one of the questions that stumped me in my CKAD (writeup: https://blenderfox.com/2019/12/01/ckad-writeup/)

In the exam, the question was to terminate a cronjob if it lasts longer than 17 seconds. There’s a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.

Well, I finally had that situation recently at work and wanted to terminate a cronjob if it was active more than 5 minutes, since the job shouldn't take that long. Finally found out that the answer was not in the CronJob documentation, but in the Job documentation.

CronJobs spawn a Job resource, and within the specification, you can include spec.activeDeadlineSeconds. This will terminate the job pod at that time and will consider the job as failed.

The "Snowball" Effect In Kubernetes

7 August 2020 #

So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)

We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.

Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.

Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history -- sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren't, the defaults would (or should) be used.

So why did we have over 900 cron pods, and why weren't they being cleaned up upon completion?

Just in case the number of pods were causing problems, I cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

But even after that, new jobs weren't spawning pods. And in fact, more CronJob pods were appearing in this namespace. So I disabled the CronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

But that also didn't help, pods were still being generated. Which is weird -- why is a CronJob still spawning pods even when it's suspended?

So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn't be 3000 Job objects for something that only runs once a minute.

So I went and deleted all the CronJob related Job objects:

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.

I decided to get Google onto the case and raised a support ticket.

Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)

2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem?

The time between "Job is created" and "Pod is created" around 80 minutes in the first case, and 12 minutes in the second one. That's right, it took 80 minutes for the Pod to be spawned.

And this is where it dawned on me about what was possibly going on.

The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
And so on, and so on.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck...

Eventually, the pod will generate but by then there's now a backlog of Jobs, meaning even if I suspended the CronJob, it won't have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.

So where does this leave us? Well, a few thoughts:

A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup -- successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.

This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.

If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute -- and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.

CKAD Exam Passed

2 December 2019 #

Whilst I was concerned about my scoring, I still passed. It scored 72%, with a pass mark of 66%

CKA Exam Passed

1 December 2019 #

5 questions I could not answer, and one I could, but arguably that question was ambiguous

Fix a broken cluster -- kubelet was started but couldn't connect to itself.
Add node to cluster. Nodes do not have kubeadm installed.
Static pod. Couldn't find where the path was to put the manifests for the yaml.

4 and 5 I can't remember the questions but will update if I remember

Ambiguous Question:

Create a pod with a persistent volume, that isn't persistent, and doesn't tell you how big to make the PV. I used emptyDir, but that's not really a PV (didn't create a PV or a PVC)

CKAD Writeup

1 December 2019 #

So I did the CKAD exam and it was one of the latest exams I've done, starting at 22:45 and finishing at 00:45. The CKAD exam is 2 hours versus the CKA's 3 hours

And I went into the exam feeling relatively confident. But, damn, the 2 hours goes by really quickly.

Had several questions I wasn't able to complete or only partially complete.

Liveness and Readiness Probes

This question wanted a pod to be restarted if an endpoint returns 500. Simple enough, but there was a catch, if another endpoint returns 500, then the application is starting, and so disregard the check.

I used similar by implementing this check as a curl command in a real life scenario (I should write a blog entry on that some time).

So in the exam, I did both the liveness and readiness checks to chain two curl commands together, if the first endpoint (/starting) in this case, returned 200, then it would do the next endpoint (/healthz) and return a fail if that gave a 500.

Buuuuut, the image didn't have curl installed so the probes failed. I could use the hack I've used in my image and install curl as part of the check, but time constraints wouldn't let me.

Persistent Volumes

Similar to the CKA question, there was a quirkily worded question here which wanted me to add a file to a node, create a pod that used hostPath and reserve a 1Gi PV. The documentation does not provide an example of that, just a pod with a hostPath as an internal volume: https://kubernetes.io/docs/concepts/storage/volumes/#hostpath

Network Policies

A technology I haven't used in Kubernetes yet. They gave several policies, one that allowed "app:proxy" and one that allowed "app:db" and wanted ius to edit a pod to only be allowed to talk to only those.

We were not allowed to modify the policies. I can't remember whether we were allowed to create new policies for this question

But both those policies use the app label. And the pod can't have the same label with two values (I did try)

Though thinking about it now, and after a few checks, the NetworkPolicy object describes how to restrict traffic to the pods in question -- so those selectors may be related to the pods the policy is restricting. I think I should have looked inside the policies more carefully to see what it was saying on the ingress rule and see if it was saying something like "app:frontend", and then making sure the pod was labelled accordingly.

"Ambassador" Sidecar Pattern

A big chunk of the exam time was taken up by the sidecar questions -- far more time than I would have liked, to be honest.

They had a question on adaptor, using fluentd, which was fine, I got that to work, but also had another where I had to use HAProxy to proxy requests do a different port (ambassador pattern). A useful use case, but I ran out of time to finish it. I wanted to come back and revisit it if I had time, but didn't.

CronJobs

Terminate a cronjob if it lasts longer than 17 seconds. There's a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.

Thoughts

I don't think I passed this, having so many issues is probably going to take me into the 60s mark.

CKA Exam Passed

29 November 2019 #

I've totally forgotten to write this up, but I successfully passed my CKA exam on the third attempt with a 78% score, scraping a pass.

I'll write up details of some of the questions I couldn't answer so I can come back and look them up later.

LPIC-1 Expiry and Google+

8 February 2019 #

Well, it was due to happen eventually, but I got an email saying my LPIC-1 certification is going to expire in 9 months, and I never got to finish LPIC-2.

Well, maybe I’ll redo it after I got my Kubernetes certifications

Finally while writing this post, I notice that Wordpress is now removing Google+ support because Google are shutting it down. A pity really, since I did like Google+ and while it didn’t take off, a lot of the features were in G+ because general use, like Hangouts.

General Updates

19 November 2018 #

So I haven’t been posting here much recently so here are some updates.

Been slowing trying to get back into running, have been slacking off WAAAAY too much lately. Tried using Aaptiv (@aaptiv) which is a training fitness app that has trainers talking you through the stuff, there are a few problems with it.

When you use a stretch/strength training routine or yoga routine, you're reliant on them telling you what to do, there's no video guide to show you the correct form, and that's bad. Other apps like FitBit Coach has videos where you can copy the coach to make sure you have the right form.
On Treadmill/Running routines, they talk in mph, but treadmills here in the UK go in km/h, which requires conversion (1.0 mph = 1.6 kph)

On a separate note, I have bought another attempt at the CKA exam, but this time bought the bundle with the Kubernetes Fundamentals Training from Linux Foundation. Let’s see how different that is to Linux Academy’s training….

CKA Exam: Strike #2

27 July 2018 #

I took my CKA exam for the second time – and failed again. This time. however got much closer to the pass mark than my first time.

Things I think I fluffed on:

Cluster DNS

pods, services and how they can show up using nslookup. I got caught up in trying to figure out why my DNS wasn’t working, and I think it’s because I was trying to nslookup from outside the cluster, which obviously would not resolve the “.cluster.local” domain correctly. I forgot that you can do an interactive, in-cluster shell using

[code lang=text] kubectl run -i –tty busybox –image=busybox – sh [/code]

Not to mention that doing nslookup {service}.svc.cluster.local won’t work, and you have to use -type=a to nslookup to get the ip address of the service to confirm it is resolving

etcd Snapshots

This got me both times. The first time I had no idea why doing a snapshot command was failing. The second time I figured out how to do the backup and how to invoke it from the pod, but still got it wrong. Now I figured out (and it was right in front of my face):

[code lang=text] <br />WARNING: Environment variable ETCDCTL_API is not set; defaults to etcdctl v2. Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.

USAGE: etcdctl [global options] command [command options] [arguments…]

VERSION: 3.2.18

[/code]

I wasn’t using the ETCDCTL_API variable beforehand so it was falling back to V2 api, which doesn’t have the snapshot command:

[code lang=text] <br /># etcdctl NAME: etcdctl - A simple command line client for etcd.

WARNING: Environment variable ETCDCTL_API is not set; defaults to etcdctl v2. Set environment variable ETCDCTL_API=3 to use v3 API or ETCDCTL_API=2 to use v2 API.

USAGE: etcdctl [global options] command [command options] [arguments…]

VERSION: 3.2.18

COMMANDS: backup backup an etcd directory cluster-health check the health of the etcd cluster mk make a new key with a given value mkdir make a new directory rm remove a key or a directory rmdir removes the key if it is an empty directory or a key-value pair get retrieve the value of a key ls retrieve a directory set set the value of a key setdir create a new directory or update an existing directory TTL update update an existing key with a given value updatedir update an existing directory watch watch a key for changes exec-watch watch a key for changes and exec an executable member member add, remove and list subcommands user user add, grant and revoke subcommands role role add, grant and revoke subcommands auth overall auth controls help, h Shows a list of commands or help for one command

GLOBAL OPTIONS: –debug output cURL commands which can be used to reproduce the request –no-sync don’t synchronize cluster information before sending request –output simple, -o simple output response in the given format (simple, extended or json) (default: “simple”) –discovery-srv value, -D value domain name to query for SRV records describing cluster endpoints –insecure-discovery accept insecure SRV records describing cluster endpoints –peers value, -C value DEPRECATED - “–endpoints” should be used instead –endpoint value DEPRECATED - “–endpoints” should be used instead –endpoints value a comma-delimited list of machine addresses in the cluster (default: “http://127.0.0.1:2379,http://127.0.0.1:4001”) –cert-file value identify HTTPS client using this SSL certificate file –key-file value identify HTTPS client using this SSL key file –ca-file value verify certificates of HTTPS-enabled servers using this CA bundle –username value, -u value provide username[:password] and prompt if password is not supplied. –timeout value connection timeout per request (default: 2s) –total-timeout value timeout for the command execution (except watch) (default: 5s) –help, -h show help –version, -v print the version

ETCDCTL_API=3 etcdctl

NAME: etcdctl - A simple command line client for etcd3.

USAGE: etcdctl

VERSION: 3.2.18

API VERSION: 3.2

COMMANDS: get Gets the key or a range of keys put Puts the given key into the store del Removes the specified key or range of keys [key, range_end) txn Txn processes all the requests in one transaction compaction Compacts the event history in etcd alarm disarm Disarms all alarms alarm list Lists all alarms defrag Defragments the storage of the etcd members with given endpoints endpoint health Checks the healthiness of endpoints specified in --endpoints flag endpoint status Prints out the status of endpoints specified in --endpoints flag watch Watches events stream on keys or prefixes version Prints the version of etcdctl lease grant Creates leases lease revoke Revokes leases lease timetolive Get lease information lease keep-alive Keeps leases alive (renew) member add Adds a member into the cluster member remove Removes a member from the cluster member update Updates a member in the cluster member list Lists all members in the cluster snapshot save Stores an etcd node backend snapshot to a given file snapshot restore Restores an etcd member snapshot to an etcd directory snapshot status Gets backend snapshot status of a given file make-mirror Makes a mirror at the destination etcd cluster migrate Migrates keys in a v2 store to a mvcc store lock Acquires a named lock elect Observes and participates in leader election auth enable Enables authentication auth disable Disables authentication user add Adds a new user user delete Deletes a user user get Gets detailed information of a user user list Lists all users user passwd Changes password of user user grant-role Grants a role to a user user revoke-role Revokes a role from a user role add Adds a new role role delete Deletes a role role get Gets detailed information of a role role list Lists all roles role grant-permission Grants a key to a role role revoke-permission Revokes a key from a role check perf Check the performance of the etcd cluster help Help about any command

OPTIONS: –cacert="" verify certificates of TLS-enabled secure servers using this CA bundle –cert="" identify secure client using this TLS certificate file –command-timeout=5s timeout for short running command (excluding dial timeout) –debug[=false] enable client-side debug logging –dial-timeout=2s dial timeout for client connections –endpoints=[127.0.0.1:2379] gRPC endpoints -h, –help[=false] help for etcdctl –hex[=false] print byte strings as hex encoded strings –insecure-skip-tls-verify[=false] skip server certificate verification –insecure-transport[=true] disable transport security for client connections –key="" identify secure client using this TLS key file –user="" username[:password] for authentication (prompt if password is not supplied) -w, –write-out=“simple” set the output format (fields, json, protobuf, simple, table)

[/code]

And then I can run

ETCDCTL_API=3 etcdctl snapshot save snapshot.db –cacert=/etc/kubernetes/pki/etcd/ca.crt –cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt –key=/etc/kubernetes/pki/etcd/healthcheck-client.key

To create the snapshot.

Certificate Rotation

I need to look this one up – I had no idea how to rotate the certificates

Static Pods

I’d never directly dealt with static pods before this exam, and I don’t think I had this question in my first run, so it was one I didn’t know the answer do. A bit of hunting on the k8s side led me to figure out it was a static pod question, but I couldn’t find out where the exam cluster was looking for its static pod manifests. The question told me a directory, but my yaml didn’t seem to be picked up by the kubelet.

Final note

Generally, a lot of the questions from my first exam run showed up again in this run, which let me run through over half of the exam fairly quickly. I thought I was going to do better than my first run, and I did, but not by much.

Using the "change-cause" Kubernetes annotation as a changelog

23 June 2018 #

Suppose you have an application you are deploying to your kubernetes cluster. For most purposes, running kubectl rollout history deployments/your-app will give you a very simple revision history.

[code lang=text] $ kubectl rollout history deployments/awesome-app REVISION CHANGE-CAUSE 1 <none>

[/code]

However, what if you had multiple deployments by different people. How would you know what was the reason for the deployment? Especially when you have something like this?

[code lang=text] REVISION CHANGE-CAUSE 1 <none> 2 3 4 5 … … 100 <none> 101 <none> 102 <none> [/code]

It is possible to set a value into the change-cause field via an annotation, but that field is quite volatile, it is also filled/replaced if someone uses the –record flag when doing an apply. However, it can be utilised to make it much more useful:

[code lang=text] REVISION CHANGE-CAUSE 11 Deploy new version of awesome-app to test environment 12 Deploy new version of awesome-app to staging environment 13 Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018 14 Deploy new version of awesome-app with integration to gitlab v0.0.0 [test] [/code]

How is this done? Pretty simply, actually. here’s a snippet from the deploy script I use.

[code lang=text] echo Deploy message? read MESSAGE if [ -z “$MESSAGE” ]; then MESSAGE=“Deploy new version of awesome-app, $(date)” echo Blank message detected, defaulting to "$MESSAGE" fi echo Deploy updates… cat deploy.yaml | sed s/‘SUB_TIMESTAMP’/"$(date)"/g | kubectl replace -f - kubectl annotate deployment awesome-app kubernetes.io/change-cause="$MESSAGE" –record=false –overwrite=true kubectl rollout status deployments/awesome-app kubectl rollout history deployment awesome-app [/code]

For lines 1 to 6, I read in a message from the terminal to populate the annotation, and if nothing is provided, a default is used. On line 8, I replace the timestamp to trigger a change to the deployment (this can be anything, for example, changing the version tag of your docker image from awesome-app:release-1.0 to awesome-app:release-1.1)

Note that I used replace and not apply – replace will reset the deployment declaration, and since my deploy yaml does NOT contain a change-cause annotation, replace will remove the annotation.

On line 9, I annotate the deployment, making sure I don’t record it and overwrite the annotation in the event it’s there already (though those two switches might be redundant)

On line 10 I check the status of the rollout – this blocks until it is complete

On line 11, I then dump the deployment history.

This is an example of a script run:

[code lang=text] $ ./deploy.sh Deploy message? [typed] Deploy new version of awesome-app with gitlab integration v0.0.0 [test] Deploy updates… deployment “awesome-app” replaced deployment “awesome-app” annotated Waiting for rollout to finish: 1 old replicas are pending termination… deployment “awesome-app” successfully rolled out deployments “awesome-app” REVISION CHANGE-CAUSE 11 Deploy new version of awesome-app, Thu 21 Jun 07:00:19 BST 2018 12 Deploy new version of awesome-app, Thu 21 Jun 07:00:52 BST 2018 13 Deploy new version of awesome-app, Thu 21 Jun 07:01:03 BST 2018 14 Deploy new version of awesome-app with integration to gitlab v0.0.0 [test] [/code]

kubectl Displaying Taints

2 June 2018 #

One of the questions in my CKA exam was how to display taints with kubectl. While you can use kubectl describe, it creates a lot of other information too.

Then I found out about jsonpath and it’s similarity to jq

You can display the taints with something like

[code lang=text] for a in $(kubectl get nodes –no-headers | awk ‘{print $1}') do echo $a – $(kubectl get nodes/$a -o jsonpath='{.spec.taints[].key}{":"}{.spec.taints[].effect}') done

[/code]

Sample output

[code lang=text] ip-10-10-10-147.eu-west-2.compute.internal – node-role.kubernetes.io/master:NoSchedule ip-10-10-10-159.eu-west-2.compute.internal – :

[/code]

So the first one has a taint (it’s the master node) and the second one doesn’t. (maybe I need to hack this a bit more when I have multiple taints but I’ll do that when I have some multi-tainted nodes to play with)

EDIT: Another way as provided by tdodds81

[code lang=text] $ kubectl get nodes -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints NAME TAINTS ip-10-10-10-148.eu-west-2.compute.internal [map[effect:NoSchedule key:node-role.kubernetes.io/master]] ip-10-10-10-218.eu-west-2.compute.internal ip-10-10-10-239.eu-west-2.compute.internal ip-10-10-10-249.eu-west-2.compute.internal ip-10-10-10-51.eu-west-2.compute.internal [/code]

And with multi-taints, it looks like this (on a GKE cluster)

[code lang=text] $ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints NAME TAINTS gke-test-cluster-k8s-n1-highmem-2-nod-589548dd-3z1v [map[effect:NoSchedule key:key1 timeAdded:<nil> value:value1] map[value:value2 effect:NoSchedule key:key2 timeAdded:<nil>]] gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-9lwk <none> gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-lnvl <none> gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-nt49 <none> gke-test-cluster-k8s-n1-highmem-2-nod-7a33f13b-xgbd <none> [/code]

CKA Exam

29 May 2018 #

My results are in… and I failed :(

Still, I have a few ideas where I fell….

CKA Exam

27 May 2018 #

Well, exam is done – for the most part it went okay. A few questions were a bit ambiguous and there were several regarding etcd and low-level tinkering with the kubelet – which I hadn’t had too much experience with unfortunately. I’m hoping I did OK, though…

Exam

27 May 2018 #

Not long left till my exam so spending the last few hours just going through some review questions from Linux Academy

CKAD

26 May 2018 #

While booking my CKA exam earlier, I spotted that there was a second Kubernetes certification – Certified Kubernetes Application Developer.

While the CKA is for the admin of the cluster and tests the knowledge of the infrastructure, CKAD is more for the developers.

Looks like that’ll be my next certification…

CKA

26 May 2018 #

With less than 24 hours to go before my exam, I’m going to spend those last hours going through the review questions and see if I can still remember the content.

CKA Exam

26 May 2018 #

Right, my exam is booked. For tomorrow. :)

CKA

26 May 2018 #

I’ve decided to take the plunge and for the Certified Kubernetes Administrator certification. Wish me luck! :)

Introduction to Kubernetes on Rancher 2.0

23 April 2018 #

The Rancher guys have put out an intro training video of Kubernetes on Rancher 2.0 – give it a check if you have time. :)

youtu.be/vfV6CehbN…

Tunnelling to Kubernetes Nodes & Pods via a Bastion

4 April 2018 #

A quick note to remind myself (and other people) how to tunnel to a node (or pod) in Kubernetes via the bastion server

[code lang=text] rm ~/.ssh/known_hosts #Needed if you keep scaling the bastion up/down

BASTION=bastion.{cluster-domain} DEST=$1

ssh -o StrictHostKeyChecking=no -o ProxyCommand=‘ssh -o StrictHostKeyChecking=no -W %h:%p admin@bastion.{cluster-domain}’ admin@$DEST [/code]

Run like this:

[code lang=text] bash ./tunnelK8s.sh NODE_IP [/code]

Example:

[code lang=text] bash ./tunnelK8s.sh 10.10.10.100 #Assuming 10.10.10.100 is the node you want to connect to. [/code]

You can extend this by using this to ssh into a pod, assuming the pod has an SSH server on it.

[code lang=text] BASTION=bastion.${cluster domain name} NODE=$1 NODEPORT=$2 PODUSER=$3

ssh -o ProxyCommand=“ssh -W %h:%p admin@$BASTION” admin@$NODE ssh -tto StrictHostKeyChecking=no $PODUSER@localhost -p $NODEPORT [/code]

So if you have service listening on port 32000 on node 10.10.10.100 that expects a login user of “poduser”, you would do this:

[code lang=text] bash ./tunnelPod.sh 10.10.10.100 32000 poduser [/code]

If you have to pass a password you can install sshpass on the node, then use that (be aware of security risk though - this is not an ideal solution)

[code lang=text] ssh -o ProxyCommand=“ssh -W %h:%p admin@$BASTION” admin@$NODE sshpass -p ${password} ssh -tto StrictHostKeyChecking=no $PODUSER@localhost -p $NODEPORT [/code]

Caveat though – you will have to make sure that your node security group allows your bastion security group to talk to the nodes on the additional ports. By default, the only port that the bastions are able to talk to the node security groups on is SSH (22) only.

Cloud Native Computing Foundation Announces Kubernetes as First Graduated Project

7 March 2018 #

SONOMA, Calif., March 6, 2018 – Open Source Leadership Summit – The Cloud Native Computing Foundation® (CNCF®), which sustains and integrates open source technologies like Kubernetes® and Prometheus™, today announced that Kubernetes is the first project to graduate. To move from incubation to graduate, projects must demonstrate thriving adoption, a documented, structured governance process, and a strong commitment to community success and inclusivity.

www.cncf.io/announcem…

Great news :) shows that Kubernetes is now considered more mature than previously and it definitely shows.

How to using S3 as a RWM/NFS-like store in Kubernetes

1 March 2018 #

Let’s assume you have an application that runs happily on its own and is stateless. No problem. You deploy it onto Kubernetes and it works fine. You kill the pod and it respins, happily continuing where it left off.

Let’s add three replicas to the group. That also is fine, since its stateless.

Let’s now change that so that the application is now stateful and requires storage of where it is in between runs. So you pre-provision a disk using EBS and hook that up into the pods, and convert the deployment to a stateful set. Great, it still works fine. All three will pick up where they left off.

Now, what if we wanted to share the same state between the replicas?

For example, what if these three replicas were frontend boxes to a website? Having three different disks is a bad idea unless you can guarantee they will all have the same content. Even if you can, there’s guaranteed to be a case where one or more of the boxes will be either behind or ahead of the other boxes, and consequently have a case where one or more of the boxes will serve the wrong version of content.

There are several options for shared storage, NFS is the most logical but requires you to pre-provision a disk that will be used and also to either have an NFS server outside the cluster or create an NFS pod within the cluster. Also, you will likely over-provision your disk here (100GB when you only need 20GB for example)

Another alternative is EFS, which is Amazon’s NFS storage, where you mount an NFS and only pay for the amount of storage you use. However, even when creating a filesystem in a public subnet, you get a private IP which is useless if you are not DirectConnected into the VPC.

Another option is S3, but how do you use that short of using “s3 sync” repeatedly?

One answer is through the use of s3fs and sshfs

We use s3fs to mount the bucket into a pod (or pods), then we can use those mounts via sshfs as an NFS-like configuration.

The downside to this setup is the fact it will be slower than locally mounted disks.

So here’s the yaml for the s3fs pods (change values within {…} where applicable) – details at Docker Hub here: https://hub.docker.com/r/blenderfox/s3fs/

(and yes, I could convert the environment variables into secrets and reference those, and I might do a follow up article for that)

[code]

kind: Deployment apiVersion: extensions/v1beta1 metadata: name: s3fs namespace: default labels: k8s-app: s3fs annotations: {} spec: replicas: 1 selector: matchLabels: k8s-app: s3fs template: metadata: name: s3fs labels: k8s-app: s3fs spec: containers: - name: s3fs image: blenderfox/s3fs env: - name: S3_BUCKET value: {…} - name: S3_REGION value: {…} - name: AWSACCESSKEYID value: {…} - name: AWSSECRETACCESSKEY value: {…} - name: REMOTEKEY value: {…} - name: BUCKETUSERPASSWORD value: {…} resources: {} imagePullPolicy: Always securityContext: privileged: true restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext: {} schedulerName: default-scheduler strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600

kind: Service apiVersion: v1 metadata: name: s3-service annotations: external-dns.alpha.kubernetes.io/hostname: {hostnamehere} service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: “3600” labels: name: s3-service spec: ports:

protocol: TCP name: ssh port: 22 targetPort: 22 selector: k8s-app: s3fs type: LoadBalancer sessionAffinity: None externalTrafficPolicy: Cluster

[/code]

This will create a service and a pod

If you have external DNS enabled, the hostname will be added to Route 53.

SSH into the service and verify you can access the bucket mount

[code] ssh bucketuser@dns-name ls -l /mnt/bucket/ [/code]

(This should give you the listing of the bucket and also should have user:group set on the directory as “bucketuser”)

You should also be able to rsync into the bucket using this

[code] rsync -rvhP /source/path bucketuser@dns-name:/mnt/bucket/ [/code]

Or sshfs using a similar method

[code]

sshfs bucketuser@dns-name:/mnt/bucket/ /path/to/local/mountpoint

[/code]

Edit the connection timeout annotation if needed

Now, if you set up a pod that has three replicas and all three sshfs to the same service, you essentially have an NFS-like storage.

How to move from single master to multi-master in an AWS kops kubernetes cluster

23 January 2018 #

Having a master in a Kubernetes cluster is all very well and good, but if that master goes down the entire cluster cannot schedule new work. Pods will continue to run, but new ones cannot be scheduled and any pods that die will not get rescheduled.

Having multiple masters allows for more resiliency and can pick up when one goes down. However, as I found out, setting multi-master was quite problematic. Using the guide here only provided some help so after trashing my own and my company’s test cluster, I have expanded on the linked guide.

First add the subnet details for the new zone into your cluster definition – CIDR, subnet id, and make sure you name it something that you can remember. For simplicity, I called mine eu-west-2c. If you have a definition for utility (and you will if you use a bastion), make sure you have a utility subnet also defined for the new AZ

[code lang=shell] kops edit cluster –state s3://bucket [/code]

Now, create your master instance groups, you need an odd number to enable quorum and avoid split brain (I’m not saying prevent, and there are edge cases where this could be possible even with quorum). I’m going to add west-2b and west-2c. AWS recently introduced the third London AWS zone, so I’m going to use that.

[code lang=shell] kops create instancegroup master-eu-west-2b –subnet eu-west-2b –role Master [/code]

Make this one have a max/min of 1

[code lang=shell] kops create instancegroup master-eu-west-2c –subnet eu-west-2c –role Master [/code]

Make this one have a max/min of 0 (yes, zero) for now

Reference these in your cluster config

[code lang=text] kops edit cluster –state=s3://bucket [/code]

[code lang=text] etcdClusters:

etcdMembers:
- instanceGroup: master-eu-west-2a name: a
- instanceGroup: master-eu-west-2b name: b
- instanceGroup: master-eu-west-2c name: c name: main
etcdMembers:
- instanceGroup: master-eu-west-2a name: a
- instanceGroup: master-eu-west-2b name: b
- instanceGroup: master-eu-west-2c name: c name: events [/code]

Start the new master

[code lang=shell] kops update cluster –state s3://bucket –yes [/code]

Find the etcd and etcd-event pods and add them to this script. Change “clustername” to the name of your cluster, then run it. Confirm the member lists include both two members (in my case it would be etc-a and etc-b)

[code lang=shell] ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal AZ=b CLUSTER=clustername

kubectl –namespace=kube-system exec $ETCPOD – etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380

kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list

kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list [/code]

(NOTE: the cluster will break at this point due to the missing second cluster member)

Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region

The script will run and output the status of the instance until it shows “ok”

[code lang=shell] AWSSWITCHES="–profile personal –region eu-west-2" INSTANCEID=master2instanceid while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ] do sleep 5s aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 done aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 [/code]

ssh into the new master (or via bastion if needed)

[code lang=shell] sudo -i systemctl stop kubelet systemctl stop protokube [/code]

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing Under ETCD_INITIAL_CLUSTER remove the third master definition

Stop the etcd docker containers

[code lang=shell] docker stop $(docker ps | grep “etcd” | awk ‘{print $1}') [/code]

Run this a few times until you get a docker error saying you need more than one container name There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.

[code lang=shell] rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data-events/member/ rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data/member/ [/code]

Now start kubelet

[code lang=shell] systemctl start kubelet [/code]

Wait until the master shows on the validate list then start protokube

[code lang=shell] systemctl start protokube [/code]

Now do the same with the third master

edit the third master ig to make it min/max 1

[code lang=shell] kops edit ig master-eu-west-2c –name=clustername –state s3://bucket [/code]

Add it to the clusters (the etcd pods should still be running)

[code lang=shell] ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal AZ=c CLUSTER=clustername

kubectl –namespace=kube-system exec $ETCPOD – etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380 kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists kubectl –namespace=kube-system exec $ETCPOD – etcdctl member list kubectl –namespace=kube-system exec $ETCEVENTSPOD – etcdctl –endpoint http://127.0.0.1:4002 member list

[/code]

Start the third master

[code lang=shell] kops update cluster –name=cluster-name –state=s3://bucket [/code]

The script will run and output the status of the instance until it shows “ok”

[code lang=shell] AWSSWITCHES="–profile personal –region eu-west-2" INSTANCEID=master3instanceid while [ “$(aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2)” != “ok” ] do sleep 5s aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 done aws $AWSSWITCHES ec2 describe-instance-status –instance-id=$INSTANCEID –output text | grep SYSTEMSTATUS | cut -f 2 [/code]

ssh into the new master (or via bastion if needed)

[code lang=shell] sudo -i systemctl stop kubelet systemctl stop protokube [/code]

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing

We DON’T need to remove the third master defintion this time, since this is the third master

Stop the etcd docker containers

[code lang=shell] docker stop $(docker ps | grep “etcd” | awk ‘{print $1}') [/code]

[code lang=shell] rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data-events/member/ rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data/member/ [/code]

Now start kubelet

[code lang=shell] systemctl start kubelet [/code]

Wait until the master shows on the validate list then start protokube

[code lang=shell] systemctl start protokube [/code]

If the cluster validates, do a full respin

[code lang=shell] kops rolling-update cluster –name clustername –state s3://bucket –force –yes [/code]

Enabling and using Let's Encrypt SSL Certificates on Kubernetes

19 January 2018 #

Kubernetes is an awesome piece of kit, you can set applications to run within the cluster, make it visible to only apps within the cluster and/or expose it to applications outside of the cluster.

As part of my tinkering, I wanted to setup a Docker Registry to store my own images without having to make them public via docker hub. Doing this proved a bit more complicated than expected since by default, it requires SSL which requires a certificate to be purchased and installed.

Enter Let’s Encrypt which allows you to get SSL certificates for free; and by using their API, you can set it to regularly renew. Kubernetes has the kube-lego project which allows this regular integration. So here, I’ll go through enabling an application (in this case, it’s a docker registry, but it can be anything).

First, lets ignore the lego project, and set up the application so that it is accessible normally. As mentioned above, this is the docker registry

I’m tying the registry storage to a pv claim, though you can modify this to tie to S3, instead etc.

[code lang=text]

kind: Deployment apiVersion: extensions/v1beta1 metadata: name: registry namespace: default labels: name: registry spec: replicas: 1 selector: matchLabels: name: registry template: metadata: creationTimestamp: labels: name: registry spec: volumes: - name: registry-data persistentVolumeClaim: claimName: registry-data containers: - name: registry image: registry:2 resources: {} volumeMounts: - name: registry-data mountPath: “/var/lib/registry” terminationMessagePath: “/dev/termination-log” terminationMessagePolicy: File imagePullPolicy: Always restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext: {} schedulerName: default-scheduler strategy: type: Recreate

kind: Service apiVersion: v1 metadata: name: registry namespace: default labels: name: registry spec: ports:

protocol: TCP port: 9000 targetPort: 5000 selector: name: registry type: LoadBalancer sessionAffinity: None externalTrafficPolicy: Cluster

[/code]

Once you’ve applied this, verify your config is correct by ensuring you have an external endpoint for the service (use kubectl describe service registry | grep “LoadBalancer Ingress”). On AWS, this will be an ELB, on other clouds, you might get an IP. If you get an ELB, CNAME a friendly name to it. If you get an IP, create an A record for it. I’m going to use registry.blenderfox.com for this test.

Verify by doing this. Bear in mind it can take a while before DNS records updates so be patient.

host $(SERVICE_DNS)

So if I had set the service to be registry.blenderfox.com, I would do

host registry.blenderfox.com

If done correctly, this should resolve to the ELB then resolve to the ELB IP addresses.

Next, try to tag a docker image of the format registry-host:port/imagename, so, for example, registry.blenderfox.com:9000/my-image.

Next try to push it.

docker push registry.blenderfox.com:9000/my-image

It will fail because it can’t talk over https

docker push registry.blenderfox.com:9000/my-image
The push refers to repository [registry.blenderfox.com:9000/my-image]
Get https://registry.blenderfox.com:9000/v2/: http: server gave HTTP response to HTTPS client

So let’s now fix that.

Now let’s start setting up kube-lego

Checkout the code git clone git@github.com:jetstack/kube-lego.git

cd into the relevant folder cd kube-lego/examples/nginx

Start applying the code base

[code lang=text] kubectl apply -f lego/00-namespace.yaml kubectl apply -f nginx/00-namespace.yaml kubectl apply -f nginx/default-deployment.yaml kubectl apply -f nginx/default-service.yaml [/code]

Open up nginx/configmap.yaml and change the body-size: “64m” line to a bigger value. This is the maximum size you can upload through nginx. You’ll see why this is an important change later.

[code lang=text] kubectl apply -f nginx/configmap.yaml kubectl apply -f nginx/service.yaml kubectl apply -f nginx/deployment.yaml [/code]

Now, look for the external endpoint for the nginx service kubectl describe service nginx -n nginx-ingress | grep “LoadBalancer Ingress”

Look for the value next to LoadBalancer Ingress. On AWS, this will be the ELB address.

CNAME your domain for your service (e.g. registry.blenderfox.com in this example) to that ELB. If you’re not on AWS, this may be an IP, in which case, just create an A record instead.

Open up lego/configmap.yaml and change the email address in there to be the one you want to use to request the certs.

[code lang=text] kubectl apply -f lego/configmap.yaml kubectl apply -f lego/deployment.yaml [/code]

Wait for the DNS to update before proceeding to the next step.

host registry.blenderfox.com

When the DNS is updated, finally create and add an ingress rule for your service:

[code lang=text]

kind: Ingress apiVersion: extensions/v1beta1 metadata: name: registry namespace: default annotations: kubernetes.io/ingress.class: nginx kubernetes.io/tls-acme: ‘true’ spec: tls:

hosts:
- registry.blenderfox.com secretName: docker-tls rules:
host: registry.blenderfox.com http: paths:
- path: “/” backend: serviceName: registry servicePort: 9000 status: loadBalancer: ingress:
- {} [/code]

Look add the logs in nginx-ingress/nginx and you’ll see the Let’s Encrypt server come in to validate:

100.124.0.0 - [100.124.0.0] - - [19/Jan/2018:09:50:19 +0000] "GET /.well-known/acme-challenge/[REDACTED] HTTP/1.1" 200 87 "-" "Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)" 277 0.044 100.96.0.3:8080 87 0.044 200

And look in the logs on the kube-lego/kube-lego pod and you’ll see the success and saving of the secret

time="2018-01-19T09:49:45Z" level=info msg="requesting certificate for registry.blenderfox.com" context="ingress_tls" name=registry namespace=default 
time="2018-01-19T09:50:21Z" level=info msg="authorization successful" context=acme domain=registry.blenderfox.com 
time="2018-01-19T09:50:47Z" level=info msg="successfully got certificate: domains=[registry.blenderfox.com] url=https://acme-v01.api.letsencrypt.org/acme/cert/[REDACTED]" context=acme 
time="2018-01-19T09:50:47Z" level=info msg="Attempting to create new secret" context=secret name=registry-tls namespace=default 
time="2018-01-19T09:50:47Z" level=info msg="Secret successfully stored" context=secret name=registry-tls namespace=default

Now let’s do a quick verify:

curl -ILv https://registry.blenderfox.com
...
* Server certificate:
*  subject: CN=registry.blenderfox.com
*  start date: Jan 19 08:50:46 2018 GMT
*  expire date: Apr 19 08:50:46 2018 GMT
*  subjectAltName: host "registry.blenderfox.com" matched cert's "registry.blenderfox.com"
*  issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3
*  SSL certificate verify ok.
...

That looks good.

Now let’s re-tag and try to push our image

docker tag registry.blenderfox.com:9000/my-image registry.blenderfox.com/my-image
docker push registry.blenderfox.com/my-image

Note we are not using a port this time as there is now support for SSL.

BOOM! Success.

The tls section indicates the host to request the cert on, and the backend section indicates which backend to pass the request onto. The body-size config is at the nginx level so if you don’t change it, you can only upload a maximum of 64m even if the backend service (docker registry in this case) can support it. I have it set here at “1g” so I can upload 1gb (some docker images can be pretty large)