Training in Quarantine - Day 315

30 May 2021 #

I've been slacking logging my walks lately, so apologies for that.

This was should have been logged yesterday. It was very hot, so much that I didn't even need to worry about wearing a jacket or anything while walking, and perhaps I should have worn sunglasses, it was that bright.

In other news, my laptop seems like it's on its way out so I have got a replacement and an upgrade, and spent Saturday trying to set it up but clonezilla seems to be having trouble registering the partition labels. I had already purchased a 2TB NVME so instead of cloning then restoring, I cross-device cloned directly, making the 2TB disk exactly the same as the 1TB, but with the space at the end of the disk which I will use for Ubuntu.

Then took the rest of the day downloading my games from Steam. Warframe and Stellaris are the main ones I've been downloading

Streaming also seems fine, with only a slight fps drop.

Training in Quarantine - Day 314

20 May 2021 #

Another wet day, but also very windy today, multiple bins had been blown over and I had to hold my hood from not being blown off my head in the wind -- haven't had to do that for a long while.

Rain started during the walk, nothing too heavy, mostly drizzle and made it back before it really turned heavy.

Training in Quarantine - Day 313

20 May 2021 #

Delayed logging from yesterday. Another walk in between rainshowers. Made it back home before it totally pelted down with hailstones too.

Training in Quarantine - Day 312

18 May 2021 #

Warmer day, but still rained, didn't get caught in any rain today during my walk, so that was good.

Training in Quarantine - Day 311

18 May 2021 #

Forgot to log this yesterday, but it was another on-off rainy day. Went for my walk in a period where the rain was easing and it was fine, but it started to rain on the way back home, so got a little wet.

Also got called by the hire car company from my accident last year (just over a year ago now) -- the insurance company on the other side are still resisting paying the hire car charges and now its going to legal action. So potentially I might end up in court.

But since the accident is a no-fault accident from my side (my car was stationary and parked, and the other driver was arrested for drink driving), I should not be getting any costs coming my way.

Downgrading LineageOS to Android 10

14 May 2021 #

LineageOS has now gone to Android 11, and like most users, I went ahead and upgraded to it. But then I started hitting lots of problems. Predominantly on location.

Android 11 changed the way location is requested and this breaks functionality in multiple apps. Ones I have noticed this issue with:

Just Eat
CityMapper
Google Maps (yes, surprisingly Google's own app had trouble with locking on)

Other location apps may have also had the same issue, but I didn't check those.

Waze did not have any issues locking on to location or tracking movement

Some non-location apps also broke. Fenix 2 (a Twitter client) and WeChat both stopped working and would not install off the Play Store, presumeably because of API differences.

I installed Plume instead (which I had previously purchased) and that installed and functioned happily.

WeChat I sideloaded by getting the apk from a mirror. That functioned okay, but could not log into Web WeChat

I decided to clean wipe and downgrade back to Android 10 (Lineage OS 17) to at least get things working again.

I formatted my SD card for Portable Storage, then took it to my laptop and saved the LOS flash zip, Open Gapps zip, and the latest Magisk.

I booted into TWRP Recovery and wiped, data, cache, system, ART and internal storage.

Switching to external storage, I then flashed LOS, OpenGapps, then Magisk.

I rebooted and let the OS do its thing until I got the welcome screen -- that's a good sign. I went through the setup but opted not to setup my Google Account yet.

Once through to the home page, I went and unlocked Developer options and enabled ADB, Local Terminal, Force Allow External Storage, and Force Close on Hold Back.

Then I plugged my phone into my pixelbook, allowed the debug connection and started up scrcpy which allows me to copy-paste text to and from the device.

I installed TitaniumBackup and the pro key so I can batch move apps to/from the storage.

The SD Card is still setup as Portable. So I formatted it as Internal. This took a few attempts as it kept erroring.

I went into Play Store and installed a few apps. A couple installed file, but the other errorred with a message:

"App requires external storage"

This was weird, I never saw that before, but checking around, I found this: https://forum.xda-developers.com/t/app-requires-external-storage.4098673/ which describes fixing the storage permissions. I ran this:

adb shell
su
restorecon -FR /data/media/0

I reinstalled the apps again, and there were no errors. Fenix 2 installed happily this time, enhancing my suspicion of some API change breaking it on Android 11.

Also found out that Strava required Google Maps so I also had to install Google Maps.

WeChat finally did install, but was then told by the app that my account cannot use Web WeChat, and I should use WeChat for Windows or Mac..... and I'm running Linux, so both of those options are not feasible.

However, I did find this: https://github.com/qo6xup6/ubuntu-deepin-wechat which is a Wine wrapper around the Windows WeChat app. This seems to work well, although I did have to update the client using the instructions on the README.md

FitBit refused to pair with my Ionic (again -- it always seems to have this trouble whenever I have to reinstall the app). I eventually resorted to factory resetting my Ionic, and re-setting it up again. It worked this time, although the pairing took a few attempts.

Surprisingly, I was then able to add my Curve card to FitBit Pay, and the SMS verification worked.

All in all, it took me from around 7am to 12:30 pm to reflash, reinstall, and setup all the apps again, and reboot to make sure the apps still worked. So around 5 hours.

Training in Quarantine - Day 310

10 May 2021 #

A very mixed bag of weather today. Windy, then sunny, then rain, then repeat.

Did my walk and decided to wrap up warm in case the wind came round again.

Just my luck, it didn't and the stayed sunny meaning I was sweltering by the time I got back from the walk.

Training in Quarantine - Day 309

8 May 2021 #

Had my MOT done today. Took surprisingly low amount of time -- I dropped the car off at 10:30 for the 11:00 appointment and they called me back at 2pm saying it was done.

They had a check-in process where you sign in, and deposit the car keys (with a tag saying which bay you had parked the car) into a small locker and that signals the mechanic you have left your keys. It's contactless so you are never near another human.

My daily walk as a result, was much later than normal.

Training in Quarantine - Day 308

6 May 2021 #

Had to shop for more milk, so went up to Tesco but they didn't have the organic milk needed. Went to Sainsbury's instead and they did.

Booked my MOT, but they said they did not have any "While you wait" slots for that date so it would be booked as a drop-off.

So instead, I booked for this weekend.

The showroom is only a single bus ride (plus a 15-minute walk) so it's not too bad.

Training in Quarantine - Day 307

6 May 2021 #

Forgot to log this yesterday.

Was another wet and windy day, didn't really do much of a walk, and just managed to walk to the local Co-op to get some milk in the pouring rain.

Toyota later called me advising my MOT on the Yaris was due next month and I should book in the MOT.

Did that today and surprised that they offer a "While you wait" MOT that you can drop the car off and wait for them to do the MOT and then pick it up afterwards.

Next available slot for that is in two weeks. Still in good time for the 9th June deadline where the MOT expires.

Training in Quarantine - Day 306

4 May 2021 #

Today's been raining, windy, sunny, rainy again and repeat all day.

Managed to get a walk in between the rain and wind phases

Finished reinstalling my apps and had a few issues with location. Waze and Google Maps had trouble locking onto GPS and found out that even though the option to allow location "only while app is running" was available, I needed to set the permission to "at any time" otherwise they would lock on.

Citymapper had no issue

Just Eat still has issues and does not have a "at any" level for location permission, so I just use the postcode for that.

Most of the apps are now working. A few will not install from Play Store, presumably they're not built to support Android 11 yet:

WeChat
Fenix 2

For Fenix, I went back to Plume (I had previously paid for the Premium version and that works fine). I installed GBoard so I can put gifs into my tweets again (one feature I really liked from Fenix)

WeChat, I ended up sideloading from apkmirror and it worked fine.

Training in Quarantine - Day 305

3 May 2021 #

Decided to finally action the upgrade notification on my phone and Lineage OS wanted to update from Android 10 to Android 11.

The upgrade went off without too many issues, but I then found out Open Gapps had no package fro Android 11, meaning I had to resort to MindTheGapps -- a minimal package that only allows Google Apps to work, but doesn't actually install any. Consequently the Google Apps I had installed via OpenGapps decided to stop working as a result. I had to uninstall and reinstall everything from Google -- Search, Notes, YouTube, Maps, Home, etc.

But fortunately everything else seemed to work. Camera MX decided to stop working, so I've gone back to CameraZoom

There seems to be a nice feature in Android 11 where if an app does not use a claimed permission after a period of time, Android will automatically remove that permission. Useful for applications which claim more permissions than they really need.

I also went round the torched fence and managed to get a pic of the new fence

Binding GCP Accounts to GKE Service Accounts with Terraform

19 March 2021 #

Kubernetes uses Service Accounts to control who can access what within the cluster, but once a request leaves the cluster, it will use a default account. Normally this is the default Google Compute Engine account in GKE, and this has extremely high level access and could result in a lot of damage if your cluster is compromised.

In this article, I will be setting up a GKE cluster using a minimal access service account and enabling Workflow Identity.

(This post is now also available on Medium)

Workflow Identity will enable you to bind a Kubernetes service account to a service account in GCP. You can then control GCP permissions of that account from within GCP -- no RBAC/ABAC messing about needed (although you will still need to mess with RBAC/ABAC if you want to restrict that service account within Kubernetes, but that's a separate article.)

What you will need for this tutorial:

A Google account
A Google Cloud account
Terraform on your local machine
kubectl on your local machine (can be installed as part of the Google Cloud SDK)
Google Cloud SDK on your local machine
A Google Cloud project setup
A service account with "Owner" permissions in your GCP project (the default compute engine account will normally work)
A credentials json file from that account -- this can be generated using
- gcloud iam service-accounts keys create credentials.json --iam-account={iam-account-email}

We will start by setting up our Terraform provider

variable "project" {
  default = "REPLACE_ME"
}

variable "region" {
  default = "europe-west2"
}

variable "zone" {
  default = "europe-west2-a"
}

provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

We define three variables here that we can reuse later -- the project, region and zone. These variables you can adjust to match your own setup.

The provider block (provider "google" {..}) references those variables and also refers to the credentials.json file that will be used to create the resources in your account.

Next we create the service account that we will bind to the cluster. This service account should contain minimal permissions as it will be the default account used by requests leaving the cluster. Only give it what is essential. You will notice I do not bind it to any roles.

resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

Now let's define our cluster and node pool. This block can vary wildly on your circumstances, but I'll use a Kubernetes 1.16 single-zone cluster, with a e2-medium node size and have autoscaling enabled

variable "cluster_version" {
  default = "1.16"
}

resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project

  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }

  # We can't create a cluster with no node pool defined, but
  # we want to only use separately managed node pools. So we
  # create the smallest possible default node pool and
  # immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

Let's go through a few things on the above block:

variable "cluster_version" {
  default = "1.16"
}

Defines a variable we will use to describe the version of Kubernetes we want on the master and worker nodes.

resource "google_container_cluster" "cluster" {
  ...
  min_master_version = var.cluster_version
  ...
  lifecycle {
    ignore_changes = [
      min_master_version,
    ]
  }
  ...
}

The ignore_changes block here tells terraform not to pay attention to changes in the min_master_version field. This is because even though we declare we wanted 1.16 as the version, GKE will put a Kubernetes variant of 1.16 onto the cluster. For example, the cluster might be created with version 1.16.9-gke.999 -- which is different to what Terraform expects, so if you were to run Terraform again, it would attempt to change the cluster version from 1.16.9-gke.999 to 1.16, cycling through the nodes again.

Next block to discuss:

resource "google_container_cluster" "cluster" {
  ...
  remove_default_node_pool = true
  initial_node_count       = 1
  ...
}

A GKE cluster must be created with a node pool. However it is easier to manage node pool separately, so this block tells Terraform to delete the default node pool when the cluster is created.

Final part of this block:

resource "google_container_cluster" "cluster" {
  ...
  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

This enables Workload Identity and the namespace must be of the format {project}.svc.id.goog

Now let's move onto the Node Pool definition:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and 
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

Let's go over a couple of blocks again:

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
 ...
}

This sets up autoscaling with a starting node count of 1 and max node count of 5. Unlike with EKS, you don't need deploy the autoscaler into the cluster. Enabling this will natively allow Kubernetes to scale nodes up or down. The downside is you don't see as many messages compared to the deployed version, so it's sometimes harder to debug why a pod isn't triggering a scaleup.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and
    # permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
  ...
}

Here we define the node config, we've got this set as a pool of pre-emptible nodes, of type e2-medium. We tie the nodes to the service account defined earlier and give it only the cloud-platform scope.

The metadata block is needed as if you don't specify it, the value disable-legacy-endpoints = "true" is assumed to be applied, and will cause the node pool to be respun each time you run terraform, as it thinks it need to apply the updated config to the pool.

resource "google_container_node_pool" "primary_preemptible_nodes" {
  ...
  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }
}

Similar to the version field on the master node, we tell Terraform to ignore some fields if they have changed.

version we ignore for the same reason as on the master node -- the version deployed will be slightly different to the one we declared.
initial_node_count we ignore because if the node pool has scaled up, not ignoring this will cause terraform to attempt to scale the nodes back down to the initial_node_count value, causing pods to be sent into Pending
node_count we ignore for pretty much the same reason -- it will likely never be the initial value on a production system due to scale up.

With the basic skeleton setup, we can run Terraform to setup the stack. Yes we haven't actually bound anything to serviceaccounts, but that will come later.

Let's Terraform the infrastructure:

terraform init
terraform plan -out tfplan
terraform apply tfplan

Creation of the cluster can take between 5-15 minutes

Next, we need to get credentials and link into the cluster

gcloud beta container clusters get-credentials tutorial --zone {cluster-zone} --project {project}

gcloud beta container clusters get-credentials tutorial --region {cluster-region} --project {project}

You should get some output like this:

Fetching cluster endpoint and auth data.
kubeconfig entry generated for tutorial.

Now you should be able to run kubectl get pods --all-namespaces to see what's in your cluster (should be nothing other than the default system pods)

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                                             READY   STATUS    RESTARTS   AGE
kube-system   event-exporter-gke-666b7ffbf7-lw79x                              2/2     Running   0          13m
kube-system   fluentd-gke-scaler-54796dcbf7-6xnsg                              1/1     Running   0          13m
kube-system   fluentd-gke-skmsq                                                2/2     Running   0          4m23s
kube-system   gke-metadata-server-fsxj6                                        1/1     Running   0          9m29s
kube-system   gke-metrics-agent-pfdbp                                          1/1     Running   0          9m29s
kube-system   kube-dns-66d6b7c877-wk2nt                                        4/4     Running   0          13m
kube-system   kube-dns-autoscaler-645f7d66cf-spz4c                             1/1     Running   0          13m
kube-system   kube-proxy-gke-tutorial-tutorial-cluster-node-po-b531f1ee-8kpj   1/1     Running   0          9m29s
kube-system   l7-default-backend-678889f899-q6gsl                              1/1     Running   0          13m
kube-system   metrics-server-v0.3.6-64655c969-2lz6v                            2/2     Running   3          13m
kube-system   netd-7xttc                                                       1/1     Running   0          9m29s
kube-system   prometheus-to-sd-w9cwr                                           1/1     Running   0          9m29s
kube-system   stackdriver-metadata-agent-cluster-level-566c4b7cf9-7wmhr        2/2     Running   0          4m23s

Now let's do our first test. We'll use gsutil to run a list of GS buckets on our project.

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

This will run a docker image with gsutil in it and then remove the container when the command finishes.

The output should be something like this:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Caller does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-5nzgt -c test -i -t' command when the pod is running
deployment.apps "test" deleted

As you can see, we get a 403. The default service account doesn't have permissions to access Google Storage.

Now let's setup the service account we will use for binding:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

Again, let's go through the blocks:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

This block defines the service account in GCP that will be binding to.

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

This block assigns the Storage Admin role to the service account we just created -- essentially it is putting the service account in the Storage Admin group. Think of it more like adding the account to a group rather than assigning a permission or role to the account.

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

This block adds the service account as a Workload Identity User. You'll notice that the member field is a bit confusing. The ${var.project}.svc.id.goog bit indicates that it is a Workflow Identity namespace and the bit in [...] is the name of the Kubernetes service account we want to allow to be bound to this. This membership and an annotation on the service account (described below) will allow the service account in Kubernetes to essentially impersonate the service account in GCP and you will see this in the example.

With the service account setup in Terraform, let's run the Terraform apply steps again

terraform plan -out tfplan
terraform apply tfplan

Assuming it didn't error, we now have one half of the binding -- the GCP service account. We now need to create the service account inside Kubernetes.

You'll recall that we had a piece of data in the [...]: workload-identity-test/workload-identity-user this is our service account that we need to create. Below is the yaml for creating the namespace and the service account. Save this into the file workload-identity-user.yaml:

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: null
  name: workload-identity-test
spec: {}
status: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com
  name: workload-identity-user
  namespace: workload-identity-test

The important thing to note is the annotation on the service account:

  annotations:
    iam.gke.io/gcp-service-account: workload-identity-tutorial@{project}.iam.gserviceaccount.com

The annotation references the service account created by the Terraform block:

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

So the Kubernetes service account references the GCP service account and the GCP service references the Kubernetes service account.

Important Note: If you do not do the double referencing -- for example, if you forget to include the annotation on the service account or forget to put the referenced Kubernetes service account in the Workload Identity member block, then GKE will use the default service account specified on the node.

Now it's time to put it to the test. If everything is setup correct, run the previous test again:

kubectl run --rm -it test --image gcr.io/cloud-builders/gsutil ls

You should still get the a 403 but with a different error message.

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 Primary: /namespaces/{project}.svc.id.goog with additional claims does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-68bb69b777-8ltvc -c test -i -t' command when the pod is running
deployment.apps "test" deleted

Let's now create the service accounts. This file should have been created by the earlier step:

$ kubectl apply -f workload-identity-test.yaml
namespace/workload-identity-test created
serviceaccount/workload-identity-user created

So now let's run the test again but this time, we specify the service account and also the namespace as a service account is tied to the namespace it resides in — in this case, the namespace of our service account is workload-identity-test

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls

The output will show the buckets you have:

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
gs://backups/
gs://snapshots/
Session ended, resume using 'kubectl attach test-66754998f-sp79b -c test -i -t' command when the pod is running
deployment.apps "test" deleted

NOTE: If you're running a later version of Kubernetes or kubectl, you may get the following error:

Flag --serviceaccount has been deprecated, has no effect and will be removed in 1.24.

In that case, you need to instead use the --overrides switch:

kubectl run -it --rm -n workload-identity-test test --overrides='{ "apiVersion": "v1", "spec": { "serviceAccount": "workload-identity-test" } }' --image gcr.io/cloud-builders/gsutil ls

Let's now change the permissions on the GCP service account to prove it's the one being used change this block:

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin"
  # role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

And change the active role like so:

resource "google_project_iam_member" "storage-role" {
  # role = "roles/storage.admin"        ## <-- comment this out
  role   = "roles/storage.objectAdmin"  ## <-- uncomment this
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

Run the terraform actions again:

terraform plan -out tfplan
terraform apply tfplan

Allow a few minutes for the change to propagate then run the test again:

kubectl run -n workload-identity-test --rm --serviceaccount=workload-identity-user -it test --image gcr.io/cloud-builders/gsutil ls

(See earlier if you get an error regarding the serviceaccount switch)

kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
AccessDeniedException: 403 workload-identity-tutorial@{project}.iam.gserviceaccount.com does not have storage.buckets.list access to the Google Cloud project.
Session ended, resume using 'kubectl attach test-66754998f-k5dm5 -c test -i -t' command when the pod is running
deployment.apps "test" deleted

And there you have it, the service account in the cluster: workload-identity-test/workload-identity-user is bound to the service account workload-identity-tutorial@{project}.iam.gserviceaccount.com on GCP, carrying the permissions it also has.

If the service account on Kubernetes is compromised in some way, you just need to revoke the permissions on the GCP service account and the Kubernetes service account no longer has any permissions to do anything in GCP.

For simplicity, here's the Terraform used for this tutorial. Replace what you need -- you can move things around and separate into other Terraform files if you wish -- I kept it in one file for simplicity.

variable "project" {
  default = "REPLACE_ME"
}

variable "region" {
  default = "europe-west2"
}

variable "zone" {
  default = "europe-west2-a"
}

provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

resource "google_service_account" "cluster-serviceaccount" {
  account_id   = "cluster-serviceaccount"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

variable "cluster_version" {
  default = "1.16"
}

resource "google_container_cluster" "cluster" {
  name               = "tutorial"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project

  lifecycle {
    ignore_changes = [
      # Ignore changes to min-master-version as that gets changed
      # after deployment to minimum precise version Google has
      min_master_version,
    ]
  }

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "tutorial-cluster-node-pool"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }

  version = var.cluster_version

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope
    # and permissions granted via IAM Roles.
    service_account = google_service_account.cluster-serviceaccount.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }

  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to node_count, initial_node_count and version
      # otherwise node pool will be recreated if there is drift between what 
      # terraform expects and what it sees
      initial_node_count,
      node_count,
      version
    ]
  }

}

resource "google_service_account" "workload-identity-user-sa" {
  account_id   = "workload-identity-tutorial"
  display_name = "Service Account For Workload Identity"
}

resource "google_project_iam_member" "storage-role" {
  role = "roles/storage.admin" 
  # role   = "roles/storage.objectAdmin" 
  member = "serviceAccount:${google_service_account.workload-identity-user-sa.email}"
}

resource "google_project_iam_member" "workload_identity-role" {
  role   = "roles/iam.workloadIdentityUser"
  member = "serviceAccount:${var.project}.svc.id.goog[workload-identity-test/workload-identity-user]"
}

GitLab's Default Branch Name

11 March 2021 #

GitLab is now implementing a change to make the default branch "main" instead of "master", following GitHub and Atlassian in ditching the "master/slave" namings due to their negative history.

It should be noted that this change this makes little difference to the functionality these sites provide, and to git repositories in general. Also, the default branch can be overridden.

When creating a blank initial repo in GitLab or GitHub (i.e. without a README.md file), the sites will prompt you to push code in using instructions such as this (GitLab haven't yet implemented the master --> main change yet so it still shows master on their instructions)

git clone git@gitlab.com:username/example.git
cd example
touch README.md
git add README.md
git commit -m "add README"
git push -u origin master

But there's nothing to stop you from doing something like

git push -u origin trunk

Instead of master at the time of pushing.

trunk is also one of the three folders used in Subversion Version Control as part of the recommended layout (trunk, tags, branches) -- yes, I did use svn previously, along with Mercurial, Visual SourceSafe, and even cvs.

trunk is also a more logical sounding main branch as you have branches that lead into the trunk of a repo. And the leaves could be considered to be the tags.

While it is great that the big name hosting platforms are migrating away from the master branch idea, it should be noted that you didn't have to have this default branch name originally, nor were you (or are you still) tied to using their choice of main branch name.

Heater Repairs

21 December 2020 #

We got the replacement heater installed today, mostly fine, the heat of the water is now very high, since the plumber had to replace the piping near the heater, and had to mount a plank of wood on the wall to provide additional support for the heater as the bracket where the heater will be mounted would be too high.

This heater requires electricity, compared to the previous one which did not. Meaning without power, we won't have hot water.

The plumber had to add cement around the exit flue of the heater and a few hours later I could see a crack in the cement, probably from it contracting and cracking while drying.

The plumber also had to remove an entire cupboard from the kitchen where the heater is and we'll have to remount that some other time.

Revolut: A warning to Android users

11 November 2020 #

It seems like Revolut's latest Android update (6th November) has shafted some users, including me, rendering them unable to receive the update, and making the app disappear entirely from the Play Store for those users. No notification, no warning. Just a sudden stop to updates. I had to restore from a backup I made of the app, and was then able to transfer my money out of there.

I spoke to support, and their suggestion? Use a newer device.

I guess I will be closing my Revolut account.

Training in Quarantine - Day 192

7 November 2020 #

A busy Saturday with several house viewings. One of which got cancelled due to a resident having to self-isolate due to covid.

One of the viewings today was originally written off by my folks as a "no-hope" but once they viewed inside, their tone dramatically changed.

A literal case of not judging a book by its cover.

In other news, I saw a tweet from [twitter.com/VictoriaB...](https://twitter.com/VictoriaBID:)

[twitter.com/VictoriaB...](https://twitter.com/VictoriaBID/status/1324711864765456385)

Now I work on top of Victoria Station, so I walk past the memorial plaque dedicated to the Unknown Soldier every day I commute to the office. Obviously not so much this year due to covid.

The Military Wives Choir did the song for Abide With Me using the now-common feature of a virtual choir:

[www.youtube.com/watch](https://www.youtube.com/watch?v=4J-oP1esgt4)

The virtual choir idea has been used a lot this year due to social distancing, but let's not forget, the idea dates back way further, even as far back as 2009 with Eric Whitacre's Virtual Choir project (https://www.youtube.com/user/EricWhitacresVrtlChr) which also made it into several TED talks

From 2010:

[www.ted.com/talks/eri...](https://www.ted.com/talks/eric_whitacre_a_choir_as_big_as_the_internet)

2011:

[www.ted.com/talks/eri...](https://www.ted.com/talks/eric_whitacre_a_virtual_choir_2_000_voices_strong)

And 2013:

[www.ted.com/talks/eri...](https://www.ted.com/talks/eric_whitacre_virtual_choir_live)

Training in Quarantine - Day 191 and other updates

6 November 2020 #

My last logged walk was 23rd October. I've been slacking off logging runs since then, so this is my first logged run since then, even though I have been doing near-daily runs since then, so I'm skipping through to Day 191 since I've done 10 days of walks since then.

I've also got a few other updates.

My house purchase fell through a while ago so I have been actively house hunting a lot and my past few Saturdays have been spent house viewing. Viewing during the day is tricky unless I take time off to house hunt.

Dealing with different Estate Agents is a pain, with some not even bothering to give you the time of day, let alone

I also upgraded my phone to Android 10 LineageOS and I've been having quite a few issues with internet speed and stability. I'm seriously considering forcing a downgrade back to Android 9. In the meantime, I might switch from Adoptable Storage back to portable storage to see if that helps with stability.

Oh, and it's frickin' COLD.

CKAD

2 November 2020 #

It's taken me nearly a year, but I finally figured out one of the questions that stumped me in my CKAD (writeup: https://blenderfox.com/2019/12/01/ckad-writeup/)

In the exam, the question was to terminate a cronjob if it lasts longer than 17 seconds. There’s a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.

Well, I finally had that situation recently at work and wanted to terminate a cronjob if it was active more than 5 minutes, since the job shouldn't take that long. Finally found out that the answer was not in the CronJob documentation, but in the Job documentation.

CronJobs spawn a Job resource, and within the specification, you can include spec.activeDeadlineSeconds. This will terminate the job pod at that time and will consider the job as failed.

Training in Quarantine - Day 181

23 October 2020 #

Got caught in the rain on my walk today. It didn't last long though, but I had made it home by the time it stopped. Seems like that's a pattern when I go for a walk....

Training in Quarantine - Day 180

20 October 2020 #

Installed more apps that were missing from my phone and set most of them up. Did my walk as normal, no issues there. Had to re-pair my phone with my car since everything was reset, though.

Training in Quarantine - Day 179

19 October 2020 #

Late out today -- my phone wanted to upgrade so I attempted it (it was an upgrade from Android 9 to Android 10), and it didn't work, and I ended up having to factory reset and install from scratch. I did have some Titanium Backup backups, but they didn't seem to work a lot of the time :/

So for the most part, I just reinstalled all the apps I remember using and logged in. For most, that was fine. But I lost the MFA codes on Google Authenticator, meaning I had to remove and setup:

AWS
LastPass
Wordpress
GitLab

all over again

AWS was quick and painless after a security check to confirm I was who I said I was and they called me on the number on the account.

Wordpress was painless too -- I was already logged in, so just removed MFA and set it up again, then logged in again. Similarly with LastPass

GitLab however, is proving to be more of a pain. They no longer accept MFA removal requests for people on the Free plan. So I wonder if they will accept me going to a subscription model so I _can_ then request the MFA removal. I think it is better anyway, since I'm hitting the 400 minute CI limit pretty regularly. The 2000 minute CI limit would be better. At least until I can get my own GitLab install working.

As for the run, yes, it was a run -- well, more of a jog, anyway. Still did the 3km lap, doing it in 20 mins rather than the 30 mins it normally takes me when I walk it.

The "Snowball" Effect In Kubernetes

7 August 2020 #

So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)

We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.

Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.

Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history -- sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren't, the defaults would (or should) be used.

So why did we have over 900 cron pods, and why weren't they being cleaned up upon completion?

Just in case the number of pods were causing problems, I cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

But even after that, new jobs weren't spawning pods. And in fact, more CronJob pods were appearing in this namespace. So I disabled the CronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

But that also didn't help, pods were still being generated. Which is weird -- why is a CronJob still spawning pods even when it's suspended?

So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn't be 3000 Job objects for something that only runs once a minute.

So I went and deleted all the CronJob related Job objects:

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.

I decided to get Google onto the case and raised a support ticket.

Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)

2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem?

The time between "Job is created" and "Pod is created" around 80 minutes in the first case, and 12 minutes in the second one. That's right, it took 80 minutes for the Pod to be spawned.

And this is where it dawned on me about what was possibly going on.

The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
And so on, and so on.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck...

Eventually, the pod will generate but by then there's now a backlog of Jobs, meaning even if I suspended the CronJob, it won't have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.

So where does this leave us? Well, a few thoughts:

A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup -- successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.

This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.

If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute -- and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.

Pixelbook

4 July 2020 #

Spent a big chunk of today preparing for, and attempting to upgrade my Pixelbook to Gallium OS.

I imaged it, then made a file backup of my home directory, before installing the OS, overwriting my Ubuntu, then restoring the home directory backup into the newly installed OS and then chowning the directory to me.

As a habit, I then imaged the laptop at this state.

I prepared a semi-automated script to install apps that I had installed on my Ubuntu, which included things like virt-manager, virtualbox, google-chrome and the like.

However, I soon found out that VirtualBox 6.1 seems to crash the mouse driver on reboot and the mouse pointer no longer moves and Gallium doesn't even seem to see a pointer device when you check the mouse and touchpad option. I had to revert back to the image just after the file copy.

There is always the option of installing VirtualBox 6.0 from the Ubuntu repositories rather than the Oracle repositories, which uses a different installation setup. Maybe that will result in a different outcome.

Eventually, I restored back to my original Ubuntu installation so I can retry again tomorrow.

EDIT: Retried again the next day, and found out the sound wasn't working, even on the live disk. Better find out what's the deal with that...

EDIT2: Found out that my Pixelbook model doesn't have working sound drivers on GalliumOS. I guess I will have to wait until that is fixed before using that. I guess I'm staying on Ubuntu. In the meantime, I'm going to see if I can compile a later version of the kernel to see if I can somehow get VirtualBox working better.

IFTTT

9 June 2020 #

My connection to IFTTT suddenly messed up so I had to delete my blog connection and recreate it, and create a new application password...

Very weird why this suddenly happened.