Revolut: A warning to Android users

It seems like Revolut’s latest Android update (6th November) has shafted some users, including me, rendering them unable to receive the update, and making the app disappear entirely from the Play Store for those users. No notification, no warning. Just a sudden stop to updates. I had to restore from a backup I made of the app, and was then able to transfer my money out of there.

I spoke to support, and their suggestion? Use a newer device.

I guess I will be closing my Revolut account.

Training in Quarantine – Day 192

A busy Saturday with several house viewings. One of which got cancelled due to a resident having to self-isolate due to covid.

One of the viewings today was originally written off by my folks as a “no-hope” but once they viewed inside, their tone dramatically changed.

A literal case of not judging a book by its cover.

In other news, I saw a tweet from https://twitter.com/VictoriaBID:

Now I work on top of Victoria Station, so I walk past the memorial plaque dedicated to the Unknown Soldier every day I commute to the office. Obviously not so much this year due to covid.

The Military Wives Choir did the song for Abide With Me using the now-common feature of a virtual choir:

The virtual choir idea has been used a lot this year due to social distancing, but let’s not forget, the idea dates back way further, even as far back as 2009 with Eric Whitacre’s Virtual Choir project (https://www.youtube.com/user/EricWhitacresVrtlChr) which also made it into several TED talks

From 2010:

2011:

And 2013:

Training in Quarantine – Day 191 and other updates

My last logged walk was 23rd October. I’ve been slacking off logging runs since then, so this is my first logged run since then, even though I have been doing near-daily runs since then, so I’m skipping through to Day 191 since I’ve done 10 days of walks since then.

I’ve also got a few other updates.

My house purchase fell through a while ago so I have been actively house hunting a lot and my past few Saturdays have been spent house viewing. Viewing during the day is tricky unless I take time off to house hunt.

Dealing with different Estate Agents is a pain, with some not even bothering to give you the time of day, let alone

I also upgraded my phone to Android 10 LineageOS and I’ve been having quite a few issues with internet speed and stability. I’m seriously considering forcing a downgrade back to Android 9. In the meantime, I might switch from Adoptable Storage back to portable storage to see if that helps with stability.

Oh, and it’s frickin’ COLD.

CKAD

It’s taken me nearly a year, but I finally figured out one of the questions that stumped me in my CKAD (writeup: https://blenderfox.com/2019/12/01/ckad-writeup/)

In the exam, the question was to terminate a cronjob if it lasts longer than 17 seconds. There’s a startup deadline but not a duration deadline. It could be implemented within the command of the application itself, or by specifying to replace any previous running version of the jobs.

Well, I finally had that situation recently at work and wanted to terminate a cronjob if it was active more than 5 minutes, since the job shouldn’t take that long. Finally found out that the answer was not in the CronJob documentation, but in the Job documentation.

CronJobs spawn a Job resource, and within the specification, you can include spec.activeDeadlineSeconds. This will terminate the job pod at that time and will consider the job as failed.

Training in Quarantine – Day 179

Late out today — my phone wanted to upgrade so I attempted it (it was an upgrade from Android 9 to Android 10), and it didn’t work, and I ended up having to factory reset and install from scratch. I did have some Titanium Backup backups, but they didn’t seem to work a lot of the time :/

So for the most part, I just reinstalled all the apps I remember using and logged in. For most, that was fine. But I lost the MFA codes on Google Authenticator, meaning I had to remove and setup:

  • AWS
  • LastPass
  • WordPress
  • GitLab

all over again

AWS was quick and painless after a security check to confirm I was who I said I was and they called me on the number on the account.

WordPress was painless too — I was already logged in, so just removed MFA and set it up again, then logged in again. Similarly with LastPass

GitLab however, is proving to be more of a pain. They no longer accept MFA removal requests for people on the Free plan. So I wonder if they will accept me going to a subscription model so I _can_ then request the MFA removal. I think it is better anyway, since I’m hitting the 400 minute CI limit pretty regularly. The 2000 minute CI limit would be better. At least until I can get my own GitLab install working.

As for the run, yes, it was a run — well, more of a jog, anyway. Still did the 3km lap, doing it in 20 mins rather than the 30 mins it normally takes me when I walk it.

The “Snowball” Effect In Kubernetes

So, a weird thing occurred in Kubernetes on the GKE cluster we have at the office. I figured I would do a write up here, before I forget everything and maybe allow the Kubernetes devs to read over this as an issue (https://github.com/kubernetes/kubernetes/issues/93783)

We noticed some weirdness occurring on our cluster when Jobs and CronJobs started behaving strangely.

Jobs were spawning but seemed to not spawn any pods to go with it, even over an hour later, they were sitting there without a pod to go with it.

Investigating other jobs, I found a crazy large number of pods in one of our namespaces, over 900 to be exact. These pods were all completed pods from a CronJob.

The CronJob was scheduled to run every minute, and the definition of the CronJob had valid values for the history — sensible values for .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit were set. And even if they weren’t, the defaults would (or should) be used.

So why did we have over 900 cron pods, and why weren’t they being cleaned up upon completion?

Just in case the number of pods were causing problems, I cleared out the completed pods:

kubectl delete pods -n {namespace} $(kubectl get pods -n {namespace} | grep Completed | awk '{print $1}' | xargs)

But even after that, new jobs weren’t spawning pods. And in fact, more CronJob pods were appearing in this namespace. So I disabled the CronJob

kubectl patch cronjobs -n {namespace} {cronjob-name} -p '{"spec" : {"suspend" : true }}'

But that also didn’t help, pods were still being generated. Which is weird — why is a CronJob still spawning pods even when it’s suspended?

So then I remembered that CronJobs actually generate Job objects. So I checked the Job objects and found over 3000 Job objects. Okay, something is seriously wrong here, there shouldn’t be 3000 Job objects for something that only runs once a minute.

So I went and deleted all the CronJob related Job objects:

kubectl delete job -n {namespace} $(kubectl get jobs -n {namespace} | grep {cronjob-name} | awk '{print $1}' | xargs)

This reduced the pods down, but did not help us determine why the Job objects were not spawning pods.

I decided to get Google onto the case and raised a support ticket.

Their first investigation brought up something interesting. They sent me this snippet from the Master logs (redacted)

2020-08-05 10:05:06.555 CEST - Job is created
2020-08-05 11:21:16.546 CEST - Pod is created
2020-08-05 11:21:16.569 CEST - Pod (XXXXXXX) is bound to node
2020-08-05 11:24:11.069 CEST - Pod is deleted

2020-08-05 12:45:47.940 CEST - Job is created
2020-08-05 12:57:22.386 CEST - Pod is created
2020-08-05 12:57:22.401 CEST - Pod (XXXXXXX) is bound to node

Spot the problem?

The time between “Job is created” and “Pod is created” around 80 minutes in the first case, and 12 minutes in the second one. That’s right, it took 80 minutes for the Pod to be spawned.

And this is where it dawned on me about what was possibly going on.

  • The CronJob spawned a Job object. It tried to spawn a pod, and that took a significant amount of time, far more than the 1 minute between runs
  • The next cycle, the CronJob looks to see if it has a running pod due to the .spec.concurrencyPolicy value.
  • The CronJob does not find a running pod so generates another Job object, which also gets stuck waiting for pod generation
  • And so on, and so on.

Each time, a new Job gets added, gets stuck waiting for pod generation for an abnormally long time, which causes another Job to be added to the namespace which also gets stuck…

Eventually, the pod will generate but by then there’s now a backlog of Jobs, meaning even if I suspended the CronJob, it won’t have any effect until the Jobs in the backlog are cleared or deleted (I had deleted them).

Google investigated further, and found the culprit:

Failed calling webhook, failing open www.up9.com: failed calling webhook "www.up9.com": Post https://up9-sidecar-injector-prod.up9.svc:443/mutate?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We were testing up9 and this was using a webhook, so it looks like a misbehaving webhook was causing this problem. We removed the webhook and everything started working again.

So where does this leave us? Well, a few thoughts:

  • A misbehaving/misconfigured webhook can cause a Snowball effect in the cluster causing multiple runs of a single CronJob without cleanup — successfulJobsHistoryLimit and failedJobsHistoryLimit values are seemingly ignored.
  • This could break systems where the CronJob is supposed to be run mutually exclusively, since the delay in pod generation could allow two cron pods to spawn together, even though the CronJob has a concurrencyPolicy set as Forbid.
  • If someone managed (whether intentionally or maliciously) to install a webhook that causes this pod spawning delay, and then adds a CronJob that runs once a minute — and then maliciously crafts the job to never finish, this snowball effect will cause the cluster to run out of resource and/or scale up nodes forever or until it hits the max allowed by your configuration.

Pixelbook

Spent a big chunk of today preparing for, and attempting to upgrade my Pixelbook to Gallium OS.

I imaged it, then made a file backup of my home directory, before installing the OS, overwriting my Ubuntu, then restoring the home directory backup into the newly installed OS and then chowning the directory to me.

As a habit, I then imaged the laptop at this state.

I prepared a semi-automated script to install apps that I had installed on my Ubuntu, which included things like virt-manager, virtualbox, google-chrome and the like.

However, I soon found out that VirtualBox 6.1 seems to crash the mouse driver on reboot and the mouse pointer no longer moves and Gallium doesn’t even seem to see a pointer device when you check the mouse and touchpad option. I had to revert back to the image just after the file copy.

There is always the option of installing VirtualBox 6.0 from the Ubuntu repositories rather than the Oracle repositories, which uses a different installation setup. Maybe that will result in a different outcome.

Eventually, I restored back to my original Ubuntu installation so I can retry again tomorrow.

EDIT: Retried again the next day, and found out the sound wasn’t working, even on the live disk. Better find out what’s the deal with that…

EDIT2: Found out that my Pixelbook model doesn’t have working sound drivers on GalliumOS. I guess I will have to wait until that is fixed before using that. I guess I’m staying on Ubuntu. In the meantime, I’m going to see if I can compile a later version of the kernel to see if I can somehow get VirtualBox working better.

Slow Download Speeds on Steam For Linux

I’ve been getting horrendously slow speeds on Linux Steam (~500k/s) and 5-6Mb/s on Windows, and only now found out why. There’s a ticket on GitHub for this:

https://github.com/ValveSoftware/steam-for-linux/issues/3401

In short, the client is very aggressive on its DNS requests, which normally causes it to be throttled by servers, leading to really slow downloads. However, using dnsmasq allows the requests to be cached locally and offload the requests.

Even though the instructions are for Arch, they worked for me:

  1. Install dnsmasq
  2. Modify /etc/dnsmasq.config and add the line listen-address=127.0.0.1
  3. Restart the dnsmasq service (systemctl restart dnsmasq.service) or reboot your machine

Enjoy the speed

%d bloggers like this: