How to move from single master to multi-master in an AWS kops kubernetes cluster

Having a master in a Kubernetes cluster is all very well and good, but if that master goes down the entire cluster cannot schedule new work. Pods will continue to run, but new ones cannot be scheduled and any pods that die will not get rescheduled.

Having multiple masters allows for more resiliency and can pick up when one goes down. However, as I found out, setting multi-master was quite problematic. Using the guide here only provided some help so after trashing my own and my company’s test cluster, I have expanded on the linked guide.

First add the subnet details for the new zone into your cluster definition — CIDR, subnet id, and make sure you name it something that you can remember. For simplicity, I called mine eu-west-2c. If you have a definition for utility (and you will if you use a bastion), make sure you have a utility subnet also defined for the new AZ

kops edit cluster --state s3://bucket

Now, create your master instance groups, you need an odd number to enable quorum and avoid split brain (I’m not saying prevent, and there are edge cases where this could be possible even with quorum). I’m going to add west-2b and west-2c. AWS recently introduced the third London AWS zone, so I’m going to use that.

kops create instancegroup master-eu-west-2b --subnet eu-west-2b --role Master

Make this one have a max/min of 1

kops create instancegroup master-eu-west-2c --subnet eu-west-2c --role Master

Make this one have a max/min of 0 (yes, zero) for now

Reference these in your cluster config

kops edit cluster --state=s3://bucket
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-2a
      name: a
    - instanceGroup: master-eu-west-2b
      name: b
    - instanceGroup: master-eu-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-west-2a
      name: a
    - instanceGroup: master-eu-west-2b
      name: b
    - instanceGroup: master-eu-west-2c
      name: c
    name: events

Start the new master

kops update cluster --state s3://bucket --yes

Find the etcd and etcd-event pods and add them to this script. Change “clustername” to the name of your cluster, then run it. Confirm the member lists include both two members (in my case it would be etc-a and etc-b)

ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal
ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal
AZ=b
CLUSTER=clustername

kubectl --namespace=kube-system exec $ETCPOD -- etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380

kubectl --namespace=kube-system exec $ETCEVENTSPOD -- etcdctl --endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists
kubectl --namespace=kube-system exec $ETCPOD -- etcdctl member list

kubectl --namespace=kube-system exec $ETCEVENTSPOD -- etcdctl --endpoint http://127.0.0.1:4002 member list

(NOTE: the cluster will break at this point due to the missing second cluster member)

Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region

The script will run and output the status of the instance until it shows “ok”

AWSSWITCHES="--profile personal --region eu-west-2"
INSTANCEID=master2instanceid
while [ "$(aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2)" != "ok" ]
do
  sleep 5s
  aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2
done
aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2

ssh into the new master (or via bastion if needed)

sudo -i
systemctl stop kubelet
systemctl stop protokube

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest
Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing
Under ETCD_INITIAL_CLUSTER remove the third master definition

Stop the etcd docker containers

docker stop $(docker ps | grep "etcd" | awk '{print $1}')

Run this a few times until you get a docker error saying you need more than one container name
There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.

rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data-events/member/
rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data/member/

Now start kubelet

systemctl start kubelet

Wait until the master shows on the validate list then start protokube

systemctl start protokube

Now do the same with the third master

edit the third master ig to make it min/max 1

kops edit ig master-eu-west-2c --name=clustername --state s3://bucket

Add it to the clusters (the etcd pods should still be running)

ETCPOD=etcd-server-events-ip-10-10-10-226.eu-west-2.compute.internal
ETCEVENTSPOD=etcd-server-ip-10-10-10-226.eu-west-2.compute.internal
AZ=c
CLUSTER=clustername

kubectl --namespace=kube-system exec $ETCPOD -- etcdctl member add etcd-$AZ http://etcd-$AZ.internal.$CLUSTER:2380
kubectl --namespace=kube-system exec $ETCEVENTSPOD -- etcdctl --endpoint http://127.0.0.1:4002 member add etcd-events-$AZ http://etcd-events-$AZ.internal.$CLUSTER:2381

echo Member Lists
kubectl --namespace=kube-system exec $ETCPOD -- etcdctl member list
kubectl --namespace=kube-system exec $ETCEVENTSPOD -- etcdctl --endpoint http://127.0.0.1:4002 member list

Start the third master

kops update cluster --name=cluster-name --state=s3://bucket

Wait for the master to show as initialised. Find the instance id of the master and put it into this script. Change the AWSSWITCHES to match any switches you need to provide to the awscli. For me, I specify my profile and region

The script will run and output the status of the instance until it shows “ok”

AWSSWITCHES="--profile personal --region eu-west-2"
INSTANCEID=master3instanceid
while [ "$(aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2)" != "ok" ]
do
  sleep 5s
  aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2
done
aws $AWSSWITCHES ec2 describe-instance-status --instance-id=$INSTANCEID --output text  | grep SYSTEMSTATUS | cut -f 2

ssh into the new master (or via bastion if needed)

sudo -i
systemctl stop kubelet
systemctl stop protokube

edit /etc/kubernetes/manifests/etcd.manifest and /etc/kubernetes/manifests/etcd-events.manifest
Change the ETCD_INITIAL_CLUSTER_STATE value from new to existing

We DON’T need to remove the third master defintion this time, since this is the third master

Stop the etcd docker containers

docker stop $(docker ps | grep "etcd" | awk '{print $1}')

Run this a few times until you get a docker error saying you need more than one container name
There are two volumes mounted under /mnt/master-vol-xxxxxxxx, one contains /var/etcd/data-events/member/ and one contains /var/etcd/data/member/ but it varies because of the id.

rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data-events/member/
rm -r /mnt/var/master-vol-xxxxxx/var/etcd/data/member/

Now start kubelet

systemctl start kubelet

Wait until the master shows on the validate list then start protokube

systemctl start protokube

If the cluster validates, do a full respin

kops rolling-update cluster --name clustername --state s3://bucket  --force --yes