I process my buildchain via Gitlab CI. Besides the actual build of the application, the frontend assets, numerous tests are executed, the code quality is checked and finally the finished website is deployed. That's where some jobs come together. How nice would it be if there were always enough Gitlab runners available? We need autoscale Gitlab runners. A manual.
Gitlab-Runner: Autoscaling in the HETZNER cloud.
Why do I need a Gitlab runner?
Every job within the Gitlab CI is done by a runner. This runner is nothing special in itself; it receives a task from the Gitlab server (e.g. "Build the application", "Test syntax for validity" or "Deploy the result to the production server") and the appropriate data (usually the Gitlab repository) and processes it - the result is sent back (if desired) and is then available for subsequent jobs.
A Gitlab runner usually has no memory, but can remember things (especially important for caching).
Now you can easily register Gitlab Runner on a server to take your first steps with Gitlab CI. But as soon as you start using this tool seriously, you will find out that jobs have to wait too long because no runner is free. But we want to process the buildchain as fast as possible and waiting is dead time.
The first reaction to a defect is: More of it! We need Runner! But at the latest when these runners are distributed over several servers - and you get to this point faster than you think: runners can generate a lot of load on a server! - there is a new problem: Runners on different servers cannot share a cache easily. But this makes the buildchain slower again. A dilemma.
There's also an economic aspect: very few of us will work around the clock. Like everywhere else, there are phases when the air is burning and you need Runners without end; but then (typically at night and on weekends) it is completely quiet and there is nothing to do. And there are normal working days when there is much less running through the build chains than usual.
Keeping all the runners you need at peak times available at all times is economic and ecological nonsense.
So the goal is to have many runners available whenever we need them. And if we don't need them, they should shut down. We need autoscaling!
Gitlab Runners are an excellent example of where, in my opinion, cloud computing makes sense. The runners themselves have no dependencies on each other; it's just a small software army waiting for work and getting it done. Whether there is one, three or five hundred runners doesn't matter to the buildchain.
Choosing the right cloud provider
Here there is no right or wrong in the actual sense. Basically all providers of cloud solutions are suitable for this, as long as they provide an appropriate API to start, configure and delete servers automatically.
This is of course the case with the big ones: Amazon AWS, Google Cloud-Front, Microsoft Azure, Akamai, Digital Ocean etc. can all do this. I have chosen the cloud offer from HETZNER because on the one hand I have had servers with this provider for a long time and have always had good experience with support. On the other hand the server location is selectable and you can choose e.g. servers in Germany if you want to (which I do).
HETZNER's prices are very fair: the smallest server is available for just under 3,- €/month if it runs without interruption. Since we may have a lot of switching times, I made a point of choosing a tariff that only charges for the pure running time and no setup costs or the like.
For the Gitlab runners I chose the CX21 servers. These offer enough power to process the jobs, but with 1 ct/hour they are still very cheap. The CX11 servers are enough to try out, but for larger jobs I came to the limit of the main memory.
If you would like to test this on the HETZNER cloud, you can register here: https://hetzner.cloud/?ref=wNQZvWvSsgiI (If you register via this link, I will receive a small commission in the form of cloud credit from HETZNER)
For our orchestra of Gitlab runners to work, we need a conductor. The conductor will take the jobs from Gitlab CI and assign them to the runners. At the same time, the conductor ensures that there are always enough runners available and shuts down runners that are no longer needed.
I call this role Brokersand have named the server accordingly. My broker is also a server in the HETZNER cloud, and so far a small CX11 server is sufficient. This server runs all the time.
The Broker takes over another important role for me: it provides storage space to cache caches. This is also the reason why I use a cloud server for the broker and do not let an already existing server take care of this service. It makes sense that Broker and Gitlab-Runner are close to each other in terms of network topology to keep latencies and transfer times low. Every second counts in a buildchain and the price for the broker is very well invested money.
To prevent the broker from being accidentally switched off or even deleted, it can be protected via cloud management.
The configuration of the broker
For the configuration of the broker I used a Gitlab runner image from mawalu. This allows the Gitlab-Runner to run easily encapsulated in a docker container and makes little problems with the configuration. You can find the project here: https://github.com/mawalu/hetzner-gitlab-runner
The docker-compose.yml is simple:
# docker-compose.yml version: '2'. services: hetzner-runner: image: mawalu/hetzner-gitlab-runner:latest mem_limit: 128mb memswap_limit: 256mb volumes: - "./hetzner_config:/etc/gitlab-runner" restart: always
With docker-compose up -d the Broker is started. Afterwards, docker-compose run hetzner-runner register registers the runner with the Gitlab server and creates a config.toml in the hetzner_config directory and connects to Gitlab CI. We will modify this config.toml in a moment.
The broker is started with docker-compose run and is then active. Changes to config.toml take effect immediately; the broker does not need to be restarted!
My config.toml looks like this (confidential information is removed)
[[runners]] name = "[HETZNER] Cloud-runner with autoscale" limit = 15 url = "***********************" token = "*********************" executor = "docker+machine" environment = ["COMPOSER_CACHE_DIR=/composer-cache"] [runners.custom_build_dir] [runners.docker] tls_verify = false image = "marcwillmann/codeception" memory = "2048m" privileged = true disable_entrypoint_overwrite = false oom_kill_disable = false disable_cache = false volumes = ["/var/cache:/cache:rw", "/var/run/docker.sock:/var/run/docker.sock"] pull_policy = "if-not-present" shm_size = 536870912 [runners.machine] IdleCount = 2 IdleTime = 600 MachineDriver = "hetzner" MachineName = "runner-%s" MachineOptions = ["hetzner-api-token=*************", "hetzner-image=ubuntu-18.04", "hetzner-server-type=cx21"] OffPeakPeriods = ["* * 0-8.19-23 * * mon-fri *", "* * * * * sat,sun *"] OffPeakTimezone = "Europe/Berlin" OffPeakIdleCount = 1 OffPeakIdleTime = 600 [runners.custom] run_exec = ""
The HETZNER API token can be obtained in the Cloud Console (https://console.hetzner.cloud/) when you create a new project and under Accesses -> API Tokens.
With this token, the broker may now start new servers and delete existing ones.
In the configuration there is the IdleCount. Our autoscale always provides so many runners as a reserve: in the cloud console you can observe this very well. Directly after the start, 2 Gitlab runners are started there and wait for jobs. If a Gitlab CI pipeline is running and one (or usually more) of these runners is assigned a job, new servers will appear. You will need to experiment a bit with this value. If it is set too small, jobs will have to wait until a new runner is started (this takes a little while to get it ready). If it's too high, you give away money, because unneeded computing power is available.
The IdleTime specifies how long a runner will be kept after it has finished a job. If a new job comes in the meantime, it will be reused. If no job comes, the runner is deleted.
I have had good experience with an IdleTime of 600; the IdleCount in my productive system is currently set to 5.
Because we all hopefully have reasonably regular working hours and don't work all night, there are still the OffPeak settings in the configuration. Here you can specify times when the pipeline is not or only little demanded (after hours, on weekends). During these times you can configure your own OffPeakIdleCount (e.g. 0 or 1) to reduce costs.
By the way: even an IdleCount=0 does not mean that you cannot work. It just means that no reserves are kept, i.e. it takes a bit longer until the job starts. But for a normal NightlyBuild, where seconds don't matter, this is no problem.
Time for improvements
So the basic setup is already running. Jobs in Gitlab are distributed to the Broker, who makes sure that there are always enough Runners available and provides them with the tasks. The number of runners will be rescaled.
But the runners themselves have no memory: the jobs themselves are unfortunately running rather slowly at the moment. But that's not due to a lack of power: in the job output you can see that e.g. our application build job doesn't find a package in the composer cache. This is logical, the server for the Runner was just started and can't have a cache. And our runner doesn't know about the cache that another runner has just built. We have to change that!
Step 1: Provide space
We build a central cache that all runners can access read and write. Which place would be better suited for this than the Broker: it is located network topologically right next to the Runners, is always available and the Runners can reach it.
To ensure that there is enough space on the Broker and that it survives a Delete/Rebuild of the Broker server, I have attached a volume. This is a persistent file storage, which can easily be enlarged at any time. Currently I use a 25GB volume as gitlab-runner-cache.
The volume can be easily attached via the Hetzner Console and is then visible from the system under /dev/disk/by-id/scsi-0HC_VolumeXXXX. There it behaves like a normal hard disk and can be formatted and mounted with the file system of choice (ext4 in mine).
This can also be read in the Hetzner Console, for completeness again at a glance:
sudo mkfs.ext4 -F /dev/disk/by-id/scsi-0HC_Volume_XXXXXXX sudo mkdir /export sudo mount -o discard,defaults /dev/disk/by-id/scsi-0HC_Volume_XXXXXXX /export sudo echo "/dev/disk/by-id/scsi-0HC_Volume_XXXXXXX /export ext4 discard,nofail,defaults 0 0" >> /etc/fstab
If you change the size of the volume later on, this must of course be made known in the Linux system:
sudo resize2fs /export/
Step 2: Provide S3 storage
Unfortunately Hetzner does not yet have S3 compatible storage. But that's no problem; we use the volume just created and let the broker provide it. With minio there is a slim docker image that provides an S3 compatible service. This is also started quickly:
docker run -it -d --restart always -p 9005:9000 -v /.minio:/root/.minio -v /export:/export --name minio minio/minio:latest server /export
cat /export/.minio.sys/config/config.json | grep Key
we get the S3 access codes.
Step 3: Set up Multi-Runner Cache
This gives us everything we need to provide our runners with a common cache. In config.toml we set this up:
[runners.cache] Type = "s3" Shared = true [runners.cache.s3] ServerAddress = "###IP_BROKER###:9005" AccessKey = "***********************" SecretKey = "********************************************************" BucketName = "runner" Insecure = true [runners.cache.gcs]
and are pleased that in the future the cache will be persisted and our jobs will run much faster. This is especially noticeable in the build jobs (composer and npm); here you can sometimes speed up the execution time by an order of magnitude!
Docker on Speed
However, we have not yet reached the end of the optimisation process. Our Gitlab runners run as docker+machine type, i.e. each job is passed to a docker container which is started on the runner server. We specify which image is used in the job configuration in .gitlab-ci.yml.
Naturally, this is different for a frontend build job than for a quality gate job, and the deployment will again rely on a different container.
And we find that pulling the docker image is now one of the most time consuming parts of any job we do. We need a solution for this as well.
Caching Docker Images
To do this, we set up a proxy on the Broker that acts like a docker repository for the Runners. If the broker has the desired image in the cache, it will be delivered directly; otherwise it will be requested and passed on by the official repo - and of course cached.
Sounds complicated, but it's not:
sudo docker run -d -p 6005:5000 -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io --restart always --name registry registry:2
does the job. ¯\_(ツ)_/¯
In the config.toml of the broker we make this new repository known to the runners:
[runners.machine] ... MachineOptions = ["hetzner-api-token=****************", "hetzner-image=ubuntu-18.04", "hetzner-server-type=cx21", "engine-registry-mirror=http://###IP_BROKER###:6005"] ...
and are happy about much faster pipelines.
Important note: On 12/10/2020, a docker update was released which causes problems with the docker+machine-Gitlab runners described below. I suspect that the bug will be fixed soon, but since most of us depend on working pipelines now, I am happy to share my findings from yesterday's working day:
Essentially, the problem is that the Gitlab Runner starts a new machine as requested, waits there for the docker stack, and fails to find it. The machine is then discarded (but stupidly not deleted, but continues to run) and a new one is started, which then does the same thing. The pipelines in Gitlab are stuck and just won't run because there is no machine. The last message in Gitlab was usually the line
Preparing the "docker+machine" executor
and then nothing happened.
Debugging the whole setup constructively is not that easy - there are just a few things that can go wrong in many places. I was also led astray by a problem reported by HETZNER with the cloud servers where a network problem was reported. As it turned out, this was - as so often - only by chance at the same time and at least my runners were not affected by this network segment.
Finally, getting me on the right track was not only looking at the logs (the Gitlab broker logs to /var/log/syslog), but shutting down the Gitlab broker and running it in non-daemon mode:
docker-compose down docker-compose up
There I came across the error messages
WARNING: Failed to process runner builds=1 error=failed to update executor: no free machines that can process builds executor=docker+machine
and with this then to the hint that the new docker update is causing these problems, probably because there is a timing problem and the Gitlab runners at least don't wait long enough for the Dokcer stack.
It remains to be seen whether Docker or Gitlab will provide an update for this. To get the pipelines running again, the previous Docker version can be used. The machines are automatically provisioned, so we have to set the appropriate configuration in config.toml (see above):
MachineOptions = ["hetzner-api-token=*****", ..., "engine-install-url=https://releases.rancher.com/install-docker/19.03.9.sh"]
I hope this helps some of you who have built a similar setup with the help of this manual.