Running Jenkins on ECS


Running Jenkins using ECS tasks to run worker nodes has been documented before, however there aren’t any up to date examples, nor  provide separation of the master and salves. This post is fairly up to date deployment using the newer deployment techniques offered by AWS. Even if your not using Jenkins, being able to create CloudFormation templates using the new EC2 launch options is helpful if you’re using many spot instances, as you will most like be experiencing instance type availability fluctuations.

Key features of this deployment

  • Both worker nodes and master node run on ECS: the master as a service and slaves as dynamically added tasks
  • The master node runs on it’s own dedicated cluster; it's file system store is only mounted on and accessible by the master
  • The job can launch and run build docker images from within an already running container
  • The worker nodes can also spawn build agents (docker containers) using the "new" Jenkins pipeline syntax;
    where the Jenkins code (agent) runs inside an ECS task which in turn runs a docker container on the same host with the build steps being actioned inside it
  • It uses EC2 Launch Templates and multiple instance types with AutoScaling Groups to maximise availability when new jobs trigger
  • As an additional security measure, the master node and worker nodes are configured to run on private subnets, directing all communications with Jenkins through it’s load balancer
This implementation is mainly based on this nice set of CloudFormation templates.
You can access the currently work in progress here. Once I've finished testing the generalised version ported out of our build environment I’ll merge it and update this post.

Master node deployment

Jenkins master is run on it’s own dedicated ECS cluster. This is key for security, especially when using the docker plugin or docker directly which can mount local filesystem paths. By keeping the master separate, only the workspace of the project being built will be on the worker node so it minimises access to credentials and data that belonging to other projects.
EFS is used as the storage device for Jenkins data directory as an EBS volume can be only connected to one EC2 at a time. Even though you can now use mount an EBS as a docker volume, by defining the volume driver in the ECS task definition, the time for the EBS volume to disconnect is quite long. In a failover scenario, this would result in either the master not starting up on the new ECS host or taking a long time to start up. EFS eliminates this problem, as it is designed to be shared across multiple EC2 instances.
As the the Jenkins master task can move EC2 hosts, during updates or failover, a LoadBalancer is used to provide a consistent DNS mapping for both worker nodes and browser access to it. Application Load Balancers (aka ELB v2) can’t be used as each ECS task can only be attached to one Target Group at a time, thus only one port can be mapped to the ALB.  This means Classic Load Balancer (ELB v1) are used to expose both the HTTP port (8080 - for the website + workers) and JNLP port (50000 – for the workers).

Worker nodes deployment

This setup uses ECS tasks for all worker nodes – nodes are started by the ECS plugin automatically. When a job that is labelled with a name matching a Jenkins ECS Agent template’s label gets queued, a node gets defined and an ECS task gets submitted to run the build.
To enable easy use of the docker plugin, this setup shares “docker.sock” from the host to the container, so that containers launched from within the worker node’s container get run as a sibling via the host docker. This can have a lot of security implications so it is not advised to use it past the initial setup and design phase of your Jenkins server. A safer way is to use Docker in Docker (as shown here).
It’s worth also deploying this stack in each account you have, to drain tasks from any EC2 Spot Instances that have been given a termination warning. Currently, the ECS Agent or service do not automatically detect a pending spot instance termination and will keep allocating tasks to it.

Jenkins Configuration

To configure the ECS worker nodes, go into Manage Jenkins –> configuration. Under cloud, select add new Amazon EC2 Container Service Cloud. Call it whatever name you please. Select the appropriate region and fill in the ARN of the Jenkins slave/worker cluster.
Add an ECS agent template selecting what memory and CPU requirements etc you have. Here are some of the fields required:
  • Label: default
  • Template Name: jenkins-default
  • Docker Image: jenkins/jnlp-slave
  • Launch Type: EC2
  • Network mode: awsvpc
  • Filesystem root: /home/jenkins/agent
  • Subnets: (the subnet ids of your private subnets)
  • Security groups: (the security group id of…?)
  • Task role ARN: (optionally the ARN of a limited role for this particular build instead of using the EC2’s permissions)
  • Advanced:
  • Logging Driver: awslogs
    • Logging:
    • Name: awslogs-stream-prefix
      Value: ecs
    • Name: awslogs-region
      Value: (your region)
    • Name: awslogs-group
      Value: jenkins.slave
  • Container Mount Points
    • Name: docker-sock
      Source path: /var/run/docker.sock
      Container Path: /var/run/docker.sock

Advantages of this deployment

My goals were to:
  1. Keep Jenkins as isolated as reasonably as possible, whilst still keeping it usable and as a deployment tool
  2. Easily keep the operating system and the Jenkins master install up to date, with the aim of having a regularly triggered update
  3. Keeping the various, sometimes complicated, worker slaves up to date – which can be quiet slow and tedious to rebuild big AMI images
Currently in order to update the master, it’s a manual task of increasing the desired count of the master ASG to 2, updating the ECS Task with force new deployment on and finally scaling back the master ASG once done. This could be easily automated to run once a week with a Lambda to scale up and trigger the redeployment. Updating the OS is a case of update the CloudFormation template with the latest AMI ID and it will autoscaling out and then shutdown the old instances.

References

Comments