How Netflix Autoscales CI

Rahul Somasunderam

What does CI look like at Netflix

Jenkins @ Netflix

35 Jenkins controllers
~45k job definitions
~600k builds per week
650-1500 agents
1-100 executors per agent

The Spinnaker view

1 Application
35 stacks (Controller Clusters)
180 Agent Clusters
1+ ASG per cluster
All workloads on AWS

We use Spinnaker for delivery.

Each of our Jenkins controllers runs in its own cluster. We make sure there is only one instance running at a time. If you’re running Jenkins controllers, this should be no different from what you’re used to.

There are 180 different agent clusters - each has a fixed configuration - instance type, labels, region, etc.

The reason we have more than one ASG is: When we roll out updates, we wait for the new ASG to come online and then we mark the old one offline in Jenkins, and wait for the builds to drain. As builds drain, we terminate instances. If we see problems, we can rollback to the old ASG. After a few days of the new ASG being online, and the old one being down to 0 agents, we destroy the ASG. In very rare occasions there are more than 2 ASGs in the cluster.

Clusters and ASGs

AWS has Auto Scaling Groups
Spinnaker calls them Server Groups
<Application>-<Stack>-<Detail>-v<Version>
jenkins-unstable-agent-highlander-v123

<step>

AWS has Auto Scaling Groups - ASGs for short. On each ASG, you can set a min and a max. Then AWS will figure out what the desired size is, and adjust the current size by either spinning up a new instance or killing some running instance.

<step>

Spinnaker calls those things Server Groups.

<step>

Spinnaker has a naming convention that maps the Server Group to a more structured coordinate. At the top level, there’s an application. Within each application, you can have multiple stacks. We map these to Jenkins controllers.

Within each stack, you have multiple clusters. The naming relies on a field called detail for that. We put one kind of agent into each cluster.

For example, So one would have m5.2xlarge instances and run 4 executors. Another would have m5.4xlarge instances and run 1 executor.

We perform immutable deployments. If we want to update a package, we bake a new Amazon Machine Image and start rolling it out. So each cluster will have multiple Server Groups representing the version.

<step>

In this example:

jenkins is the application
unstable is the stack
agent-highlander is the detail
123 is the version

That whole thing is the name of the server group. The cluster is jenkins-unstable-agent-highlander. The version is not part of it.

How to plan for CI infrastructure

Infinite resources

Provision capacity based on known maximum load
Multiply by a safety factor for good measure
Monitor and change the capacity as load increases

Infinite Patience

Plan capacity based on median load
Builds will sit in queue for long times

Instant resources

You will get resources as soon as you request for them
Works well with Containerizable builds
Not all builds can be containerized
Does not scale well with large numbers of short-lived builds

Autoscaling

Set up minimum and maximum capacity
Scale based on some metric

What Metric to use

System Metrics

CPU/Memory/Disk IO/Network throughput
- Natively supported by cloud providers and most metrics solutions
Scaling Policies are supported by cloud providers