How Netflix Autoscales CI

Rahul Somasunderam

What does CI look like at Netflix

Jenkins @ Netflix

  • 35 Jenkins controllers

  • ~45k job definitions

  • ~600k builds per week

  • 650-1500 agents

  • 1-100 executors per agent

The Spinnaker view

  • 1 Application

  • 35 stacks (Controller Clusters)

  • 180 Agent Clusters

  • 1+ ASG per cluster

  • All workloads on AWS

Clusters and ASGs

  • AWS has Auto Scaling Groups

  • Spinnaker calls them Server Groups

  • <Application>-<Stack>-<Detail>-v<Version>

  • jenkins-unstable-agent-highlander-v123

How to plan for CI infrastructure

Infinite resources

  • Provision capacity based on known maximum load

  • Multiply by a safety factor for good measure

  • Monitor and change the capacity as load increases

Infinite Patience

  • Plan capacity based on median load

  • Builds will sit in queue for long times

Instant resources

  • You will get resources as soon as you request for them

  • Works well with Containerizable builds

  • Not all builds can be containerized

  • Does not scale well with large numbers of short-lived builds

Autoscaling

  • Set up minimum and maximum capacity

  • Scale based on some metric

What Metric to use

System Metrics

  • CPU/Memory/Disk IO/Network throughput

    • Natively supported by cloud providers and most metrics solutions

  • Scaling Policies are supported by cloud providers

System Metrics

Not very useful for CI

Queue Depth

  • Queue Depth seems adequately proportional.

  • However, it is a trailing metric.

Agent Utilization

  • For each agent, find [idle, busy, offline] executors.

  • Sum these up by ASG.

  • Compute utilization as \$(busy + offli\n\e) / (busy + offli\n\e + id\l\e)\$

Measuring Agent Utilization

An agent’s ASG

When launching agents, use labels to specify the placement of the agent.

AgentHighlighted

Capturing Metrics

We wrote a custom plugin that plays well with Atlas. You could write one for whatever your metrics capturing service is.

Autoscaling

How to Autoscale

AWS offers 2 ways to scale

  • Target Tracking

  • Step Scaling

When to scale up

ScalingPolicy

How to scale up

ScalingPolicy2

When to scale down

ScaleDown

How to scale down

ScaleDownProgress

Recap

Recap

What we learnt

  • This improved support experience

  • This improved the experience for spiky workloads

Thank you!