Rahul Somasunderam
35 Jenkins controllers
~45k job definitions
~600k builds per week
650-1500 agents
1-100 executors per agent
1 Application
35 stacks (Controller Clusters)
180 Agent Clusters
1+ ASG per cluster
All workloads on AWS
AWS has Auto Scaling Groups
Spinnaker calls them Server Groups
<Application>-<Stack>-<Detail>-v<Version>
jenkins-unstable-agent-highlander-v123
Provision capacity based on known maximum load
Multiply by a safety factor for good measure
Monitor and change the capacity as load increases
Plan capacity based on median load
Builds will sit in queue for long times
You will get resources as soon as you request for them
Works well with Containerizable builds
Not all builds can be containerized
Does not scale well with large numbers of short-lived builds
Set up minimum and maximum capacity
Scale based on some metric
CPU/Memory/Disk IO/Network throughput
Natively supported by cloud providers and most metrics solutions
Scaling Policies are supported by cloud providers
Not very useful for CI
Queue Depth seems adequately proportional.
However, it is a trailing metric.
For each agent, find [idle, busy, offline] executors.
Sum these up by ASG.
Compute utilization as \$(busy + offli\n\e) / (busy + offli\n\e + id\l\e)\$
When launching agents, use labels to specify the placement of the agent.
We wrote a custom plugin that plays well with Atlas. You could write one for whatever your metrics capturing service is.
AWS offers 2 ways to scale
Target Tracking
Step Scaling
This improved support experience
This improved the experience for spiky workloads