Versatile Steady Integration for iOS

Michael Bachand
The Airbnb Tech Blog

How Airbnb leverages AWS, Packer, and Terraform to replace macOS on tons of of CI machines in hours as an alternative of days

A person leans over the edge of a balcony. In the background are trees.

By: Michael Bachand, Xianwen Chen

At Airbnb, we run a complete suite of steady integration (CI) jobs earlier than every iOS code change is merged. These jobs be sure that the primary department stays secure by executing essential developer workflows like constructing the iOS utility and working checks. We additionally schedule jobs that carry out periodic duties like reporting metrics and importing artifacts.

A lot of our iOS CI jobs execute on Macs, which permits working developer instruments offered by Apple. CI jobs for all different platforms at Airbnb execute in containers on Amazon EC2 Linux situations. To satisfy the macOS requirement of iOS CI jobs we now have traditionally maintained alternate CI infrastructure exterior of AWS particularly for iOS improvement. The introduction of Macs to AWS offered a possibility for us to rethink our method to iOS CI.

We designed the subsequent iteration of our iOS CI system in late 2021, completed the migration to the brand new system in mid 2022, and polished the system by means of the tip of 2022. CI for iOS and all different platforms at Airbnb already leveraged Buildkite for dispatching jobs. Now, we deploy iOS CI infrastructure to AWS utilizing Terraform, which helps align CI for iOS with CI for different platforms at Airbnb.

On this article, we’re excited to share with you particulars of the versatile and easy-to-maintain iOS CI system that we’ve carried out with Amazon EC2 Mac situations.

Traditionally we ran Airbnb iOS CI on bodily Macs. We loved the pace of working CI with out virtualization however we paid a considerable upkeep price to run CI jobs instantly on bodily {hardware}. An iOS infrastructure engineer individually logged into over 300 machines to carry out administrative duties like enrolling the Mac in our MDM (Cell System Administration) instrument and upgrading macOS. Handbook upkeep necessities restricted the scalability of the fleet and consumed engineer time that might be higher spent on higher-value tasks.

A screenshot of a macOS desktop with many open VNC sessions to remote Mac machines.
An engineer remotely updates a number of bodily Macs to macOS Huge Sur. EC2 macOS AMIs have eradicated this handbook work.

Our outdated CI machines had been hardly ever restarted and too typically drifted into a foul state. When this occurred, the best-case situation was that an engineer may log into the machine, diagnose what configuration drift was inflicting points, and manually carry the machine again to an excellent state. Extra generally, we shut down the corrupted machine in order that it may now not settle for new CI jobs. Periodically, we requested the seller who managed our bodily Macs to revive the corrupted machines to a clear set up of macOS. When the machines finally got here again on-line, we manually re-enrolled every machine in MDM to carry our fleet again to its full capability.

Updating to a brand new model of Xcode was fairly error-prone as nicely. We try to roll out new Xcode variations repeatedly since many iOS engineers at Airbnb comply with Swift and Xcode releases carefully and are desperate to undertake new language options and IDE enhancements. Nonetheless, the mounted capability of our Mac fleet made it troublesome for us to confirm iOS CI jobs completely in opposition to new variations; any machine allotted to testing a brand new model of Xcode may now not settle for CI jobs from the earlier Xcode model. The danger of tackling every Xcode replace was elevated by the truth that rolling again to a earlier model of Xcode throughout our fleet was not sensible.

When evaluating AWS, we had been excited by the potential for launching situations from Amazon Machine Pictures (AMIs). An AMI is a snapshot of an occasion’s state, together with its file system contents and different metadata. Amazon offers base AMIs for every macOS model and permits prospects to create their very own AMIs from working situations.

AMIs enable us so as to add new situations to our fleet with out human intervention. An EC2 Mac bare-metal occasion launched from a correctly configured AMI is instantly prepared to just accept new work after initialization. When updating macOS, we now not have to log into each machine in our fleet. As an alternative, we log right into a single occasion launched from the Amazon base AMI for the brand new macOS model. After performing a handful of handbook configuration steps, like enabling automatic login, we create an Airbnb base AMI from that occasion.

Initially, we powered our EC2 Mac fleet with manually created AMIs. An engineer would configure a single occasion and create an AMI from that occasion’s state. Then we may launch any variety of extra situations from that AMI. This was a serious enchancment over managing bodily machines since we may spin up a complete fleet of an identical situations after configuring solely a single occasion efficiently.

Now, we build AMIs using Packer. Packer programmatically launches and configures an EC2 occasion utilizing a template outlined within the HashiCorp configuration language (HCL). Packer then creates an AMI from the configured EC2 occasion. A Ruby wrapper script invokes Packer constantly and performs useful validations like checking that the consumer has assumed the right AWS position. We verify the HCL template code into supply management and all modifications to our Packer template and companion scripts are made by way of GitHub pull requests.

Timing statistics for creating a brand new Arm AMI with Packer. This command ran on an EC2 mac2.metallic occasion.

We initially ran Packer from developer laptops, however the laptop computer wanted to be awake and on-line at some stage in the Packer construct. Ultimately, we created a devoted pipeline to construct AMIs within the cloud. A developer can set off a brand new construct on this pipeline with a few clicks. A profitable construct will produce freshly baked and verified AMIs for each the x86 and Arm (Apple Silicon) CPU architectures inside just a few hours.

Our new CI system leveraging these AMIs consists of many environments, every of which will be managed independently. The central AWS element of every CI setting is an Auto Scaling group, which is answerable for launching the EC2 Mac situations. The variety of situations within the Auto Scaling group is set by the desired capacity property on the group and is bounded by min and max measurement properties.

An Auto Scaling group creates new situations utilizing a launch template. The launch template specifies the configuration of every occasion, together with the AMI, and permits a “consumer information” script to run when the occasion is launched. Launch templates will be versioned, and every Auto Scaling group is configured to launch situations from a particular model of its launch template.

Though the introduction of environments has made our CI topology extra advanced, we discover that complexity manageable when our infrastructure is outlined in code. All of our AWS infrastructure for iOS CI is laid out in Terraform code that we verify into supply management. Every time we merge a pull request associated to iOS CI, Terraform Enterprise will routinely apply our modifications to our AWS account. We’ve outlined a Terraform module that we will name every time we need to instantiate a brand new CI setting.

Calling a Terraform module to create a CI setting of Arm Mac Minis with Xcode 14.2 put in.

An inside scaling service manages the specified capability of every setting’s Auto Scaling group. This service, a modified fork of buildkite-agent-scaler, will increase the specified capability of an setting’s Auto Scaling group as CI job quantity for that setting will increase. We specify a most variety of situations for every CI setting partially as a result of On-Demand EC2 Mac Devoted Hosts presently have a minimal host allocation and billing length of 24 hours.

A diagram showing the relationship between CI environments, the scaling service, and Buildkite.
A sketch of Airbnb’s new iOS CI system.

Every CI setting has a novel Buildkite queue identify. Particular person CI jobs can goal situations in a particular setting by specifying the corresponding queue identify. Jobs will fall again to the default CI setting when no queue identify is explicitly specified.

CI Environments Are Extremely Versatile

With this new Terraform setup we’re in a position to assist an arbitrary variety of CI environments with minimal overhead. We create a brand new CI setting per CPU structure and model of Xcode. We are able to even duplicate these environments throughout a number of variations of macOS when performing an working system replace throughout our fleet. We use devoted staging environments to check CI jobs on situations launched from a brand new AMI earlier than we roll out that AMI broadly.

After we are now not repeatedly utilizing a CI setting, we will specify a minimal capability of zero when calling the Terraform module, which can set the identical worth on the underlying Auto Scaling group. Then the Auto Scaling group will solely launch situations when its desired capability is elevated by the scaling service. In follow, we are inclined to delete older environments from our Terraform code. Nonetheless, even as soon as an setting has been wound down, reinstating that setting is so simple as reverting a few commits in Git and redeploying the scaling service.

Rotation of Cases Will increase CI Consistency

To attenuate the chance for EC2 situations to float, we terminate all situations every night time and substitute them day by day. This fashion, we will be assured that our CI fleet is in a recognized good state at first of every day.

When an occasion is terminated, the underlying Devoted Host is scrubbed earlier than a brand new occasion will be launched on that host. We terminate situations at a time when CI demand is low to permit for the EC2 Mac scrubbing course of to finish earlier than we have to launch recent situations on the identical hosts. When an occasion terminates itself in a single day, it should decrement the specified capability of the Auto Scaling group to which it belongs. As engineers begin pushing commits the subsequent day, the scaling service will increment the specified capability on the suitable Auto Scaling teams, inflicting new situations to be launched.

A chart showing CI capacity relative to job volume over more than one week.
Cases terminate themselves in a single day. We cut back our most capability over weekends. The spikes in job quantity that elevated capability on the 2nd, sixth, and seventh have been hidden by smoothing within the chart.

When an occasion does expertise configuration drift, we will disconnect that occasion from Buildkite with one click on. The occasion will stay working however will now not settle for new CI jobs. An engineer can log into the occasion to analyze its state till the occasion is finally terminated on the finish of the day. To maintain total CI capability secure, we will manually add an extra occasion to our fleet, or a alternative will likely be launched routinely if we terminate the occasion early.

We Ship Xcode Variations Extra Shortly

We recognize the brand new capabilities of our upgraded CI system. We are able to lease extra Devoted Hosts from Amazon on demand to climate sudden spikes in CI utilization and to check software program updates completely. We roll out new AMIs progressively and might roll again painlessly if we encounter sudden points.

A chart showing CI capacity relative to job volume for two simultaneous versions of Xcode.
CI jobs shift from Xcode 14.1 to 14.2. On the twenty fourth, we briefly elevated 14.2 capability to accommodate a spike in jobs.

Collectively, these capabilities get Airbnb iOS builders entry to Swift language options and Xcode IDE enhancements extra shortly. Actually, with the tailwind of our new CI system, we now have seen the tempo at which we replace Xcode improve by over 20%. As of the time of writing, we now have internally rolled out all accessible main and minor variations of Xcode 14 (14.0–14.3) as they’ve been launched.

Our new CI system ran over 10 million minutes of CI jobs within the final three months of 2022. After upgrading to EC2, we spend meaningfully fewer hours on upkeep regardless of a rising codebase and constantly excessive job quantity. Our newfound capability to scale CI to satisfy the evolving wants of the Airbnb iOS group justifies the elevated complexity of the rebuilt system.

After the migration to AWS, iOS CI advantages extra from shared infrastructure that’s already getting used efficiently inside Airbnb. For instance, the brand new iOS CI structure enabled us to keep away from implementing an iOS-specific resolution for routinely scaling capability. As an alternative, we leverage the aforementioned fork of buildkite-agent-scaler that Airbnb engineers had already transformed to an inside Airbnb service full with a devoted deployment pipeline. Moreover, we used current Terraform modules which are maintained by different groups to combine with IAM and SSM.

We’ve discovered that EC2 Mac situations launched from customized AMIs present most of the advantages of virtualization with out the efficiency penalty of executing inside a digital machine. We think about AWS, Packer, and Terraform to be important applied sciences for constructing a versatile CI system for large-scale iOS improvement in 2023.