Farewell to the Era of Cheap EC2 Spot Instances
AWS spot instances have long been a source of cheap, preemptible compute. Are macroeconomic conditions changing that?
AWS Spot instances offer a discount off on-demand rates in exchange for the risk of instance preemption (running instances stopped) and unavailability (no capacity to start instances). I have long held the belief that spot instances are underpriced, as most tenants underestimate their tolerance for preemption or overestimate its likelihood. In reality, instance preemption is rare in most instance families, and tolerating a large set of different instance types (such as by taking advantage of equivalent instances) can allow drastic savings over on-demand with minimal negative impact. A well-designed spot placement strategy has generally been able to achieve much of the max savings (“up to 90%”) that Amazon claims. In effect, spot compute leverages a market inefficiency in which other tenants pay too much.
It appears, however, that this is changing, and quickly. As macroeconomic conditions change and businesses look to trim costs, cloud bills are on the chopping block. It has recently been reported that Amazon is trying to help customers cut costs through more efficient deployments. Third-party data can give us some idea where this optimization is occuring: on-demand instance usage by Vantage customers dropped to its lowest ever (around 30%). Amazon is likely pushing a similar trend globally, encouraging customers to switch to savings plans/reserved instances and spot instances.
t4g.nanospot prices in
At the same time, if you’re already using spot instances you may have noticed your savings start to dry up. I know I did! So naturally I went out and tried to measure it. Is this spike in spot prices just a blip in the radar, or have macroeconomic conditions also had an effect on spot pricing?
Spot price trends
Any good experiment has a hypothesis. Today, we’re testing the hypothesis that there’s only so much spot savings to go around. By this I mean in aggregate: as more tenants move to spot instances will Amazon bring in less money, or is there some fixed amount of overall spot discount that is “split” between customers? The last couple months would argue that our hypothesis is true, and as more businesses use spot instances the savings will continue to diminish. For example:
This figure shows aggregate spot ratios (the mean ratio of spot price to on-demand price, across availability zones/instance types) within each region over the past year. These values shift with supply and demand constantly, but historically have kept within a tight range. That is, until the past few months. Since the start of 2023, spot ratios have spiked as much as 55% (in
us-east-1). In four of the largest AWS regions1 prices have skyrocketed, and others may not be far behind.
Regardless of what is causing it, these increased prices are surely putting pressure on cloud customer compute budgets. Next, we’ll take a deeper dive and try to pin down why they’re occuring, and also some ideas on what to do about it.
Inferring demand: preemption rates
Without more details, one might suspect that this is just Amazon playing with pricing levers. One way that we can differentiate this is by looking at the actual underlying infrastructure: if actual spot instance demand is increasing then we would expect to see an increase in instance preemptions.
Thankfully, our research group manages a large-scale deployment of
t4 spot instances. Between October 2022 and today, we’ve provisioned 5.5 million spot instances on AWS across all regions. We use these to study how cloud providers allocate IP addresses and its effect on tenant security. Each server runs for 10 minutes each before shutting down, but we also track when instances get preempted. Here’s a plot of preemption rates for spot instances in regions with price spikes vs. without:
We see a near quadrupling in preemption rates in the span of just a few months! Note that our spot pools are diversified across instance families, availability zones, and in this case even regions! The data suggest a marked increase in aggregate spot instance demand. Note that these preemption rates are across the 10-minute windows that instance are deployed for, so in a given month the odds of being preempted are much higher!
What instance families are most affected?
Of course, increase in demand may not affect all instance types equally. Let’s look at spot price changes by instance family. In this table, note that prices are normalized as a fraction of the on-demand cost:
|Family||Price 2023-01-01||Price 2023-05-01||Change|
There’s quite a bit of range here, though the overall trend is positive. The price reductions are more of an exception that proves the rule, as older families (such as
t1) have too poor of compute performance to be price competitive.
Managing spot instance cost
Overall, I think we’ve entered a race to the bottom in terms of spot discounts. For now, you can likely still reduce your bills by diversifying across equivalent instances in other families. The largest price win you can get would be to move to a region that has not (yet!) seen price spikes, but this is probably not feasible for a lot of architectures.
On the plus side, spot prices can’t keep rising forever. Since Amazon’s new spot pricing model, it appears that spot prices rarely if ever exceed on-demand prices. If your architecture is designed to accept preemption and the cost of provisioning replacement instances isn’t too high (e.g., compute time to start a new instance), you probably won’t turn out worse than if you used on demand instances. Using spot instances can also be seen as a sort of built-in chaos monkey, since even on-demand instances can fail. However, as the discount on spot instances erodes it may make less and less sense to invest in the resilient design to use them.
Spot instances have offered an incredible price advantage to those who are willing to architect around preemptible instances, but the party may be coming to close. While there is still some savings to be had, organizations should make their architecture decisions in light of these decaying benefits, and more strongly consider approaches with more guaranteed savings (such as Savings Plans) when they have predictable usage.
By size of IPv4 range ↩︎