Restoring the promise of the public cloud for AI

August 13, 2024

Jared Quincy Davis

It’s hard to imagine how the internet would have developed had its pioneers not converged on the layered approach represented by the OSI or TCP/IP models. Building a web app would be extremely difficult if React developers had to concern themselves with notions of packet loss, buffer overflow, and congestion control.

Unfortunately, analogous abstraction layers are absent for most practitioners at the forefront of AI. They are bogged down by challenges far removed from their core research, including GPU failure, cluster contention, and opaque procurement processes.¹ Long-standing labs like Google DeepMind and OpenAI have invested years into building teams and tools to abstract away this complexity, but such infrastructure is exclusively the domain of these institutions.

We’re launching Foundry Cloud Platform to bring these cutting-edge abstraction mechanisms and unprecedented compute elasticity to every AI practitioner. Join our limited access release.

Foundry Cloud Platform gives AI practitioners self-serve access to state-of-the-art GPU compute, enabling them to train, fine-tune, and serve models without drawn-out procurement processes and long-term, inflexible contracts. On Foundry, practitioners can reserve compute elastically for as little as 3 hours. Perhaps more importantly, the platform is backed by a monitoring and orchestration system that eliminates the distractions of managing infrastructure, making it simple for researchers to execute and optimize their workloads.

Reserve GPUs without worrying about failure rates

The non-negligible failure rates accompanying the scale and sophistication of today’s state-of-the-art systems are an extremely prevalent yet rarely discussed challenge in AI development. We’ve heard many horror stories of AI teams spending months of time and millions of dollars to procure thousands of GPUs, only to discover that, to account for failure rates, they can only use 75% of them.

“The likelihood of a supercomputer running for weeks on end is approximately zero. The reason for that is that there are so many components working at the same time, so the probability of them working continuously is very low… [an] ability to keep the utilization of the supercomputer high, especially when you just spent $2B building it, is super important.”
– Jensen Huang, CEO, NVIDIA

We want Foundry customers to know that when they reserve 1000 GPUs for X many hours (or days or weeks), they will get 1000 GPUs for X many hours.

Developers shouldn’t need to guess how many extra nodes to set aside in a “healing buffer” or think about hardware monitoring and failover, so we automate these standard practices. Through Foundry Cloud Platform, customers can reserve exactly the capacity they need for as little as 3 hours and be confident that they can leverage all of it. We algorithmically maintain pools of healing buffer nodes so that if a reserved node fails, we can often replace it proactively, obviating instance failure challenges.

Elastically access Foundry buffer nodes with compelling economics

Rather than let our pool of held-aside healing buffer nodes sit idle, we let customers access it via preemptible spot instances at prices that are typically 12-20x better than those offered by other GPU clouds and traditional hyperscalers. While spot instances are not traditionally well-suited for large pre-training runs because they are preemptible, they offer compelling economics for live inference, batch inference, fine-tuning, and hyperparameter optimization workloads that are more dynamic and horizontally scalable.

Foundry Cloud Platform further offers a number of convenience features that help customers abstract away the potential complexities associated with preemptible compute:

Hosted Kubernetes clusters
Spot disk state saving
Auto-mounting persistent storage
And more (to be announced)

We’re previewing many novel convenience capabilities today, and we’re looking forward to announcing more over the coming weeks and months to expand the viability of spot instances across a wider range of workloads to even long-running pretraining jobs.

Delivering on the original promise of the cloud

Relatively recently, cloud computing as a construct was underappreciated. Now, the cloud is ubiquitous, and we often don’t question it or think back on the purpose it was initially intended to fulfill. Early pioneers and engineers noted that the “killer idea” of the cloud was that “fast is free.”² If you had a process that ran on 10 machines for 10 days, you could run it on 100 machines for 1 day or 10k machines for 15 minutes. In the cloud, the cost would be the same, but the work would finish up to 1000 times faster (or beyond)!

That was a radical idea; however, to make it possible, you’d need the cloud, as well as the software to render processes “horizontally scalable.”

“Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1000 servers for one hour costs no more than using one server for 1000 hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT.”
– Above the Clouds, Ambrust et al. (2009)

Unfortunately, this founding promise of elasticity is not satisfied in today’s AI cloud context. Companies are forced to procure fixed long-term contracts, despite AI workflows being even more bursty and dynamic than many in the prior web era of cloud computing.

Although there are non-negligible technical and economic reasons why it is challenging to deliver true elasticity in the AI cloud context, Foundry’s endeavoring to do so and has made a number of breakthroughs to advance this cause for our customers.

What’s next for Foundry

Compute is often the largest investment for many of our customers and is of existential import, so we push to ensure that each has a phenomenal experience. Thus, we are deliberately and selectively rolling out access to Foundry Cloud Platform as we continue dramatically scaling up our capacity to meet demand over the coming weeks and months. As the availability and economics of the platform normalize across a larger set of customers, we will increase the rate of our rollout.

If you would like to access the platform during this phased rollout, you can request access here.

––