Today’s most sophisticated artificial intelligence systems are capable of impressive feats, whether it’s directing cars through city streets or writing human prose. But they share a common bottleneck: hardware. The development of state-of-the-art systems often requires enormous computing power. For example, the creation of DeepMind’s protein structure prediction structure Alpha folding took a cluster of hundreds of GPUs. Further underscoring the challenge, a source estimates that AI startup OpenAI’s language-generating GPT-3 system using a single GPU would have taken 355 years to develop.
New techniques and chips designed to accelerate certain aspects of AI system development promise (and have already) reduced hardware requirements. But developing with these techniques requires expertise that can be hard to come by for small businesses. At least that’s the claim of Varun Mohan and Douglas Chen, the co-founders of the infrastructure startup. Exafunction. Emerging from stealth today, Exafunction is developing a platform to take the complexity out of using hardware to train AI systems.
“Improvements [in AI] are often underpinned by large increases in…computational complexity. As a result, companies are forced to make large investments in hardware to reap the benefits of deep learning. It’s very difficult because the technology improves so quickly and the workload increases rapidly as deep learning proves valuable within a company,” Chen told TechCrunch in an interview by e-mail. -mail. “Specialized accelerator chips needed to run large-scale deep learning computations are rare. Effective use of these chips also requires esoteric knowledge not common among deep learning practitioners.
With $28 million in venture capital, including $25 million from a Series A funding round led by Greenoaks with participation from Founders Fund, Exafunction aims to address what it sees as a symptom of the shortage of AI expertise: idle hardware. GPUs and the aforementioned specialized chips used to “train” AI systems – that is, to feed the data the systems can use to make predictions – are often underutilized. Because they run certain AI workloads so quickly, they sit idle waiting for other components of the hardware stack, like processors and memory, to catch up.
Lukas Beiwald, founder of AI development platform Weights and Biases, reports that almost a third of his company’s customers use less than 15% of the GPU on average. Meanwhile, in 2021 survey commissioned by Run:AI, which competes with Exafunction, only 17% of companies said they were able to achieve “high utilization” of their AI resources, while 22% said their infrastructure was mostly inactive.
The costs add up. According to At Run:AI, 38% of companies had an annual budget for AI infrastructure — including hardware, software, and cloud fees — exceeding $1 million as of October 2021. OpenAI is valued having spent $4.6 million to form GPT-3.
“Most deep learning companies go into business so they can focus on their core technology, without spending their time and bandwidth worrying about resource optimization,” said Mohan via email. “We believe there is no significant competitor that solves the problem we are focused on, which is to abstract from the challenges of handling accelerated hardware like GPUs while delivering superior performance to customers.”
Seed of an idea
Prior to co-founding Exafunction, Chen was a software engineer at Facebook, where he helped build tooling for devices like the Oculus Quest. Mohan was the CTO of autonomous delivery startup Nuro, responsible for managing the company’s autonomous infrastructure teams.
“While our deep learning workloads [at Nuro] grew in complexity and requirement, it became clear that there was no clear solution to scale our hardware accordingly,” Mohan said. “Simulation is a strange problem. Perhaps paradoxically, as your software improves, you need to simulate even more iterations in order to find edge cases. The better your product, the more you have to search for flaws. We learned how hard it was and spent thousands of engineering hours trying to extract more performance from the resources we had.
Exafunction customers connect to the company’s managed service or deploy Exafunction’s software in a Kubernetes cluster. The technology dynamically allocates resources, moving computing to “cost-effective hardware” such as Spot Instances when available.
Mohan and Chen hesitated when asked about the inner workings of the Exafunction platform, preferring to keep those details under wraps for now. But they explained that at a high level, Exafunction leverages virtualization to run AI workloads even with limited hardware availability, which apparently leads to better utilization rates while lowering costs.
Exafunction’s reluctance to reveal information about its technology, including whether it supports cloud-hosted acceleration chips like Google’s Tensor Processing Units (TPUs) – is cause for concern. But to dispel doubts, Mohan, without naming names, said Exafunction already manages GPUs for “some of the most sophisticated autonomous vehicle companies and organizations at the cutting edge of computer vision.”
“Exafunction provides a platform that decouples workloads from accelerating hardware like GPUs, ensuring extremely efficient utilization – reducing costs, accelerating performance, and allowing businesses to fully benefit from hardware… [The] allows teams to consolidate their work on a single platform, without the challenges of assembling a disparate set of software libraries,” he added. “We expect that [Exafunction’s product] will be deeply market-friendly, doing for deep learning what AWS did for cloud computing.
Mohan may have grand plans for Exafunction, but the startup isn’t alone in applying the concept of “smart” infrastructure allocation to AI workloads. Beyond Run:AI — whose product also creates an abstraction layer to optimize AI workloads — Grid.ai offers software that allows data scientists to train AI models on hardware in parallel. For its part, Nvidia sells AI Enterprisea suite of tools and frameworks that enables enterprises to virtualize AI workloads on Nvidia-certified servers.
But Mohan and Chen see a huge addressable market despite the overcrowding. During a conversation, they positioned Exafunction’s subscription-based platform not only as a way to break down barriers to AI development, but also enable companies facing blockchain constraints to sourcing to “unleash more value” from the material available. (In recent years, for a range of different reasons, GPUs have become a commodity.) There’s always the cloud, but, according to Mohan and Chen, it can drive up costs. A estimate found that training an AI model using on-premises hardware is up to 6.5x cheaper than the lowest-cost cloud-based alternative.
“While deep learning has virtually endless applications, two of the ones we’re most interested in are autonomous vehicle simulation and large-scale video inference,” Mohan said. “Simulation is at the heart of all software development and validation in the autonomous vehicle industry…Deep learning has also led to exceptional advances in automated video processing, with applications in a wide range of industries. [But] Although GPUs are essential to autonomous vehicle companies, their hardware is often underutilized, despite their price and scarcity. [Computer vision applications are] so computationally demanding, [because] each new video stream is actually a firehose of data, with each camera producing millions of images per day. »
Mohan and Chen say the Series A capital will be used to expand Exafunction’s team and “deepen” the product. The company will also invest in optimizing AI system runtimes “for the most latency-sensitive applications” (e.g., autonomous driving and computer vision).
“While we are currently a strong and agile team focused primarily on engineering, we plan to rapidly expand the size and capabilities of our organization in 2022,” Mohan said. “Across virtually every industry, it’s clear that as workloads become more complex (and more enterprises want to take advantage of deep learning insights), the demand for compute far exceeds [supply]. While the pandemic has highlighted these concerns, this phenomenon, and its associated bottlenecks, is poised to worsen in the years to come, especially as cutting-edge models become exponentially more demanding.