TACC Turns to “Horizon” System for Prominent IT Facility

Speaking for the Ken Kennedy Institute’s 2022 Energy High Performance Computing Conference, Dan Stanzione, Executive Director of the Texas Advanced Computing Center (TACC), provided an update on TACC’s upcoming Leadership Class Computing Facility (LCCF) – a huge NSF-funded expansion of its supercomputing capabilities which will be launched with a new flagship supercomputer, “Horizon”.

“We have a lot of computers; they’re a bit big,” Stanzione explained, highlighting systems like Frontera (23.5 petaflops Linpack) and Stampede2 (10.6 petaflops Linpack) that top the Top500 and support thousands of project teams. “Collectively we have about 20,000 machines, we have over a million cores, we have a thousand GPUs, we provide about seven billion core hours per year to our user community. And it adds all the time: Stanzione said the graphics were just added to the new Lonestar-6 system the day before the conference.

New art on Lonestar-6.

All of this computing power supports projects in NSF areas. “Generally, if it’s unclassified and it’s open science and it’s in an academic institution, we probably support it at this point,” Stanzione said. And the demand for more remains:[At] At the start of the center, we had requests for about five times the calculation that we were producing. We have about 80,000 times more computing available now…and we get about five times more requests for the time we can produce.

“So apparently,” he explained, “computing demand is invariant and in no way depends on the size of the computer you buy. So obviously you should buy a bigger one.

Assuming a (class) leadership role

Enter the LCCF and Horizon. Back in mid-2018, TACC won an NSF award to create Frontera (which Stanzione says is “technically the first phase of the LCCF”), laying the groundwork for a longer-term IT leadership strategy. The following year, it was learned that TACC planned to follow Frontera with a 10× more powerful computer around 2024. In 2020, TACC started talk more about these planscalling the facility “Leadership-Class Computing Facility”, shifting its focus slightly to 2025, and releasing concept art for an extended data center.

An LCCF concept. Image courtesy of TACC.

The LCCF’s main argument, Stanzione said, is to “provide a more sustainable way to invest in computing than one-off system competitions every four years.” On several occasions, Stanzione referred to the National Center for Atmospheric Research (NCAR), an NSF-funded computing center focused on weather and climate research. TACC, he said, wants “to be an anchor to the NSF computing environment like NCAR is to atmospheric research.”

Stanzione explained that the plan for the LCCF is to “actually begin construction – fitting out a data center – … two years from today” (March 1, 2024), pending Congressional budget approval. Then they will aim to deploy the LCCF flagship system in the first half of 2025, with scientific users on the system by the second half of 2025. After that: ten years of center support, from 2026 to 2036, with “maybe to be an upgrade in there,” Stanzione said.

“We’ve never had more than four years of funding on an NSF system before as an initial commitment,” he pointed out, citing NSF rules regarding the duration and renewal of its grants. “The reason I say we wanted to build an NCAR-like facility is because it was funded there [by the NSF] in Colorado since 1957, so…maybe my real challenge is figuring out, personally, how I was going to get around this limit that technically exists.

The flagship system, Stanzione said, will be called Horizon, “if DOE does not steal this name and I don’t have to change it again when they set up a system. (Regarding the LCCF itself: “I’ll give it a better name and logo before it’s finished, but that will come later in the funding cycle.”)

The current, seemingly short-lived logo of the LCCF.

on the horizon

“You probably all wanted me to say which machine [we are] choose,” Stanzione said, “and I’m not going to, because honestly I haven’t made a final decision, because the best way to go wrong is to tell people four years in advance what it’s going to be like. your computer and what it’s going to do to it. … Two years before we deploy, ask me what it’s going to be like.

That said, Stanzione went into detail about the process of setting goals and gaining support for Horizon.

“[Frontera] gets old very quickly, so we need a follow-up, and the only written instructions are “10 times faster!” “, did he declare. “And we’ve had I can’t tell you how many hours of conversations about what 10x actually means in this context – is it application throughput? Is this the maximum application performance? Does it have something to do with flops? (“I argued not,” he said.)

It then broke down from where they expect to get the 10×.

“10× in this time frame is difficult for several reasons,” he said. “One: if we look at the baseline of Frontera, if we just relied on improving vendor performance…at best, over that five-year period, we’ll get maybe 3x of that” , mainly thanks to the increase in memory. bandwidth. “But three is not 10.”

Then, he said, “Buy more – that one almost always works… So we’re going to double the budget from what Frontera was, and I got away with that, then.” This brings it to 6×, leaving Horizon with a two-thirds need to speed up. That rest, Stanzione said, will be accomplished through improvements in algorithms and software methods.

Putting things to code

To that end, last spring, TACC appealed to the research community submit problems that they believe are central to future research problems and representative of problem designs in the supercomputing space. They received 140 proposals, narrowing them down to 30, then selecting 21 to move forward for further review, seeking to achieve meaningful speed-ups “and see where we can go.” The projects, funded to the tune of approximately $300,000 each, span mathematics, physical sciences, engineering, geosciences, life sciences and even social sciences.

“If all the codes were well-designed and really good, I would worry a lot about our ability to make improvements and hit the numbers,” Stanzione said, continuing, “Not all codes are really good, well-packaged, well-engineered codes. Most of them are what we would call “software”.

This code research, he added, would also help TACC sell the benefits of LCCF and Horizon to decision makers. “We know that if we build a bigger machine, interesting things are going to happen,” he said. “But I can’t sell hundreds of millions of investment dollars on ‘cool stuff will happen’.”

Now back to the hardware

On the hardware front, Stanzione showed a long list of vendors being evaluated. “We did a lot of assessments on site, we did others with partners,” he said. “We looked at a wide variety of processor technologies…we looked at various weapons, we looked at NextSilicon…we looked at a bunch of networking options…we looked at a lot of other more exotic stuff with sites partners.

A non-exhaustive list of hardware being evaluated by TACC for Horizon

  • Processors: AMD; Fujitsu arm; Intel; NEC; NextSilicon; Nvidia
  • Networking: Cornelis; Nvidia; Rockport
  • File systems: BeeGFS; DAOS; VAST
  • Disaggregation of nodes: gigaIO; Liquid
  • AI/quantum: Quantum (via Stanford); Graphcore (via Argonne National Laboratory); Cerebras (also via Argonne); SambaNova (also through Argonne)

(“The Argonne buys all these [exotic] chips,” Stanzione said, “so I just call Rick Stevens [associate laboratory director at Argonne] and say ‘how are you?’ rather than trying to imitate what they’re doing on that. »

Stanzione also said they plan to add around 10% capacity to the core system to ensure there is room for smaller testbeds and research projects when the main system is busy. . “We’ll define the size of the system,” he said, “and then I’ll lie about it, because we’ll buy more than that. … We’ll have that part of the system that does the 10x part, and then we’ll have a bunch of extra racks to deal with all these other use cases that don’t count for capabilities.

Confusing these efforts, Stanzione said, an “anonymous program manager” asked him, “what is the peak flops of the entire system if you include all of these items?

“That’s the question you weren’t supposed to ask!” cried Stanzione. “And now I have to make up an answer because no one knows what the 2025 peak flops of any of these processors are, let alone which ones are achievable. So I wrote an answer and posted it. returned.

The hardware will be supported by an additional 15 MW of electrical capacity, making TACC a 25 MW facility. And, regarding that “maybe an upgrade in there” from earlier: “The idea is basically that we would buy a second system halfway through that [2026-2036] lifetime,” Stanzione said, “so it’s two five-year lifetimes. He added that they would prefer to avoid a radical change in architecture between the two systems.

For now, however, Stanzione and TACC are embroiled in what he called the “terrifying amount of project management” needed to begin building the new facility in two years.

Leave a Reply

Your email address will not be published. Required fields are marked *