Dell Technologies Interview: Univ. of Liverpool’s hybrid HPC strategy boosts scientific computing with a burst

[SPONSORED CONTENT] In a recent Dell Technologies interview on this site, we talked about HPC-as-a-Service with R Systems, provider of on-demand HPC resources and technical expertise in partnership with Dell HPC Cloud Services. Now, in this interview, we have a variation within this HPC segment: moving to the cloud when an on-premises cluster needs a resource boost.

Faced with this situation, the University of Liverpool Advanced Research Computing within the IT Services Department. The group, led by Cliff Addison, uses the Dell-based “Barkla” Linux cluster for its scientific computing needs. When the needs of the group exceed Barkla, the university has worked with Dell Technologies and based in the UK Flights to Alces, which designs and builds HPC environments for scientists, engineers and researchers. UK-based Alces and Dell designed bursting capability for Amazon Web Services, with a priority on creating a transparent environment that was easily adopted and accessed by Advanced Research Computing scientists.

In this interview, Addison explains – among other things – how AWS capacity was used when the COVID 19 pandemic hit.

Doug Black: Hello everyone, I’m Doug Black, editor at inside HPC, and today as part of our interview series on behalf of Dell Technologies, we’re speaking with Cliff Addison, he’s head of IT at advanced research at the University of Liverpool. Cliff, welcome.

Cliff Addison: Hello or hello, depending on the time of day. But yes, okay.

Black: So please give us an overview of the HPC system that the university has set up with Dell’s integration partner, Alces Flight. Now, a key aspect of the system, as I understand it, is that it pours into Amazon Web Services for additional compute and storage resources. Is it correct?

University of Liverpool (wikipedia)

Addison: This is largely correct. What we did – I’ll take a step back. In 2017, when we put out a call for tenders, we had a number of researchers who had grants that they wanted to use to buy equipment. We needed to be able to have things that were demonstrably high impact, and we also needed an environment that could be scaled to accommodate changes as our research computing requirements changed. And we were also looking for something that would basically provide good computing power out of the box.

And Dell responded to that with a partnership with Alces Flight and also working with Amazon Web Services to provide us with a very solid system on site, with very competitive hardware and a very good configuration that our researchers adopted immediately.

Also, we started with a lot of credits from AWS so we could start working with the cloud, and Alces Flight used their expertise to set up a fairly seamless Barkla cloud environment where we could switch from system on-premises environment on the cloud system with the same users, the same storage and a very familiar environment for researchers. So the researchers didn’t really need to worry about a different environment in the cloud, it was a very similar environment to what they already had. And these characteristics together were really a very strong advantage. And I’ll talk a bit later about some of the ways that have worked for us.

Black: Okay, now let’s move on to the work that your organization does. What’s new in the Advanced Research Computing Group at the University of Liverpool?

Addison: Computational chemistry at Liverpool has always been a major user of our facilities. And 10-15 years ago, it was large-scale parallel molecular dynamics and…calculations. But what’s happened over the years, and it’s consistent with several other groups, is they’ve moved to a very sophisticated workflow environment where they do detailed studies occasionally, but they’re driven by a huge number of quick test surveys with a bit of machine learning to help guide things.

And so, instead of just doing a lot of calculations, we see them doing a very high mix of very fast cycles of investigation, machine learning, and then doing detailed calculations on certain aspects of molecules that we thought were promising. And that’s one of the general trends we’re seeing.

Additionally, with the COVID 19 outbreak that we have had, we have had several specific requirements that have arisen. And again, the cloud Barkla environment with the AWS cloud, the bursting was fundamental in order to be able to start. There was one of our groups doing deep learning to try to look for COVID detection in CT images and x-rays. And they just didn’t have the resources available. We applied to AWS and got research credits, then again with the Alces Flight environment, those researchers were able to seamlessly access AWS, do some of their data analysis/cleaning of data on the local cluster and then very transparently. switch to GPU nodes on AWS to perform the detailed computation. And it worked extremely well, we were able to present the results at Supercomputing 2020 (conference). And they have just submitted an online log of their results which is being accepted.

Black: So Cliff, you all started with the Barkla cluster back in 2017 – tell us about the growing system capabilities, in terms of nodes, and the current updates you’re currently working on.

Addison: Well, we bought the system with good expandability in mind. We started with 96 Skylake nodes, each with 40 cores. And we were able to expand that over time to now have 140 nodes. I am sure that many of the research groups that worked on it were very pleased with this result.

But recently another research group came to us and told us that we would like to have improved GPU capability for our PhD students. We think we probably also need some quick storage to sit behind this. And I was able to contact Dell and Alces Flight, and they were able to come back with some ideas in terms of (NVIDIA) A100 nodes and fast NVMe storage. And when our researchers looked at the options, they were very pleased. And we just decided on a mix of configurations, and Dell and Alces will now put that together. And hopefully we’ll have that later in the year.

Black: Nice. Really interesting. So now, with the pandemic, and more working and learning from home, what impact has that had on your team?

Addison: Well, that’s interesting, our team did well. We are able to get good remote access to our on-site services. And again, cloud hooks are basically done through this on-premises system so that we can access the cloud whenever we need it. It was research – it’s a struggle, because of course one of the main lessons learned is that home broadband isn’t as fast as a good university network. And so we’ve had researchers trying to download large app packages that are 10 gigabytes in size to run on their home systems. And we kept saying better not to do that, better to use our facilities on campus. And don’t do the heavy math on your home systems. And finally, I think we succeeded. So once people agreed to make better use of on-premises systems, it worked well, but our researchers took a while to get used to it, especially when dealing with large datasets.

Black: So, generally speaking, how important is AWS burst connection to you? And do you have any advice for other HPC site managers?

Addison: One of the things we found was that we loved AWS, we loved the folks at AWS. The environment has a reasonably steep learning curve, and you need to use it quite a bit to get familiar with handling it. But Alces Flight, as a third party, provided a very seamless environment. And there are several other companies that are capable of doing similar things. And I would encourage HPC groups to look to partner with someone who has that expertise rather than trying to reinvent things on their own. It makes a huge difference being able to let someone else handle this, setting things up, doing the accounting for you, doing the node setup, making sure that when the nodes aren’t in use or down, you don’t don’t pay for it – that stuff. It really makes it a much more enjoyable experience.

Black: Yeah, that kind of smooth transition back and forth from cloud to on-premises. This is absolutely a big key so people don’t constantly struggle to learn how to reuse a UI.

Addison: That’s right. But also from a local perspective, we often have issues with understaffing in terms of HPC staff and we don’t really have the extra capacity to do a lot of the first-hand cloud management that would be needed to such a good environment. Thus, being able to work through a third party makes our lives considerably easier. We can focus on helping users, we don’t have to worry directly about management and accounting. We’re able to do that through a third party, and we found that to be a big, big win.

Black: Awesome. Alright, Cliff. Well, it was a pleasure talking with you. We have been with Cliff Addison at the Advanced Research Computing Group at the University of Liverpool. Thank you so much.

Addison: Thank you so much.

Leave a Reply

Your email address will not be published.