How CDW is Building Energy-Efficient HPC Clusters of Tomorrow
Article
10 min

How CDW is Building Energy-Efficient HPC Clusters of Tomorrow

In this blog, CDW’s HPC experts discuss the rise of high-performance computing in Canada and how organizations can build, operate and maintain energy-efficient HPC clusters of tomorrow.

CDW Expert CDW Expert
What's Inside
  • The Canadian HPC landscape

    Canada’s share of global computing capacity measured in Peta FLOPS (10^15 floating-point operations per second) is just 0.7 percent. While this is lower than the nations with large supercomputing investments, adoption continues to soar in Canada.

  • Key operational challenges of an HPC cluster

    As the cluster grows bigger and more complex, it becomes harder to ensure that it stays energy efficient. This can lead to greater costs for the facility and even disturb performance. Let’s understand these challenges in detail.

  • A new era in HPC infrastructure management

    Organizations with current or upcoming HPC investments must focus on five core priorities to drive better energy efficiency and cluster performance.

  • Building HPC clusters of tomorrow with CDW

    CDW experts are aware of the roadblocks organizations face before they can bring their HPC cluster to life. Which is why we have built an all-encompassing HPC service that meets varied HPC needs in a single offering.

In Data Center IT Engineer Stands Before Working Server Rack Doing Routine Maintenance Check and Diagnostics Using Laptop. Concept of Cloud Computing, Artificial Intelligence, Supercomputer

Modern high-performance computers serve as the bedrock of path-breaking research and innovation in Canada. Compared to a commercial desktop PC, which usually comes with eight to 16 cores, an HPC cluster can have more than 100,000 cores at once. Such behemoth compute power is used for intensive tasks such as training foundational AI models or making extremely complex calculations.

As Canada gains AI momentum, more organizations are on the path to build new HPC clusters for transforming key industries such as financial services, healthcare and education. But given the sheer size of an HPC setup, running and operating them can present energy, cooling and cost challenges.

In this blog, CDW’s HPC experts discuss the rise of high-performance computing in Canada and how organizations can build, operate and maintain energy-efficient HPC clusters of tomorrow.

The Canadian HPC landscape

Canada houses two of the world’s most powerful supercomputers named after Canadian scientists, Anne Barbara Underhill and André Robert.

Yet, Canada’s share of global computing capacity measured in Peta FLOPS ( 1015 floating-point operations per second) is just 0.7 percent. While this is lower than the nations with large supercomputing investments, adoption continues to soar in Canada.

Key HPC drivers

Canada has a rapidly growing community of skilled AI experts and a booming startup ecosystem that continue to advance the country’s innovation potential. As per Crunchbase, Canada has over 1500 AI companies registered on the platform with a net investment of $8.3 billion.

CDW’s 2024 Canadian Hybrid Cloud Report also found that 55 percent of surveyed organizations plan to invest in AI in the next 12 months. The financial services sector leads the way with 70 percent of organizations investing in AI, followed by healthcare at 61 percent.

To meet the infrastructure and innovation demands of this rapid AI adoption, an equal rise in HPC investments is also expected.

AI initiatives are driving demand for GPU-based clusters

Another interesting development is the rising demand for GPU-based clusters that offer high performance parallel processing capabilities. These clusters are well-suited for AI and deep learning workloads with applications in financial services, healthcare and energy sectors.

More organizations are securing GPU-based clusters for AI initiatives that range between 0.5 to 4 megawatts of power consumption, similar to a peta-scale HPC setup. This trend is likely to further ramp up demand for HPC-like high-density compute architectures at grassroots levels.

Key operational challenges of an HPC cluster

A traditional mid-range HPC facility found in a research institute can hold around 600-1000 servers. This requires a heavy inflow of electrical power to run the servers, keep them cool and manage their upkeep.

As the cluster grows bigger and more complex, it becomes harder to ensure that it stays energy efficient. This can lead to greater costs for the facility and even disturb performance. Let’s understand these challenges in detail.

Managing megawatts of power

Managing power and energy efficiency in an HPC cluster is challenging due to the sheer computational demands and the complex interplay between hardware, cooling and software optimization.

A typical HPC setup can consume anywhere between five to 20 megawatts of power; enough to light up 20,000 homes. They also come with high power densities that can be as high as 12kw per square foot.

Moreover, balancing workloads across such a large system to minimize idle power consumption requires intricate electrical engineering.

The interconnected nature of these systems, where storage, networking and computational nodes must work in harmony, further complicates energy management, as inefficiencies in one area can ripple throughout the entire cluster.​

Handling excessive heat

HPC systems consist of thousands of high-performance CPUs and GPUs, each of which can consume hundreds of watts, generating vast amounts of heat.

Efficiently cooling these components without using excessive energy is difficult. Traditional air cooling can prove to be inefficient at such magnitude and more advanced cooling methods like liquid immersion or direct liquid cooling are needed.

The structural design and climate conditions of the facility also influence what kind of cooling options can be readily leveraged. This may further increase cooling energy overheads and make it challenging to curb cooling costs.

Scaling is an uphill battle

Upgrading and scaling an HPC setup is challenging, not only due to hardware compatibility and system integration, but also because of physical limitations like floor loading.

HPC systems require heavy, dense server racks, each of which can weigh over a thousand pounds when fully populated. Expanding an HPC cluster means adding more racks, potentially exceeding the structural limits of the data centre floor, which is not always designed to handle such concentrated weight.

Additionally, new hardware components, such as more advanced CPUs and GPUs, may not integrate seamlessly with older infrastructure, leading to performance imbalances and potential bottlenecks.

A new era in HPC cluster management

Organizations with current or upcoming HPC investments must focus on five core priorities to drive better energy efficiency and cluster performance:

  • AI-friendly architecture design
  • Improved cooling methods that consume less energy
  • Systematic power management that can balance load
  • Advanced algorithms for optimizing workloads
  • New-age processor design

We explore the latest technical advancements across these priorities that promise better results for HPC owners.

AI-ready HPC solutions

At the core, any organization that wants to build an HPC cluster for AI projects must ensure the servers, operating software and processors can work together to handle complex AI workloads.

Modern HPC clusters now come with AI-ready capabilities for handling large amounts of data and support ML tools to aid AI development. These clusters simplify compute-intensive tasks such as training a large language model from scratch while also keeping energy consumption in check.

By incorporating AI-ready design elements, even ambitious projects can be achieved within budget, power and infrastructure constraints.   

Our partners at Dell Technologies offer scalable HPC storage architectures that can simplify the complexities associated with AI projects. Their validated designs for HPC storage offer the following benefits:

  • Simplified monitoring and management: Deploy scalable HPC storage for data-intensive projects with a storage architecture that is easier to maintain and operate
  • Highly available storage: Improve HPC storage availability with the integration of Dell EMC servers that fulfill redundancy and interoperability requirements while avoiding single point of failure
  • Achieve storage efficiency: Use data consolidation and retention methods to drive better storage efficiency for HPC systems that can bring down costs while improving performance

Hewlett Packard Enterprise (HPE) also partners with CDW to offer scalable and flexible HPC solutions through its ProLiant Compute platforms, designed to enhance AI workload throughput and efficiency.

These platforms are backed by decades of HPE expertise and an extensive partner ecosystem, providing comprehensive support from design to implementation and management. This is particularly beneficial for customers with limited in-house AI resources, addressing skill and knowledge gaps effectively.

  • HPE’s ProLiant Gen11, Gen10 Plus and Gen10 servers offer improved performance, security and workload performance. These servers cater to various industries, including financial services, manufacturing, healthcare, life sciences and retail.
  • HPE’s enterprise computing solution for generative AI is optimized for both edge and data centre deployments, featuring a highly scalable architecture and GPU-enabled applications to maximize AI outcomes and inferencing performance.

Liquid cooling

Unlike traditional air-cooling methods, liquid cooling can be far more efficient at dissipating heat.

This technique involves using liquids, often water or specialized coolants, to absorb heat directly from the components, reducing the energy needed for cooling and improving overall efficiency.

Technologies like direct-to-chip cooling and immersion cooling are becoming popular in HPC data centres due to their ability to handle high heat loads with less energy consumption​.

Dynamic power management

Advanced software solutions can dynamically manage power consumption by adjusting the voltage and frequency of processors based on workload demand.

This is done via a process known as dynamic voltage and frequency scaling (DVFS). By lowering the power usage during low-computation periods, energy waste is minimized without affecting performance significantly.

AI-led workload optimization

AI algorithms can be used to predict workload patterns and optimize resource allocation, ensuring that computing resources are not overprovisioned. These systems can automatically tune power usage by identifying low-energy states or shifting workloads to underutilized resources to improve overall efficiency​.

Intel partners with CDW for various HPC solutions including the 5th Gen Intel Xeon Scalable Processors, equipped with Intel HPC Engines. These processors help achieve optimized cluster performance by integrating purpose-built accelerators. With their HPC-optimized design, they can drive improved performance and power efficiency for various HPC workloads, such as simulation and modeling.

These accelerators help resolve I/O bottlenecks, process specific workloads faster and offload tasks from the CPU, thereby preserving headroom for more demanding computations.

Intel HPC Engines also include Intel Advanced Vector Extensions 512 (Intel AVX-512), which condense and fuse common computing operations into fewer steps, accelerating general computing, AI processing and mathematically intense HPC workloads. This makes HPC more accessible and cost-effective, enabling more organizations to leverage supercomputing resources for scientific discovery, engineering simulations and complex system modeling.

Energy-efficient processors

New generations of processors, CPUs and specialized AI chips are designed to be more energy-efficient than traditional architectures. These chips are optimized for parallel computing tasks common in HPC setups and consume less power per computation, making them ideal for large-scale deployments.

AMD, our HPC technology partners, offer processors that can help meet power and cooling objectives. Their Zen 4c architecture, which uses EPYC processors, is engineered for density and efficiency, allowing up to 128 cores on a single processor.

This design enhances performance per watt, enabling industries like healthcare and life sciences to achieve greater computational power per server rack without increasing power consumption or cooling costs. The efficient thermal management of these cores leads to improved energy efficiency, which can significantly reduce electricity costs for running and cooling servers, a major factor in the total cost of ownership for data centres.

Additionally, the 4th generation AMD EPYC processors, compatible with the Socket SP5 platform, offer a scalable solution for modern digital initiatives such as generative AI applications.

Building HPC clusters of tomorrow with CDW

Building an HPC cluster is an enormous yet intricate task. It doesn’t end at figuring out the right technology; it needs extreme sophistication to make all its moving parts work in perfect harmony. We’re talking about thousands of CPU cores and hundreds of servers in one facility.  

CDW experts are aware of the roadblocks organizations face before they can bring their HPC cluster to life. Which is why we have built an all-encompassing HPC service that meets varied HPC needs in a single offering.

With previous successes in the Canadian research, education and healthcare industries, we have facilitated organizations with:

  • HPC expert solution engineers for end-to-end HPC cluster design and architecture
  • Hands-on HPC installation, from configuring server racks to power outlets
  • Access to top HPC technology providers in Canada with enterprise offerings
  • Consultation and advice from a broad ecosystem of HPC partners
  • Recommendations and design support to implement best practices and ensure compliance with regulations  

This uniquely positions CDW to not only fulfill the hardware requirements for your HPC cluster but also collaborate with you to build and install it from the ground up. Our capabilities can help you simplify procurement, access engineering support and be future-proof right from the beginning.