By Vince Hill, cPacket Networks
High performance computing (HPC) requires an extremely high-powered network with ultra-low latency to move large files between HPC nodes quickly. IT and network operations (NetOps) teams in industries such as financial service, oil and gas, animation/3D rendering and pharmaceutical research need to monitor their networks in exacting detail to ensure they can support HPC workloads. But monitoring latency and other metrics at HPC-class performance levels creates a new set of challenges, including monitoring packets at 40Gbps and 100Gbps speeds, measuring latency at millisecond and nanosecond intervals, and detecting miniscule “microbursts” of traffic before they cause performance issues.
Let’s dig into those challenges in more detail.
Monitoring Packets at 10Gbps or Greater
As network speeds increase to 40 or 100Gbps, network monitoring tools, packet capture appliances and packet brokers will struggle to keep up unless they are specifically built for this use case. A general-purpose CPU architecture can’t capture packets at over 10Gbps without hardware assistance. The high-resolution measurement that HPC networks require often necessitates measuring key performance indicators (KPIs) like latency and jitter on each box, rather than at a central point (more on this later). This adds an extra layer of processing, which in turn adds a delay. The packet broker must be powerful enough to acquire, process and distribute packets accounting for this delay without slowing down the network. NetOps must ensure their monitoring hardware was designed for this high-speed, high-performance scenario.
HPC workloads require extremely low network latency, usually less than a millisecond. Monitoring tools must measure latency to a more granular level (for example, if the HPC workloads cannot tolerate more than 2 milliseconds of latency then the monitoring tools must measure it in 1 millisecond intervals). This might seem obvious, but not all monitoring solutions are built for an ultra-low latency use case. NetOps must make sure their chosen solution is up to the challenge.
It’s unusual – even for HPC networks – to run at full capacity all the time. The network will have an average throughput and a maximum throughput that it will occasionally hit in short bursts. Because of this, packet capture solutions also have two speeds – a sustained capture speed that they can run at indefinitely, and a “burst” speed, which they can run at for up to a minute. 40/60Gbps sustained and 100Gbps burst is common for high-performance packet capture devices, so NetOps teams should make sure their chosen solution meets these standards.
A more complex issue is on the other end of the scale: traffic bursts that are so short – lasting a few milliseconds – they can slip past monitoring solutions that aren’t granular enough. If a monitoring device measures throughput every 10 milliseconds, and traffic spikes up past the network’s maximum allowable load for just 2 milliseconds in between measurements, the spike won’t be detected. But during those two milliseconds, some packets will get dropped. In industries like finance, where trades can be lost based on milliseconds and just a few network packets, these microbursts can have serious consequences. HPC workloads that tend to generate “bursty” traffic will need high-resolution metrics, as discussed earlier, plus the ability to analyze and determine the cause of microbursts.
To solve these issues, IT and NetOps teams should make sure their chosen solution offers the following capabilities.
For success, IT and NetOps teams must make sure they can monitor the network at speeds approaching 100Gbps, measure latency to a sufficiently granular level, and analyze the smallest of microbursts. Granular, lossless network visibility will help ensure that their networks meet the high standards that HPC workloads demand.
Vince Hill is senior technical marketing manager at cPacket Networks.
Notify me of follow-up comments by email.
Notify me of new posts by email.
insideHPC in association with the technology analyst firm OrionX.net today announced the launch of the @HPCpodcast, featuring OrionX.net analyst Shahin Khan and Doug Black, insideHPC’s editor-in-chief. @HPCpodcast is intended to be a lively and informative forum examining key technology trends driving high performance computing and artificial intelligence. Each podcast will feature Khan and Blacks’ comments on the latest HPC news and also a deeper dive into a focused topic. In our first @HPCpodcast episode, we talk about a recent spate of good news for Intel before taking up one of the hottest areas of the advanced computing arena: new HPC-AI chips. You can find the @HPCpodcast on insideHPC and on Twitter. Here’s the RSS feed: http://orionx.net/category/audio-podcast/feed We welcome your suggestions [READ MORE…]
In this whitepaper, our friends over at Lambda walk you through the Lambda Echelon multi-node cluster reference design: a node design, a rack design, and an entire cluster level architecture. This document is for technical decision-makers and engineers. You’ll learn about the Echelon’s compute, storage, networking, power distribution, and thermal design. This is not a cluster administration handbook, this is a high level technical overview of one possible system architecture.
Return to top of page
Copyright © 2022 · News Theme for Inside HPC on Genesis Framework · WordPress · Log in