The Data Center’s Traffic Cop: AI Clears Digital Gridlock

NVIDIA researchers created an AI model that can unsnarl traffic jams in computer networks, and it’s coming soon to a data center near you.
by Rick Merritt

Gal Dalal wants to ease the commute for those who work from home — or the office.

The senior research scientist at NVIDIA, who is part of a 10-person lab in Israel, is using AI to reduce congestion on computer networks.

For laptop jockeys, a spinning circle of death — or worse, a frozen cursor — is as bad as a sea of red lights on the highway. Like rush hour, it’s caused by a flood of travelers angling to get somewhere fast, crowding and sometimes colliding on the way.

AI at the Intersection

Networks use congestion control to manage digital traffic. It’s basically a set of rules embedded into network adapters and switches, but as the number of users on networks grows their conflicts can become too complex to anticipate.

AI promises to be a better traffic cop because it can see and respond to patterns as they develop. That’s why Dalal is among many researchers around the world looking for ways to make networks smarter with reinforcement learning, a type of AI that rewards models when they find good solutions.

But until now, no one’s come up with a practical approach for several reasons.

Racing the Clock

Networks need to be both fast and fair so no request gets left behind. That’s a tough balancing act when no one driver on the digital road can see the entire, ever-changing map of other drivers and their intended destinations.

And it’s a race against the clock. To be effective, networks need to respond to situations in about a microsecond, that’s one-millionth of a second.

To smooth traffic, the NVIDIA team created new  reinforcement learning techniques inspired by state-of-the-art computer game AI and adapted them to the networking problem.

Part of their breakthrough, described in a 2021 paper, was coming up with an algorithm and a corresponding reward function for a balanced network based only on local information available to individual network streams. The algorithm enabled the team to create, train and run an AI model on their NVIDIA DGX system.

A Wow Factor

Dalal recalls the meeting where a fellow Nvidian, Chen Tessler, showed the first chart plotting the model’s results on a simulated InfiniBand data center network.

“We were like, wow, ok, it works very nicely,” said Dalal, who wrote his Ph.D. thesis on reinforcement learning at Technion, Israel’s prestigious technical university.

“What was especially gratifying was we trained the model on just 32 network flows, and it nicely generalized what it learned to manage more than 8,000 flows with all sorts of intricate situations, so the machine was doing a much better job than preset rules,” he added.

Reinforcement learning for congestion control
Reinforcement learning (purple) outperformed all rule-based congestion control algorithms in NVIDIA’s tests.

In fact, the algorithm delivered at least 1.5x better throughput and 4x lower latency than the best rule-based technique.

Since the paper’s release, the work’s won praise as a real-world application that shows the potential of reinforcement learning.

Processing AI in the Network

The next big step, still a work in progress, is to design a version of the AI model that can run at microsecond speeds using the limited compute and memory resources in the network. Dalal described two paths forward.

His team is collaborating with the engineers designing NVIDIA BlueField DPUs to optimize the AI models for future hardware. BlueField DPUs aim to run inside the network an expanding set of communications jobs, offloading tasks from overburdened CPUs.

Separately, Dalal’s team is distilling the essence of its AI model into a machine learning technique called boosting trees, a series of yes/no decisions that’s nearly as smart but much simpler to run. The team aims to present its work later this year in a form that could be immediately adopted to ease network traffic.

A Timely Traffic Solution

To date, Dalal has applied reinforcement learning to everything from autonomous vehicles to data center cooling and chip design. When NVIDIA acquired Mellanox in April 2020, the NVIDIA Israel researcher started collaborating with his new colleagues in the nearby networking group.

“It made sense to apply our AI algorithms to the work of their congestion control teams, and now, two years later, the research is more mature,” he said.

It’s good timing. Recent reports of double-digit increases in Israel’s car traffic since pre-pandemic times could encourage more people to work from home, driving up network congestion.

Luckily, an AI traffic cop is on the way.