NVIDIA DGX Spark Now Scales to 4 Nodes for 700B Parameter AI Agents

Rebeca Moen
Mar 16, 2026 21:42

NVIDIA expands DGX Spark to help 4-node configurations, enabling native inference of 700B parameter fashions and near-linear fine-tuning efficiency scaling.

NVIDIA has expanded its DGX Spark desktop AI platform to help as much as 4 nodes, quadrupling out there reminiscence to 512 GB and enabling native inference of fashions as much as 700 billion parameters. The improve, introduced alongside the NemoClaw agent toolkit, positions DGX Spark as a critical contender for enterprises eager to run autonomous AI brokers with out cloud dependencies.

The scaling numbers inform the story. Token technology throughput jumps from 18,400 tokens per second on a single node to 74,600 on 4 nodes—a clear 4x enchancment for fine-tuning workloads. For inference duties, time per output token drops from 269ms to 72ms when scaling from one to 4 nodes utilizing tensor parallelism.

Why This Issues for AI Agent Growth

Autonomous brokers are reminiscence hungry. NVIDIA’s benchmarks present brokers routinely processing 30K-120K token context home windows, with advanced requests hitting 250K tokens. That is roughly equal to studying two full novels earlier than responding to a single question.

The DGX Spark handles this by what NVIDIA calls the Grace Blackwell Superchip, which parallelizes a number of subagents concurrently. Operating 4 concurrent subagents requires solely 2.6x extra time than working one, whereas immediate processing throughput triples. For builders constructing multi-agent programs, that is the distinction between ready minutes versus hours for advanced reasoning chains.

4 Topology Choices

NVIDIA outlined particular use instances for every configuration. A single node handles inference as much as 120B parameters and native agentic workloads. Two nodes help fashions as much as 400B parameters. Three nodes in a hoop topology optimize for fine-tuning bigger fashions. The total four-node setup with a RoCE 200 GbE change creates what NVIDIA calls a “native AI manufacturing facility” able to working state-of-the-art 700B parameter fashions.

Fashions explicitly referred to as out as benefiting from multi-node stacking embody Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B—all common selections for the OpenClaw autonomous agent runtime that ships with NemoClaw.

The Cloud Bridge

Maybe probably the most sensible addition is Tile IR, a kernel portability layer letting builders write code as soon as on DGX Spark and deploy to Blackwell B200/B300 information middle GPUs with minimal modifications. Roofline evaluation exhibits kernels scale successfully relative to every platform’s theoretical peak, which means optimizations made regionally translate to cloud deployments.

This addresses an actual ache level. Groups prototype on native {hardware}, then spend weeks rewriting for manufacturing cloud infrastructure. The cuTile Python DSL and TileGym’s preoptimized transformer kernels goal to remove that friction.

For enterprises weighing AI infrastructure investments, the expanded DGX Spark capabilities provide a center path between pure cloud dependency and constructing out devoted information middle capability. The power to run 700B parameter fashions regionally—with a transparent improve path to cloud scale—makes the financial calculation extra fascinating than it was six months in the past.

Picture supply: Shutterstock

Source link