Artificial intelligence is forcing radical changes in the physical build of the cloud. With the exponential growth of language model parameters, traditional, multi-purpose networks have reached the limits of their performance. As noted by Benny Siman-Tov (Senior Director of Product Management) and Arjun Singh (Engineering Fellow) at Google, legacy designs struggle with massive scale, explosive bandwidth demands, strict low-latency requirements (critical for real-time response generation), and “synchronized bursts” – microsecond traffic spikes that can clog buffers and slow down the entire cluster.
Google’s answer to these issues is Virgo Network – a “mega-scale” data center network that serves as the foundation for the company’s new AI Hypercomputer. Instead of versatility, the focus here is on specialization, adhering to the “campus as a computer” philosophy.
The new infrastructure is divided into three collaborating but evolutionarily independent layers. The traditional Jupiter network now only handles vertical traffic (north-south), providing access to storage. The main emphasis, however, is placed on the new horizontal communication (east-west), featuring a dedicated RDMA network optimizing traffic between the accelerators themselves (GPUs/TPUs), and a tightly integrated “Scale-up” domain for chips within a single rack.
This separation of tasks enabled the creation of a flat (two-tier), non-blocking topology based on high-radix switches. The benefits of deploying Virgo Network are impressive:
- Massive scale: The network can link up to 134,000 chips (TPU 8t) into a single system, offering a staggering bidirectional bandwidth of 47 petabits per second.
- Performance leap: The new network delivers up to 4 times more bandwidth per accelerator compared to the previous generation.
- Latency reduction: Data transfer time between TPUs has dropped by 40 percent, which is crucial for maintaining fluidity when handling user queries.
With hardware connecting hundreds of thousands of chips, component failures are statistically inevitable. To prevent a situation where a single broken element or a slowed node (a “straggler”) halts the entire training process, Virgo Network relies on advanced fault isolation and deep telemetry. The network monitors systems with sub-millisecond precision, allowing for instant detection of stalls, buffer management, and minimized mean time to recovery (MTTR).
As the company’s engineers argue, Virgo Network is an evolutionary step essential for entering the “agentic AI era.”

