Layer 2 data center networks scaled to 100,000 ports
San Diego, CA - University of California computer scientists have created software that they believe will allow massive data centers to logically function as single, plug-and-play networks.
The software system — PortLand — is a fault-tolerant, layer 2 data center network fabric capable of scaling to 100,000 nodes and beyond. It's fully compatible with existing hardware and routing protocols, they say.
Critically, it removes the reliance on a single spanning tree, natively leveraging multipath routing and improving fault tolerance.
"With PortLand, we came up with a set of algorithms and protocols that combine the best of layer 2 and layer 3 network fabrics," said Amin Vahdat, a computer science professor at UC San Diego's Jacobs School of Engineering. "Today, the largest data centers contain over 100,000 servers. Ideally, we would like to have the flexibility to run any application on any server while minimizing the amount of required network configuration and state."
Vahdat and his team revisited the trade-offs between layer 2 or Ethernet networks — which route on MAC addresses — and layer 3 networks, which route on IP addresses.
Their result, they say, is a system of algorithms and protocols that eliminates the scalability and routing-path limitations of existing layer 2 approaches and avoids the administrative and virtualization headaches caused by implementing layer 3 networks in data centers.
One of PortLand's key innovations is its location discovery protocol. Switches automatically learn their location within the data center topology and then assign "Pseudo MAC" (PMAC) addresses to each of the servers they connect to. These PMAC addresses — rather than MAC addresses — are used for packet forwarding.
Servers still send out an ARP - a request for the MAC address of the computer with which they want to communicate. But now, instead of broadcasting this request to the entire network, the switch that receives the ARP talks to a directory service which returns a PMAC address. When new machines are added, or when virtual machines are moved, new PMAC addresses are automatically generated.
"We have replaced broadcast with a server lookup. And we are forwarding based on PMAC addresses rather than MAC addresses. On the last hop, the egress hop, the switch rewrites the PMAC to be its actual MAC address," said Vahdat.
"An important thing here is that all the switches are off the shelf — unmodified 'merchant silicon'," added Vahdat.
A full prototype is currently running on a network in the Department of Computer Science and Engineering at UC San Diego's Jacobs School of Engineering.