Applying the Tenets of Availability - Example 1
In earlier blog posts I talked about key tenets to availability, the value of simplicity if something is not redundant and key characteristics (active-active) for redundancy. I thought it would be interesting to discuss a couple of examples of this.
In this blog posting I will cover the first; how a critical issue was addressed as Nortel looked at building data networks that would have the level of availability necessary to support telephony services that would approach those of traditional vertically designed systems. This was key issue as a wiring closet switch with an MTBF of 10 years and an MTTR of 4 hours would have 24 minutes a year of downtime, almost 5 times less availability attributable to this device than the typical availability in a redundant processor PBX. Obviously, when the core switch is added in the campus, the numbers become worse due to the size and complexity of that platform.
A key to availability was to make the core switches redundant. At the time there were two ways of doing this as shown in the chart; first, you could use spanning tree protocols to connect a L2 closet switch to the L3 core switch, or you could use L3 protocols by making the wiring closet switch a "router" and using BGP or OSPF to manage the links. Both of these solutions had significant (if not fatal) drawbacks. The spanning tree method created a simple closet device, but the uplinks were an active-passive (hot-stand-by) system. This reduced the average useable capacity of the investment by 50% and the time to complete a switchover was many seconds. In fact, the switchover time was so long it exceeded the average human patience to remain on a call (are you there??). In fact successive efforts I the spanning tree space have tried valiantly with limited success, to solve this problem. On the other hand, turning the wiring closet device into a L3 router created an active-active redundancy, but it has significant impacts on the complexity of the non-redundant devices. It also extended the routing table into those devices. As significant operational issues can occur managing the complexity of large L3 environments, this impacted the reliability of that device and the overall system. As there are typically 10-20 times as many wiring closets as cores in network, increasing their complexity by an order of magnitude that making them into routers incurs a 100-200x increase in complexity. Ultimately, this results in decreased availability.
So the Nortel engineers set out to develop an alternative. Through some brilliant thinking, they realized that the industry standard 802.1ab Multi-Link Trunking could be "split" between two redundant core switches. By connecting the switches together with a "redundant path", the closet switches could send packets over both links thinking they were talking to a single core switch. A failure of an uplink or card would quickly (today in less than 100 msecs) cause a switchover to the available path. This design is shown in the figure. By focusing on meeting the architectural goals, the engineers created a unique (and patented) solution that has never been equaled. It combines the key attributes of active-active redundancy (always on, quick switchover, easy configuration), with the attribute of simplicity in the wiring closet. While we added L3 DiffServe CoS to simplify packet markings (eliminating the conversion of L2 Preference bits to DiffServe), it eliminated the need for full blown router complexity and the addition of many additional endpoint router domains. This technology, known a "Split-MLT" is a key ingredient of creating a data network that is truly available.
It is interesting to note that many of the same thoughts went into creating Nortel's leadership in Metro Ethernet and PBT. But we will save that for a later blog.
Older: 
[…] data center can collapse into a pair of high end servers. You would really only need one except for availability considerations. Think about it - multiple application tiers (web front ends, application servers […]
November 28th, 2007 at 9:12 am from Enterprise Technology » Blog Archive » Guest Blog - Will the Computer be the Network?
[…] system availability. For a number of years, Nortel has deployed a patented implementation called Split Multi Link Trunking (SMLT) that uses the IEEE 802 standard in a configuration that is active-active with incredibly low […]
December 6th, 2007 at 12:24 pm from Enterprise Technology » Blog Archive » Cisco VSS….too little, too late?