Sign in to follow this  
Followers 0
paulengr

Redundancy causing network disruptions?

5 posts in this topic

We just set up RSTP-based redundancy connected in a ring with 7 Cisco switches connected by fiber (1 Gbps). We noticed during getting everything configured and connected, various disruptions to the existing PLC's (all L61's with 3-5 remote IO cards running EthernetIP) occurred. Example: 1. Disconnect the link farthest from the root. Since it is marked as "blocking", there was no effect from doing this. 2. Disconnect the link next to the "blocking" link as well, forming an "island". No effect except to devices on the "island" (interruption). 3. Reconnect the "island" using the formerly blocked link. PLC's which are on the connected network (not affected) by the island are momentarily disrupted. The above (#3) is the concern that I have. Reading through the RSTP documents it appears that there is a belief that RSTP will restore connectivity on a ring in X+Y*Z seconds where X is the time to detect a failed link, and Y is the "round trip" time for a proposed/acknowledged synchronization between any two nodes, and Z is the number of nodes in the ring. This doubles for root node failures. So for X=5 ms and Y=5 ms, and a 7 node ring, I should see 40 ms recovery time for RSTP except for a root node failure which increases to 80 ms. As a secondary effect it appears that the routing tables on all affected links that change directions get wiped until the usual learning mechanisms (IGMP snooping or standard packet snooping) rebuild the routing tables, turning affected ports into broadcast traffic for a period of time. With the limitations of the hardware in use (some 20-COMM-E cards, 600 packets/second max throughput; ENBT = 5K packets/second throughput), the resulting traffic can easily overload the affected PLC's and IO. According to the Ethernet/IP documentation, missing a packet for a period of 3 RPI's means the connection is dead and should be restarted. RPI's are currently set to 20 milliseconds. So in my case if everything RSTP-related goes as planned, the 3rd packet for the IO (60 milliseconds) will just make it in time if it doesn't get clobbered by anything else, and since there was no reversing directions traffic should start back to normal operation in 40 milliseconds. Any traffic coming from the "island" however might cause disruption since it will be broadcasting until the next IGMP multicast update, if the combined load overloads my IO devices (especially the drives). Am I reading this correctly? Anyone else with experience/checking how RSTP recovers or how quickly Cisco switches handle it?

Share this post


Link to post
Share on other sites
Paul... Just came across this topic. You are reading it correctly and you are not alone. We have enhanced RST with a solution called “fast ring”. In the event that faster recovery is required, Phoenix Contact has optimized our managed switch line to better recover in just the scenario that you have described. "Fast Ring" allows us to recover both unicast traffic and multicast groups in approximately 200 ms or less even with extremely large topologies. Our solution also works in more than just a simple ring topology. For example, partial and full mesh, interlocking rings, etc. Phoenix Contact switches running fast ring can also interconnect with other STP capable switches on the IT side such as Cisco. For more technical details you can check out our product page: http://www.phoenixcontact.com/automation/32119_32466.htm or call our free tech support at 800 322-3225. Edited by Phoenix Contact USA

Share this post


Link to post
Share on other sites
Can't comment on your question Paul as you're on the bleeding edge ahead of most of us -- as usual. Just wanted to say your post is well written and raises what is a very valid point if I follow you correctly. Please don't drop this idea and do let us all know how it shakes out. Sooner or later it's a bridge we'll all have to cross.

Share this post


Link to post
Share on other sites
Paul, Do you think its the additional broadcast/multicast traffic that is causing the interuption to the plc's that were on the connected network when the "island" is reconnected? if so then, i think broadcast traffic can be supessed/limited in the some of the cisco swithes. Edited by chelton

Share this post


Link to post
Share on other sites
I know you are probably working on this first and foremost, but why does the link go down? Are the blocking routes static or dynamic? Does the protocol at some point determine that the cost of the linked route is so high it needs to switch to the blocked route? I would think with a Gb line you would not come across that but I have seen some protocols measure the cost like this and if the default is too low it will momentarily seem like the backup route is more effective and this toggling will occur. If you configure the blocked link statically though I would guess that should not happen. When I saw it occur, it was in a regular cisco network and they let the routers elect the root and block a route dynamically. Then the bandwidth of the end link went above whatever threshold triggered the math to think that the other link was better and it should switch. Then it would bounce back and forth....

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0