Strange Ethernet/IP "blips" |
![]() ![]() |
Strange Ethernet/IP "blips" |
Nov 2 2009, 08:19 PM
Post
#1
|
|
|
Propeller Head ![]() ![]() ![]() ![]() ![]() ![]() Group: MrPLC Member Posts: 1,003 Joined: 27-March 05 From: North Carolina Member No.: 7,183 Country:United States
|
It's happened twice now. I've been running Ethernet/IP networks for a few years now and most recently started using them for I/O.
We have a 1756-L61 with 5 remote I/O's. Two are remote racks. The other three are 22-COMM-E's (Powerflex 40's). At the end points, we have N-Tron "dumb" switches. These are connected to a Cisco 3570 switch. There are also two Wonderware servers connected via several more switches talking to the PLC through the same 3570. Well, we did. I substituted a Sixnet switch for the N-Tron at the PLC end of things simply because it's the first managed switch I could lay my hands on so that I could monitor the ports more closely at the PLC end of things. However, twice now in the past 2 or 3 weeks, all communications has ceased for about 10 seconds, and then reconnected as if nothing ever happened. The remote racks and drives are all programmed to fail safe when this happens (shut down). The trending data on the servers indicates that there was never a loss of communication between the servers and the PLC. All the port monitoring data says that all ports were OK with no loss of communication. Nothing in the logs indicates power failure (which if nothing else would be loss of communication). It's as if at the CIP level of things, the links were all severed and then recovered for no apparent reason. The really strange part is the server communication. I would narrow down the failure cause to either the Cisco switch, the N-Tron (now Sixnet) switch, the ENBT card, or the cabling between all of these if we lost communication with the PLC altogether. But during these events, I'm getting data updates with tags that are changing values at the servers which suggests that all the above hardware is indeed communicating. It's as if all producer/consumer traffic suddenly imploded in a puff of logic and then got regenerated. Has anyone ever seen anything like this with Ethernet/IP? I'm waiting for my license for RS-Logix Pro to get here because if nothing else, I'd really, really like to load up RS-Networx and poke around and see if there's any obvious configuration issues that were missed. |
|
|
|
Nov 3 2009, 10:30 AM
Post
#2
|
|
|
Sparky ![]() ![]() ![]() Group: MrPLC Member Posts: 78 Joined: 22-September 04 Member No.: 2,296 Country:United States
|
All ethernet switches and hubs are not alike. I have learned, the hard way, that you should test them off line. N-tron is a solid product and reliable and my experience with sixnet from the past was not a good experience so I won't comment as I have not used them for years. The best way to trouble shoot is using a managed switch. Take a diagram of your network and look for bottleneck points. All devices should be configured for a fixed transaction speed and not auto-negotiate. The fixed speed should be the slowest device on the network. Place the managed switch such that you can monitor most of the transactions and look for the device that is generating the most requests. The switch is one place I would look first. Then have you performed any software upgrades? Look at the update rates on the devices. See if something has changed. I redownloaded the program to a L61 and lost the RPI settings so check those to make sure they haven't been changed or lost.
This is general testing I have used. Some may or may not apply and without a hardware layout and configuration settings it is dificult to point you in a specific direction. There are some monitoring software you can get online that looks at packets and transactions. Some have costs. Good luck as AB ethernet is not easy to troubleshoot and may take some time and patience. |
|
|
|
Nov 3 2009, 03:49 PM
Post
#3
|
|
|
Sparky ![]() ![]() ![]() Group: MrPLC Member Posts: 68 Joined: 10-October 07 Member No.: 26,111 Country:United States
|
I had a problem where when data got to a certain traffice level, the cisco switch would shut the port down momentarily. This was caused by Voice over IP settings on the Cisco switch. Their is nothing wrong physically with the network, just the amount of traffic.
Did you have your Cisco firmware upgraded? |
|
|
|
Nov 3 2009, 10:24 PM
Post
#4
|
|
|
Propeller Head ![]() ![]() ![]() ![]() ![]() ![]() Group: MrPLC Member Posts: 1,003 Joined: 27-March 05 From: North Carolina Member No.: 7,183 Country:United States
|
I had a problem where when data got to a certain traffice level, the cisco switch would shut the port down momentarily. This was caused by Voice over IP settings on the Cisco switch. Their is nothing wrong physically with the network, just the amount of traffic. Did you have your Cisco firmware upgraded? I have no control over the Cisco garbage. The port shutdown garbage is the reason that Cisco cannot be used in an industrial controls network as far as I'm concerned, no matter how much AB tries to be their marketing arm. My next move is to bypass it. That being said, as soon as I turned on IGMP querying (guess what IT wasn't doing?), the packet rates dropped significantly. Apparently even the Cisco stuff was configured with IGMP snooping, just not generating queries. I have drives with 22-COMM-E boards which only have packet limits of about 500-600 packets/second. The rest is ENBT's which have packet throughputs up to about 5500 packets/second, and they're reporting loads averaging around 1100 packets/second. I'm somewhat concerned about the total bandwidth limits in the Cisco switch as they have everything VLAN'd properly as far as packet isolation goes but not bandwidth control/isolation. That requires a whole other conversation. |
|
|
|
Nov 4 2009, 10:36 AM
Post
#5
|
|
|
Sparky ![]() ![]() ![]() Group: MrPLC Member Posts: 22 Joined: 9-March 05 Member No.: 6,474 |
All devices should be configured for a fixed transaction speed and not auto-negotiate. The fixed speed should be the slowest device on the network. Place the managed switch such that you can monitor most of the transactions and look for the device that is generating the most requests. The switch is one place I would look first. I agree about the Sixnet device, but the above... NONOONONONONONONONONONONO. Unless you are one of the 0.015% of super-networking guru's AND have a solid reason, NEVER change a device from AutoNegotiate. That is a recipe for problems. For one thing, you have to guarantee that AN will be turned off forever on each end of the wire connection between devices. Switch dies? Someone replaces it? Leaves it default? Broken, damaged connection. Dropping in an unmanaged device where you can't set AN? Broken connection. The bad part, is having an AN mismatch may not show up for anywhere from minutes to months, but it will. Turning off AN will usually result (no matter how carefully you try to configure the endpoint devices) in the wire-mode being set forever to half duplex, instead of full. For modern devices (less then about 7 years old) turning off AN can also screw up MDX Autoswitching. 10 Years ago or so, when FastEthernet was first becoming popular, AN devices had some issues. That has changed. |
|
|
|
Nov 4 2009, 12:02 PM
Post
#6
|
|
|
Propeller Head ![]() ![]() ![]() ![]() ![]() ![]() Group: MrPLC Member Posts: 1,069 Joined: 31-January 02 Member No.: 35 |
If you are getting an accumulation of physical errors like noise, or getting auto-negotiate errors, you're going to see that in the embedded diagnostics of the 1756-ENBT and the 22-COMM-E, so check there.
Use a web browser to examine the status of the 1756-ENBT's I/O connections. If they all have the same uptime, that tells us something, because they all failed and re-started at about the same time. The web browser will also show you the CPU utilization of the 1756-ENBT. Because EtherNet/IP is a stack-based protocol, it is possible to just overload the module with traffic and have it drop I/O connections. That's one of the reasons I use ControlNet for I/O. It's possible for the Atmel backplane chip issue from a few years ago to behave this way, breaking one or more I/O connections at the backplane level and not faulting the -ENBT or the controller. Double-check for Atmel backplane bridge chips with -64 at the end of them and use the RA Knowledgebase document to check for the remanufacturing indicators. If you can mirror the port that the 1756-ENBT is using and capture the failure as it happens, you'll see whether the 1756-ENBT just stopped transmitting, or if it stopped receiving from the field devices, or if something else happened that caused traffic to cease to that port. I keep a 1756-L1 handy just to use the serial port to stop my Wireshark or TShark captures via ASCII and SerialKeys (there's a how-to in the Download section). |
|
|
|
Nov 4 2009, 09:52 PM
Post
#7
|
|
|
Propeller Head ![]() ![]() ![]() ![]() ![]() ![]() Group: MrPLC Member Posts: 1,003 Joined: 27-March 05 From: North Carolina Member No.: 7,183 Country:United States
|
If you are getting an accumulation of physical errors like noise, or getting auto-negotiate errors, you're going to see that in the embedded diagnostics of the 1756-ENBT and the 22-COMM-E, so check there. Did that. No increases in error counts. QUOTE Use a web browser to examine the status of the 1756-ENBT's I/O connections. If they all have the same uptime, that tells us something, because they all failed and re-started at about the same time. Did that. That's what happened. Only strange thing is that the drives all went offline almost exactly 5 minutes earlier and then came back. QUOTE The web browser will also show you the CPU utilization of the 1756-ENBT. Because EtherNet/IP is a stack-based protocol, it is possible to just overload the module with traffic and have it drop I/O connections. That's one of the reasons I use ControlNet for I/O. Hence the reason that the switches are getting upgraded now to something where I can limit the packet rates below the point where the module overloads. My #1 suspicion right now is that either the Cisco switch (which doesn't have any kind of VLAN bandwidth allocation settings) may have become overloaded handling a broadcast storm on another VLAN and threw away the packets on the I/O network enough to cause it to lose communication. My second thought is that without IGMP support and since I've been monitoring traffic around 1100 packets/second at the ENBT modules (half is HMI-generated) at the processor and a little over 500 packets/second at the remote I/O racks, I suspect that the drives probably became overloaded very slowly over time and finally choked. What I don't know is what happened next. If everything works the way it should, then the drives ONLY should drop off the network, reboot (or fault or something) and then come right back, and the ENBT cards should be unaffected. With IGMP now enabled and working correctly, nothing is anywhere near those kind of packet throughputs. That's what SHOULD happen. However, I'm now wondering if those same drives did something very bad while they blew their memories and caused a momentary outage while they choked all over the network. QUOTE It's possible for the Atmel backplane chip issue from a few years ago to behave this way, breaking one or more I/O connections at the backplane level and not faulting the -ENBT or the controller. Double-check for Atmel backplane bridge chips with -64 at the end of them and use the RA Knowledgebase document to check for the remanufacturing indicators. Hmm...these were all bought brand new in the last 2 months. I forgot to check for that. QUOTE If you can mirror the port that the 1756-ENBT is using and capture the failure as it happens, you'll see whether the 1756-ENBT just stopped transmitting, or if it stopped receiving from the field devices, or if something else happened that caused traffic to cease to that port. I keep a 1756-L1 handy just to use the serial port to stop my Wireshark or TShark captures via ASCII and SerialKeys (there's a how-to in the Download section). At this point I was looking for ideas before I went crazy with Wireshark and waited again for weeks before it happens. That is of course the next best thing. Hadn't thought of doing this before and I left the L61 serial port open (don't really need it except for troubleshooting), so this might be a good time to try using it if I get a third outage. |
|
|
|
![]() ![]() |
| Think this page or topic is awesome? Submit to: |
|
Lo-Fi Version | Time is now: 20th November 2009 - 09:25 PM |