Sign in to follow this  
Followers 0
Conor

Ethernet issues

7 posts in this topic

Hi guys, I have two networks on my site. One is a ControlNet and the other is an Ethernet. The ControlNet works fine but as we would need to bring down the Keeper to make any changes add new devices etc, and this would cause us a problem with stopping processes, we started to put any newer stuff on to our Ethernet network. We have mainly PC's, PLC's and Panelviews connected to the Ethernet. Over the longer distance runs we have fibre connections. I have replaced most of the older unmanaged switches with managed switches (as per advice on MRPLC). I also had a few older SLC PLC's and one PLC5 (with a sidecar Ethernet card) on 10 Mb half duplex. I have changed all of these to Autonegotiate and they are now running at 100 Mb full duplex, except for one older SLC that I am not able to change to this as the firmware is too old. I will be installing a new Processor in this soon. The other PLC's are mainly CompactLogix with one ControlLogix. The PC's are running Wonderware Intouch Scada applications. Now last week we had a problem where I was dropping connections between one area and another. I suspected that it was an old Westermo SD-321 fibre to ether convertor. I changed this out and we still were getting the problem. We found a problem with a new L35E that was installed on the network. This processor needed to be re-flashed and this cured the problem (solution given by Rockwell tech support). This processor is connected to a 1783-EMS08T. Before the processor was re-flashed when we went through the web server to connect to this switch the port connected to the PLC was showing Speed 100 duplex half. When we checked the processor properties online with RS5000 it was set to Autonegotiate and was reaing 100, full duplex. After the re-flash the Ethernet crashing errors stopped happening and then going through the web server to the managed switch the speed was 100 and full duplex. We kind of stumbled apon this problem, and my question is what would be the best way to troubleshoot a problem like this. I have Wireshark installed on my PC but am not 100% sure how to use it. Would Wireshark find this problem? Thanks, Conor

Share this post


Link to post
Share on other sites
It probably helps to understand first what half/full duplex means and also to a degree 10-BASE-T and 100-BASE-TX. During the initial connection to the switch, a device can do "nothing" and just start communicating Ethernet packets. This default comes up as 10-BASE-T, which is 10 Mbps, half-duplex only. Or, the device can send some control signals and negotiate for a faster speed (currently up to 10GBASE-T, which is 10 Gbps over twisted pair copper wiring). It can also negotiate whether to operate in half or full duplex, which is also a bit of history. Historically, Ethernet was designed to operate over coaxial cabling (thick, 10-BASE5, or thin 10-BASE2). All devices were connected on the same physical connection although boosters/repeaters (hubs) were allowed. Each device had to follow strict "listen before transmit" rules and just like an RS-485 network (DeviceNet for instance), collisions (two transmitters start at the same time) are a fact of life. Although Ethernet was originally spec'd at 10 Mbps, in reality due to the collision problem absolute theoretical maximum throughput when considering two-way traffic was only 3.5 Mbps, far below the other standard in common use at the time, Arcnet. Arcnet is a system which sets up a transmit schedule so that the collision problem never happens, but in exchange, timing is much more critical and delays are longer (though consistent) because each transmitter has to wait for it's assigned time slot even if it has nothing to send. The ODVA has published a specification for transmitting CIP packets over Arcnet, which you probably know as Controlnet. As more devices were loaded up on a given Ethernet, they had to share the same 3.5 Mbps and things got slower and slower overall. In order to combat this, you could add a switch. Ethernet switches were "store-and-forward"...basically, they electrically isolated two Ethernets but the individual Ethernet LAN's didn't know about the switch (switches are transparent, unlike routers). The trouble with switches at the time is that they were SLOW, and expensive. A hub is little more than an electrical "follower" circuit with multiple outputs. There is no processor or anything resembling much of a logic circuit. This makes them very cheap. Eventually switches got so fast that the store-and-forward nature became the speed bottleneck and "cut through" switches came out that would actually begin forwarding a packet right after the header was received, even before the full packet came through. This was very important on 10 Mbps Ethernet...at 100 Mbps or more, it's not so important anymore. Switches eventually began to be constructed with VLSI technology and this brought the price down to the point where hubs have all but disappeared off the market. Two things happened next. First, the old coaxial stuff was really expensive. In response, common PBX telephone cabling (CAT 3) and connectors (RJ-45) were adapted which drastically cut the cost of Ethernet over even the "ultra cheap thin" (10-BASE2) standard, and the new standard was called 10-BASE-T (T=twisted pair). Later, a new, faster option was created which compressed 100 Mbps signals down to occupy only 33 MHz over the same cabling, although later a better cable was designed (CAT 5, and the enhanced spec, CAT 5E, capable of up to about 100 MHz) because the CAT 3 cabling was marginal at best with the new standard. This standard was called 100-BASE-T. During this time switches also went way up in speed and came way down in price to the point where it really didn't make sense to use hubs anymore. In fact, a hub doesn't make sense with 100 Mbps+ because of the negotiation that has to happen and because the complicated signalling pattern can't be achieved with simple digital circuits alone (you need VLSI level logic). Very shortly after that, another negotiation option was added. Up until this point, all Ethernet traffic was still designed to be backwards compatible with the old 10-BASE2 and 10-BASE5 standards. However, the signals on the twisted pair system were designed so that all transmitted packets happened on one pair, and all received packets happened on a different pair. With hubs, whenever a node transmits, it also sees it's own transmission (as well as those of any other transmitter) on the received wiring. But with a switch, this isn't necessary. As soon as you turn off the "reflected" signal, it becomes possible to both transmit and receive a packet at the same time, and collisions completely disappear. 100-BASE-T became 100-BASE-TX which was automatically capable of 200 Mbps (100 Mbps in both directions, full duplex), although we usually still call it "100 Mbps". It is possible of course to turn off autonegotiation and set specific settings (such as 100 Mbps, full duplex), and just let the nodes "deal with it", whatever the consequences. This is true of everything except the faster (1 Gbps and higher) standards which REQUIRE autonegotiation (at least by the standard). Occasionally you may find a device which manages to fail to negotiate properly and it becomes a necessity to force one particular standard. In addition, although the CAT 5E standard is really good, in an industrial setting, wiring isn't always in the pristine condition we'd like it to be and sometimes we have no choice but to force a device to "downshift" to the slower but more reliable 10 Mbps standard to take advantage of it's ability to tolerate very poor cabling. OK...now getting on to your next question...why did this particular L35E "kill" the network? There are many ways in which something could have "gone wrong" with any particular device. One of the nastiest ones is called "spamming". This happens when a device has some sort of hard/software fault that causes it to either continuously or intermittently transmit bogus packets, especially broadcast packets. Unicast packets are bad enough...the receiving device has to deal with the excess traffic. Broadcast packets are even worse because EVERY device in the broadcast domain has to deal with them. Based on your description of the fact that it was "stuck" in 100 Mbps, half-duplex mode my best guess is that probably somehow this device (or the switch it was tied to) started spamming in some way. A switch can usually shovel around 20,000 packets per second or so over 100 Mbps Ethernet. Your typical PC has enough processor horsepower to easily digest those rates. However, AENT and ENBT modules are limited to 5 K packets per second and the little Ethernet cards on AB drives are limited to 600 packets per second. Not sure where your SLC and PLC-5 processors lie but my guess is probably slower than 5K packets/second. So a device which is spamming can do upwards of 20K packets/second. What is the response if a device sees way too many packets per second? Some will simply start sampling the packet stream (say reading every 5th one), causing even more retransmits of the legitimate traffic (and perceived errors and slowness). Others (ENBT modules and drives) simply start crashing and restarting. ...which brings us to the next question, what to do about it. Remember all those managed switches you bought? One of the features that you need to be looking for is rate limitiing. And you need to turn it on and rate limit the traffic to your devices. You may also want to think about broadcast domains and VLAN's (controlling who can broadcast to who) to logically divide up your network. And you may want to consider QoS (giving some ports priority over other ports when communicating to a particular device), so that you can protect the IO of say a ControlLogix processor from other traffic, and perhaps your whole control network from the office network (if they are tied together). Switches can also often set up priorities for an entire VLAN over another one, giving you other options. It's all about traffic management. None of this in particular "solves" any particular problem. All you are doing is hardening your network to AVOID and localize future problems. If a simple remote Flex IO AENT card started spamming with broadcast traffic at it's full rate of 5K packets per second, it might not just shut down the attached PLC. With an integrated network, you'd have problems across your whole site. And if you did put in two ENBT cards in the ControlLogix chassis to avoid this problem, you'd still have a tough time telling precisely WHICH attached device was the problem in the first place, especially with a large IO network. That's why it's best to use managed switches extensively and set up the rate limiting feature to limit any one device from overwhelming all the legitimate traffic...this makes troubleshooting the specific problem device much easier and keeps things running in spite of a fault. Same principle that you use when designing control systems with multiple PLC's or setting up circuit breakers to shut down only one small area where the fault is located. As to the value of wireshark...yep, you can find this easily. Carry around the smallest 4 or more port managed switch you can get your hands on. Hopefully you can buy a laptop with two Ethernet ports (rare) or plug in a decent speed USB/Ethernet adapter so that you can have two Ethernet ports. Program one port to mirror the traffic from the other two ports and carry this switch around in your tool bag with your laptop. Whenever you have a problem and you need to do anything in the field, first examine the cabling and looked for pinched/damaged cables. Next, look at each switch/hub in the cabinet. If it's a hub, replace it with a switch. Carefully take a large hammer (16 oz. usually works well) and destroy the hub before putting it in the trash. This avoids the possibility of someone coming in behind you and picking it out of the trash. Now take your managed switch and plug this device into the cables where you want to "tap in" and look around. First thing to do is go to the switch's diagnostic screens and check on whether the switch is indicating connection problems (watch the counters for bad packets...the type doesn't really matter except that we expect collisions on half duplex so don't get too concerned about these unless they get to be more than 10% of the total packet rates). This test is only for finding PHYSICAL or ELECTRICAL problems...basically, noise or other problems such as a pinched/damaged cable. You also need to know what the maximum speed of your device is. If it is dropping down from it's maximum capabilities to 10 Mbps, the first thing to suspect is a cabling issue. Also carefully examine the ports when you plug your stuff in because frequently you'll find cases where someone has tried to force a cable into a port upside down or some other crazy way and bent the little pins in the RJ-45 connector. Also examine the cables to see if there are proper terminations. Remember...this is 100 MHz cabling! If you don't terminate the cables properly including NOT unravelling all the little twists, you won't get a good impedance match. The end of the cable turns into an antenna. Any interfering signals get injected into it, and your intended signals instead of travelling down the cable and into the Ethernet port instead go shooting out into space where they don't do anyone any good. Good cable terminations are just as critical with copper as they are with fiber, and most electricians don't realize that communication wiring is vastly difference from 110 VAC wiring other than the voltage which causes them to give it no respect at all. This is where I find about 99% of the problems which is why I focus here first. While you are at it even without a managed switch, carefully look at the web pages of your PLC's and IO modules because they have the same statistical information available even without Wireshark. Next, fire up wireshark. First thing to do after collecting some packets is too go over to the statistics page and look at what you see. Notice if there's anything obviously out of whack. For instance if you see a huge percentage of broadcast traffic or a very large amount of overall traffic, then you definitely have a problem. If you notice Multicast traffic packets, you have a problem. If after 3 minutes you don't find a single IGMP query packet and you have Ethernet/IP IO, you have a problem. Now using Wireshark carefully look at the sources and destinations. Notice if anything looks amiss. The key thing with Wireshark is that it's a scanner and one heck of a filter. It's a question of understanding what you expect to see and what you don't, and carefully looking through the traffic to understand why something might or might not be working correctly. At this stage in the game, we're really out of looking for hardware problems. We're looking for configuration problems. Here's an example of something that can be found with Wireshark. About 2 years ago, I installed a remote rack of 1756 IO connected to another 1756 chassis with the PLC without incident. Then we added another PLC with even more remote IO. Then we added a THIRD PLC. Now about once every 2 week at roughly 8-9 AM all THREE PLC's hiccuped for about 10 seconds and then went back to work. The web pages on the ENBT modules reflected that the connections just mysteriously dropped and then restarted. Managed switches revealed nothing wrong with the cabling, and neither did the ENBT modules, drives, etc. What has going on? Firing up Wireshark without even doing mirroring on another port, I noticed...Ethernet/IP IO packets. What? After 3 minutes, I never saw an IGMP QUERY packet. Aha...turning that on made all the problems go away. In a similar fashion, a new CompactLogix L32E was installed with a large pump in a remote pumping station. The remote pump had a block of FlexIO with an AENT module attached to simplify the wiring. Almost immediately all devices on the wireless LAN that the remote pumping station was attached to started having problems. The wireless LAN was configured in "infrastructure" mode so there was no screening of broadcast traffic, and it was overloading the wireless system. Now it gets even more curious. The switch that was used was an unmanaged AB Stratix switch. We temporarily installed a Hirschmann managed switch into a spare port and turned on IGMP QUERY broadcasts so that it could manage the traffic. Guess what happened? Nothing. It appears that this particular AB Stratix switch does NOT properly snoop IGMP packets in spite of the labelling. So we had to shut down the pumping station and SWAP the switches. Once the Hirschmann switch was in place of the Stratix switch, all the wireless traffic problems disappeared. Again, all the troubleshooting was done with Wireshark in this case. I've also had a network where I did not see any "problems" per se in the control network at all but we kept having "hiccups" on the control network, especially at the end of the month. It wasn't until enlisting IT that we found it. We had VLAN'd the whole network so that the control network, office PC's, and the VOIP phones were all on separate networks. But there was no packet prioritization set up. It turns out that the problem was that the accounting department would do huge file transfers when closing out things at the end of the month into Excel spreadsheets. The phones could tolerate one or two dropped packets here and there but the control system that we had wasn't so intolerant. It was only by looking at the statistics pages over several days and monitoring for dropped packet activity that we eventually narrowed down the problem. The solution was to set up QoS (priorities) based on the VLAN's. We set up the control network for a maximum throughput of 30 Mbps but gave it priority #1. Voice was #2 with a 60 Mbps bandwidth. PC traffic was #3 but with no bandwidth limit. That way when the network wasn't loaded, everyone got as much as they wanted/needed. But when bandwidth became critical, the higher priority traffic got as much as needed except that bandwidth limits kicked in just in case something was very wrong in those networks so that it didn't wipe out everything else (didn't want the phones to go out while troubleshooting a control network issue). Again, we couldn't have figured this out without Wireshark. Granted we should have "known better" but it's a case of live-and-learn.
5 people like this

Share this post


Link to post
Share on other sites
Hi Paul, As usual it is a real pleasure reading your detailed response. The problem that we were having came back again yesterday, and I disconnected the same processor off the network. I did this by disconnecting the lead to the outside on the switch. We left this out over night to prove that this was the problem and the network stayed up. The switch that we diconnected was a Stratix 6000. So this morning we connected a spare Stratix 8000 that we had to prove the switch. We haven't had any network issues since, so I think that the problem all along was the Stratix 6000. As you had stated Paul this must have been "spamming" the network. I will have a look at the AB website for firmware updates for my 6000 switch. Thanks again Paul, you have given me some very good advice which I will start using as soon as I re-read your e-mail to fully digest it. Conor

Share this post


Link to post
Share on other sites
Paul, that is an outstanding post. I have marked a few of these over the years as golden nuggets (so I can search for them), but that one qualifies, in my opinion, as a chest of gold coins... Thank you.

Share this post


Link to post
Share on other sites
Outstanding reply Paul.

Share this post


Link to post
Share on other sites
x2 Good stuff Paul. Thanks

Share this post


Link to post
Share on other sites
To add to Pauls post, there can be a issue where a switch's redundancy is turned on (typically using a manufacturer specific proprietary signaling technique) and if another switch on the segment is set to use spanning tree (prevents loops from creating packet storms) the combination will produce uncertain results. As an example, a Hirschman switch set to "redundant ring" with another Hirschmann switch within the ring set with spanning tree turned on will simply shut down the segment between them... sometimes. If the switches aren't the same manufacturer who know how either would behave. Just more fuel for the thought process...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0