Sign in to follow this  
Followers 0
waterboy

Can these errors be cause by a running PLC program?

37 posts in this topic

I am at the end of ideas. I know I am grabbing at straws here but I dont have enough PLC programming experience to rule it out. So I gotta ask. Using iFix4.0 and 5.0 for HMI, GE OPC and ABR drivers through RSLinx 2.51, 2.54 to PLC5,1 SLC , CompactLogix and ControlLogix PLCs. I have intermittent but frequent connection losses to multiple PLCs around the plant causing the HMI screens to show ??? in place of the data. So far they clear thmeselves up in just a few seconds but this is only getting worse. The OPC driver in every machine will show errors similar to this: AB_ETH-1\0.(10.10.28.17).1.0 : F07_Air_Alarm_Reset : 03/13/09 - 20:01:44 : 00h Items: F07_Air_Alarm_Reset Type: READ Mfg: Allen-Bradley PlcType: Logix5000 Desc: ControlLogix Optimized packet being reinitialized 74bf620, connection 4813329, packet's connection 4813329, packet's read connection 0, packet's state --Creating Optimized Packet - and - Desc: CIP Messaging Error: a message timed out waiting for a response -and- AB_ETH-1\0.(10.10.28.42).1.0 : @IsPresent : 03/13/09 - 21:01:09 : EE00036h Items: @IsPresent Type: READ Mfg: Allen-Bradley PlcType: Logix5000 Desc: Control Net Service cannot be performed while Object is in current state Number of occurrences: 20 Last Error occurred: 03/23/09 - 07:15:18 - and I just love this one - Desc: A Timeout occurred where the engine didn't generate a timeout in the time required. Every single SCADA does this so I can't imagine out how it could be a configuration issue on a SCADA. Even a machine I purposefully havent touched for a year does this. I really thought it was RSlinx but how could this happen on so many machines like this and what could cause it? I examined the HTTP interface to the PLCs and their CPU's are running at about 5%. There are ZERO errors in the Ethernet Statistics. So I spent the entire afternoon with my Packet Sniffer at each PLC in question and at one of the SCADA servers. All I found (and fixed) was a single HDX port mismatch on the switch but that port isn't even in the PLC circuit. I examined the managed switch through which everything flows and it has no errors noted. I am going to borrow one and replace it as a test but I am not hopeful. I can see nothing wrong with the Ethernet layer, I think I saw 2 TCP retransmits, a bunch of immediatly answered keep alives but nothing else. I'm truly stumped. So the question becomes; Since it appears that all traffic that actually gets generated is being properly handled, Is there anything within a CLX program that can make RSLinx or a single or multiple PLC's stop responding like this? Something like a program sending a message to itself that would consume all the CPU resources until it failed, yet have no external traffic? (told ya, straws!) I welcome ALL ideas, I got none left.

Share this post


Link to post
Share on other sites
Do you have ethernet I/O on this network also? How much bandwidth are you using? What's the "scan" time of your hmi tags, and how many. Do you have your drivers optimized? If you have your driver configured for 10 tags here, and 10 tags there, and maybe 5 over here, you can get slow response. It is much better to have 10 reads of 100 tags, than 100 reads of 10 tags. Each read/write has a large amount of overhead.

Share this post


Link to post
Share on other sites
No Ethernet I/O, Bandwidth on each machine is less than 1%, Scan time of the tags is generally 1 sec, some are longer but none are less than that. You are correct about how the drivers are configured, the groups I created are tied to the area involved rather that the speed at which they should be updated. I get what you are saying and somehow I will change all my tags and groups to reflect that. But... Speed isnt the issue. The screens update just fine the vast majority of the time. The issue I face is that suddenly RSLinx can't or wont communicate with a PLC, but for no obvious reason. Then 10 seconds later its fine again. Without any user interaction, and they all don't do it at the same time. Or even the same datapoints.

Share this post


Link to post
Share on other sites
Has there been a recent installation of something new somewhere that you may not be aware of? I had four conveyor lines all next to each other all on the same 480 volt feed. There was contruction going on next door and whenever the construction crew fired up the elctric air hammer all four of the conveyor lines would jerk and change speed causing the product to fall all over the floor. I saw this accidently by walking over to get an extrension cord for something else. So when the air hammer started again (I can hear it but not see it around the brick wall), the conveyors changed speed again. So what I did was have the contstruction crew plug their electric hammer into another feed, they must have used 400 feet of extension cord just to get to that other outlet that was being feed from some other switchgear (just as an experiment) and guess what, no more speed changes when the hammers went off. I had them replug the hammer back into the outlet they were using and connected an oscilloscope. Wow, what a bunch of noise spikes!!! Maybe you can isolate your PLC power or check it out with a scope. Its just good general practice to design an electrica panel, especially with electronics in it, with noise filters, or chokes. I use Sola Transient protection STV25K-10S... Hope this helps Edited by jimdi4

Share this post


Link to post
Share on other sites
There is lots of construction going on here but they have been instructed to use generators for their equipment for precisely that reason. Additionally this problem will also happen during the nights and weekends when they arent here. I located a doc on the Rockwell site that speaks to this exact behavior. It says that when the HMI attempts to read a specific bit within a floating point address, (i.e. addresses like [TopicName]FloatAddress.1. or [TopicName]FloatAddress[0].1) it can cause the "Optimized packet being reinitialized..." message and disrupt messages. It also states that a ripple affect can also affect other addresses. Now I might have typoed one when entering then originally, maybe, but none exist now. So this KB article appears to describe the issue, but since I am not reading any bits from a floating point address, what else can it be?

Share this post


Link to post
Share on other sites
I was on the phone with Tier 1 of Rockwell support and they havent been much help so far. They are supposed to call back today with more ideas. In the mean time I decided to focus on the PLC CPU to see if I could see how hard it was working and what it was doing. I ran across the Logix 5000 Task Monitor and I see that the CPU usage is around 94% on most of these PLC's. Occasionally I see a downward spike and one of those spikes coincided with a comms dropout. I cant say its cause and effect, but I know that 94+% usage cant be good so its worth pursuing. 2 other Contrologix PLC's are showing between 7-15% CPU usage. After reading the documentation for the tool there is some question about the validity of the CPU reading. It states that there some calculation is needed depending on whether the task is continuious or not. Also necessary is adjusting the "Rate" in the tool? and then at that point I need to subtract a "Null Task" value which doesnt appear anywhere. Can someone put that into english and tell me how much I can trust this value I see. And then if these values are true, what should I look for in the PLC to throttle them down? I wouldnt think they should be that busy. I attached a screenshot of the tool running on 2 of these plcs.

Share this post


Link to post
Share on other sites
Anyone... anyone...?

Share this post


Link to post
Share on other sites
This may be a long shot. I had an issue using Ge screens(Cimplicity) and found that the touchscreen driver was updating based on what the screen was currently showing and some other random factor GE could not explain. never mind I had remaped all of the data in CLGX to prevent this. The long and the short was that GE was not optimizing the transactions and so every request to the Control logix was different and not optimized. The Control logix Ethernet card could not buffer the data ahead of time based on a repetetive transaction and so there were delays getting the data off the back plane and routing it. Bandwidth had little effect. To discover I started pulling users off the circuit until the comms stabalized and then added them back to narrow the search. GE's driver was part of the issue. This was two years ago and they may have patched it by now. What I found was the control logix ENBT card supports both AB CIP and AB ethernet transactions and the combination of both plus changing requests on the same network slowed everything down, and I am not sure if IFix allows prioritizing transactions or has set updates. Good luck I have felt your pain and it is not fun. Keep fightin

Share this post


Link to post
Share on other sites
The latest product notifications form Rockwell contain a link to a Knowledgebase article (Answer ID 33121) which deals with this error. http://rockwellautomation.custhelp.com/cgi...ated=1127966400 This might help. Andybr

Share this post


Link to post
Share on other sites
Since you are pouring a lot of time and effort into resolving this issue with no smoking gun, yet, I suggest you step back and break the problem up into pieces. Verify there is not anything going on in your power. If you have a Dranetz, put it to use. If you do not own one, your department can expense a rental for 30 days for a few bucks. I was backup controls engineer on multi-million dollar project. After commissioning, we had intermittent loss of annealing for many loads that had to be scrapped at $0.5M/load (very costly). The root-cause (2 months later) was found to be a noisy sump pump (its startup noise injected enough ripple into the main controller power supply to cause it to intermittently stop). With more engineers that one can imagine pouring over every last detail for weeks, this one slipped right through our fingers. By accident, someone was near the sump pump when it started and the machine stopped. After verifying good power, start dropping some of the network communications off (if possible) to find the break point for the comms failure.

Share this post


Link to post
Share on other sites
Thanks, I have seen that article and while it describes the error perfectly, I am not addressing any floats at the bit level. I searched all over my SCADA looking for them after seeing that article.

Share this post


Link to post
Share on other sites
Thanks for the reply KW, I have a Fluke 433 that I can (and will) connect to it and monitor the power, however all these PLC's have a UPS and surge suppression. I am fixated on the dicovery that the Logix 5000 Task Manager shows them all dropping CPU usage from 60% (made that low after adjusting the timeslice to 50% and ticking the Comms priority button) to 30% randomly for various periods of time, but not all together nor for the same duration. Hypothetically lets suppose for a minute that I am hammering these compactlogix PLCs with SCADA requests, say 20000 tags a second (I'm not, worst case its more like 1000 per second). Is that too much ethernet comms and will that affect the CPU usage like this? How would I witness the actual effect that the SCADA system has on it? There is some discussion that perhaps the SCADA is asking too much of it. I would like to determine if that is true.

Share this post


Link to post
Share on other sites
How about the grounds? Is it possible that you may be getting noise through your ground connection? Are you just having problems with the compacts?? I remember a utiility that is available from Rockwell that allows you to check for issues on an Ethernet I/P system taking into account loading, managed switches, etc. Cant remember the name right now - I will try to get the name of it for you. Russell

Share this post


Link to post
Share on other sites
I'll look forward to seeing that. Having no experience with these I am really curious about how these divide up their time, why dont they sit idle more often, whether connections should be unconnected or connected, and whateven else is making the CPU work so hard. It really isnt doing much for the process, most of the time its simply reading a level, modulating a single valve with a PID from that level, and reading a few AI for transmission to the SCADA system. It must be doing more things than we are telling it to do.

Share this post


Link to post
Share on other sites
I am with PLCMentor...check your grounding. Even though you have UPS's on the PLC power supplies, there could be something funky happening on your power delivery/grounding. Ground impedance changing instantly can wreak havoc on high-speed digital circuits. Is this control network tied to the company network? Could the IT group have made changes about the time this started? Have you looked at segmenting out the Ethernet comms to limit the broadcast traffic? Programming routing tables in managed switches to control comms traffic? Have you run WireShark to gather statistics on Ethernet comms? In a nutshell, when a system has been running then starts to sputter, my money is always on answering the starting question "what changed to my process?". Either the setup was perfect, before the root-cause incident, or the setup was doomed to fail once it was loaded beyond a certain point.

Share this post


Link to post
Share on other sites
Here is the Ethernet Planning tool I mentioned: visit http://www.rockwellautomation.com/solution.../resources.html select the "choosing an architecture" tab then download the "EtherNet/IP Capacity Tool. Russell

Share this post


Link to post
Share on other sites
Russell thanks for that link, have bookmarked it. I tried the tool briefly having not read the manual yet, but it appears to deal only with PLC to PLC communication using Produced/ Consumed tags. Thats not what we are doing. We poll the PLCs from SCADA and the comms between PLC's are simply messaged, and there arent that many of those.

Share this post


Link to post
Share on other sites
Since you say you are having problems with OPC driver. Just to restate the problem: The HMI uses GE OPC to read/write to PLC over your ethernet network. Can you just use CIP messaging from the PLC's to read/write to the OPC server onto its data tables? Instead of letting the OPC write/read data to/from the PLC's. I had this same issue with a Prosoft device that polled the plc. The prosoft device read/write at an RPI rate and caused some bandwidth issues on the network. When I switched to using a MSG instruction to read/write to the prosoft device, my problems went away. Maybe the GE OPC is broadcasting messages to everyone instead of Unicasting to your particular PLC. Just a thought from past experience.

Share this post


Link to post
Share on other sites
and I appreciate all of it. This is both interesting and maddening.

Share this post


Link to post
Share on other sites
If it is the problem with servicing communications, then there is a few things that you can do. change the System Timeslice under the controller properties, Advanced menu. Default value = 20% Look at the example I gave you as a bmp file. I have had this upwards of 35% before it affected the scan time of the processor. But it all depends on how much programming is in the processor. If the scan time is affected by the programming in the continuous task do this!!! Also, to do a more radical change is to move all the programming to a time interrupt task of say the same timed interrupt as your present scan time. Example: If your scan time is say 7mSec under the continuous task and the time between each continuous task is say 20mSec. You can put your program in a time interrupt task at 20mSec. Understand what i am saying? timeslice.bmp

Share this post


Link to post
Share on other sites
I do, I will see if I can do that . what about RPI? will that have much impact? There are 11 modules in the rack

Share this post


Link to post
Share on other sites
RPI settings are independent on a seperate processor chip on a contrologix. This is the IO task that you are not aware of until you do a controlnet configuration or look at the RPI settings on the local rack. Again the IO RPI update time is done on a seperate processor on the contrologix!!!! Get this PDF this will explain this in detail until you have a headache! TasksSchedule 1756-pm005_-en-p.pdf

Share this post


Link to post
Share on other sites
Too late for that! THanks I'll look into it

Share this post


Link to post
Share on other sites
I cant locate a document by that name on the rockwell site

Share this post


Link to post
Share on other sites
There's a lot of confusion going on in this thread. First let's get some things straight. The ControlLogix (1756) processor does indeed have two CPU's. This is NOT the case in a CompactLogix. All of the CompactLogix members use the same single CPU for both I/O and scanning. This is one of the reasons why the CompactLogix series is so much less expense (one less CPU, and no dual-ported memory controller). From iFIX, you can actually do two different types of message transfers. You can do unsolicited, and solicited. There is a huge difference especially with the CLX in how they are implemented. In unsolicited-type communication, the actual protocol is effectively identical to MSG blocks. iFIX sends a request to RS-Linx at exactly the scan rate that you programmed into iFIX. RS-Linx then sends a command to the PLC to request a specific section of memory, and then the PLC responds. With unsolicited communication, things change. Under a SLC or PLC-5 or Micrologix, RS-Linx does essentially the same thing with regards to the PLC, BUT it doesn't send data back to the iFIX OPC driver unless there is a change of state detected by RS-Linx. With a CLX processor, RS-Linx registers the information with the PLC itself. The CLX PLC then keeps track of when the data updates or the requested update rate and sends over data as-needed without any request at all from RS-Linx. Obviously depending on how fast the data is changing, this can be more or less burdensome on the CLX processor. In addition, all these requests take up memory (and processor time) on the CLX. The best way to use unsolicited mode is with UDT's. The granularity of how unsolicited mode works on a CLX is the "data item". If you have 20 DINT's, then it will need to monitor all 20 of them. It expends one packet for EACH DINT that changed value, every time. If you put all your data into reasonably sized UDT's, then whenever data in the UDT changes, it will send over ONE packet containing everything in that UDT (assuming that you keep the UDT size down to roughly 400-500 bytes). Now, keep in mind also that current Ethernet/IP hardware is capable of doing 100 Mbps. Note that I said HARDWARE. That is, the NIC card can keep up. However, the Ethernet/IP adapter is limited to closer to 2000 packets/second or so, and the I/O cards (if you are doing Ethernet/IP I/O) are limited to about 1000-1200 packets/second. This is MUCH LESS than a full 100 Mbps data stream. Regardless of the interaction between the CPU's in the PLC, this alone might explain what is going on. Why is this an issue? Well, if you have a switch that does IGMP snooping, then the devices attached to that switch will see only the packets that they are registered for (the multicast packets they are listening for). If there is unsolicited (aka producer/consumer) traffic on the network, there are no issues. HOWEVER, if you don't have IGMP snooping in some of your switches, then when it passes through those switches, it turns into full blown BROADCAST traffic. And all general PC-type chatter which is broadcast in nature is also hitting your PLC constantly. If you exceed the PLC's ability to keep up with that traffic, you will overload it and cause the PLC Ethernet devices to either drop packets or sometimes lock up (I/O devices usually do the latter!) Normally if you are looking at bandwidth displays in Wireshark or something like that, of course, you might not realize that even if the bandwidth is only a fraction of 100 Mbps that you may be hitting this limitation because most PC's can easily keep up with a full 100 Mbps (more processor horsepower is available for handling packets). Also, keep in mind that RS-Linx itself is NOT the greatest of OPC servers. Once you get above about 3000 data items or so, or if you have very heavy polling rates, RS-Linx will start to stutter and generally act up. You can usually detect this because RS-Linx will start to use a lot of CPU time if you monitor it with Windows Task Manager when you get close to the limit. The solution is to switch to Kepware which tends to be much more scalable, less expensive, and more stable. Matrikon is more stable but also not as scalable. Also, RS-Linx is known to be extremely flaky when used as an OPC server for CLX in general. If you lose packets for any reason, you have to close the connection and reopen it to get RS-Linx to reconnect. I can't even predict why or when this would happen. It seemed to be random and uncontrollable. I think in the end I was simply detecting dropped packets and that RS-Linx was recovering very poorly. There are ways to detect that this is going on with RS-Linx but it's a real pain and doesn't work all that reliably either. I went through a lot of this pain when I switched to a CLX processor with Cimplicity PE. I just found that in general I have to live with Cimplicity "blinking" for about 10-20 seconds about 2-3 times per day on average. Nothing you can do about it except switch to Kepware or Matrikon. No amount of calls to AB Tech support ever resolved why a non-AB OPC server works stably day in and day out with OPC while RS-Linx is very unstable, or why RS-Linx works just fine with non-CLX processors but gives you so much trouble with a CLX processor. So based on what you are describing, I suggest that some I/O optimization is in order. I'd start by creating some UDT's and trying to group together bunches of data in some sort of logical fashion. Take a good look at the scan times on iFIX. Try using unsolicited communication as well to reduce the overall packet load where unsolicited mode makes sense (read up on this before trying it...it can also cause the PLC to choke as it attempts to spam the PC when it is dealing with rapidly changing data). Check into your switches and generally how you have everything configured to decide if IGMP packets are causing an issue. Use wireshark to check not only bandwidth but more importantly, packets per second.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0