Redundant or not to redundant, that is the question

paulengr · 27 Dec 2006

Q: If you had a choice between installing multiple Controllogix processors, one for each process line, with no redundancy, or alternatively to install a redundant pair with Ethernet ring switches (so that there are redundant Ethernet links everywhere), which would be the better option? Assume there are 4 process lines here. The way I see it, one processor per line results in never having production losses of more than one line. Having multiple processors means that you never lose anything as long as you don't lose both processors simultaneously. Any second opinions? Any experience with processor redundancy?

jstolaruk · 27 Dec 2006

If the processes are critical enough that downtime can absolutley not be tolerated I would go with the two processors but use dual ControlNet not ethernet. Edited 27 Dec 2006 by jstolaruk

rpraveenkum · 27 Dec 2006

Redundancy means you have use of two Identicals sets with processor, rack, powersupply communication cards with same hardware series,revisions & same firmware revisions per system only

mellis · 27 Dec 2006

From my point of view there are two entirely separate questions here. 1) Should you run 4 production lines from the same controller? 2) Should you use a redundant controller? I would highly recommend that any production line that can run independantly deserves its own controller. Putting all 4 lines at risk everytime you make a change is reason enough to separate them. Single point of failure for all 4 lines is bad too. Performance is another reason. Unless the processor demands of your lines are trivial, sharing a processor imposes limits on performance as well. Controller failure is an extemely rare thing. Why is it that most people start thinking of improving the reliablility of a control system by making one of the most reliable parts redundant? If that is all you do, it usually has the opposite effect (more failure points, more failure modes, etc.). I've seem some strange things done in the name of redundancy. Another point that is important is that redundancy takes some of the performance of your processor away from the application. Keeping both processors memory synched takes processor resources. So a redundant pair of processors is actually less capable of running mulitple lines than a single processor of the same model. Bottom line, I'd recommend 4 single processors for your 4 production lines. If any of the lines are so mission critical that you really need zero downtime, then each one of those systems deserves its own robust design. That may include redundant processors, but it would also include redundant power supplies, networks, and I/O. Edited 27 Dec 2006 by mellis

TimWilborne · 27 Dec 2006

Just to clarify Jstolaruk's statement, you must use Controlnet. Redundancy over Ethernet is not yet supported

BobLfoot · 27 Dec 2006

In my previous position before where I a now I worked for an SI who did a number of redundant controlnet systems. What follows is my observations about CLGX redundancy. 1. I have never understood how processor redundancy or network redundancy improved much of anything when the field I/O was not redundant. A single Photo-Eye failure is still your worst enemy. 2. The CLGX implementation has the biggest pitfall of scan time problems. One system we were asked to "save" had a scan time of 52 ms avg and 98 ms max in V11 on an L55M14. The designer had resigned because he could not get the tracking buffer for products to work. Turns out the resolution on his "encoder by photo-eye" was a pulse every 60 ms. The scans over 60ms meant lost tracking pulses. Applying the Rockwell Redundacy Bible principles we dropped scan time to 20ms avg and 42ms max. Miraculous it worked. 3. Given you situation with 4 process lines I'd advise 4 CPU's. By the time you add the cost of 2coax's, and 2 complete "base" processors there won't be a lot of cost difference between your redundant option and your 4 system plan. IF YOU DO GO REDUNDANT WATCH THE FOLLOWING: A. Get Rockwells guidlelines for redundant code writing and follow it. THIS IS A MUST!! B. Pay attention to scan time. Adding one rung can blow scan time out of the water surprisingly.

paulengr · 29 Dec 2006

I've gone through everything and these are my conclusions: 1. I was only mildly concerned about the PLC's themselves. I have hardly EVER seen a PLC crash. It mostly seems to be an I/O problem. Aside from a few major faults (mostly hardware related), the vast majority of the major faults that I've witnessed PLC-wise are software problems such as screwing up a PID loop on a PLC-5 (playing with a PID loop on a PLC-5 is playing with fire in general). The rest were related to hardware problems. Very occasionally, I'll have a PLC-5 go brain dead for unknown reasons (I suspect code problems but since it rejects the program and erases itself, you can't reconstruct what happened). For these reasons, it would seem more sensible to build in support to try to defend the control network from faults instead of the PLC. 2. Since major non-hardware related faults are limited to a single task and given that AB pushes the Controllogix platform as "network-centric" vs. "PLC-centric", it seemed like redundant platforms would be much more common and popular. In this model, you'd treat PLC's as essentially "units of processor horsepower" instead of being the center of the control network. 3. I am not aware of "ring switches" for Arcnet (aka Controlnet) so that totally blows away the idea of having any sort of redundancy in the network. 4. Dealing with loss of communication with I/O is the most common problem but also very easily overcome. If you design everything from the ground up with the key words "fail safe" in mind and design your control software to react accordingly to loss of I/O communication, you will have a fail safe system. If not, your mileage will vary. 5. So the idea here would be that there are two ways I could go about doing things. I could spend the money on ring switches (if redundant PLC's could work over Ethernet), and two of everything else, then I'd be able to defend the network against PLC hardware failure as well as network failures NOT involving I/O. In either model, if something went wrong with a particular production line software wise or hardware wise, the reaction would be the same (since tasks localize software faults). If I put in 4 processors, I spend about the same amount of money. I'd now be vulnerable to PLC hardware failures (again admittedly low) as well as inter-PLC network failures. Since Controlnet is the only available redundancy-capable network, it pretty much nixes the idea of going this route since putting in 8 processors and the associated hardware to defend against an event that is already infrequent is definitely bordering on NASA-levels of sanity, and the system in question is not a nuclear reactor or a super-collider.

BobLfoot · 29 Dec 2006

Paul you are aware and have considered the following layout haven't you? 1 CLGX PLC per Process for 4 Process. {Assumes that PLC MTBF approaches infinity, and MTTR is acceptable at 30 minutes to 1 hour}. 2. One 1756-CNBR and All I/O in racks using 756CNBR or 1771-ACNR15. This give you redundancy between your PLC and I/O cards. You can even configure critical points to have 2 or 3 sensors each thru a differnet I/O card and logic to resolve who is right and who is wrong. THis is lest costly than processor redundancy and give a measure of protection. 3. You can place multiple 1756-EN2T or ENBT cards in the 1756 rack and let your SCada app chose the pathway which is alive autoimatically. Just food for thought.

rpraveenkum · 29 Dec 2006

To get familiar about rockwell redundancy system read this literature http://literature.rockwellautomation.com/i...um523_-en-p.pdf

BobLfoot · 16 Sep 2009

I was just PM'd and asked what this publication was so in it's current 2009 format look for publication 1756-UM523F-EN-P found here http://literature.rockwellautomation.com/i...um523_-en-p.pdf. I searched for redundancy manual on the AB Literature page to find it.

paulengr · 16 Sep 2009

Based on my experience since I originally posed the question, I would not take the redundant processor route per se anyways. 1. You are taking what is the most reliable part of the system and making it much more complicated, and thus on a per-processor basis, lowering the reliability. The overall system reliability is increased but by a very tiny margin. 2. The GuardLogix processor is two CLX processors plus some glue logic to detect failures. The resulting processor, in spite of responding to discrepancy problems by faulting, becomes over an order of magnitude more reliable. Only one copy of the program is needed, and you don't lose features in terms of online programming unlike the redundancy approach. If one processor faults, the whole system shuts down, but regardless, the reliability of the entire system is much higher than a standard CLX processor, which is still a far cry from standard PLC's, which are still at least an order of magnitude better than PC-based equipment. 3. Regardless, the I/O is where the limitations are at. Outputs are always the weakest link by an order of magnitude. The only way to increase actual reliability would be to increase the redundancy in terms of outputs in almost all cases.

ratcliffe_ic · 16 Sep 2009

I've just been involved in writing the software for a redundant CLX system, I wasn't involved in the hardware design, anyone setting out on one of these projects must get the A-B technotes and follow the guidelines, these basically consist of ensuring all intelligent cards are flashed at the same firmware level, selecting controlnet addresses correctly to ensure there is always a keeper and positionning of cards in remote racks. While the system has redundant processors and a redundant controlnet network and it all sounds great to the people buying the equipment, I still only have one input looking at a sensor and only one output driving an actuator. In my experience this is where the faults happen that stop a system, I can't remember any occasion where I have had a processor fail (apart from out of the box). So unless you have 2 sets of inputs and outputs and 2 sets of sensors and actuators how can I have improved the situation. One thing which may be of use to those performing online edits on a critical system is the ability to decide what to do with test edits on a switch over. So if you select to cancel edits on a switch over, when you crash the running running program through some bad code, the other processor takes over with the unedited code. Which if you've done that on a live critical system you will know it could be a good thing! So my thoughts are a bit mixed, I can see benefits for me in that I can avoid the risk of crashing a system but the system we have installed does not truely address all the issues which can occur with a control system. The cynic in me says its basically a way of getting more money in. Edited 16 Sep 2009 by ratcliffe_ic

paulengr · 16 Sep 2009

This may or may not do it. Let's consider 3 possibilities. 1. The sensor fails but the fault is undetectable. This would be the case if for instance you have a digital input that spends 90% of it's life in the "0" state and faulted to a "0". This type of fault would be undetectable in software. In the real world, it would eventually become detectable but not in a good way (failure on demand). These all eventually become detectable (case 2) but it gets you in the right frame of reference for thinking about this. 2. The sensor fails in a detectable way. This would be for instance the mirror image of item #1. An example is a flame detector. These typically fail in such a way that they detect flame when in fact no flame is present. Self-monitoring flame detectors actually use this property by periodically opening and closing an internal shutter. 3. The sensor fails but has diagnostic support and the failure mode happens to be one that the diagnostics can detect. For instance, AB sells diagnostic I/O cards for 1756 and 1794 I/O. You can detect shorts and opens with the digital diagnostic inputs. Something similar can be done in almost all cases with analog I/O, and having a simple auxiliary contact feedback allows for detecting a lot of contactor faults. In fact the auxiliary contact plays prominently into safety relays. OK, now let's consider the 1oo2 case. If we have say 2 digital inputs and one reads "1" while the other reads "0", and we've waited for sufficiently long (a timer) to verify that this isn't simply a switching delay, what do we do? We can't tell which state is correct, so we can't do fault tolerance (ability to ignore faulty hardware). We can only either pick one and hope we picked the right sensor, or fault. So process reliability increases but so does downtime. Not what we intended to do... Round 2. This time, we have diagnostic sensors. Now we can not only detect that we have a fault (case 1) but we can even detect (hopefully) which one it is. In addition, many of the previously undetectable faults become detectable ahead of time. The overall result is that we usually detect more faults and sooner, increasing reliability over the first case, plus we can often detect which sensor faulted, which decreases downtime by an order of magnitude as well. Round 3. We have 3 sensors. Now we can use 2oo3 (best 2 out of 3). This allows us to use even non-diagnostic sensors and operate reliably. Diagnostics still boost overall reliability and decrease downtime, but the increase is less than what we can do in round 2 where we are operating with bare minimum hardware. Triple redundancy type things are necessary to achieve another order of magnitude increase in reliability. This scenario can of course play itself out even further with 3oo5, 4oo7, etc., if you have lots of money. I'm currently working on a fluidized bed system with 7 temperature sensors in the original design. The original design had electricians come in and swap wires any time that they wanted to switch sensors. I've provided enough thermocouple readers that all of them can be read simultaneously. A simple median calculation (sort the readings from low to high, use the middle value as the correct one) logically implements 4oo7 (up to 3 sensors can fail in any way with no direct impact on the process). But that's the only scenario that I can recall where a 4oo7 case actually legitimately came up. Regardless, the outputs are usually the least reliable function. In this case, double redundancy such as two blocking valves or two relays in series provides the same benefit as 1oo2 sensors with diagnostics...you get fully half the chance of something going wrong as long as there are no major "common mode" issues, an example being pneumatically operated valves that have to be energized to operate. A pressure loss in the compressed air supply disables them both. In practice, when doing safety and reliabilty calculations, usually it turns out that the processor, wiring, etc., accounts for 10% of the failures. The sensors are another 40%. And the actuators make up about 50%. That's in the process industry and those are generic numbers. I can conjure up tons of examples where this is not the case. I've had experience with hundreds of PLC's over my career, mostly because the plants I've worked in have 20+ of them, all PLC-5, SLC, ControlLogix, etc. Not many of the micro-PLC type things such as Omron. Overall, I can count the number of actual hardware PLC failures that are not due to external problems such as bad grounds and transients, high ambient temperatures, or running water into the controller, on one hand. Every one of them resulted in a processor faulting after a semi-random interval, although it might take days or weeks to repeat itself. The result was always the same...red light on, processor memory blank. I verified these situations only by taking the processor off the plant floor and running it on the bench and waiting for weeks to see if it happened again. I'm not attributing it to the processor actually losing it's memory either...I'm suspecting that in these situations whatever was going on, the PLC's diagnostics probably recognized corrupted memory and deleted the program. I've seen this happen about 3 times. That's with hundreds of PLC's out there in my experience. It is exceedingly rare. And I only believe it with a "swap processor, run on bench for 6 months" type test. I've seen the same SCENARIO play itself out a lot more than that. There are lots of ways for it to happen, too. You can send bogus MSG blocks to some PLC-5's and cause this to happen. Some PLC's have been known to be susceptible to crappy network connections or traffic. If you have the right poor grounding situation or transients, I've seen it happen that way. Bad power supplies are another typical animal. The most common problem is allowing the battery to die (don't believe the indicator lights!) All of these are avoidable, and any plant which is so concerned about reliability that redundant processors is being contemplated should first make sure to address these all-too-common causes first. If your concern is coding errors, there is something else you can do. First, there is no harm in running code in "test" mode for a period of time, even days. I've done just such a thing for a long period of time to verify that everything is working properly. Second, you should probably strongly consider writing a simulator. They are simple to write, and you can usually code them as a separate task in the same program. Then you run the "debugging" version in RS-Emulate or SoftLogix whenever you desire. It also handily doubles as a trainer, and I've found that code quality (bug free code) drastically increases with a simulator. All those little typos tend to get caught right away. It also helps end the arguments about whether it was operator error or programmer error because you can actually put the operator on the simulator and challenge them to duplicate the event. Third, break your code out into separate self-contained programs. You should be doing this anyways. Now here's the tricky part. Write a fault task. If the fault task detects a code error, then turn on the inhibit bit for that program and clear the fault. The downside is that the fault light won't turn on for code errors (you need to provide your own light for this). The upside is that in the event that a code error does cause a fault, the offending program shuts down but the rest of the PLC can continue to operate. If you have a PLC running lots of processes, then all the processes EXCEPT the faulty one can continue to operate unharmed. The downside of this is that you can easily decode the task, program, and routine from the data returned from the fault data (even though AB doesn't document what's in a lot of the fault data). The one thing I haven't been able to experimentally determine is the rung/instruction location from the returned fault data. I suspect that the returned data includes a position within the code stream but since I don't have access to the op codes or other "raw" ControlLogix data, I can't discern what the remaining data means. You can't maintain this data either (since the goal is to clear the fault) so you will only be able to determine task, program, and routine after the fact on a trapped fault, not line and instruction.

ratcliffe_ic · 16 Sep 2009

On for example a production machine with a repeating cycle, it is very easy to have the necessary diagnostics within your program to provide timeouts and give maintenance personnel the necessary prompts via operator interfaces to reduce fault finding time. My application is a back-up generator which is used rarely, and definitely falls into the "failure on demand" scenario. The generator will be run once per week to test so there is an opportunity to pick-up any faults but if something fails between times and the generator is required its no use having two processors when an output has failed. Yes it would be great to write a simulator and fully test code but sometimes when the pressure is on, you know what can happen.

IamJon · 17 Sep 2009

Redundant is always used in glass furnaces. If one, TC goes down, you might not be able to control a forehearth very well, but your furnace won't go down, freezing the glass and costing millions upon millions of dollars to replace the furnace. Plus there are multiple forehearths that can still make product. That said, here's an interesting story. I was doing a job at a glass plant where I was making changes (live) to a redundant system which utilized indirect addressing for most of it's code. Once I got the changes in one processor, we decided to switch processors to make the changes in the other. Well, every time we switched, the processor faulted. As it turned out, there is a timer in the old BCM (backup control module i believe) that times the delay to switch processors and start running code. The time delay was set well below... I think maybe it was below scan time... but at any rate, it was set way to fast. This would rush the switch and fault the processor. Moral of the story (which has already been discussed in this thread), the PLC was more reliable than the redundancy itself, due to the redundant system's complexity. That said #2: Rockwells redundancy has come a long way from the BCM and is much easier and more reliable. Edited 17 Sep 2009 by IamJon

BobLfoot · 17 Sep 2009

Jon makes an interesting point and at the time of my post in 2006 the majority of my Automation experiences had been in Serial Processes such as material and baggage handling or component assemblies (TV). I will support and agree that in a situation where the batch loss or system interrupt has radical cost factors then redundancy makes sense. But CPU redundancy alone is the weakest approach even for your situations.

paulengr · 18 Sep 2009

We are in the midst of doing a major upgrade on a kiln right now. This one fits the bill as the classic "process" application and it's a burner management scenario. If you screw it up, you can pump oil and/or coal into the vessel at a time when it does not burn. At this point, there is only one way to recover safely...shut down and clean out everything (with shovels and jackhammers), and eat the 12 day turnaround doing this. Mind you, a split second problem here or there is no big deal...it's not like you're going to instantly go from safe to bomb, unlike say safety systems in presses and other machinery operations which spell the difference between 10 fingers and toes, and missing personnel. The critical problem here is that it's a fluidized bed system. With most industrial heaters, you operate in the temperature range well above the autoignition temperature...if you put fuel in and there's oxygen present, it is guaranteed to burn. These systems operate well below that point because the oxidation conditions favor combustion so much that in a sense, it has a lot more in common with smoldering coals than open flames. So normal operation in a fluidized bed system is much closer to the threshold of having a high enough temperature to get any kind of combustion at all. The major problem of course is that a fluidized bed system naturally likes to plug up since it is in the nature of the system to be very dusty, and whatever doesn't plug up tends to be eroded. So sensor redundancy in the extreme is the way to go. The third problem is a consequence of all that dust...housekeeping is a major duty around the area and it is highly mechanized. Trouble is that all the fiber and CAT 5 (even armored) in the world cannot be made tough enough to withstand the clean up crews. And if you lose a communication cable, it might be longer than a even a few minutes before communication can be restored. Hence in this scenario, PLC redundancy is pretty much a waste of time. Sensor redundancy is almost a must-have. If anything, network cable redundancy is a very nice-to-have.

IamJon · 18 Sep 2009

Wow, get the myrrh and unwrap the burial shroud, cause this thread has been resurrected! Didn't realize it was so old :)

Sign In

Redundant or not to redundant, that is the question

18 posts in this topic

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Share this post

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity