QUOTE (ratcliffe_ic @ Sep 16 2009, 04:14 AM)

While the system has redundant processors and a redundant controlnet network and it all sounds great to the people buying the equipment, I still only have one input looking at a sensor and only one output driving an actuator. In my experience this is where the faults happen that stop a system, I can't remember any occasion where I have had a processor fail (apart from out of the box).
So unless you have 2 sets of inputs and outputs and 2 sets of sensors and actuators how can I have improved the situation.
This may or may not do it. Let's consider 3 possibilities.
1. The sensor fails but the fault is undetectable. This would be the case if for instance you have a digital input that spends 90% of it's life in the "0" state and faulted to a "0". This type of fault would be undetectable in software. In the real world, it would eventually become detectable but not in a good way (failure on demand). These all eventually become detectable (case 2) but it gets you in the right frame of reference for thinking about this.
2. The sensor fails in a detectable way. This would be for instance the mirror image of item #1. An example is a flame detector. These typically fail in such a way that they detect flame when in fact no flame is present. Self-monitoring flame detectors actually use this property by periodically opening and closing an internal shutter.
3. The sensor fails but has diagnostic support and the failure mode happens to be one that the diagnostics can detect. For instance, AB sells diagnostic I/O cards for 1756 and 1794 I/O. You can detect shorts and opens with the digital diagnostic inputs. Something similar can be done in almost all cases with analog I/O, and having a simple auxiliary contact feedback allows for detecting a lot of contactor faults. In fact the auxiliary contact plays prominently into safety relays.
OK, now let's consider the 1oo2 case. If we have say 2 digital inputs and one reads "1" while the other reads "0", and we've waited for sufficiently long (a timer) to verify that this isn't simply a switching delay, what do we do? We can't tell which state is correct, so we can't do fault tolerance (ability to ignore faulty hardware). We can only either pick one and hope we picked the right sensor, or fault. So process reliability increases but so does downtime. Not what we intended to do...
Round 2. This time, we have diagnostic sensors. Now we can not only detect that we have a fault (case 1) but we can even detect (hopefully) which one it is. In addition, many of the previously undetectable faults become detectable ahead of time. The overall result is that we usually detect more faults and sooner, increasing reliability over the first case, plus we can often detect which sensor faulted, which decreases downtime by an order of magnitude as well.
Round 3. We have 3 sensors. Now we can use 2oo3 (best 2 out of 3). This allows us to use even non-diagnostic sensors and operate reliably. Diagnostics still boost overall reliability and decrease downtime, but the increase is less than what we can do in round 2 where we are operating with bare minimum hardware. Triple redundancy type things are necessary to achieve another order of magnitude increase in reliability.
This scenario can of course play itself out even further with 3oo5, 4oo7, etc., if you have lots of money. I'm currently working on a fluidized bed system with 7 temperature sensors in the original design. The original design had electricians come in and swap wires any time that they wanted to switch sensors. I've provided enough thermocouple readers that all of them can be read simultaneously. A simple median calculation (sort the readings from low to high, use the middle value as the correct one) logically implements 4oo7 (up to 3 sensors can fail in any way with no direct impact on the process). But that's the only scenario that I can recall where a 4oo7 case actually legitimately came up.
Regardless, the outputs are usually the least reliable function. In this case, double redundancy such as two blocking valves or two relays in series provides the same benefit as 1oo2 sensors with diagnostics...you get fully half the chance of something going wrong as long as there are no major "common mode" issues, an example being pneumatically operated valves that have to be energized to operate. A pressure loss in the compressed air supply disables them both.
In practice, when doing safety and reliabilty calculations, usually it turns out that the processor, wiring, etc., accounts for 10% of the failures. The sensors are another 40%. And the actuators make up about 50%. That's in the process industry and those are generic numbers. I can conjure up tons of examples where this is not the case. I've had experience with hundreds of PLC's over my career, mostly because the plants I've worked in have 20+ of them, all PLC-5, SLC, ControlLogix, etc. Not many of the micro-PLC type things such as Omron. Overall, I can count the number of actual hardware PLC failures that are not due to external problems such as bad grounds and transients, high ambient temperatures, or running water into the controller, on one hand. Every one of them resulted in a processor faulting after a semi-random interval, although it might take days or weeks to repeat itself. The result was always the same...red light on, processor memory blank. I verified these situations only by taking the processor off the plant floor and running it on the bench and waiting for weeks to see if it happened again. I'm not attributing it to the processor actually losing it's memory either...I'm suspecting that in these situations whatever was going on, the PLC's diagnostics probably recognized corrupted memory and deleted the program.
I've seen this happen about 3 times. That's with hundreds of PLC's out there in my experience. It is exceedingly rare. And I only believe it with a "swap processor, run on bench for 6 months" type test. I've seen the same SCENARIO play itself out a lot more than that. There are lots of ways for it to happen, too. You can send bogus MSG blocks to some PLC-5's and cause this to happen. Some PLC's have been known to be susceptible to crappy network connections or traffic. If you have the right poor grounding situation or transients, I've seen it happen that way. Bad power supplies are another typical animal. The most common problem is allowing the battery to die (don't believe the indicator lights!) All of these are avoidable, and any plant which is so concerned about reliability that redundant processors is being contemplated should first make sure to address these all-too-common causes first.
If your concern is coding errors, there is something else you can do. First, there is no harm in running code in "test" mode for a period of time, even days. I've done just such a thing for a long period of time to verify that everything is working properly.
Second, you should probably strongly consider writing a simulator. They are simple to write, and you can usually code them as a separate task in the same program. Then you run the "debugging" version in RS-Emulate or SoftLogix whenever you desire. It also handily doubles as a trainer, and I've found that code quality (bug free code) drastically increases with a simulator. All those little typos tend to get caught right away. It also helps end the arguments about whether it was operator error or programmer error because you can actually put the operator on the simulator and challenge them to duplicate the event.
Third, break your code out into separate self-contained programs. You should be doing this anyways. Now here's the tricky part. Write a fault task. If the fault task detects a code error, then turn on the inhibit bit for that program and clear the fault. The downside is that the fault light won't turn on for code errors (you need to provide your own light for this). The upside is that in the event that a code error does cause a fault, the offending program shuts down but the rest of the PLC can continue to operate. If you have a PLC running lots of processes, then all the processes EXCEPT the faulty one can continue to operate unharmed. The downside of this is that you can easily decode the task, program, and routine from the data returned from the fault data (even though AB doesn't document what's in a lot of the fault data). The one thing I haven't been able to experimentally determine is the rung/instruction location from the returned fault data. I suspect that the returned data includes a position within the code stream but since I don't have access to the op codes or other "raw" ControlLogix data, I can't discern what the remaining data means. You can't maintain this data either (since the goal is to clear the fault) so you will only be able to determine task, program, and routine after the fact on a trapped fault, not line and instruction.