Conor

SLC 5/04 Watchdog error

22 posts in this topic

Hi guys, I came in this morning and I had an error on one of my SLC's. The error was "The watchdog timer expired.". I looked at the major error codes and it says to increase the watchdog timeout value in the status file. I have checked out the Scan times on the PLC and S:3 is set to 160. From reading the manual I take it that this is 1.6 sec? I checked on the "Errors" tab S:20 and S:21. These point towards the EOP rung on my first (Main) routine. Just before this last rung is an Unlacth for S:5/0. I take it this is a Math overflow Unlatch. Does anyone have any idea's why this error may have happened. I don't recall seeing this error before. Thanks, Conor

Share this post


Link to post
Share on other sites
Was the program running fine before? I mean since long time...this quite certainly excludes a programming error, unless somebody has modified something in the code, maybe doing loops or conditional jumps... After that checked, the suspect is an hardware issue, so it could be needed to check which modules you have in the chassis, expecially if there are comms module, like RIO... The LAD2 indication of status file doesn't help here, watchdog error always fail in that point, as far as I can remember... - fuzzy logic

Share this post


Link to post
Share on other sites
Hi fuzzy, Yes the program was running fine, for a long time. The rack configuration is as follows: 1747-L541C 16K 1746-IB16 1746-IB16 1747-SDN 1746-OW16 1746-OW16 1746-NI8 1746-NI4 1746-NIO4I Conor

Share this post


Link to post
Share on other sites
Could be that the watchdog was set to close to the actual scan rate. I typically leave the watchdog setting at the default value. If memory serves me correctly that value is 500. Units are milliseconds not seconds

Share this post


Link to post
Share on other sites
Any excessive jumping or nested loops that can sometimes be executed but sometimes not? If this program doesn't have anything that varies it's scan time such as that then I'm with Fuzzy Logic that you've got a hardware problem. Program scan time is extremely consistent when the same routines are called in the same manor every time.

Share this post


Link to post
Share on other sites
Hi guys, To expand a little. This processor was on one air Blower. We have 9 Blowers in total, 4 pairs and one stand-by in case one of the pairs goes down. I checked the Watchdog timeout setting on all and it was 160. I checked the Scan times on all of the processors and they are all about 31/2 ms. Also Michael, I checked the help files and S:3 is in multiples of 10 ms. The default is 10 (100 ms) and this can go up to 250 (2.5 secs). Conor

Share this post


Link to post
Share on other sites
It is strange that the watchdog was set so high when the default would have worked fine. So the next question is was it raised because of excessive scan time while developing some other part of the code or has this happened before. I would reset the max scan time and check on it in a few days. Is there anyway you can post your program or are you pretty sure that the program flow is pretty straight forward and consistent?

Share this post


Link to post
Share on other sites
See attached. It is a vendors program CEN3750E.RSS

Share this post


Link to post
Share on other sites
OK this is a complete swag considering I don't even know what this machine does, but aside from PLC hardware the only other thing that pops out is ladder 9. Specifically rungs 42-47 and 52-57. It is a jmp lbl combination that jumps backwards from 42-47 and 57 to 52 in the program. That in itself is fine but you might look further there to see if there are any external variables that might cause excessive looping. The first thing I would look at is the Q9:2 calculation bypass look. Is B3:6/1 staying off except once in a blue moon? It's code is above but I didn't have time to look at it. After that does it consistently loop through the code? What external variables could affect it.

Share this post


Link to post
Share on other sites
I'm finished for the day and I don't have 500 software on my laptop at home (yet!?!?). I'll check it out tomorrow morning and get back to you

Share this post


Link to post
Share on other sites
Do you know if this watchdog faulted occurred right after power up? I have two SLCs with good sized DeviceNet networks that generate a first scan time of up to 962ms! Then run until power is cycled never getting above 40ms...Not sure why, but some smart cards definitely can impact 1st scan time. I discovered this when one of them faulted (watchdog) with a scan time of 1010ms with the watchdog set to 100 (1000ms). I have also seen a failing module cause a watchdog timeout in the past, I believe it was an analog card, not sure which p/n. The JMP/LBL loop that jumps backwards is definitely a prime suspect, but it looks nearly bulletproof, since the FFU is unconditional and the EM bit is the condition for the JMP EDIT: Wait a minute, will the FFU do anything without the false to true transition? The FFU will empty the stack so the EM bit will go true and skip the JMP at 100 or fewer cycles, as long as nothing external to this loop is changing the value of R6:0.POS. Now, if some external code or device writes to this control element position, this could definitely extend the loop or even make it infinite and cause your fault, however since you have several of these identical machines. The other two JMPs go forward so I don't think they can cause this. With my quick review, I think the code is probably okay, and that you have a bad piece of hardware triggering the problem. EDIT: A watchdog value of 160 is 1600ms or 1.6 seconds and that is where I have my SLCs with devicenet set so they don't fall over every power cycle. Edited by OkiePC

Share this post


Link to post
Share on other sites
Hi Paul, The machine was running away fine when this happened. I will check out the code now I checked the code. B3:6/1 is coming on/off every minute or so, when 100 samples are taken. I can't find anywhere if R6:0.Pos is being written to externally. I will check more. Some of the code in these PLC's is done with indirect addressing. Conor Edited by Conor

Share this post


Link to post
Share on other sites
Hi guys, The major thing about this problem was that when the processor faulted the other machine did not start it thought that the faulted machine was running fine. When the Op's looked at the Scada it showed running, of course no one went down and physically checked wheather or not it was actually running. I can't force zero's into the values when the processor is faulted? The program won't be cycling (am I correct here). These PLC's are monitored by a PLC 5. I am going to have to look at the heart beats from each of these PLC's and set off an alarm when one of the heart beats is not detected. Conor

Share this post


Link to post
Share on other sites
Depending on the SCADA, it should be able to report a comms loss, and should be able to clear any stale values within a few seconds of the event, unless the SCADA is FactoryTalk, then it may take several minutes. EDIT: Actually, if this is a runtime fault the above may be wrong...the SLC still talks over ethernet, it just ain't gonna run the LAD (including any MSG stuff). The SCADA then needs to explicitly monitor the fault status registers and assign those as faults in its alarm system... The other machine thinking this one was running is a big time issue? That signal should be hardwired possibly, or at least a heartbeat counter, so that failure on either end of comms will be detectable by the non faulted SLC. For heartbeat, to me nothing beats a counter. Follow the Leader by exchanging two INTs or DINTs, In the Leader this logic (pseudocode): IF Leader.HeartbeatCount= Follower.HearbeatCount, THEN Leader.HeartbeatCount ++ //Trap overflow where required In the Follower this logic: Follower.HeartbeatError=Leader.HeartbeatCount-Follower.HeartbeatCount Follower.HeartbeatCount=Leader.HeartbeatCount; Switch (Follower.HeartbeatError) Case 0: Break Case 1: Break Case default: // Log number of heartbeats missed (Follower.HeartbeatError), or take required action. Edited by OkiePC

Share this post


Link to post
Share on other sites
Greetings Conor ... naturally I haven't had time to go COMPLETELY through your program – but here is ONE thing that really jumps out at me ... you have documentation for T4:0 which specifies it to be the: "Power Up Reset Timer" ... you're using XIC instructions for that timer's Done bit in multiple places – but (and here's the fishy part) you don't have a TON instruction for T4:0 anywhere in your program ... the Done bit has a status of ONE – which means that all of the XIC instructions based on that timer will be TRUE just as soon as the processor enters the Run mode ... I'm just guessing here, but I'd say this is something that needs looking into ... the data table for the timer in question shows a setting of 5 seconds – and presumably this was a time allowance for SOMETHING in your system to "warm up" or something else along those lines ... maybe that missing "warm up" period is part of your problem ... then again – maybe not ... if it were me, I'd check this program against the others that you've mentioned and see if this "missing timer" setup is unique ... going a little bit further ... as near as I've been able to tell so far (and my time has been limited) the purpose of all of this "collecting sample readings" in Ladder File #9 is to control the status of bit B3:6/6 – which is documented as the: "Surge detection bit" ... but ... it looks like the only thing in your program capable of writing that bit to a ONE status is a single OTL instruction ... the fishy part is that the OTL rung is being rendered ineffective by a "homebrew" Always False condition ... in other words, I don't think that this program (in its current configuration) is ever going to declare a "surge" condition ... so ... my first question is: why are you making the processor jump BACKWARDS through two separate loops – if you're not going to make use of the information being developed within those loops? ... suggestion: if these ideas don't pan out, can you post another program file – one from an identical system that is NOT giving problems? ... maybe a comparison of the two programs would be fruitful ... anyway ... I wish I had more time to play – but I've got other time-critical stuff to work on ... I hope this helps ... Edited by Ron Beaufort

Share this post


Link to post
Share on other sites
As an aside for Paul; I'll bet you a case of donuts that your systems with DeviceNet scanners that run so slowly make extensive use of direct references to the M0 and M1 files. Every single time you refer to those files you take about a 1-ms scantime penalty. It doesn't matter if it's an XIC or OTE or a MOV or COP; each instruction referencing M0 or M1 requires a module access interrupt. I always copy the entire M0 and M1 files to Integer arrays at the beginning and end of the program.

Share this post


Link to post
Share on other sites
Thanks for all your help guys. It will be Monday now at this stage before I can look at the code. Also the machine and the code is from a vendor, of course I would write code like this I have written some code yesterday so that my main site alarm panelview will bring up an alarm when a processor is faulted. I have one alarm for each. Conor

Share this post


Link to post
Share on other sites
Ken that is something we have dug into and cleaned up a bit, but I never understood why it was only 1st scan that is affected. Does the SLC hold up the scan until all the inputs configured have responded? We only have the minimal number of active M0 M1 COPy instructions after removing some of the OEM code that was intended to allow the HMI to edit drive parameters, my 1st scan time did drop a bit. The few references we have left are just COP instructions, and they occur every scan. There are about 40 AB160 series drives on the scanlist, and there is also a RIO scanner for two robots and a couple of racks of flex I/O and we are doing two M1 copies for status and alarms from that scanner...but why only does it kill 1st scan, and then run at 30-45ms from then on? Edited by OkiePC

Share this post


Link to post
Share on other sites
Hi Ron, I have checked all of the other PC's and none of them have a TON instruction for T4:0. I have looked through the documentation from the vendor and can find anything about this either. I will do a bit more digging. Of course all of the machines have run fine since the problem last week. I have written code to monitor the Heartbeat from each of these PLC's as well. To get it back to my alarm panelview I have to go through the DeviceNet to One PLC5 then a message across ControlNet to another PLC5 then an EtherNet message to a SLC which the panelview is looking at. Conor

Share this post


Link to post
Share on other sites
well, this won't do anything to "fix" your problem – but it might help to shed some light on what's causing the intermittent watchdog fault ... I'd suggest adding a new "tattle tale" timer into each of your two recursive loops ... notice that you'll need to add an RES (Reset) instruction just ABOVE each of your existing "loop" labels (Q9:3 and Q9:4) ... you'll also need to add a TON (Timer On Delay) instruction just BELOW each of those same labels ... naturally you'll want to use a separate address for each of the two new timers ... the basic idea is covered in the rung comments – but feel free to post any questions that you might have ... naturally if the processor ever does fault out again, you'll want to take a look at the Accumulator values stored in each of these two timers BEFORE you reset the fault ... personally I'm not convinced that these loops are actually causing the watchdog problem – but the little trick shown here should go a long way toward eliminating the loops from your list of potential culprits ...

Share this post


Link to post
Share on other sites
Thanks Ron. I will try this out and let you know

Share this post


Link to post
Share on other sites

Ganpat, this  thread is seven years old and addresses a fault on a different model of controller with a different cause and solution.

Please create a new thread to address your faulted MicroLogix 1200 controller.

In general, "single digit" fault codes are caused by faulty hardware or by electrical noise.    When I've seen Fault 02h on MicroLogix 1200 controllers, the first place I look is for a heavy green ground wire.

Edited by Ken Roach

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now