Sign in to follow this  
Followers 0
Sleepy Wombat

Control System Troubleshooting - Tricks of the trade

10 posts in this topic

This topic has come up from time to time, and I was fortunate enough to receive an industry email outlying what I know I do when faced with a trouble shooting problem but never had the time or inclination to actually express this in words so verbosely. So, original credit goes to Fluke Networks for a well written complimentary booklet entitled: FrontLine LAN Troubleshooting Guide; and to a blog by Steve Mackay from IDC training. I have added my own spin on things and I hope that you may find the article interesting to say the least. - Enjoy. As you are all members / guests of this forum, you all have a vested interest and sometimes an obsession with the controls industry. You probably don't get paid enough for your efforts and knowledge but that is another debate. To be truly good at what you do though, requires more than just studying theory. More often than not, as you make your way through the controls world you will be required to troubleshoot control issues when a problem arises. This scenario could be commissioning / installing a new piece of machinery or fixing an existing piece of plant equipment - these problems can stem from faulty control gear or other peoples mistakes. Now here is a little process that you can use to help you overcome these problems when they arrive. Keep your mind completely open when tackling the problem, avoid pre-conceived ideas as these can throw you off track. You have to realize anything is possible. We all have seen it before with some so called techy's approach to a problem, pulling out component by component randomly in a caveman like fashion in a rush to fix the problem and at the other end of the scale we have also seen the thoughtful philosopher approach of pondering on the problem, spending ages on investigating it and doing very little to fix it. On the other hand narrowing your focus too easily you can extend the time taken to find the actual problem, with the chance that you might fix it without knowing exactly what caused the problem. If there's one thing I hate, it is fixing something without really knowing why. The following suggested steps for general troubleshooting are as follows. At times it will be tempting to leave some out, and, if it is a simple fix mabey you can , but I would not then skip the test and re-test part though. 1. Identify the exact issue and question the history leading up to the problem When someone reports a problem to you; you can bet your bottom dollar this may not be the actual problem. When seen through the eyes of a user the report of the situation may not reflect reality. Ensure you get a careful explanation and if possible a demonstration of the problem. It is your job to ascertain what the real problem is. Ask a simple question such as "Has any maintenance/repair/adjustment work been recently done on this machine?" At least half of the time the answer to that question will determine where to start. The answer indicated the correct place to look on two of the three times last week that Bubba asked for Engineering help and once already this week. Often a problem presents intermittently, these are generally the worse type to come across. Intermittent problems are the ones I like to term "F**k, I've never seen that before!" moments. We all have seen them, you might not use the same expression as me but I still wish I had a dollar for every time I have said it. Word of advice - Don't walk away from it presuming it has gone forever, you can be assured that it will raise its ugly head again and unless you are exceptionally good at telling a few white lies to cover your tracks, then your name will be mud. The problem could also be a combination of different issues. As an example with PID loops and tuning, the operation could seem sluggish but could this not also be a fact that you might be dealing with high frequency signals (an aliasing problem) that could be solved through filtering rather then interfering with the PID constants. 2. Reproduce the problem It is best to reproduce the problem where possible. You can then observe the full sequence of events, view the error messages, watch the crashing of the machine/process (in this case please make sure that safety of you, those around you and finally the machine will not be compromised) and analyse other variables that may be affecting it. If the problem is intermittent, you may need to train the operator to do basic diagnostics to collect the right descriptions leading up to and including the fault whether they are being verbal, written or quantitative data. For example, a network card wouldn't perform erratically until the afternoon sun had warmed up a control room and subsequently heated the card up. Without this knowledge it would be difficult to reproduce. Another example is the office network slowing down to a crawl at 2pm every day for 30 minutes. The network Mr Fixit had decided to set up automatic backups for that time, daily. Once this is understood, however, the investigation of the problem becomes fairly straightforward. 3. Localise, isolate and zone in Now you have to zone in on the equipment or software module that is responsible for the problem. The trick is to zone in on the precise element causing the problem. Penetrate the thicket of equipment, spaghetti wiring and or code and find the precise element. Remember that unrelated elements can cause problems. It is also vitally important to identify exactly what happened before the problem occurred - was a communications card changed out and the IP address not updated on the server, or in the case of an analogue card are all of the settings, software and hardware dip switches the same as the one replaced. Or was there a sudden power surge? Or was the RTU exposed to excessive heat? 4. Make a Plan Ensure that you assess what is required carefully. Beware the Law of unexpected consequences - Examine the "What if" scenarios. The process of fixing something may cause other unexpected problems. When going through your plan, step-by-step, to best remedy the problem, you may find other issues appear that you hadn't considered. It is worth reflecting on each item of the fix to test for these unexpected consequences. For example, in replacing a valve, you may find the loop controller may need to be tuned again, as the parameters are slightly different. Or a replaced instrument has subtly different ranges, which require updating in the PLC code and SCADA configuration. etc 5. Trace your steps Ensure that when you fix the problem, you know exactly what you have done in case you need to retrace your steps later to put the equipment back into its original state. Writing was not invented for nothing, and doesn't cost all that much time initially, but will be a godsend when trying to put everything back together instead of relying on your on memory. 6. Test and retest Test and retest over a period of time before accepting that the problem has been fixed. If there is any doubt about whether the problem has been fixed or not, there is no doubt. It is, most probably, still a problem. Many leave this step out and the result is irritating for everyone when the process needs re-commencing. Then importantly, gain acceptance from the end user and confirm he or she is happy with the fix and it all works satisfactorily. I deal with a number of different clients, they all sign a service report upon the completion of a job that I have completed. 7. Document for an absolute moron People who come after you may not be aware of what you have done and how you have solved the problem. The problem may reappear or something similar may happen to another piece of equipment. So - document with infinite detail for someone who may have no knowledge of what you have done. This is something which we are not so enthused with but it is critical to the process. Naturally, ensure the documented fix is easily accessible by anyone; and not hidden somewhere in an arcane folder in some dark and dusty corner or in some cases hidden away on a server somewhere. 8. Communicate with the client or user Sometimes the user is not convinced the problem has been fixed. Your job is to ensure you communicate honestly; what you have done and why the problem has been fixed. Don't treat the user as a complete idiot and don't overcomplicate the explanation either. This is important for your credibility and their acceptance. So, there you have it, I hope that you have found this a useful guide and good luck in future trouble shooting. Regards Sleepy Wombat. Contributors: Alaric pdl

Share this post


Link to post
Share on other sites
Question #1: Has any maintenance/repair/adjustment work been recently done on this machine? At least half of the time the answer to that question will determine where to start. The answer indicated the correct place to look on two of the three times last week that Bubba asked for Engineering help and once already this week. Edited by Alaric

Share this post


Link to post
Share on other sites
Thanks Alaric for the contribution, in my haste to paraphrase the original document I forgot to add that very simple question in step 1, as in the past I have been at pains to point that very starting point out in previous forum posts of mine. I have ammended the document to reflect this and added you as a contributor. I look forward to adding any other suggestions from other members so that we can all use this as a reference document as I beleive that it is a very important and useful subject mattter. Regards Sleepy.

Share this post


Link to post
Share on other sites
First of all, great topic Sleepy I would add the fact that you have to realize anything is possible. Be open for everything and don't jump conclusions before you thoroughly checked them. If you narrow your focus too easily you can be sure it will take you a lot longer to find the actual problem, with the chance that you might even fix it without knowing exactly what caused the problem. If there's one thing I hate, it is fixing something without really knowing why.

Share this post


Link to post
Share on other sites
Thanks pdl, the open mind bit was already in there, so i high lighted it a bit more as it is important and paraphrased your comments, Thanks mate..

Share this post


Link to post
Share on other sites
Good topic Sleepy! Here's my 2 cents... from #1: "Often a problem presents intermittently, these are generally the worse type to come across. Intermittent problems are the ones I like to term "F**k, I've never seen that before!" moments." Actually, the more correct quote is, "Oh f***, not that again!" because intermittent problems tend to go undiagnosed for a while because we've all become accustomed to hiccups when we shouldn't be. Definitely don't walk away from the problem, but it may take some time to recreate the conditions that caused it. I had a seasonal temperature problem that hits random variable frequency drives (VFD). I could almost time WHEN the fault is going to happen by the outdoor temperature after watching it for 2 years, but I couldn't tell which drives were going to be affected. Always happened in the early Spring when the temperature first gets above 85 deg F, but never happened during the rest of the year. With two 90 deg F days this month, I was able to observe the problem again. The cooling tower's VFD overheated and I found a tiny dead fan buried in the middle of a very dirty heatsink. After swearing heartily while replacing the fan, the senior electrician "remembered" that there was a Spring Cleaning PM where all the drive heatsinks get spray cleaned with canned air. I went back to the maintenance office and looked at the PM records. Yep, Spring Cleaning is scheduled for May & June. Guess what PM is getting moved to March & April and is being done right now? So I didn't walk away from it, but it did take time & the right questions to get to the root cause and figure out how to prevent it for the future. Actually, it will take until next Spring to see if moving the PM truly works, but I should know if cleaning the heatsinks earlier helps before the end of July based on the reduced number of overtemp faults. from #7: "People who come after you may not be aware of what you have done and how you have solved the problem." Someone else must have noticed this "phenomenon" several years before my time and instituted the Spring Cleaning PM as their attempt at a documented fix. It also makes sense that I wasn't seeing more drive overheating in the Summer because the heatsinks had been cleaned by that time. All I'm doing is adjusting the timing of Spring Cleaning to have it done before the outside temperature gets above that critical level. Now if I could just find an easy way for all of the drives to call for a Spring Cleaning PM after XXXX many hours of the heatsink fan running, I'd have a real winner!

Share this post


Link to post
Share on other sites
One thing I've noticed a lot of people do...(sort of a rehash of pdl's thoughts but a little different) They get ideas on what could be wrong, but don't really pursue them all, usually for time's sake. Or they may start to look into an issue, nothing looks out of the ordinary, so they back off and look elsewhere, when the problem will be just a little deeper than they went. This usually happens when there is a group of engineers trying to solve the same problem. I've found that it is most efficient to track an issue in full as you think of them. Take a leadership role, or assign someone as the head of the group. As people think of things, if it's agreed to be a possible source of the problem, the head needs to take initiative to pursue the issue then and there. This could be made more efficient by listing a couple possibilities, addressing them in full, then going back to the drawing board if neither work. Either way's fine, depends on the scope of the problem/project which one is better.

Share this post


Link to post
Share on other sites
This one is huge. Problem reporting should be symptom centric, not diagnosis centric. Problem solving is diagnosis centric, but not problem reporting. Our own company is a classic example of why it can be a problem. We have system where the operator enters a problem code for the machine in the OEE software and the system sends an email to the maintenance group and a copy of it goes to the engineering group. The problem is that the production management group selected the cause codes - and they selected diagnostic centric cause codes. The operator enters the code when he reports the problem making him the front line diagnostic person instead of a problem reporter. As they track the cause codes they get a completely inaccurate picture of the kinds of problems that really occur. The most common reported cause code is "PLC problem" which has become the de facto catch all. The reality is it is virtually never a PLC problem - its a problem with a sensor, actuator, or operator error. Bubba and Co. have learned never to trust what the email says, so instead of showing up with the right diagnostic and repair tools they first make a trip to the machinery to find out what is really happening, wasting time. Management looks at the reports and incorrectly decides that an even stronger emphasis on PLC training is needed for Bubba and Co. based on a poorly designed cause code list. Meanwhile the training that Bubba and Co. most need is neglected and management remains oblivious about where they should really should be focusing common problem abatement projects. Naturally we can't get it changed. Have I ever mentioned that Scott Adams is secretly a mole who works for us?

Share this post


Link to post
Share on other sites
I am trying to add to this great thread and not turn it into a “my experience” type thread, but several times in my career I have experienced what everybody else has added so far and a few that have not been mentioned. Some I have experienced have been touched on by Iamjon. So hear is my 2 cents. Having a supervisor losing his temper at 3 am in the morning, and constantly asking “how long will it take to fix” does not help with your train of thought. This can be solved by standing up in front of everybody and politely telling him, “When I find the fault, you will be the first to know how long it will take to fix. Until then get the F*** out of my face and let me do my job” Along the same lines as above, we (I and the mechanic) had a problem with an old Injection molder (had a bread board wired type PLC). After a solid 6 hours of troubleshooting we had narrowed it down to a pressure switch. At that point both our managers showed up asking and suggesting this and that. 8 hours later and the next day, after re-checking everything, I walked away from the managers and got the mechanic to try the machine, at the right time I shorted out the pressure switch, and the machine ran. Come to find out an operator had adjusted the head pressure valve and didn’t inform anyone. Bottom line, managers can cause problems more than help, if you are sure of your self, stand up for yourself. Also this machine soon got an upgrade to a new PLC Adding to “ask the operator” this I have found to work 2 ways, if you know the operator and he sounds like he knows the machine, then this can be a source of valuable information. On the other hand, some operators don’t care about the machine or know it, even after 10 years running it. When an LVDT has a bad connection causing fluctuation in the PLC values, fixing the light bulb that has been out for 3 months (never been told until the machine goes down) is not high on my list. As for the “it fixed itself” problems, I have always been honest with Supervisors and Management by telling them, it’s fixed but I haven’t done anything so it may come back. I have found if they know you are just as frustrated as them for not finding anything, then they can be more understanding and co-operative if it returns. This I guess is OK when you are on site everyday in the same plant, but I can understand if you are an on call tech, and the company has to keep paying when you come out. This case would then reflect the document everything, and see if you can instruct/educate the operators to gather extra information if it happens again. And finally, EDUCATION, I had to finally take a hammer and chisel to educate some operators that a PLC program logic does not re-program itself, I have lost track how many times a machine did this because the program was wrong. Yet the machine has run for 5 years with no changes to the program. Oh look this sensor is bad, or why is this setting at this, when that setting is not in the program. Great Thread Sleepy, hope some more contribute.

Share this post


Link to post
Share on other sites
ssomers,IamJon,Alaric and forqnc, thank you all for your feedback, much appreciated. I am a bit busy right now with work and running a local sports club in my spare time, but I will endeavour to add you valuable contributions as well. thanks Sleepy.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0