Thursday, March 17, 2016

Troubleshooting in Linux/Solaris core file

Troubleshooting crash occured at site
Now customer reported that the binary crashed at site and helling that he lost about x% of revenues for the crash.
Ooops.. and that too crashed on weekends.
Dont worry, cannot do anything at the time of crash, just sit back and relax.
Ask the following from site.
1) Core file
2) Last log
3) pstack
4) pflag
5) Binary version or label running

3 and 4 will be available if OS is solaris.

Now build the same binary in your local lab and run the gdb to pin point where the crash occured.

$cd /home/test/binary
$gdb binary_name core

This will show the location of the stack crashed.
For more details enter backtrace command, this will display the complete stack.
Enter the command "list" to display the 10 lines in and around the crash.
Check which parameters are passing to sys call and why it is coming as NULL.
Review the code and check which all the cases the param gets NULL value.
And reply to the customer that the issue is because of XXX is getting NULL for YYY data and get some time to analyse more time.
The above gdb commands will work only if you are lucky. But most of the time site reported issues will ???????? :-(.

For stack containing only ?????, you cannot do anything. But the frequency of the crash is repeating, then you can give the debug built to site and run the binary. Let binary crash and writes all logs into the file. The logs will help to troubleshoot. If it is one off case and customer is very aggressive about the crash, take the help of your senior manages and let them handle.

No comments:

Post a Comment