Draft Abend Hunting Wiki page
This page quotes instructions for using the Netware support forums to analyze an abend. Posts should be to novell.support.netware.5x.abend-hangs or novell.support.netware.6x.abend-hangs as appropriate.
Describe the Problem
A quick summary of what happened to your system helps set the stage. "My system abended! Help!" doesn't give the volunteers as much to go on as does "My system was running backups using Brand X last night, and we found it halted by abends when we came in the next morning".
To diagnose an abend, the volunteers usually need the complete text of abend 0 or abend 1, including the modules list and any stack dumps or hex dumps. The abend log is normally found at sys:system/abend.log. If it is not there, look in c:\nwserver\abend.log.
If there is no abend 0 or 1 in your log, you can get a better chance at recording the abend by setting your recovery options "set auto restart after abend = 0".
This causes the server to halt upon abend and it will ask you what to do to continue. In this case, you have to reply "X" which is write the abend.log file and exit.
Multiple abends mean that your server had an abend, and in trying to handle that abend it ran into another abend condition. No abend log will be available, but the top entries modules listed on the screen might, maybe, be useful. Sometimes multiple abends can be reduced to the original abend by setting your recovery options to "set auto restart after abend = 0".
Here is an example screen of a multiple abend event; this one one is a variation on the NW65SP6 classic Requestr.nlm issue.
Note that if you have something like the HP ASR feature or something similar that automatically reboots your server upon a hang, you have to disable this feature or you will not get a chance to get your abend.log file in the way described above.
HP-ASR uses a setting in the BIOS: AutoServerRestart.
(Tip of the hat to Marcel)
Sometimes the module listing is not available or additional information is needed. When asked, please download fconfig16.exe from the file finder at support.novell.com. Extract Config.nlm from it and copy that to SYS:SYSTEM .
On the console do LOAD CONFIG /jumba1se, and wait until the output file CONFIG.TXT gets created (on NW 6.x this message only appears on the Logger screen). Please post that file in your forum thread, with any sensitive infomation -- eg serial number, public IP addresses, snmp community strings, remote access passwords etc. -- edited out. Thank you.
(Tip of the hat to Andrew)
CPU Hog Abends
CPU Hogs abends can be a bit tricky. This is because the thread shown in the abend log might be the thread that is misbehaving, or it might be that this is just a thread grabbed by the Hog Wrangler code, and so is an innocent bystander. If the thread is abending in a spinlock routine (such as LOADER.NLM|kSpinLock) or a mutex routine (such as SERVER.NLM|kMutexLock), then it is most likely the guilty routine, and you can address the issue by looking for updates or asking for advice on settings that may avoid the problem.
If that is not the case, then it gets harder. A first thing to try is disabling hyperthreading features (usually a BIOS option, but the console command "STOP PROCESSORS" or the NRM equivalent can be used if the server is already booted). Hyperthreading can be a cause of CPU Hog Abends because the spinlock routines don't work well on a virtual cpu.
If hyperthreading is already disabled on the server in question, you'll have to do some detective work -- either examining the other screens at the time of the abend to see what was going on, or using Monitor/NRM to watch for the hog.
An NMI almost always indicates a hardware problem. The usual cause is either a memory fault (parity error or non-correctable ECC error), or a problem on the PCI bus -- often an adapter detects a bus error, and that turns into the NMI.
These sorts of issues are hard to catch with most readily available diagnostics, and several people have reported in the forums that replacing memory fixed *their* problem even though no diagnostic turned anything up.
If you've got lots of add-in adapters, try removing one to see if it is just a bus-loading issue or bus-contention.
Occasionally, the motherboard itself will be faulty and will produce NMIs under certain activity patterns.
Tip for Hangs
The following advice is offered when the server has not abended, but appears to be hung:
Next time it happens try to get into the debugger with shift-shift-alt-esc - this often works even if no other keys do. If nothing happens you almost certainly have a hardware fault.
If you get to a "#" prompt then you are in the debugger, and the cause of the hang could be software or hardware. Type
- ? <enter>
to see what module the hang happened in. If it always hangs in the same module it is likely to be a problem with that module, if it is always different modules it is likely to be a hardware fault. Type
- Q <enter>
to exit the debugger to Dos.
(Tip of the hat to Andrew)
Note: This technique may find the system in certain house-keeping activities, which may give an appearance of "always different modules".
High Utilization is not an Abend, but it can be just as distressing. Basically, the approach is to use Monitor.nlm or NRM to identify the busiest threads, and then figure out how to deal with it from there. See also High Utilization
Room for Corrections
Looking forward to the sysops' notes on this page.