Draft Abend Hunting Wiki page
This page quotes instructions for using the Netware support forums to analyze an abend. Posts should be to novell.support.netware.5x.abend-hangs or novell.support.netware.6x.abend-hangs as appropriate.
Describe the Problem
A quick summary of what happened to your system helps set the stage. "My system abended! Help!" doesn't give the volunteers as much to go on as does "My system was running backups using Brand X last night, and we found it halted by abends when we came in the next morning".
To diagnose an abend, the volunteers usually need the complete text of abend 0 or abend 1, including the modules list and any stack dumps or hex dumps. The abend log is normally found at sys:system/abend.log. If it is not there, look in c:\nwserver\abend.log.
If there is no abend 0 or 1 in your log, you can get a better chance at recording the abend by setting your recovery options "set auto restart after abend = 0".
This causes the server to halt upon abend and it will ask you what to do to continue. In this case, you have to reply "X" which is write the abend.log file and exit.
Multiple abends mean that your server had an abend, and in trying to handle that abend it ran into another abend condition. No abend log will be available, but the top entries modules listed on the screen might, maybe, be useful. Sometimes multiple abends can be reduced to the original abend by setting your recovery options to "set auto restart after abend = 0".
Here is an example screen of a multiple abend event; this one is a variation on the NW65SP6 classic Requestr.nlm issue.
Note that if you have something like the HP ASR feature or something similar that automatically reboots your server upon a hang, you have to disable this feature or you will not get a chance to get your abend.log file in the way described above.
HP-ASR uses a setting in the BIOS: AutoServerRestart.
(Tip of the hat to Marcel)
Sometimes the module listing is not available or additional information is needed. When asked, please download fconfig16.exe from the file finder at support.novell.com. Extract Config.nlm from it and copy that to SYS:SYSTEM .
On the console do LOAD CONFIG /jumba1se, and wait until the output file CONFIG.TXT gets created (on NW 6.x this message only appears on the Logger screen). Some sysops show this command as "LOAD CONFIG /sma1bjue" -- the difference is merely a matter of nmemonics, and is not a comment on Cajun food flavors.
Please post that file in your forum thread, with any sensitive infomation -- eg serial number, public IP addresses, snmp community strings, remote access passwords etc. -- edited out. Thank you.
(Tip of the hat to Andrew)
CPU Hog Abends
CPU Hogs abends can be a bit tricky. This is because the thread shown in the abend log might be the thread that is misbehaving, or it might be that this is just a thread grabbed by the Hog Wrangler code, and so is an innocent bystander. If the thread is abending in a spinlock routine (such as LOADER.NLM|kSpinLock) or a mutex routine (such as SERVER.NLM|kMutexLock), then it is most likely the guilty routine, and you can address the issue by looking for updates or asking for advice on settings that may avoid the problem.
If that is not the case, then it gets harder. A first thing to try is disabling hyperthreading features (usually a BIOS option, but the console command "STOP PROCESSORS" or the NRM equivalent can be used if the server is already booted). Hyperthreading can be a cause of CPU Hog Abends because the spinlock routines don't work well on a virtual cpu.
If hyperthreading is already disabled on the server in question, you'll have to do some detective work -- either examining the other screens at the time of the abend to see what was going on, or using Monitor/NRM to watch for the hog.
An NMI almost always indicates a hardware problem. The usual cause is either a memory fault (parity error or non-correctable ECC error), or a problem on the PCI bus -- often an adapter detects a bus error, and that turns into the NMI.
These sorts of issues are hard to catch with most readily available diagnostics, and several people have reported in the forums that replacing memory fixed *their* problem even though no diagnostic turned anything up.
If you've got lots of add-in adapters, try removing one to see if it is just a bus-loading issue or bus-contention.
Occasionally, the motherboard itself will be faulty and will produce NMIs under certain activity patterns.
Double Fault Abends
Double faults are similar to multiple abends in that an original error occurs but cannot be handled because of some additional error. The difference is that the additional error is detected by hardware before starting or while trying to start the abend handler for the original fault.
Quite frequently, the additional fault is stack overflow -- there was not enough room in the stack segment for the original fault record or other overhead of invoking the handler. In this case, the first step in troubleshooting is to make sure that memory issues have been eliminated. However, this is not always sufficient to prevent double faults, as the special conditions of abend handling may place extra burden on stacks, especially if there is already an interrupt handler active on the same stack.
A corrupted redzone means somebody was writing to memory that didn't belong to them. Either because they had a stale or invalid pointer, or because they were overflowing a buffer
Redzone abends do not necessarily implicate the application which is running at the time. So you can't just look at the stack trace and say "Hey, it was XYZ.NLM because that is what was running at the time."
Each block of memory allocated to an application comes with a leading and training "redzone" which is a special signature in memory just before and just after the memory allocated to the application. If the application were to overwrite outside this region, the redzone signature would be overwritten. Then later on you get this abend when the memory is returned, and the OS can inspect the redzones.
However, because of the way buffers can be passed around between modules in the same memory space, its hard to identify exactly which one caused the problem, and certainly harder to identify a code path that causes the problem.
NB: When applications do not live in protected memory, virtually any other application ( also not in protected memory ) including the majority of the Netware OS and drivers, could conceivably have overwritten the memory. So redzone abends can be caused by totally unrelated misbehaving applications.
(tip of the hat to Robert Charles Mahar)
If you run Netware Auditing and get an abend like
"Abend 1 on P00: Server-5.70.06-4348: Kernel detected an attempted context switch when it was not allowed"
then you should be checking the load order of anything that uses File System hooks.
The auditing components MUST be loaded last. Otherwise you get this oddly specific abend.
There is a TID on this very issue. And before you say "But Bob, in AUTOEXEC.NCF I have the auditing loaded AFTER the AV software..." this is only true if you never unload/reload these modules.
The modules listing from the ABEND log (Forum Example)shows the actual load order with most recent at the top of the list. Any other FSHooks consumer (PSCAN) must be loaded PRIOR to the auditing, e.g. listed BELOW the auditing pieces in the ABEND log.
Virus scanners pose special problems if they unload/reload themselves to swap out signatures or get updates. The only way to accomodate that is to ensure that the auditing pieces are unloaded prior to the auto update and reloaded after.
(tip of the hat to Robert Charles Mahar)
Tip for Hangs
The following advice is offered when the server has not abended, but appears to be hung:
Next time it happens try to get into the debugger with shift-shift-alt-esc (also known as the four-finger salute) - this often works even if no other keys do. If nothing happens you almost certainly have a hardware fault.
If you get to a "#" prompt then you are in the debugger, and the cause of the hang could be software or hardware. Type
- ? <enter>
to see what module the hang happened in. If it always hangs in the same module it is likely to be a problem with that module, if it is always different modules it is likely to be a hardware fault. Type
- Q <enter>
to exit the debugger to Dos.
(Tip of the hat to Andrew)
Note: This technique may find the system in certain house-keeping activities, which may give an appearance of "always different modules".
High Utilization is not an Abend, but it can be just as distressing. Basically, the approach is to use Monitor.nlm or NRM to identify the busiest threads, and then figure out how to deal with it from there. See also High Utilization
Room for Corrections
Looking forward to the sysops' notes on this page.