Draft Abend Hunting Wiki page
This page quotes instructions for using the Netware support forums to analyze an abend. Posts should be to novell.support.netware.5x.abend-hangs or novell.support.netware.6x.abend-hangs as appropriate. The 6x forums are also accessible under the OES forum folders: OES NW Abend Forum. For older systems, see NW5.x Abend Forum; the 4.x forums are nearby for those on an accelerated update path.
Describe the Problem
A quick summary of what happened to your system helps set the stage. "My system abended! Help!" doesn't give the volunteers as much to go on as does "My system was running backups using Brand X last night, and we found it halted by abends when we came in the next morning".
It can also be useful to give at least a quick description of your server configuration, as well. The volunteers often like to know what applications are installed, and sometimes the hardware vendor and model information is important. Is this a file-and-print server in a small office, a high-traffic webserver, a Groupwise postal resource, or a Border Manager guardian? Is the system clustered?
A Note on Console Prompts
Many of us have experienced something like what this user describes:
since recovery of my raid my novell server name now has a <2> after its name; example: PL800 <2>
Timothy Fitch gives the explanation: "Server abended.. Twice... The <x> indicates the number of recovered abends." And as Edward van der Maas notes, if you see this, you should be checking your abend.log and preparing to restart your server (running in a recovered abend state may make later abends more likely, due to resource issues).
To diagnose an abend, the volunteers usually need the complete text of abend 0 or abend 1, including the modules list and any stack dumps or hex dumps. The abend log is normally found at sys:system/abend.log. If it is not there, look in c:\nwserver\abend.log.
If there is no abend 0 or 1 in your log, you can get a better chance at recording the abend by setting your recovery options "set auto restart after abend = 0".
This causes the server to halt upon abend and it will ask you what to do to continue. In this case, you have to reply "X" which is write the abend.log file and exit.
Once you've collected the abend.log file, you can use a text editor to trim a copy to just the relevant entries, and paste those lines into your message. Sometimes a very long entry may need to be attached as a file, or as a zip file -- see the forum hints on how to do that if you are using the web interface.
Note: older systems sometimes accumulate enough entries in the logfile to make the file big and awkward even for local examination. You may delete or rename the file (if you have another abend, a new log file will be created). Take a look also at the Cool Solutions abend log file filter utilty -- see the link in the references section below.
Multiple abends mean that your server had an abend, and in trying to handle that abend it ran into another abend condition. No abend log will be available, but the top entries modules listed on the screen might, maybe, be useful. Sometimes multiple abends can be reduced to the original abend by setting your recovery options to "set auto restart after abend = 0".
Here is an example screen of a multiple abend event; this one is a variation on the NW65SP6 classic Requestr.nlm issue.
Note that if you have something like the HP ASR feature or something similar that automatically reboots your server upon a hang, you have to disable this feature or you will not get a chance to get your abend.log file in the way described above.
HP-ASR uses a setting in the BIOS: AutoServerRestart.
(Tip of the hat to Marcel)
(And Massimo's variation, with another tip o' the hat) Another option is to set ASR Timeout to its largest value (30 minutes, for example), which should give plenty of time for an operator to select the 'X' option on the abend screen, and still brings the systems back up if no operator is present. This is a tradeoff between system availability and the need for information for diagnostic purposes.
These options might appear in various menus depending on system and firmware revisons; for instance, an HP DL380 G4 -- select "setup menus" with F10, select "server availability", and edit the ASR Status and ASR Timeout fields (enabled/disabled for the status, and choices between 10 minutes and 30 minutes for the timeout).
Sometimes the module listing is not available or additional information is needed. When asked, please download fconfig17.zip from the file finder at support.novell.com. Extract Config.nlm from it and copy that to SYS:SYSTEM .
On the console do LOAD CONFIG /jumba1se, and wait until the output file CONFIG.TXT gets created (on NW 6.x this message only appears on the Logger screen). Some sysops show this command as "LOAD CONFIG /sma1bjue" -- the difference is merely a matter of mnemonics, and is not a comment on Cajun food flavors.
Please post that file in your forum thread, with any sensitive infomation -- eg serial number, public IP addresses, snmp community strings, remote access passwords etc. -- edited out. Thank you.
(Tip of the hat to Andrew)
CPU Hog Abends
CPU Hogs abends can be a bit tricky. This is because the thread shown in the abend log might be the thread that is misbehaving, or it might be that this is just a thread grabbed by the Hog Wrangler code, and so is an innocent bystander. If the thread is abending in a spinlock routine (such as LOADER.NLM|kSpinLock) or a mutex routine (such as SERVER.NLM|kMutexLock), then it is most likely the guilty routine, and you can address the issue by looking for updates or asking for advice on settings that may avoid the problem.
If that is not the case, then it gets harder. A first thing to try is disabling hyperthreading features (usually a BIOS option, but the console command "STOP PROCESSORS" or the NRM equivalent can be used if the server is already booted). Hyperthreading can be a cause of CPU Hog Abends because the spinlock routines don't work well on a virtual cpu.
If hyperthreading is already disabled on the server in question, you'll have to do some detective work -- either examining the other screens at the time of the abend to see what was going on, or using Monitor/NRM to watch for the hog.
An NMI almost always indicates a hardware problem. The usual cause is either a memory fault (parity error or non-correctable ECC error), or a problem on the PCI bus -- often an adapter detects a bus error, and that turns into the NMI.
These sorts of issues are hard to catch with most readily available diagnostics, and several people have reported in the forums that replacing memory fixed *their* problem even though no diagnostic turned anything up.
If you've got lots of add-in adapters, try removing one to see if it is just a bus-loading issue or bus-contention.
Occasionally, the motherboard itself will be faulty and will produce NMIs under certain activity patterns.
Double Fault Abends
Double faults are similar to multiple abends in that an original error occurs but cannot be handled because of some additional error. The difference is that the additional error is detected by hardware before starting or while trying to start the abend handler for the original fault.
Quite frequently, the additional fault is stack overflow -- there was not enough room in the stack segment for the original fault record or other overhead of invoking the handler. In this case, the first step in troubleshooting is to make sure that memory issues have been eliminated. However, this is not always sufficient to prevent double faults, as the special conditions of abend handling may place extra burden on stacks, especially if there is already an interrupt handler active on the same stack.
A corrupted redzone means somebody was writing to memory that didn't belong to them. This can happen either because they had a stale or invalid pointer, or because they were overflowing a buffer
Redzone abends do not necessarily implicate the application which is running at the time. So you can't just look at the stack trace and say "Hey, it was XYZ.NLM because that is what was running at the time."
Each block of memory allocated to an application comes with a leading and trailing "redzone" which is a special signature in memory just before and just after the memory allocated to the application. If the application were to overwrite outside this region, the redzone signature would be overwritten. Then later on you get this abend when the memory is returned, and the OS can inspect the redzones.
However, because of the way buffers can be passed around between modules in the same memory space, it's hard to identify exactly which one caused the problem, and certainly harder to identify a code path that causes the problem.
NB: When applications do not live in protected memory, virtually any other application (also not in protected memory) including the majority of the Netware OS and drivers, could conceivably have overwritten the memory. So redzone abends can be caused by totally unrelated misbehaving applications.
(tip of the hat to Robert Charles Mahar)
If you run Netware Auditing and get an abend like
"Abend 1 on P00: Server-5.70.06-4348: Kernel detected an attempted context switch when it was not allowed"
then you should be checking the load order of anything that uses File System hooks.
The auditing components MUST be loaded last. Otherwise you get this oddly specific abend.
There is a TID on this very issue. And before you say "But Bob, in AUTOEXEC.NCF I have the auditing loaded AFTER the AV software..." this is only true if you never unload/reload these modules.
The modules listing from the ABEND log (Forum Example)shows the actual load order with most recent at the top of the list. Any other FSHooks consumer (PSCAN) must be loaded PRIOR to the auditing, e.g. listed BELOW the auditing pieces in the ABEND log.
Virus scanners pose special problems if they unload/reload themselves to swap out signatures or get updates. The only way to accomodate that is to ensure that the auditing pieces are unloaded prior to the auto update and reloaded after.
(tip of the hat to Robert Charles Mahar)
Tip for Hangs
The following advice is offered when the server has not abended, but appears to be hung:
Hangs can either be a very uncooperative process tying up the processor in a way that escapes the CPU hog timer, or a hardware problem. The first check after finding you can't get a response at the console is to try to enter the (built-in) system debugger. If you're not familiar with how to do that, it involves the "4-finger salute" -- the 2 SHIFT keys, the ESC key, and an ALT key are all used together ("depressed at the same time", which sounds pretty gloomy). This is usually done with 2 hands -left hand for left SHIFT and ESC, right hand for right ALT and right SHIFT.
If you get a debugger prompt, you can then begin to try to see what was going on and who might be unresponding.
The most important debugger command at this point is the "v" command, which cycles you through the server screens so you can examine them for clues.
The next most important is the "? command, which tells you what code you are breaking into. That may or may not be the guilty code. Tip of the hat to Andrew: If it always hangs in the same module it is likely to be a problem with that module, if it is always different modules it is likely to be a hardware fault.
Note: This technique may find the system in certain house-keeping activities, which may give an appearance of "always different modules".
The "q" command will exit to DRDOS so that you can restart the system. If you decide to wait out the hang a little while longer, the "g" command resumes Netware.
Another note: If you are maintaining a cranky system at a remote location, you may not be able to have someone at the physical server keyboard. Setting up that system with rdbhost.nlm (and rdb.exe on your workstation) may allow you to try the above tip; make sure you set up your rdb configuration file properly to only allow the proper systems access to the feature.
And another: It is possible to examine the list of processes from the debugger, but it's a tedious job, and the ranking is not by "busiest" as it would be in Monitor.nlm or NRM. You probably Do Not Want To Go There.
See also the following section for the pointer to the page for "High Utilization".
High Utilization is not an Abend, but it can be just as distressing. Basically, the approach is to use Monitor.nlm or NRM to identify the busiest threads, and then figure out how to deal with it from there. See also High Utilization
This TID is lightly off-topic, but may be useful if you are having trouble booting your system after an abend
Room for Corrections
Looking forward to the sysops' notes on this page.