System problems fall into several categories. The first category is difficult
to describe and even more difficult to track down. For lack of a better word, I
am going to use the word "glitch." Glitches are problems that occur infrequently
and under circumstances that are not easily repeated. They can be caused by
anything from users with fat fingers to power fluctuations that change the
contents of memory.
Next are special circumstances in software that are detected by the CPU
while it is in the process of executing a command. I discussed these briefly in the section on
kernel internals. These problems are traps, faults, and exceptions,
including such things as page faults. Many of these events are normal parts of
system operation and are therefore expected. Other events, like following an
invalid pointer, are unexpected and will usually cause the process to
What if the kernel
or exception? As I mentioned in the section on kernel internals,
there are only a few
cases when the kernel
is allowed to do this. If this is not one of those cases, the situation is
deemed so serious that the kernel
must stop the system immediately to prevent any further damage.
This is a panic.
When the system panics, using its last dying breath, the kernel
special routine that prints the contents of the internal registers onto the
console. Despite the way it sounds, if your system is going to go down, this is
the best way to do it. The rationale behind that statement is that when the
system panics in this manner, at least there is a record of what happened.
If the power goes out on the system, it is not really a system problem, in
the sense that it was caused by an outside influence, similar to someone
pulling the plug or flipping the circuit breaker (which my father-in-law did to
me once). Although this kind of problem can be remedied with a
UPS, the first time the system goes down before the UPS is
installed can make you question the stability of your system. There is no record
of what happened and unless you know the cause was a power outage, it could have
Another annoying situation is when the
system just "hangs." That is, it stops completely and does not react to any
input. This could be the result of a bad hard disk controller, bad
RAM, or an improperly written or corrupt
device driver. Because there is no record of what was happening, trying to
figure out what went wrong is extremely difficult, especially if this happens
Because a system panic
is really the only time you can easily track down the problem, I will start
there. The first thing to think about is that as the system goes down, it does
two things: writes the registers to the console screen and writes a memory image
to the dump device. The fact that it does this as it's
dying makes me think that this is something important, which it is.
The first thing to look at is
the instruction pointer. This is actually composed of two registers: the CS
(code segment) and EIP (instruction pointer) registers. This is the instruction
that the kernel was executing at the time
of the panic.
By comparing the EIP
of several different panics, you can make some assumptions about
the problem. For example, if the EIP
is consistent across several different panics, this indicates
that there is a software problem. The assumption is made because the system was
executing the same piece of code every time it panicked. This usually
indicates a software problem.
other hand, if the EIP
consistently changes, then this indicates that probably no one piece of code
is the problem and it is therefore a hardware problem. This could be bad
or something else. Keep
in mind, however, that a hardware problem could cause repeated EIP
values, so this is not a hard
The problem with this approach is that the kernel
is generally loaded the same
way all the time. That is, unless you change something, it will occupy the same
area of memory. Therefore, it's possible that bad RAM
makes it look as though there is a bad driver. The way to
verify this is to change where the kernel
is physically loaded. You can do this by rearranging the
order of your memory chips.
Keep in mind that this technique probably may not tell you what
SIMM is bad, but only indicate that you may have a bad SIMM.
The only sure-fire test is to swap out
the memory. If the problem goes away with new RAM
and returns with the old RAM, you have a bad
Getting to the Heart of the Problem
Okay, so we know what types of problems
can occur. How do we correct them? If you have a contract with a consultant,
this might be part of that contract. Take a look at it and read it. Sometimes
the consultant is not even aware of what is in his or her own contract. I have
talked to customers who have had consultant charge them for maintenance or
repair of hardware, insisting that it was an extra service. However, the
customer could whip out the contract and show the contractor that these services
are not fortunate to have such an expensive support contract, you will obviously have to
do the detective work yourself. If the printer catches fire, it is pretty
obvious where the problem is. However, if the printer just stops working,
figuring out what is wrong is often difficult. Well, I like to think of problem
solving the way Sherlock Holmes described it in The Seven Percent
Solution (and maybe other places):
"Eliminate the impossible and whatever is left over, no matter how
improbable, must be the truth."
Although this sounds like a basic enough statement, it is
often difficult to know where to begin to eliminate things. In simple cases, you
can begin by eliminating almost everything. For example, suppose your system
was hanging every time you used the tape drive. It would be safe at this point
to eliminate everything but the tape drive. So, the next big question is whether
it is hardware problem or not.
Potentially, that portion of the
kernel containing the tape driver was corrupt. In this case, simply rebuilding
the kernel is enough
to correct the problem. Therefore, when you relink,
link in a new copy of the driver. If that is not
sufficient, then restoring the driver from the distribution media is the next
step. However, based on your situation, checking the hardware might be easier,
depending on your access to the media.
If this tape drive requires its own controller and you have access to another
controller or tape drive, you can swap components to see whether the behavior
changes. However, just as you don't want to install multiple pieces of hardware
at the same time, you don't want to swap multiple pieces. If you do and the
problem goes away, how do you know whether it was the controller or the tape
drive? If you swap out the tape drive and the problem goes away, that would
indicate that the problem was in the tape drive. However, does the first
controller work with a different tape drive? You may have two problems at
If you don't have access to other equipment
that you can swap, there is little that you can do other than verify that it is
not a software problem. I have had at least one case while in tech support in
which a customer called in, insisting that our driver was broken because he
couldn't access the tape drive. Because the tape drive worked under
DOS and the tape drive was listed as supported, either the
documentation was wrong or something else was. Relinking the
kernel and replacing the driver had no effect. We checked
the hardware settings to make sure there were no conflicts, but everything
had been testing it using tar the whole time because tar is quick and easy when
you are trying to do tests. When we ran a quick test using cpio, the tape drive
worked like a champ. When we tried outputting tar to a file, it failed. Once we
replaced the tar binary, everything worked correctly.
If the software behaves correctly, there is potential for conflicts. This
only occurs when you add something to the system. If you have been running for
some time and suddenly the tape drive stops working, then it is unlikely that
there are conflicts; unless, of course, you just added some other piece of
hardware. If problems arise after you add hardware, remove it from the
kernel and see whether the problem goes away. If it doesn't
go away, remove the hardware physically from the system.
Another issue that people often forget is cabling. It has happened to me a number of
I had a new piece of hardware and after relinking and rebooting, something else
didn't work. After removing it again, the other piece still didn't work. What
happened? When I added the hardware, I loosened the cable on the other piece.
Needless to say, pushing the cable back in fixed my problem.
have also seen cases in which the cable itself is bad. One support engineer
reported a case to me in which just pin 8 on a serial cable was bad. Depending on
what was being done, the cable might work. Needless to say, this problem was not
easy to track down.
Potentially, the connector on the
cable is bad. If you have something like SCSI,
on which you can change the order on the SCSI cable
without much hassle, this is a good test. If you switch hardware and the
problem moves from one device to the other, this could indicate one of two
things: either the termination or the connector is bad.
If you do have a hardware problem, often times it is the result of a
conflict. If your system has been running for a while and you just added
something, it is fairly obvious what is causing the conflict. If you have
trouble installing, it is not always as clear. In such cases, the best thing is
to remove everything from your system that is not needed for the install. In
other words, strip your machine to the "bare bones" and see how far you get.
Then add one piece at a time so that once the problem re-occurs, you know you
have the right piece.
As you try to track down
the problem yourself, examine the problem carefully. Can you tell whether there
is a pattern to when and/or where the problem occurs? Is the problem related to
a particular piece of hardware? Is it related to a particular software package?
Is it related to the load that is on the system? Is it related to the length of
time the system has been up? Even if you cant tell what the pattern means, the
support representative probably has one or more pieces of information to help
track down the problem. Did you just add a new piece of hardware or SW? Does
removing it correct the problem? Did you check to see whether there are any
hardware conflicts such as base address,
interrupt vectors, and DMA
I have talked to customers who were having trouble with one particular
command. They insist that it does not work correctly and therefore there is a
bug in either the software or the doc. Because they were reporting a bug, we
allowed them to speak with a support engineer even though they did not have the
valid support contract. They kept saying that the documentation is bad because
the software did not work the way it was described in the manual. After pulling
some teeth, I discovered that the doc the customers used is for a product that
was several years old. In fact, there had been three releases since then. They
were using the latest software, but the doc was from the older release. No
wonder the doc didn't match the software.
- Collection information
- Instead of a simple
list, I suggest you create a
mind map. Your brain works in a non-linear fashion, and unlike a simply list
a mind map, helps you gather and analyse information the way your brain actaully works.
- Work methodically and stay on track
- Unless you have a very specific reason, don't jump to some other area before you
complete the one you are working on. It is often a waste of time, not because that
other area is not where the problem is, but rather "finding yourself" again in the
original test area almost always requires a little bit of extra time ("Now where was
I?"). Let your rest results in one area guide you to other areas even if that
means jumping somewhere else before you are done. But make sure you have a reason.
- Split the problem in pieces
- Think of a chain that has a broken link. You can tie the end onto something, but
when you pull nothing happens. Each link needs to be examined invidually. Also, the
larger the pieces, the easier it is to overlook something.
- Keep track of where you have been
- "Been there done that." Keep a record of what you have done/tested and what the
results where. This can save a lot of time whith complex problems with many different
- Listen to the facts
- One key concept I think you need to keep in mind is that appearances can be
deceiving. The way the problem presents itself on the surface, may not the real
problem at all. Especially when dealing with complex systems like Linux or
networking, the problem may be buried under several different layers of "noise".
Therefore, you should try not make too many assumptions and if you do, verify those
assumptions before you go wandering off on the wrong path. Generally, if you can
figure out the true nature of the problem then then finding the cause is usually very
- Be Aware of all limitation and restrictions
Maybe what you are trying to do is not possible given the current configuration or
hardware. For example, maybe there is a firewall rule which prevents two machines
from communicating. Maybe you are not authorized to use resources on a specific
machines. You might be able to see machine using some tools (e.g. ping) but not
with others (e.g. traceroute).
- Read what is in front of you
- Pay particular attention to error messages. I have had "experienced" system
administrators reports problems to me and say that there was "some error message" on
the screen. It's true that many errors are vague or come from the last link in
the chain, but more often than not they provide valuable information. This also
applies to the output of commands. Does the command report the information
you expect it to?
- Keep calm
- Getting upset or angry will not help you solve the problem. In fact, just the
opposite is true. You begin to be more concerned with your frustration or anger
and forget about the true problem. Keep in mind that if the hardware
or software is as buggy as you now think it is, the company would be out of
business. (Obviously that statement does not apply to Microsoft products)
It's probably one small point in the doc that you skipped over (if you
even read the doc) or something else in the system is conflicting. Getting upset
does nothing for you. In fact (speaking from experience), getting upset can
cause you to miss some of the details for which you're looking.
- Recreate the problem
- As in many branches of science, you cause something to happen and then examine
both the cause and results. This not only verifies your understanding of the situation,
it also helps prevent wild gooses chases. Users with little or no technical
experience tend to over dramatize problems. This often results in in comments like
"I didn't do anything. It just stopped working." By recreating the problem yourself,
you have ensured that the problem does not exist between the chair and the keyboard.
- Stick with known tools
- There are dozens (if not hundreds) of network tools available. The time to learn
about their features is not necessarily when you are trying to solve a business
critical problem. Find out what tools are already available and learn how to use
them. I would also recommend using the tools that are available on all machines (or
at least as many as possible). That way you don't need to spend time learing the
specifics of each tool.
- Don't forget the obvious
- Cables can accidently get kicked out or damaged. I have seen cases where
the cleaning crew turned off a monitor and the next day the user reported
the computer didn't work because the screen was blank.