FRU(1M) FRU(1M)
fru - Field replacement unit analyzer for Challenge/Onyx systems
fru [-a] namelist corefile
fru is a hardware state analyzer that provides board replacement
information based on system crash dumps. The output provided by fru
displays what system boards, if any, are the most likely suspects that
might have induced a hardware failure.
fru can be run on any namelist and corefile specified on the command
line. namelist contains symbol table information needed for symbolic
access to the system memory image being examined. This will typically be
the unix.N kernel copied into /var/adm/crash, where N is the number of
the crash dump you are analyzing. corefile is a file containing the
system memory image. This will typically be the vmcore.N.comp file
copied into /var/adm/crash by savecore(1) when the machine reboots after
a system panic. If the memory image being analyzed is from a system core
dump (vmcore.N.comp), then namelist must be a copy of the unix file that
was executing at the time (unix.N).
Note that fru cannot be run against live systems, as there is no system
board replacement information available while the system is running
properly.
The fru command has the following options listed below. By default, all
information will be sent to the standard output:
-a Print the entirety of the error dump buffer, which might not
be complete in the kernel console buffer of the core dump.
If fru finds a hardware error state, it will try and report a confidence
level on each system board (and in some cases, the components on a
board). When fru reports a confidence level, it means that it has some
measure of confidence that the board reported has a problem. Typically
each board in the system will be assigned a 10% confidence level if it
reports anything into a hardware error state. Note that there are only a
few levels of confidence, and it is important to recognize what the
percentages mean:
10% The board was witnessed in the hardware error state only.
30% The board has a possible error, with a low likelihood.
40% The board has a possible error, with a medium likelihood.
70% The board has a *probable* error, with a high likelihood.
90% The board is a *definite* problem.
Given that there is the possibility of multiple boards being reported,
care should be taken before when replacing a board on the system. For
example, if two boards are reported at 10%, that is not enough confidence
Page 1
FRU(1M) FRU(1M)
that the boards listed are bad. If there is one board at 70% or better,
however, there is a good likelihood that the board listed is a problem,
and should be replaced. Boards at 30% to 40% are questionable, and should
be reviewed based on the frequency of the failure of the specific board
(in the same slot) between system crashes.
The objective is to catch real hardware problems, rather than just
replacing boards on systems where there isn't a problem.
Here is some sample output from a fru analysis on a system crash dump:
# fru -a /var/adm/crash/unix.0 /var/adm/crash/vmcore.0.comp
---------------------------------------------------------------
FRU ANALYZER (2.2):
++ MEMORY BANK: leaf 1 bank 0 (B)
++ on the MC3 board in slot 3: 90% confidence.
++ END OF ANALYSIS
---------------------------------------------------------------
HARDWARE ERROR STATE:
+ IP19 in slot 2
+ CC in IP19 Slot 2, cpu 3
+ CC ERTOIP Register: 0x10
+ 4: Parity Error on Data from D-chip
+ MC3 in slot 3
+ MA EBus Error register: 0x4
+ 2: My EBus Data Error
+ MA Leaf 1 Error Status Register: 0x2
+ 1: Read Uncorrectable (Multiple Bit) Error
+ MA Leaf 1 Bad Memory Address: 0x3fb27380
+ Slot 3, leaf 1, bank 0 (B)
+ IO4 board in slot 15
+ IA EBUS Error Register: 0x201
+ 0: Sticky Error
+ 9: DATA_ERROR Received
In this example, it would be a good idea to have the memory in leaf 1,
bank 0 (B) changed, and have the MC3 examined (unless the memory and
board in that slot has been replaced before, in which case further
analysis of the hardware on the machine should be completed.)
Please also note that it is possible the system problem being reported
might be something unknown to the version of fru you are currently
running with. There might also be some bugs within fru that SGI is
unaware of that will keep field replacement unit analysis from being
completed.
PPPPaaaaggggeeee 2222 [ Back ]
|