Troubleshooting a Computer System: Simulation Assessment
Background Information
The Enterprise Services division at Sun Microsystems provides a broad range of services, but many of them have a common component based in the skill area of troubleshooting computer systems which are failing in some measure. New support personnel often have some programming or hardware assembly experience, but many do not have extensive experience with troubleshooting, or formal training in that skill area. New support agents can easily feel intimidated in their first experiences with customers, and may resort to a trial and error approach, rather than a systematic analysis of symptoms for probable causes. This creates unnecessary expense in the service process, and adds an undesirable delay to the solution of the customer's problem.
Two levels of training are currently provided to employees of Sun Enterprise Services in the area of troubleshooting:
Analytic Troubleshooting System training, provided by Kepner-Tregoe
Product-specific troubleshooting training, e.g. "Core File Analysis"
The ATS course provides a generalized troubleshooting methodology based on asking systematic questions ("When did the failure occur?" "What changed just before the failure occurred?") but although tþe training provided at Sun is slightly customized toward how to use this methodology in a computer systems environment, it provides no assessment and few valid case studies. The product-specific training available assumes a tightly constrained context, does not transfer well from one product or technology area to another, and is not available for all products. What is missing is a middle layer of generalized computer systems troubleshooting training and assessment.
Problem Statement
Troubleshooting computer systems requires basic knowledge of "how the system works." This involves knowing what the basic parts of a computer system are, including operating system, file structure, processes, interprocess communication, and hardware (CPU, RAM, storage, etc.), and how these components relate to and interact with one another. Further, knowledge of how a failure of any individual component is likely to affect the entire system is required. Finally, the skilled troubleshooter needs to be able to determine what corrective action to take to be able to resolve the problem and restore the system to correct functioning.
This assessment method is intended to be used for three purposes:
To identify stronger candidates in the area of troubleshooting during the hiring process
To identify existing support agents who do not have sufficient general computer system troubleshooting skills, for the purposes of providing remedial instruction before work assignment or before attending a more specialized class
To establish that students who did not previously possess adequate troubleshooting skills have now acquired them, as a result of instruction
Learning Outcome/Objectives
This assessment will verify that the participant is able to:
Note significant symptoms of a failing computer system
Use common Unix commands and other simple operations to acquire additional symptomatic data
Correlate symptoms of failure with probable causes
Form hypotheses about causes of failure
Test hypotheses of causes of failure
Suggest appropriate hardware-based remedies
Implement simple appropriate software-based remedies
Suggest more complex appropriate software-based remedies
Simulation Environment Specification
Simulation Technology Used
Because the support agents for whom this assessment is applicable constitute a very large group (approximately 6000 support agents) and are distributed in several hundred large and small offices around the world, working a variety of shifts, it is desirable that this assessment method be automated. Ideally, it will be paired with standalone web-based instruction, which can serve either as an instructional method before taking the assessment, as a remedial method after taking the assessment, or both.
The assessment will consist of a simple, but flexible text-based simulation environment, adapted from that used for the construction of text-based games, also known as "interactive fiction." The proposed development tool is Inform. (Please see http://www.gnelson.demon.co.uk/inform.html for more information about this programming language and interactive fiction construction.) This simulation environment will be stored on a server and accessed by a telnet connection. This telnet connection may be automated and embedded on a web page in the form of a Java applet, allowing simulations to be easily included in web-based training, and also allowing scores to be recorded and tracked in online learning management systems. (A slightly customized version of a standard telnet applet should be used to enable this last.)
Simulation Parameters
The simulation environment will present an interface similar to a standard C-shell, which accepts Unix commands, but also includes a simple natural language parser, and will simulate the effects of the following system features:
Operating System
File System
Serial Port Configuration
Two working processes (widget_check, widget_graph)
One interim data file (widget_data)
Interprocess Communication between widget_check and widget_graph
CPU
Memory (RAM)
Storage (Disk)
Serial Port Hardware (widget monitors)
Simulated System Functionality
The functionality of the simulated system, as described to the participant, will be the following:
The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.
System Failures
The system is designed to be able to simulate any one of the following failures (described here by cause):
Corrupt data or executable file
Missing data or executable file
Incorrect permissions set on data or executable file, or on a directory or umask
Bug in executable file
Bad CPU
Bad memory module
Bad storage device
Bug in Operating System
Serial port not configured correctly
Serial Cables not connected to monitoring device correctly
It is also possible to simulate multiple failures for advanced problems.
Symptoms
Each type of failure presents certain signature symptoms, some of which are readily apparent, and some of which will require use of one or more tools or methods to identify. Most symptoms are individually ambiguous, and will not indicate a single cause of failure when viewed in isolation. Symptoms include:
Graph appears, but contains no data
No graph appears
System generates a "segmentation fault" error
System panics
System generates a "permission denied" error
ls indicates no data files or missing executable files
truss indicates an executable cannot access a file
pkgchk indicates file corruption
serial port configuration file does not match manual instructions
System generates "parity errors"
System generates /dev/dsk errors
System generates "red state exception"
Available tools and operations
The student is presented with the simplest level of description first ("The customer reports that the system does not work.") A single lead clue is also provided, e.g. "The customer reports that the system has never worked following installation." The student can then query and modify the system, using a variety of tools and methods. Unix commands are in courier. Natural language operations are presented in plain (proportional) text.
ls
more or cat
cd
truss
pkgchk
umask
chmod
Attempt to run program
Read software documentation
Read source
Read /var/adm/messages
Insert debug lines in source (a limited number of options are provided)
Make correction to source (again, a limited number of corrections are available)
Reinstall package
Check bugs database (a simulated bugs database is provided which includes actual bugs used in the simulations, as well as a number of distractors)
Write a test utility (a test utility is provided)
Write a test data file (a test data file with constant values is provided)
Replace hardware parts (CPU, memory, storage devices, cables, sensors)
Test a known "good" widget and a known "bad" widget
Check cables for connection to serial port and widget monitoring sensors
Ask customer when the problem started
stty (Examine serial port configuration)
Apply a patch
Not all of these operations are appropriate or useful to every problem, and some incur expense as well as delay. To simulate the real-world support environment, a limited number of operations is allowed for each assessment attempt, keyed to the type and difficulty of problem. Progress toward the goals of identifying relevant symptoms and solving the problem is tracked by the simulation software, using a scoring system. Partial solutions are possible.
Feedback
The system can operate in two modes, assessment and instruction. Both systems support multiple levels of difficulty. In the assessment system, limited scaffolding is provided, primarily in the form of simple positive and negative feedback as the participant approaches or strays from profitible lines of inquiry. In the instructional system, scaffolding is provided in the form of prompting appropriate to the difficulty level and the number of attempts made by the student to solve a given problem. In both systms, a summative score is also presented at the end.
To provide positive feedback, as the participant uses appropriate tools which bring them closer to the goal of identifying the cause of the failure and correcting it, the system provides observations such as "The customer seems impressed with your troubleshooting skill." As the participant comes close to solving the problem, they may see a comment such as "The customer and another colleague look cheerful and start talking about lunch plans." Solution of the problem generates a message such as "The customer congratulates you and thanks you for solving the problem so quickly."
As in real life, however, the "customer" is also ready to criticize if the participant takes overly long to chase down the problem. Early indications that the participant is approaching the allowed number of operations are mild. "The customer seems slightly impatient." Later, as the operations limit approaches, "The customer and a supervisor have a serious discussion just out of your hearing." The final comment indicating failure to resolve the problem within the customer's patience might be "The customer asks you to call in a more experienced support agent to resolve the problem." (The equally realistic "The customer tosses you and the broken system out the loading dock door" was rejected as being too harsh for a learning environment.)
If the instruction mode is being used, the system can additionally offer hints, such as "Have you checked the permissions on the files and directories?" These hints are based on operations known to be of benefit in solving the specified problem, which the student has not yet tried.
Following either the resolution of or the failure to resolve the problem, the system presents the total score and the highest possible score, and indicates steps which are judged most useful in solving this type of problem. Note that there are several possible solution paths for all of the problems described above.
Sample Problem
For the purposes of this specification, a single problem has been selected to describe in detail. In the actual implementation, all of the problem types described above should be simulated, and a problem should be selected at random when the participant starts the assessment.
Incorrect permissions set on a directory
This is a common problem in software installations. It might be classified as a bug in the installation script.
Initial Symptom
The description provided above is displayed for the participant,with some context:
You have been requested to make an on-site service call to Widget Plus Enterprises, Inc. to repair their E450 running WidgetMon software and peripherals.
The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.
The customer, Wally Jones, reports: "We installed the system according to directions, but it never has worked."
You have arrived at the company site, and Wally has escorted you to the WidgetMon system.
The system then provides a prompt:
What would you like to do next?
Additional Symptoms in response to commands:
The following list of possible operations describes the results and feedback of each item. Helpful items are marked in boldface. Detrimental items are underlined. These are items which have a negative impact on customer satisfaction in the given situation.
ls
Both executable files are shown, but no data files.
more or cat
The executable files display typical binary "junk."
cd
A limited number of directories are available to explore, including the executable installation directory and the device directory. The information needed is all in the starting directory (the installation directory).
truss
Truss output is provided. This requires some skill to read.
The widget_check program halts after being unable to create an output file.
The widget_graph program halts after being unable to read an input file.
pkgchk
The output indicates that the package is correctly installed.
umask
This command indicates that the customer has a default umask setting which causes directories to be created which do not allow file creation.
Attempt to run program
A "widget_data: Permission Denied" error message is generated.
chmod
Provided the participant performs this operation on the directory the software is installed in, to make the directory allow file creation, this will solve the problem. Other attempts to chmod the executables will produce the logical results, none of which are likely to be helpful.
Read software documentation
The software documentation is provided as a URL. The participant will find a general description of the product, diagrams showing the correct setup of the cables and monitors and how to position a widget between the monitors, and installation instructions. At the beginning of the installation instructions is a note to be sure the directory created for the installation has appropriate permissions to allow the creation of files.
Read source
The source of the two executables is displayed.
Read /var/adm/messages
The error widget_data: Permission Denied appears repeatedly in the messages file.
Insert debug lines in source (a limited number of options are provided)
The debug lines indicate that the widget_check program is getting as far as reading the first pair of values and comparing them, then trying to write the result to a file, before terminating.
Make correction to source (again, a limited number of corrections are available)
The corrections introduce bugs to the software. This causes a loss of points.
Reinstall package
This could reasonably catch the error, but only if the support agent correctly reads the documentation. Therefore, simply commanding the simulation to reinstall the package will not provide any useful feedback.
Check bugs database (a simulated bugs database is provided as a set of web pages which includes actual bugs used in the simulations, as well as a number of distractors)
A bug has been logged against the installation script, recommending that it be modified to automatically create the install directory and set the permissions appropriately. The bug report also contains a description of the necessary workaround of using chmod to alter the directory permissions. The bug will be found on a search containing the error string or enough other keywords related to the problem, although a search which does not include the error string probably will not eliminate distractors.
Write a test utility (a test utility is provided)
The test utility tests the system call which accesses the serial port. The system call appears to be working correctly.
Write a test data file (a test data file with constant values is provided)
The support agent will be unable to save the file in the directory, due to the incorrect permissions on the directory.
Replace hardware parts (CPU, memory, storage devices, cables, sensors)
No changes to the system state will result from these activities. Since cost is incurred whenever parts are used, this operation lowers the participant's score.
Test a known "good" widget and a known "bad" widget
Neither widget provides any output, but the error message is generated. See Attempt to run program, above.
Check cables for connection to serial port and widget monitoring sensors
The cables seem to be securely connected.
Ask customer when the problem started
Wally reiterates that the system has never functioned correctly since installation.
Examine serial port configuration (stty)
Serial port configuration matches that described in the manual.
Apply a patch
There is no patch available for this product. You update the recommended OS patches. There is no change in system behavior. Since installing patches without identifying a need for them generally reduces system stability, it is a customer dissatisfier, and causes a loss of points.
Scoring Mechanism
Correct solution of the problem (using chmod to repair the permissions on the installation directory) is worth 300 points. An alternate allowable solution is using umask to change the default permission settings, then reinstalling the package. The allowed number of operations for this exercise is 6. Each move less than 6 used is worth an additional bonus of 50 points. Note that a solution of one move, worth 550 points, would be a lucky guess, because the signature symptom in this error: the "Permission Denied" error message, is not reported in the initial problem description. Two moves would be almost as unlikely.
Partial solutions: if the problem has not been solved, executing operations marked in boldface in the above list generate scores of 50 points each.
Items which are underlined in the list above cause a subtraction of 50 points each from the score.
The conclusion of the exercise provides the following feedback:
You scored [score] out of a maximum possible 550 points. (The solution is worth 300 points, with bonuses possible for a speedy resolution.)
[Insert specific feedback dependent on score:]
450-550: You solved the problem so quickly, I think you must have seen this problem before. (Experience is a wonderful aid in troubleshooting.) How about trying another one?
350-400: Excellent work! Wally and his manager are very pleased with your efforts. They suggest a quick lunch before you return to the office.
300: Good work! Wally and his manager are pleased that you were able to solve their problem. Wally's manager, Marge, suggests that Wally treat you to a Café Latté at the Starbuck's outside the company cafeteria.
200-250: You didn't quite solve this one, but you were close. A quick call to a more senior support agent provides the following suggestions:
[Only techniques not used by the participant are listed below. Links to web instruction detailing how to use each of these operations are provided with each item.]
Use ls to see if the data files are present
Use truss to see where the processes are quitting
Check the umask setting
Note the "permission denied" error messages
Check the bugs database
Try writing a test file, or test a known good widget
After reviewing the data with you, the other support agent identifies the problem as incorrect access settings on the install directory, and suggests that you use chmod to fix the problem. The program runs correctly after this.
50-150: You seemed to have trouble with this problem. A more senior support agent was called in, and looked at what you had done. "You had some good ideas, but you could have gained more information if you had tried one of the following:
[Only techniques not used by the participant are listed below. Links to web instruction detailing how to use each of these operations are provided.]
Use ls to see if the data files are present
Use truss to see where the processes are quitting
Check the umask setting
Note the "permission denied" error messages
Check the bugs database
Try writing a test file, or test a known good widget"
After reviewing the data with you, the other support agent identifies the problem as incorrect access settings on the install directory, and suggests that you use chmod to fix the problem. The program runs correctly after this.
-300-0: This was a tough one, wasn't it? A more senior support agent was called in and resolved the problem. It turned out to be a permissions problem. You might like to review permissions problems and how to identify them. [A link is provided to web instruction overviewing how to identify permissions problems as a class.]
Example of usage and scoring:
[The participant accesses a web page which displays a Java telnet applet. The applet automatically takes the participant's student identifying information from the training system and passes this to the simulation server, logging the student in and starting the appropriate simulation. Text in Courier Italics indicates system output to the participant. Text in Courier Bold indicates student input. Proportional (plain) text indicates a description or comment.]
You have been requested to make an on-site service call to Widget Plus Enterprises, Inc. to repair their E450 running WidgetMon software and peripherals.
The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.
The customer, Wally Jones, reports: "We installed the system according to directions, but it never has worked."
You have arrived at the company site, and Wally has escorted you to the WidgetMon system.
What would you like to do next?
Attempt to run program
A "widget_input_1: Permission Denied" error message is generated.
What would you like to do next?
Examine serial port configuration or stty
Serial port configuration matches that described in the manual. [Preferably, the actual stty output of a working system would be shown here.]
What would you like to do next?
ls
%ls -al
drwxr----- 2 root staff 3072 Feb 28 08:15 ./
drwxrwxrwx 103 root staff 7680 Feb 28 08:19 ../
-rwx--x--x 1 wally staff 2968 Feb 7 2000 widget_check
-rwx--x--x 1 wally staff 2968 Feb 7 2000 widget_graph
[Both executable files are shown, but no data files.]
The customer seems impressed with your troubleshooting skill. What would you like to do next?
Read software documentation
See http://www.widgetmon.com/manual.html
[Unfortunately, the participant misses the permissions clue.]
What would you like to do next?
Reinstall package
There is no change in system behavior.
The customer seems slightly impatient. What would you like to do next?
[In the Instruction mode, the system would offer a hint at this point, selected from appropriate actions that the participant has not yet tried or that the participant has tried but seems not to have taken advantage of. Here, an example would be:
Note the "permission denied" error messages. This is a signature symptom of a specific type of problem.]
truss
[Truss output is provided. This is lengthy and requires some skill to read, and is not included here. Salient characteristics:
The widget_check program halts after being unable to create an output file.
The widget_graph program halts after being unable to read an input file.]
The customer asks you to call in a more experienced support agent to help resolve the problem.
You scored 200 out of a maximum possible 550 points. (The solution is worth 300 points, with bonuses possible for a speedy resolution.)
You didn't quite solve this one, but you were close. A quick call to a more senior support agent provides the following suggestions:
Note the "permission denied" error messages. This is a signature symptom of a permissions problem.
Check the umask setting
Check the bugs database
Try writing a test file, or test a known good widget
After reviewing the data with you, the other support agent identifies the problem as incorrect access settings on the install directory, and suggests that you use chmod to fix the problem. The program runs correctly after this.
Click on the suggestions above if you'd like to review how to perform these procedures.
Try again?