Troubleshooting a Computer System: Simulation Assessment


Background Information


The Enterprise Services division at Sun Microsystems provides a broad range of services, but many of them have a common component based in the skill area of troubleshooting computer systems which are failing in some measure. New support personnel often have some programming or hardware assembly experience, but many do not have extensive experience with troubleshooting, or formal training in that skill area. New support agents can easily feel intimidated in their first experiences with customers, and may resort to a trial and error approach, rather than a systematic analysis of symptoms for probable causes. This creates unnecessary expense in the service process, and adds an undesirable delay to the solution of the customer's problem.


Two levels of training are currently provided to employees of Sun Enterprise Services in the area of troubleshooting:


  1. Analytic Troubleshooting System training, provided by Kepner-Tregoe

  2. Product-specific troubleshooting training, e.g. "Core File Analysis"


The ATS course provides a generalized troubleshooting methodology based on asking systematic questions ("When did the failure occur?" "What changed just before the failure occurred?") but although tþe training provided at Sun is slightly customized toward how to use this methodology in a computer systems environment, it provides no assessment and few valid case studies. The product-specific training available assumes a tightly constrained context, does not transfer well from one product or technology area to another, and is not available for all products. What is missing is a middle layer of generalized computer systems troubleshooting training and assessment.


Problem Statement


Troubleshooting computer systems requires basic knowledge of "how the system works." This involves knowing what the basic parts of a computer system are, including operating system, file structure, processes, interprocess communication, and hardware (CPU, RAM, storage, etc.), and how these components relate to and interact with one another. Further, knowledge of how a failure of any individual component is likely to affect the entire system is required. Finally, the skilled troubleshooter needs to be able to determine what corrective action to take to be able to resolve the problem and restore the system to correct functioning.


This assessment method is intended to be used for three purposes:


  1. To identify stronger candidates in the area of troubleshooting during the hiring process

  2. To identify existing support agents who do not have sufficient general computer system troubleshooting skills, for the purposes of providing remedial instruction before work assignment or before attending a more specialized class

  3. To establish that students who did not previously possess adequate troubleshooting skills have now acquired them, as a result of instruction



Learning Outcome/Objectives


This assessment will verify that the participant is able to:



Simulation Environment Specification


Simulation Technology Used


Because the support agents for whom this assessment is applicable constitute a very large group (approximately 6000 support agents) and are distributed in several hundred large and small offices around the world, working a variety of shifts, it is desirable that this assessment method be automated. Ideally, it will be paired with standalone web-based instruction, which can serve either as an instructional method before taking the assessment, as a remedial method after taking the assessment, or both.


The assessment will consist of a simple, but flexible text-based simulation environment, adapted from that used for the construction of text-based games, also known as "interactive fiction." The proposed development tool is Inform. (Please see http://www.gnelson.demon.co.uk/inform.html for more information about this programming language and interactive fiction construction.) This simulation environment will be stored on a server and accessed by a telnet connection. This telnet connection may be automated and embedded on a web page in the form of a Java applet, allowing simulations to be easily included in web-based training, and also allowing scores to be recorded and tracked in online learning management systems. (A slightly customized version of a standard telnet applet should be used to enable this last.)




Simulation Parameters


The simulation environment will present an interface similar to a standard C-shell, which accepts Unix commands, but also includes a simple natural language parser, and will simulate the effects of the following system features:



Simulated System Functionality


The functionality of the simulated system, as described to the participant, will be the following:


The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.


System Failures


The system is designed to be able to simulate any one of the following failures (described here by cause):



It is also possible to simulate multiple failures for advanced problems.


Symptoms


Each type of failure presents certain signature symptoms, some of which are readily apparent, and some of which will require use of one or more tools or methods to identify. Most symptoms are individually ambiguous, and will not indicate a single cause of failure when viewed in isolation. Symptoms include:



Available tools and operations


The student is presented with the simplest level of description first ("The customer reports that the system does not work.") A single lead clue is also provided, e.g. "The customer reports that the system has never worked following installation." The student can then query and modify the system, using a variety of tools and methods. Unix commands are in courier. Natural language operations are presented in plain (proportional) text.



Not all of these operations are appropriate or useful to every problem, and some incur expense as well as delay. To simulate the real-world support environment, a limited number of operations is allowed for each assessment attempt, keyed to the type and difficulty of problem. Progress toward the goals of identifying relevant symptoms and solving the problem is tracked by the simulation software, using a scoring system. Partial solutions are possible.


Feedback


The system can operate in two modes, assessment and instruction. Both systems support multiple levels of difficulty. In the assessment system, limited scaffolding is provided, primarily in the form of simple positive and negative feedback as the participant approaches or strays from profitible lines of inquiry. In the instructional system, scaffolding is provided in the form of prompting appropriate to the difficulty level and the number of attempts made by the student to solve a given problem. In both systms, a summative score is also presented at the end.


To provide positive feedback, as the participant uses appropriate tools which bring them closer to the goal of identifying the cause of the failure and correcting it, the system provides observations such as "The customer seems impressed with your troubleshooting skill." As the participant comes close to solving the problem, they may see a comment such as "The customer and another colleague look cheerful and start talking about lunch plans." Solution of the problem generates a message such as "The customer congratulates you and thanks you for solving the problem so quickly."


As in real life, however, the "customer" is also ready to criticize if the participant takes overly long to chase down the problem. Early indications that the participant is approaching the allowed number of operations are mild. "The customer seems slightly impatient." Later, as the operations limit approaches, "The customer and a supervisor have a serious discussion just out of your hearing." The final comment indicating failure to resolve the problem within the customer's patience might be "The customer asks you to call in a more experienced support agent to resolve the problem." (The equally realistic "The customer tosses you and the broken system out the loading dock door" was rejected as being too harsh for a learning environment.)


If the instruction mode is being used, the system can additionally offer hints, such as "Have you checked the permissions on the files and directories?" These hints are based on operations known to be of benefit in solving the specified problem, which the student has not yet tried.


Following either the resolution of or the failure to resolve the problem, the system presents the total score and the highest possible score, and indicates steps which are judged most useful in solving this type of problem. Note that there are several possible solution paths for all of the problems described above.


Sample Problem


For the purposes of this specification, a single problem has been selected to describe in detail. In the actual implementation, all of the problem types described above should be simulated, and a problem should be selected at random when the participant starts the assessment.


Incorrect permissions set on a directory


This is a common problem in software installations. It might be classified as a bug in the installation script.


Initial Symptom


The description provided above is displayed for the participant,with some context:


You have been requested to make an on-site service call to Widget Plus Enterprises, Inc. to repair their E450 running WidgetMon software and peripherals.


The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.


The customer, Wally Jones, reports: "We installed the system according to directions, but it never has worked."


You have arrived at the company site, and Wally has escorted you to the WidgetMon system.


The system then provides a prompt:


What would you like to do next?


Additional Symptoms in response to commands:


The following list of possible operations describes the results and feedback of each item. Helpful items are marked in boldface. Detrimental items are underlined. These are items which have a negative impact on customer satisfaction in the given situation.


Both executable files are shown, but no data files.


The executable files display typical binary "junk."


A limited number of directories are available to explore, including the executable installation directory and the device directory. The information needed is all in the starting directory (the installation directory).


Truss output is provided. This requires some skill to read.

The widget_check program halts after being unable to create an output file.

The widget_graph program halts after being unable to read an input file.


The output indicates that the package is correctly installed.


This command indicates that the customer has a default umask setting which causes directories to be created which do not allow file creation.


A "widget_data: Permission Denied" error message is generated.


Provided the participant performs this operation on the directory the software is installed in, to make the directory allow file creation, this will solve the problem. Other attempts to chmod the executables will produce the logical results, none of which are likely to be helpful.


The software documentation is provided as a URL. The participant will find a general description of the product, diagrams showing the correct setup of the cables and monitors and how to position a widget between the monitors, and installation instructions. At the beginning of the installation instructions is a note to be sure the directory created for the installation has appropriate permissions to allow the creation of files.


The source of the two executables is displayed.


The error widget_data: Permission Denied appears repeatedly in the messages file.

The debug lines indicate that the widget_check program is getting as far as reading the first pair of values and comparing them, then trying to write the result to a file, before terminating.


The corrections introduce bugs to the software. This causes a loss of points.


This could reasonably catch the error, but only if the support agent correctly reads the documentation. Therefore, simply commanding the simulation to reinstall the package will not provide any useful feedback.


A bug has been logged against the installation script, recommending that it be modified to automatically create the install directory and set the permissions appropriately. The bug report also contains a description of the necessary workaround of using chmod to alter the directory permissions. The bug will be found on a search containing the error string or enough other keywords related to the problem, although a search which does not include the error string probably will not eliminate distractors.


The test utility tests the system call which accesses the serial port. The system call appears to be working correctly.


The support agent will be unable to save the file in the directory, due to the incorrect permissions on the directory.


No changes to the system state will result from these activities. Since cost is incurred whenever parts are used, this operation lowers the participant's score.


Neither widget provides any output, but the error message is generated. See Attempt to run program, above.


The cables seem to be securely connected.


Wally reiterates that the system has never functioned correctly since installation.


Serial port configuration matches that described in the manual.


There is no patch available for this product. You update the recommended OS patches. There is no change in system behavior. Since installing patches without identifying a need for them generally reduces system stability, it is a customer dissatisfier, and causes a loss of points.


Scoring Mechanism


Correct solution of the problem (using chmod to repair the permissions on the installation directory) is worth 300 points. An alternate allowable solution is using umask to change the default permission settings, then reinstalling the package. The allowed number of operations for this exercise is 6. Each move less than 6 used is worth an additional bonus of 50 points. Note that a solution of one move, worth 550 points, would be a lucky guess, because the signature symptom in this error: the "Permission Denied" error message, is not reported in the initial problem description. Two moves would be almost as unlikely.


Partial solutions: if the problem has not been solved, executing operations marked in boldface in the above list generate scores of 50 points each.


Items which are underlined in the list above cause a subtraction of 50 points each from the score.


The conclusion of the exercise provides the following feedback:


You scored [score] out of a maximum possible 550 points. (The solution is worth 300 points, with bonuses possible for a speedy resolution.)


[Insert specific feedback dependent on score:]




[Only techniques not used by the participant are listed below. Links to web instruction detailing how to use each of these operations are provided with each item.]


[Only techniques not used by the participant are listed below. Links to web instruction detailing how to use each of these operations are provided.]



Example of usage and scoring:


[The participant accesses a web page which displays a Java telnet applet. The applet automatically takes the participant's student identifying information from the training system and passes this to the simulation server, logging the student in and starting the appropriate simulation. Text in Courier Italics indicates system output to the participant. Text in Courier Bold indicates student input. Proportional (plain) text indicates a description or comment.]


You have been requested to make an on-site service call to Widget Plus Enterprises, Inc. to repair their E450 running WidgetMon software and peripherals.


The WidgetMon system is used to apply a test to new widgets as they are produced in the factory. Two sets of measurement values are obtained via the serial port using the widget monitors, attached to the serial ports. These values are processed by widget_check, and the result is saved in the widget_data file. This file is read by widget_graph, to obtain a running measure of quality.


The customer, Wally Jones, reports: "We installed the system according to directions, but it never has worked."


You have arrived at the company site, and Wally has escorted you to the WidgetMon system.


What would you like to do next?


A "widget_input_1: Permission Denied" error message is generated.


What would you like to do next?


Serial port configuration matches that described in the manual. [Preferably, the actual stty output of a working system would be shown here.]


What would you like to do next?


%ls -al

drwxr----- 2 root staff 3072 Feb 28 08:15 ./

drwxrwxrwx 103 root staff 7680 Feb 28 08:19 ../

-rwx--x--x 1 wally staff 2968 Feb 7 2000 widget_check

-rwx--x--x 1 wally staff 2968 Feb 7 2000 widget_graph


[Both executable files are shown, but no data files.]


The customer seems impressed with your troubleshooting skill. What would you like to do next?



See http://www.widgetmon.com/manual.html

[Unfortunately, the participant misses the permissions clue.]


What would you like to do next?


There is no change in system behavior.


The customer seems slightly impatient. What would you like to do next?


[In the Instruction mode, the system would offer a hint at this point, selected from appropriate actions that the participant has not yet tried or that the participant has tried but seems not to have taken advantage of. Here, an example would be:


Note the "permission denied" error messages. This is a signature symptom of a specific type of problem.]


[Truss output is provided. This is lengthy and requires some skill to read, and is not included here. Salient characteristics:

The widget_check program halts after being unable to create an output file.

The widget_graph program halts after being unable to read an input file.]


The customer asks you to call in a more experienced support agent to help resolve the problem.


You scored 200 out of a maximum possible 550 points. (The solution is worth 300 points, with bonuses possible for a speedy resolution.)


You didn't quite solve this one, but you were close. A quick call to a more senior support agent provides the following suggestions:




Click on the suggestions above if you'd like to review how to perform these procedures.


Try again?