When walking through a troubleshooting wizard, I'm often asked questions that the computer could answer better than I could. Take for example, the age old "I can't print!" problem. Take a second right now to speculate wildly now why I can't print; I'll give you the answer later.
If I was using any one of the many troubleshooting wizards available, the first question it asks is "Is the printer plugged in and turned on?". Now, I've never actually clicked "No" to this question, but I'm sure the wizard would then suggest that I plug the printer in and turn it on. I fear even my Mom would find this advice insulting.
With the visible lack of a "Duh" button, I click "Yes".
But why? Shouldn't a wizard be able figure that out for itself. Where does my configuration say my printer is located anyway? On USB, hanging off my parallel port, or on the network somewhere?
Well, if it's USB, why doesn't the wizard check for it there? If it can't find it, try sending out signals along the bus to see if anything responds. Reset it if necessary. Check to see if the printer tries to get configured but fails (not enough power left on the bus?)
If it's parallel, try sending some of those funky "what kind of printer are you?" signalling that can be done through the parallel port. If nothing responds, try resetting my port, asking again.
If it's on the network, try pinging it. No response? Try pinging my DNS server. No response? Try pinging my default gateway. No response? Try pinging my network card.
In my case, my printer would have been found plugged in through USB, turned on and eagerly awaiting something to print.
Next the troubleshooter would ask "Is there paper in the printer?". A crazy question to ask considering that just about every modern day printer can send an "out of paper" signal to the computer.
As you can see, I'm not too hot on these troubleshooting wizards.
Giving Wizards a Spellbook
My proposal to improve the situation is a "Healing Wizard". A framework that allows a user to say "Help! Something is wrong with my computer! I can't print!" and the computer digs around inside of itself to figure out what the cause may be.
The Healer would check into the setup of the printer.
Is it the right one? Is it configured properly? Do the drivers I've picked match the one the computer thinks is the right one for it? Are there any known problems with this printer and the versions of software I've got installed?
According the version/model/make codes the printer sends back, the computer can easily figure out which printer drivers I should be using. Based on what bus it's on, it should be able to figure out the rest.
The Healer could go on, checking daemons. Are lpd/CUPS/etc set up correctly? Are they running? If not, perhaps they should be added to the right runlevel.
Can the daemon be connected to? Perhaps my localhost isn't configured? Perhaps I've messed up my firewall rules and in attempts to block external access to lpd's TCP port, I've blocked it for myself as well.
How about permissions? Are the permissions on everything correct? On /dev/printer device? Can I open it? How about on /var/spool/lpd?
These are all things a Healer could check for, but would be a nightmare for a traditional troubleshooting wizard to ask Mom to check.
For me, my drivers match, my configuration is okay, lpd is running just fine, I can connect to the TCP port, all my permissions are correct. Heck, I was able to print yesterday!
Modules of Insight
In theory, it should be possible to write tiny, re-usable modules in any scripty type language (Python, Perl, PHP, etc).
Each check would be for a very specific thing on a machine and return if the check was successful.
For example, "Is localhost configured properly?", "Is the network with IP x.x.x.x up and running?", "Is DNS resolving?", "Can I ping default gateway?", "Can I ping the DNS server?", "Is it a full moon?", "Is /foo full?", "Is DRI enabled?", "Is my videocard model XYZ?", "Is my serial port in use?", "Is application XYZ running?"
These modules could be glued together and re-used to check for any number of things.
Further, specific modules could be perminately disabled in environments where they provide false positives (i.e the default gateway silently drops ICMP packets).
As the portfolio of available checks grow, the odds that the problem the user is having isn't automatically fixable diminishes.
If the Healer is (or isn't) able to successfully solve my problem, it should send feedback to a central server somewhere so that developers can know that a) their particular check has come in handy, and b) people are having a particular problem.
Also, it should store the problem locally. If a particular user is having the same problem, over and over, the troubleshooter may be able to offer a better, but more drastic fix over the long haul or automatically email someone.
System administrators can also quickly check the troubleshooter log to see what kind of problems users are having (and what kinds of changes it has made).
Not For Everyone
Healers aren't for everyone. There are people out there who have such disturbingly odd configurations that no wizard on the planet could begin to guess what to do to fix things. Odds are though, that these people aren't the ones in the market for an automatic troubleshooter anyway.
There are some obvious security considerations when using a Healer, but nothing that I can imagine can't be controlled. The same is true for Healers running rampant on a system configuration and breaking more than it fixes. These are problems that happen in their development but are far from insurmountable.
Back To Me
Well, this troubleshooting round has come to an end. If you guessed the problem was the partition that /var/spool/lpd was on is full, you'd be right! My Healer noticed that /var/spool/lpd was on the same partition as /var/log and suggested I delete older log files as a temporary work around. I clicked okay, and all those "messages.N.gz" files disappeared. Sure I'm not going to have the logs of the local skript kiddie breaking into my machine, but I can print.
If this keeps happening, my wizard would probably recommend symlinking "/var/spool/lpd" into something in my "/home" where I've got far too much room. Sure it's a hack, but I'd much rather that the system do it for me, and log that it's done this hack than not being able to print.
Idcmp has been coding for far too long in far too many languages and far too many platforms. Idcmp stays crunchy in milk and has heard just about every possible variation on what Idcmp stands for, including the rare individual who still remembers. Idcmp finds it weird to talk about himself in the third person in italics and doesn't actually own a printer.
This is really a special case of a general problem of our coding software in the most convenient way (to us) coding only what is essential to please the people paying us. Alan Cooper has some humourous things to say about this in "Inmates running the Asylum". They want a help file, so we give them a help file knowing the people requesting it will never read it, and the people needing it are never asked anyway.
In 25 years, I have never actually been helped by on-line help ... in any O/S. Maybe I've just had bad luck, but I always find the help describes what is going on when everything is going well, but if everything is going well, why would I be asking for help? At least opensource software (and admittedly some Microsoft software) will contain a link to go online where you might find a human who can offer help through a peer-support website --- considering the fine track record of Mandrake-Forum, I am surprised they haven't folded that forum in to the Mandrake Linux help system.
We've known about expert-systems for twenty years, and while they failed miserably at a lot of tasks, I'm surprised we don't have multiple "help-wizard" expert systems on sourceforge. I'm surprised that HTML-Help became a standard but CLIPS did not.
Years ago I saw a presentation of a very useful expertsystem that would watch /var/log/syslog and report on trends. It was amazing both for what it could do, and for the utter simplicity of it. For example, every sysadmin knows the telltale signs that a disk is failing, yet there are no freesoftware systems to monitor logs making decisions on patterns of entries; they all report only on each line of the log in total ignorance of even the line immediately before it. In addition to your Healer wizards, I'd love to see some Mystic wizards who invoke the Help/Healer because the need is immanent.
Just as a humourous aside, I had a no-printing bug many years ago, back when I was a starving researcher, that I will never forget. It was Epson dot-matrix days, and the print head just went really weird, like it was sort of jammed, sort of just schizoid. I poked around in the software drivers (Win3.0 in those days) and couldn't find anything, so I turned on the office light thinking maybe some paper shards had lodged under the print drive mechanism. As soon as I turned on the light, I could see hundreds of tiny cockroaches fleeing the printer! One of the most surreal experience of my life (while not on anything) --- The "jam" was their brethren who had the misfortune to have their egg laid in the mechanism and were due to hatch that day when I had a rare need to print.
Many of the tests you suggest that your Healer perform sound a lot like autoconf tests. We've gotten used to scripts that test the properties of a system to get a software package to build, or to figure out if any important prerequisite is missing. It would seem to make sense in wider situations.
However, healers should not be made to be too smart. There may be a reason that some service has been disabled, so the Healer should ask before offering to restore it.
I mean that in a good way. Healing wizards sound like a very Unix approach to the age-old problem of diagnosis - and diagnosis could really use a kick in the ass on all platforms. A Unix-like solution would be the right thing.
It is kind of like the relativly modular /sbin/service set of scripts for unix daemon control (I admit I'm a Redhat wuss) ... it is a much better approach to the problem of controlling daemons. Compare the service scripts to the monolithic signalled approach Microsoft uses with NT services - where the start / stop criteria have to be coded into the service. Monolithology sucks, especially when the sources are not accessable.
This concept would make a great distro-funded project, or a great basis for a more user-centric new distro. The only huge problem implementing the bits would be the relative non-standardness of the bazillion distros and *nix variants. Though, if it were modular, the bits could be written around the differences. And, open-sources make things accessable to port to other distros.
If someone were to start a project like this, I'd certainly donate some effort. I'd like to see a Unix based system one-up the closed source operating systems ... and it wouldn't be tough. Ever tried to figure out why printing stopped on a Win32 network? Now that's a magical trip.
Take a step back and characterize the problem of diagnosing printer failure. What makes it hard to do? It is not hard because it is a 2^N algorithm, or because there is a hard real-time constraint on diagnosing the problem. It is a knowledge problem: the trouble-shooting system needs a lot of knowledge about printer makes and models and print driver configuration and compatability between drivers and the OS. There is a lot of knowledge needed in the tool, but the trouble-shooting tool will not have complete knowledge because new printers and drivers and OS versions are coming on the market all the time.
How would you implement a solution to a problem where knowledge is the part that makes it hard? You could use a normal 3rd generation proceedural language and code a bunch of if-statements. You could organize your design around decision trees and tables. Or, you could make use of some of the techniques used in AI to do knowledge-based systems: expert systems, neural nets, bayesian nets, etc.
I believe that the microsoft printer trouble shooting wizard uses bayesian nets. This seems to be a related paper. Unfortunately, these approaches have their drawbacks. I think they often seem flakey and hard to debug.
In my own work on ArgoUML (an open source UML tool), I faced similar problems. I wanted the tool to catch, explain, and automatically help users resolve common UML design problems. Rather than use a generic engine with a purely abstract knowledge representation, I decided to do something closer to what the original poster suggested: my design critics are simply Java methods that return true when they detect a problem and false when that particular aspect of the design seems fine. The downside is that it can be a lot of simple code. The advantage is that it is simple code, and anyone who knows java can add a new design critic. For those users that want to help build the knowledge base, but dont want to develop java code, I proposed a graphical notation and several other low-tech methods of feedback.
I see a general trend toward more knowledge-based tools. I am sure different approaches will be used. It may be a slow trend, but I think it is a sure one.
The choice of printer setup as an example is amusing. Printer setup (at least for non-Postscript printers) in Linux is woefully bad, but there are already so many easy wins available for people who want to make it better, without having to create some amazing automatic diagnostic system. User-intelligible error messages (that don't require, for example, the user having read the IPP RFCs) would be a good start.
I picked printing as my example as many people have suffered some odd problem relating to printing at least once in their life (cockroaches or not).
jbuck's association of a Healer to an autoconf check is an excellent parallel to draw as through years of refinement developers rely implicitly on autoconf to check their system and do The Right Thing (tm). Approaching developers with this analogy would definately ease developer buy-in.
I've played with ArgoUML and was facinated by the critics approach. Having worked in a project where many junior developers had tools thrusted upon them (Rose) without the proper experience to do good modelling, (creating dozens upon dozens of unreadable diagrams) I can see how a mature collection of critics would be amazing for UML modelling.
In a normal troubleshooting environment, the knowledge system wants to minimize the questions it asks the user (ala old 'Animal' game). With a Healer, it can automatically check everything (ala 'xmaze'), making debugging easier (as the flow is more linear).
If it was possible to use a form of introspection into each Healing module, a simple GUI modelling tool (using GEF or Dia as its basis) could be created to ease the design.
Lots of cool things people can do. :-)
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!