The first step in my plan to take over the world is complete. I whipped up a little script that (after much pain) dumped the first five function names (more or less) from all crashes in bugzilla (more or less) into a new table, where I can glean experience and data from them.
The idea long term is that you'd be able to submit a bug and instead of the current time consuming manual stack matching after submission, or a very slow match on the bodies of the 400K comments, at submit time a very quick query on this new table could say 'actually, we think this is a duplicate of bug XXXXX- look familiar?' Less spam for everyone, less work for bugsquad, more accurate data on what things are reported most often.
Short-term there are three things I should do: first is to resurrect and clean up the old simple-dup-finder, which ran on the same principle as this experiment, so worst-case it is no more/less accurate than the old one, and a hell of a lot faster. Second is to figure out exactly how accurate it is- I think I should be able to do some queries that will be pretty revealing of how this auto-matching compares with human matching. Just have to figure out exactly how to do the comparison to get meaningful data. Third is to think about how to make this permanent. I think since it is all in a separate table, the upgrade risk is low- even if the table gets nuked in an upgrade, we could just re-run the script that created the table in the first place. More complex is any UI implications and how they are handled. At least at first these scripts will all live in a separate directory and not interact with the rest of the UI, so for the time being it isn't a big issue. We'll see, of course.