Older blog entries for awu (starting at number 5)

I migrated some of the gfx links below from graphics.cs.uiuc.edu to a new host:

http://awu.textdriven.com/word/index.php?cat=3

New libre source for data mining on large data sets:

CloSpam -- closed frequent pattern mining, released under the University of Illinois/NCSA Open Source License.

Given a database of transactions, we would like to find frequent patterns in those transactions. For example, if we are running a large retailer, and maintain terabytes of customer transactions, we would like to find what items are bought together. In the data mining community, this is known as "frequent pattern mining".

There are an exponential number of possible frequent patterns in a large database, and thus it becomes important to avoid checking all such patterns.

CloSpam is a fast implementation of CLOSET, a recent data mining research paper in SIGMOD by Jian Pei, Jiawei Han, and Runying Mao.

A proposal for trustworthy e-mail. I mentioned this idea when I (re?)thought of it, sending an e-mail to M-manda a short while ago (I guess I shouldn't really use M- here, since most people don't speak native (X)Emacs).

Sendmail creator EricA [1] suggests that "a small number of people are polluting a great medium", e-mail. He argues that spam makes economic sense, at least for the senders, and points out that current approaches put the burden on the shoulders of the recipients, rather than the senders.

Instead of arguing for direct permission-based mail, EricA believes that the only viable long-term solution it to "make spammers pay more than we do".

I suggest an alternative, which may or may not be feasible, or new, for that matter. A trustworthy e-mail can be sent between Bob and Carol iff there exists a connected path, on some social network, of length at most k for some small constant k.

That is, instead of having to give permission to every possible sender, you have to only give permission to a smaller 1-ring of people you trust. Trust is then transitive, to some degree, even though it may decay exponentially along with path length.

All other e-mails may be received, but perhaps get filed under "shady".

I imagine that you can also specify different levels of trust. You could say, I trust Paul to send me e-mails, but not as a gateway from other people. We can also define greater aggregate entities worthy of trust. Communities, organizations, and countries, for example.

Implementation-wise, this requires an integration of social networking software and e-mail systems (or at least interop), which is no small feat for Internet-scale topologies, but maybe, just maybe it could even work.

Back to finishing up conference paper to be submitted to SIGxxx on social networks.

--
1: Wikipedia quote: "There is some sort of perverse pleasure in knowing that it's basically impossible to send a piece of hate mail through the Internet without its being touched by a gay program. That's kind of funny."

24 Dec 2003 (updated 24 Dec 2003 at 04:46 UTC) »

This post by raph on multidimensional interpolation points to a pretty good survey on scattered data interpolation, which is a problem I have tried to tackle in the past (and present, in a way). It would have been nice to see this survey back then :).

It's been a while since I thought about such issues, but there are interesting connections here to various problems in computational geometry and graphics.

raph also points out that the survey does not give clear guidance, one way or another, which is quite understandable given the range of problems people solve with scattered data interpolation.

--

I've been looking through some of raph's recent posts from the perspective of data mining. Data mining is a relatively new field that tries to extract knowledge from large amounts of data. It has roots in database technology, statistics, machine learning, visualization, and algorithmics.

I'm curious how work in data mining could be applied, if it hasn't been already, to work on trust metrics. In particular, what little I've seen of such work seems to often be concerned with directed attacks, trying to prove that a system can robustly survive in the face of would be danger, or at least gracefully degrade.

I'm not sure if work on outlier detection mainly happens in data mining or networking, but there are many difficult problems to solve, it would seem. PayPal, for example, I have heard, tries to detect attacks that involve some notion of a clique or chain. Similarly, insurance companies want to detect if a group of cars form a path of destruction, literally, in that some nefarious gang of thieves has arranged a chain of crashes to bilk the insurance companies.

I have my own intuitions about how problems might be solved, but nothing concrete yet.

I'm not quite sure what to think about "trust". As someone who is not familiar with the literature, when I see this word, I think of sensitivity and perturbation analysis from scientific computing (numerical analysis). Given some system, how much effect can small perturbations in the input have on the resultant output? I think of robust data streams and adversary arguments from theoretical computer science.

In particular, my first thought would be to employ hierarchies as a natural way of understanding how humans trust. Human trust is not so infallible a thing, but perhaps it is a start. Then again, I reiterate my lack of knowledge of the field. I merely give my first impressions.

My initial reaction would be to be wary of trust metric systems that are completely automated, but this is probably too vague a claim. In practice, there is some human intervention.

I do find recent work, on finding ways to differentiate humans from computers, to be an interesting slant on things. Even more funny, perhaps, is that again people just bypass such mechanisms, hiring pasty faced teens to read scribbly numbers or, in general, recognize hidden patterns, something computers can't quite do well yet.

The joke that is often made here is that computers are not adapting to us, we are adapting to them. We modulate our credit behavior so that some computer algorithm thinks well of us, and here we are putting our teens at work, having them live up to their full human potential by clearing the path for spam meisters everywhere.

Libre source:

Cache-Oblivious Implicit van Emde Boas Binary Search Trees, released today under the University of Illinois/NCSA Open Source License.

This is a very short bit of C++ code that demonstrates how to derive and implement cache-oblivious implicit van Emde Boas BSTs.

It has not been tested very much, but both algorithms for iVEB BST search and construction have been proven correct (although not independently verified by anybody but me). It was written to facilitate writing a proof or two, rather than for actual use as a cache-oblivious data structure.

As such, it was written without performance or elegance in mind, but rather with the goals of comprehensibility and readability.

Personal notes on awu:

Grew up on LI, about to finish a MS in CS at UIUC, with a focus on graphics, with connections to data mining, machine learning, and vision.

Previous work:

WinDrunk, with UIUC ACM WinDevils 1999.

A Virtual 3D Blackboard, work with Mubarak Shah and Niels da Vitoria Lobo. Published in IEEE International Conference on Automatic Face and Gesture Recognition 2000, Grenoble.

Libre source: basic vision code in C/C++.

Project Earthlight: Tangible user interfaces and machine vision for a first-person lightsabre fighting game, EOH 2000. (Yes, that guy does look suspiciously like Shinomori Aoshi :)

SashXB for Linux, Extreme Blue program @ IBM Cambridge, MA, 2000. LGPL, hosted at GNOME.org.

Work on computer ports of Red Faction, THQ/Volition's first-person shooter, 2001.

Dance Dance Convolution, DDR clone written in x86 assembly.

Summer 2002, developer division SDE intern at MSFT, graphics and visual design technology.

Libre graphics code:

awu gfx:

- Machete 3D, a multiresolution mesh editor and signal processor.

- Fast Furrier Transform -- dynamic fur visualization.

- Terrify -- modern, simple, efficient, terrain rendering and level of detail system.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!