Older blog entries for raph (starting at number 190)

Ok, the response to my bit on Soap security was pretty negative. I didn't communicate what I was trying to say very well. Apologies.

Soap is probably not any worse than CGI or other Web things. But my point is that it's not any better, either. I guess it all depends on whether you think that's good enough. (I don't.)

Firewalls are pretty useless. The main security implication of going through port 80 is that it will eliminate the need to explicitly open a Soap-specific port in the firewall, in the off chance it wasn't left open anyway. So talking about Soap and firewalls isn't very enlightening. In any case, I believe Dave when he says that Soap wasn't explicitly designed to go through firewalls.

PhinisheD

This afternoon, I helped Heather print out the final paper copies of her dissertation. Tomorrow, she hands it in for binding in the library and microfilming. The entire time I've known her, this has been a goal to work towards. Now it's actually done.

For historical reasons, the thesis is in troff format. I consider myself a smart guy, but fixing things was a usability nightmare. I had to rely on a mix of man pages, "helpful" web tutorials, a book on troff, and the groff s.tmac source. Even then, I ended up hand-tweaking a bunch of stuff. Ugh.

I helped format some Chinese hanzi as illustrations. I feel absolutely terrible about the fact that I used a Japanese font even though the context was Chinese, but didn't want to spend the time to redo all the eps's. Fortunately, all of the hanzi are quite similar visually to the corresponding Japanese kanji, so it's pretty unlikely that anyone will notice, much less care. rillian accuses me of perpetuating Unicode FUD, though. I guess I'll just have to live with that on my conscience.

Kids

Alan is resisting doing homework with me, and also resisting reading I'm not sure why. He has a lot of anxiety, and these days seems to be expressing it a lot more.

Max loves cooking. Tonight he helped me make a stir-fry. He had a little difficulty stirring, and decided that he needed a "big" spatula. When I told him that it was a big spatula, he asked for "giant". I thought it was cute, anyway :)

It's fun to play the "can you say X" game with both kids. For Max, the edge is currently "peanut butter jelly belly". A week or so ago, he had trouble saying this, usually dropping one of the words. Alan has no trouble with "uvulopalatopharyngoplasty". Finding harder words is going to be challenging :)

Expensive infrastructure

I'll try my best not to make David McCusker's head explode tonight.

His IronDoc and my Athshe are both examples of "expensive infrastructure." To do a good job implementing the ideas requires a lot of complex code.

David again makes the crucial distinction between interface complexity and implementation complexity. Ideally, you have a system which is easy to use, but might do very smart, sophisticated things under the hood. In the best case, doing things well at a lower level actually makes things simpler in the higher levels.

Filesystems are the perfect example. The basic interface is pretty simple. In fact, files and directories are the basic concepts exported to the UI, even in systems designed for novice users. Yet, you have implementations such as XFS, which weighs in at 140 kloc or so.

Simpler implementations are possible, of course. In fact, the only thing that could possibly justify this kind of complexity is to be a lot better. And if you compare a modern filesystem implementation against, say, the original Minix filesystem in Linux (about 2000 lines of code), there is no contest. Performance, scalability, and robustness have all improved dramatically over the years.

Yet, that original 2 kloc implementation was important. If you had told Linus that he had to implement a highly scalable journalling filesystem before he could release the Linux kernel, the project would have been dead in the water. As the kernel became more popular, however, the payoff for more sophisticated implementations became clearer. For one, a better filesystem implementation improves matters for all of the applications (totalling many millions of lines of code, no doubt) running on Linux, without making those apps any more complex.

What is the "right level" of complexity for something like IronDoc? I'm not sure. It's not popular enough yet for the amortized payoff argument to work. This is a bit of a Catch-22; it won't become popular until it's reasonably good. The usual way around this dilemma is to build a much simpler prototype first.

I think this kind of "expensive infrastructure" is a good place for highly talented programmers to spend their time. It is a theme which runs through quite a bit of my work, including, today, Ghostscript and the continuing evolution of its graphics library.

Soap and security

Security experts such as Bruce Schneier and Alan Cox are rightfully concerned about the security vulnerabilites of new systems built on protocols such as Soap. At his ETCon talk, Schneier quipped that Soap is "firewall-friendly" in the same way that a bullet is "skull-friendly".

Dave Winer's response is pigheaded. In fact, his new "standard response" says almost nothing about security.

Dave has asserted that getting through firewalls was not an explicit goal of Soap. Could be; doesn't matter. The fact is that it will be used to get through firewalls. More precisely, firewall configurations based on Web assumptions will be used to filter traffic that breaks these assumptions. In general, you have to look pretty deeply inside a Soap message to figure out how much (if any) damage it can do.

Firewalls are only part of the problem. Expect a flowering of attacks that simply request things through Soap unanticipated by the designers of the system. With highly dynamic scripting languages, it's even harder to imagine all the ways that a clever attacker might put together the pieces made available to him; I am sure that the SOAP::Lite security hole was merely the first of its kind to come to light.

In theory, it should be possible to create secure services based on Soap. In practice, it's just not going to happen. As Schneier argued so well in his talk, increasing security is expensive. In addition to the labor and expertise needed for audits, paying attention to security slows down development, curbs features, and generally makes software less competitive in the market. The nature of the software industry guarantees that products will be developed and shipped with security holes.

I'm not saying that Soap is a bad thing. Indeed, for many applications, the advantages surely outweigh the disadvantages. Also, if you try to build those same applications with competing techniques, there will no doubt be security vulnerabilities from those, as well. What I am saying is that people should be conscious of the security problems that history and reflection tell us will occur.

I met Dave briefly at ETCon. I'm sorry we didn't have a chance to chat more.

Trees, again

A general suite of format/protocol/api for trees such as Athshe is a pretty complicated piece of code. McCusker's blob spec alone, is daunting in its complexity. I'm trying to get a sense of where the complexity is coming from. Obviously, some of it is optimization, for example trying to maximize leaf occupancy, or minimize the number of mutated nodes in a copy-on-write case.

McCusker asks what I mean about network protocols and file formats. His observation that disk access and network latencies are similar is spot-on. I am not really talking about tuning the optimization. My main concern is concurrency. A network protocol must deal with locking and change notification. Transactions would be nice too.

Consider the read-only case first. Concurrency becomes a non-issue. In fact, a perfectly good protocol is to request blocks from a btree file format. If the block size divided by the bandwidth is comparable to the request latency, then you're keeping the connection reasonably full (I'm not worried about small factors).

If you do care, you can hide the latency in a number of ways too, including prefetching. If the client task is to fetch a complete subtree, it can pipeline several requests, hiding almost all latency without the risk of fetching unneeded blocks. However, this requires asynchrony in the client API, a source of complexity. Such is the tradeoff.

I'm not going to go into the details of adding network writes tonight. A simple locking discipline, with caching in the kernel, might work well in a concurrent local disk case. The client drills down from the root every time it accesses the tree, but these requests will usually hit in the cache. If you did something similar over the network, latency could kill you.

One approach is for the server to invalidate blocks (possibly) in the client's cache. The response to a write lock request guarantees that all such invalidations have been sent (we assume message ordering). The bandwidth for the invalidations is trivial compared with the blocks themselves. The main cost is the latency round-trip for the write lock. Also, contention for the write lock could be a serious problem.

A much more sophisticated approach is to optimistically send write requests under the assumption that the relevant blocks are still valid. The server checks, and only commits if so. Invalidations are still useful as a way of improving the chances of a valid commit, but are not needed for correctness. In any case, cleaning up in case the optimism was unwarranted is a challenge.

Cache invalidation is related to change notification, which is the stronger property of actually notifying the application when a node has changed. Change notification is essential for the Model/View paradigm, as it is what triggers recomputation of the View. Doing it generally and efficiently is a hard problem, but one that has been studied in detail. Many new designs include some ad-hoc form of change notification. Polling is very simple and robust, but is not very efficient, especially if you care about notification latency. Thus, you see crude things such as the <cloud> interface in RSS (and OPML). The prospect of solving the problem well is a good motivation for Athshe.

I'll write about these ideas in more detail later.

20 May 2002 (updated 20 May 2002 at 20:18 UTC) »
Doctor Mommy

Heather's PhD graduation ceremony was today. She has two of three signatures, and still needs to submit an archival copy, but otherwise is basically done. I am so proud of her.

The valedictory speech by Daniel Hernandez (also known as the editor of the Daily Cal) was very good. Prof. Christ's speech was also good. Prof. Adelman's speech was not bad, but had a somewhat defensive tone regarding the importance of English vs other more ostensibly practical pursuits, including a potshot at computer science.

Heather's father and youngest sister were there, as were the kids and I. Also, Marty and Jan. Jan seemed offended that I shook her hand in parting (as opposed to a hug). Oh well.

Low tech trust vs. high tech trust

I had an interesting discussion with Roger Dingledine and Luke (hacker-name armitage) over lunch at ETCon about the importance of manual input into attack-resistant trust metrics. Advogato uses the explicit,

[shit - this part of my entry got deleted - anyone have a cached version? Add "retrieve previous version" to m/ wishlist.]

Outline follows:

Trust metrics need manual input. Advogato uses manually-entered trust links. Google uses hyperlinks, which are usually manually entered.

"Low tech trust" is a pattern of manual information flow across trust links. Example: links to recommended content, represented by "hot lists" in the early days of the Web, blogrolling today. You can find a well recommended blog by starting with one you know, and clicking blogroll links. Google's PageRank calculates something similar.

Example: personal music recommendations. Automated recommendation systems (Firefly, Amazon) don't work as well as just getting recommendations from friends.

Both examples are attack-resistant, borrowing the analysis from trust metrics. In blog rolls, trust graph is explicit. In music recommendations, trust graph is implicit in flow of recommendations (they only flow between friends).

Low tech trust is good. Automated computation is harder, but opens interesting doors. Example: manual clicking on links is good at finding high quality random site, but Google is good at finding high quality site relevant to search term (and quickly, too). Killer apps for trust metrics will probably follow the same pattern.

Trees

Why am I so fascinated by trees?

One of my pie-in-the-sky projects is Athshe, a somewhat nebulous data format for trees. The writeup isn't that good, apologies. In any case, I've been kicking the ideas around in my head for quite some time now, and I still think it's worthwhile.

I'll probably write about some of the more detailed pieces, but I want to get some sense of the large scale across this time.

I think formats for tree-structured data have many intertwingled aspects. It might help to consider XML and filesystems as two special cases.

Abstract tree language

This is basically the language of what's allowed. Obviously, there are trees. The children of a node may be primarily sequential (as in XML), or primarily mapped by name (as in filesystems). What's in a leaf is also important - Unicode text in XML, arbitrary binary in filesystems.

It's good for the abstract tree language to be simple. Often, if you want more complex stuff, you find you can map it into simpler types fairly easily.

Athshe has exactly four types in the language: list, map, bytes, and tag (similar to MIME Content-Type). It's an appealing set, but I'm not sure it's quite right. In particular, I get the feeling that references (perhaps analogous to the # part of URL's) may be important too.

Serialization

Once you have a tree in hand, you want to serialize it out as a file, so you can store it in a file or move it over the net as one chunk. Also, because parsing is so well understood, it's often more useful to talk about the serialization than the abstract tree language.

For a given tree language, there may be many different serializations. In the case of filesystems, there's tar, pax, zip, and many others. XML is mostly just XML, but there's stuff like wbxml too. ASN.1 is another example of an abstract tree language with many different encodings.

Access API

If you want to actually process trees, you'll need an API to access the tree data. In the case of XML, the API of choice is DOM. In the case of filesystems, it's the traditional Unix open/read/write/close API, or a variation.

The access API has many implications for performance. For one, grain size is important. For another, it's important for nodes to be very, very cheap, but active node references (open file descriptors) can be much less cheap. DOM doesn't quite get this right.

Another important API consideration is how calls map across the process boundary. For files, most calls cross the process boundary, which generally implies that files are not good for very small grains. However, note that fread() and fwrite() manage to cache a goodly fraction of file accesses within the process.

This is a good place for complexity to set in. There are all kinds of important issues in real systems only incidentally related to the abstract data language, such as asynchrony (nonblocking I/O), integrity (see fsync), access control, and so on. Applications may also want generalized undo (versioning), copy-on-write, transactions, and more.

Finally, the access API is a reasonably good place for change notification to happen. Unix has been traditionally deficient in this area; it's a standard feature in NT. In the case of XML, DOM has it, but with flaws.

Protocol

It used to be enough to access a tree local to disk or memory. Now, you usually want to do it over the network too. For this, you need a protocol.

Protocols for remote access to filesystems are a very rich, fertile, and increasingly mature area. People use this stuff heavily, though not quite yet universally. Performance is important, and there are lots of clever ideas. A major performance goal is to cut down on the number of network boundary crossings.

In the XML world, the most important protocol is HTTP, which basically gives you the entire tree as one chunk. You can also run DOM through over CORBA, as described in the DOM spec, but the performace would be horrible. There's no good general way to access a large tree incrementally.

Random access file format

Filesystem trees typically don't get manipulated as tarfiles. Usually, the filesystem implementation stores the tree on disk in some random access format. Only a small part needs to be in RAM, and it's pretty quick to get to an arbitrary place in the tree. Again, this is a mature area. In the case of filesystems, it's almost a solved problem.

XML doesn't really have a random access file format. There are such things as "hierarchical databases", but they are not popular.

The main idea of Athshe is that a simple abstract data language makes the problem of building a high-performance suite of tools more tractable. The major performance theme is scalability. Operations on a small local tree should be almost as fast as a pure in-memory implementation. On the other end of the scale, it should be possible to access huge remote trees efficiently. Chunk size in your process and network boundary crossings should match the chunk size of the application; in particular, traversing a complete subtree should resemble transferring it as a serialization.

I find it very difficult to communicate why this is such a good thing. But see a recent thread on wmf's discussion page, which reminded me of my Athshe ideas.

MLP

AaronSw's ETCon schedule with weblog links is useful and interesting. One day soon, most conferences will be annotated like this, probably in realish time.

17 May 2002 (updated 17 May 2002 at 23:07 UTC) »

I'll start by responding to a few threads here, namely simonstl on ASN.1, and sej on EPS/PDF.

ASN.1

simonstl recently brought up ASN.1 as a possible alternative to XML. It's worth looking at, if for no other reason than to be aware of history so you don't have to repeat it. The advantages are clear: it is much more compact than XML for binary-oriented data, and it is quite mature. However, the associated schema language isn't especially nice (schema language design is a very hard problem), and it carries quite a bit of historical baggage.

In particular, implementations have tended to be horrible. In the olden days, most implementations were "compilers" that autogenerated C code to marshal and unmarshal ASN.1 data in and out of C structures. This mostly-static approach loses a lot of flexibility, of course. Code quality is another issue. I am unaware of any general ASN.1 tool that approaches the quality of open source XML tools such as DV's excellent libxml (although it's possible things have improved since I last looked).

Another major piece of baggage is X.509, a truly horrid data format for public key certificates using ASN.1 as the binary syntax. Most sane people who have encountered it run away screaming. See Peter Gutmann's x.509 style guide for a more detailed critique.

So it's worth looking at ASN.1, but it's certainly no magic bullet.

Incidentally, ASN.1 is one of many attempts to define low-level binary data formats for tree structured data. I hear good things about bxxp/beep, but haven't studied it carefully enough to critique it. Off the top of my head, there's sexp, which is the binary data format for SDSI (the name reflects its kinship with Lisp S-expressions). Further, most RPC mechanisms define binary encodings. Don't forget IIOP (Corba), XDR (Sun RPC), and so on. I'm sure all have advantages and disadvantages.

What exactly are the goals? As I see it, the main problems with XML are bloat (for general data; XML bloat in marked up text is quite acceptable) and complexity (XML is only medium-bad here). Binary formats help a lot with the former, and may or may not with the latter. But there are a lot of other considerations, including:

  • Quality of implementations (XML wins big).

  • Associated schema languages (messy; importance varies).

  • Suitability for dynamic mutation (XML's DOM is a reasonable, if inefficient, API for local tree manipulation; most other formats don't have anything similar, and their more rigid nature would make it harder).

  • Scaling. XML itself doesn't scale particularly well to huge trees, but is usually implemented in conjunction with URI's, which do. A binary data format can either help or hinder this goal.

Update: ASCII vs Binary, by David Reed.

Transparency, PostScript, PDF

sej brings up the desire to handle vector objects with transparency. Basically, PostScript can't do it, and probably never will. PDF 1.4, on the other hand, has very rich transparency capabilities. It's a bit more complex, and caution is needed when dealing with a standard controlled by a highly proprietary organization such as Adobe, but it has compelling technical advantages.

Here are two signs that the rest of the world is moving to PDF: pdflatex appears to be far more vital than dvips. Mac OS X uses PDF for the imaging and printing metafile.

ETCon

I went to the O'Reilly Emerging Technology Conference (formerly known as the P2P conference) on Wednesday. It was great fun. I met a lot of people, some of whom I've known for years (such as Jim McCoy). Others I've known online, but met for the first time in person (Kevin Burton, Aaron Swartz, Dave Winer).

I particularly enjoyed hanging out with Jim, Roger Dingledine, Bram Cohen, Zooko, and Wes Felter. It is such a luxury to be able to interact with people on such a deep level. In my opinion, the best people to associate with are those you can teach and learn from. Thanks all of you for making my Wednesday a day of intense teaching and learning.

A lot of people have heard of my work, and are reading my thesis. I think Roger is largely to blame for this; a number of people mentioned hearing about it from him. I had great conversations with the MusicBrainz and nocatauth folks, both of which would be killer applications for attack-resistance.

I have this uneasy feeling there's something important that I left out. Also, one risk of naming names is that people might infer something from a name being left out. Not to worry - I genuinely enjoyed meeting everybody I met at ETCon.

McCusker

Reading David McCusker's log fills me with sadness. The personal troubles he's going through resonate deeply for me, as I've been through similar things. I'm in a much happier space now, for which I consider myself very fortunate.

His technical work is also quite compelling. When we meet, one of the main things I want to talk about is whether to join forces on some of our pie-in-the-sky dream projects. Collaboration can be a nightmare when the other person doesn't understand what you're trying to do, or particularly share your goals. I get the feeling that working with him might be otherwise.

David talks about being busy and not having much time for things. I can certainly relate to that. He does, of course, have time to meet with me, whether he realizes it or not. Other things will of course slide, so it's worth considering their importance in the long run.

11 May 2002 (updated 11 May 2002 at 23:00 UTC) »
Runtimes

I've been thinking quite a bit about runtimes recently. It's an interesting topic in general, but I also have a more specific interest: I'm designing Fitz, a next-generation 2D graphics library, and I want this library to be as useful as possible. So I want to avoid gratuitous runtime incompatibility.

Bertrand Meyer's Polyglot Programming is a very insightful discussion of some of the consequences of Microsoft's Common Language Runtime (CLR). Here's a particularly interesting quote:

The language openness of .NET is a welcome relief after the years of incessant Java attempts at language hegemony. For far too long, the Sun camp has preached the One Language doctrine. The field of programming language design has a long, rich history, and there is no credible argument that the alpha and omega of programming, closing off any future evolution, was uttered in Silicon Valley in 1995. Microsoft's .NET breaks this lock.

Absolutely. What the CLR provides instead is a runtime hegemony. It seems to be a fairly high quality attempt, and is well on its way to achieving huge market share. The idea of a runtime hegemony provides almost all the advantages of a language hegemony, with the added advantage of providing a much easier migration path for components written in non-native languages to be assimilated into the runtime. Do note the echoes between Bertrand's discussion of libraries and Paul Graham's in Being Popular.

It remains to be seen, I believe, whether it's worth writing new CLR code in any language other than C#. It's a nice enough language, and you have the advantage of not having to worry about the language-to-runtime mapping Bertrand mentions towards the end of his essay. But other languages might turn out to provide compelling advantages over C# for this environment, and the CLR will, as Bertrand argues, support them.

There is now a flowering of thought, design, and implementation of runtimes on the free software side. This world is resistant to hegemony. Instead of one dominant runtime, we will see lots of different approaches. In the short term, this is a serious disadvantage compared with CLR, because it doesn't provide a good story for people (such as myself) who just want to write software. In the long term, I think, it could lead to much stronger working knowledge about how to knit together systems out of disparate components.

I'll list here a few selected projects I find interesting, with some comments focussed on runtime and integration.

  • CPython. The CPython runtime is carefully layered on top of the C runtime, supporting the highly dynamic Python, while allowing full access to the wealth of C libraries. In fact, CPython + C can be seen as an "aggregate language", actually a fairly compelling platform. Other language implementations, such as Ruby, fall into this class as well.

  • Pyrex. Pyrex is essentially a hybrid language with syntax and semantics fairly similar to Python, but able to call C directly. Thus, we essentially have a 3-language aggregate, consisting of CPython + Pyrex + C.

  • Lisp. Lisp is a mature language, with mature, high quality implementations. Most Lisp implementations provide a Foreign Function Interface (FFI), which is capable of directly calling C. However, the details of the FFI tend to differ from implementation to implementation, so you can't really have a Lisp + C aggregate, but rather a family of CMUCL + C, AutoLisp + C, etc. See Design Issues for Foreign Function Interfaces for a much deeper discussion. Also note Arc, which could well become the most compelling Lisp dialect.

  • Parrot. See this interview with Dan Sugalski for much more info on why Parrot is interesting.

  • Ruby/Python. This project attempts to knit Ruby and Python together. I'm including it largely because it illustrates some technical difficulties in integrating runtimes: the last posted version integrates Python 1.5 (which uses reference counting) with Ruby (which is fully gc'ed). It's relatively straightforward to integrate N runtimes when only one is gc'ed - it is the master, and all others are slaves. It's far harder to have more than one, and Ruby/Python seems not to have grappled with the problem yet.

  • JVM (Java) + JNI. This is still a reasonable approach, but free implementations are still not very good. They will get there in time. Note also Jython, which knits the Python language into the JVM runtime.

  • Swig, an "interface language" for easy integration of C and scripting languages. One of the most fascinating aspects of Swig is that a single interface can generate wrappers for many different scripting languages.

  • Corba. Most people these days use Corba to autogenerate low quality network protocols, layered on top of IIOP. However, one of the original goals of Corba was to facilitate integration between a number different languages. It is a mature standard, and fairly well implemented even in the free world, but is fairly painful in practice.

  • Mono and DotGNU, free clones of Microsoft's CLR.

These projects have widely varying goals and ambitions. Some will no doubt succeed, others fail. I expect our community to learn a lot in the process. And perhaps, some day, programmers can sit down and write code without facing difficult decisions about which runtimes to be compatible with, and which to be incompatible with.

Update: Stacking them up: a Comparison of Virtual Machines by John Gough. A well-written, technical comparison of JVM and CLR.

A poem on awakening

Twelve people emerge from sleep camp
Untroubled by dreams
Completely oblivious

3 May 2002 (updated 4 May 2002 at 00:45 UTC) »
Kademlia

I think it was Zooko who pointed me to Kademlia. It's good. In particular, it's a lot less fragile than Chord because of their XOR metric.

Of course, it's not attack resistant yet. I'm sending mail to them to point to Chapter 7 of my thesis, which is actually very similar in a lot of ways to what they're doing. My problem is that I haven't gotten my ideas published yet.

Mail

I see more and more recognition that email is becoming unusable because of spam. The recognition is a very good thing, largely because it funds and motivates research into spam-resistant messaging. There's now a ton of work on the problem, exploring many different directions. Also encouraging is the fact that both free software and commercial products and services are well represented.

Making email resistant to spam is a hard problem. I believe that attack resistant trust is a very good approach. In particular, I believe that the stamp trading network in Chapter 7 of my thesis is likely to do a good job filtering out the spam without trimming the legitimate messages. Of course, the only way to test this hypothesis is to build a prototype. I don't have the time right now to do so myself, and so far I haven't been able to interest anyone else in it either.

Until quite recently, I'd believed that email would have to be completely replaced by a better system. The existing email infrastructure, of course, not only lacks any kind of attack resistance, but also lacks any form of authentication, which is needed in the next level down. Thus, I reasoned, to solve these problems, you have to build something new.

But a few days ago, in a conversation with Jeff Darcy, I realized that this is not necessarily the case. Highly sophisticated email filtering clients will, almost inevitably, implement a p2p network with good trust properties. Vipul's Razor is already a significant step in the right direction.

Such a client could then easily implement a real attack resistant trust metric. If both the sender and receiver clients participate in compatible p2p networks, then approval for the message could be expedited. Otherwise, the message would have to take its chances.

Assuming that this trust-based message approval works well, then in time it will probably get a lot of penetration. When that happens, you could probably just turn off untrusted smtp delivery.

But how do you know in advance whether a trust system will work well for email? Building it and fielding it in the existing email infrastructure is really hard. Plus, there are lots of different ideas on how to do this (my stamp trading network is but one). How do you know which one is best?

To me, the answer is clear: build a prototype. Such a prototype would not have to work with the existing email infrastructure, or attract a huge user base (just large enough to test whether it is spam resistant). Once the prototype proves the ideas, then the difficult work of integrating into the existing email network can begin.

Subpixel positioning

I have a hacked up version of Ghostscript that does unhinted antialiasing with subpixel positioning. See before and after screenshots.

Now that I've posted the screenshots, I'll have to make the patch really work. I have code for making the cache subpixel-aware, but it's not 100% yet (the "after" screenshot was made with caching disabled). Also, I'll need to add a few configurable parameters, especially to turn off hinting by default in aa mode (right now, it's compile time).

Denver

Denver was fun. We went up to Winter Park on Monday, and the kids got to play in the snow a bit.

Alan tested, as we expected, highly gifted. Less expected was the large discrepancy between verbal skills (off the scale) and visual/spatial skills (mazes and the like), at which he is basically average. They recommended an optometry consult. That will be interesting.

When we got back, I think Max made his first pun. He said, "miss spider", which either meant that he missed his favorite spider book, or that he was referring to more of the title words. Of course, it's possible I'm just reading the pun into what he said, but even so I wouldn't be surprised if it were for real. All the people around him love language and delight in puns, and he's a quick study.

MLP

Desmond Tutu: Apartheid in the Holy Land. This is one of the sanest things I've read, about an insane conflict.

Maze

David McCusker posted a very nice maze recently. I solved it this morning, and it took a gratifyingly long time - I'm used to solving mazes in my head.

Someone posted a link to Olin Shivers's maze page. Olin's and David's mazes are both mazes, but there the similarity ends. David's maze, hand-drawn, has a distinctly organic flavor. More than that, it has an architecture. The different locations in the maze are places. In Olin's mazes, they're just grid coordinates.

In any case, David posted a challenge to me, to come up with an image processing algorithm to represent images in the line widths. I think the right algorithm is Ostromoukhov's Artistic Screening. I don't have an implementation handy, but I faked something similar in the Gimp. The results weren't that satisfying - I was hoping for more crispness in the image. Here's the source image, for reference.

Life

I fly tomorrow to meet the kids in Denver. Kinda surprising how much I miss them after only a couple of days apart.

I'm happy these days. I'll write up and post a bit about my philosophy of "humble elitism" which I think is partly responsible. An example will do for now: plain old elitism is reading Slashdot, being irritated at how stupid the comments are, and feeling all superior. Humble elitism is reading better blogs, such as Hack the Planet.

Thesis

I spent a good part of the day going over potential references in my queue. Here are my conclusions:

Sybil attack

The author presents a bad trust metric, shows that it is not attack resistant, and seems to imply that centralized indentity service (a la VeriSign) is needed. Obviously, I don't agree. He gets a brief mention in Chapter 1.

Self-Organization and Identification of Web Communities

Interesting. Their community-finding algorithm is very similar to Advogato's, so much so that it really feels like convergent evolution. However, they're not trying to make it attack resistant, and, indeed, it's not. I wrote a page or so at the end of Chapter 3 describing the differences and their effects on attack resistance.

Poblano (Jxta)

The paper they have up there is white, but with writing on the pages (see alancoxonachip). They present a trust metric in fairly vague terms, but there's no reason to believe it's any good. No cite.

The new text is, as always, available in the thesis snapshot.

181 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!