Older blog entries for RyanMuldoon (starting at number 20)

I took a look at www.canonicaltomes.org - it is a very cool idea. It reminded me of a project I wanted to do 3 or 4 years ago, but I still have yet to get off the ground, or see anyone else really do. The project would be a central compilation of every public domain work that has been digitized. The goal would be to provide a nice, navigable, searchable interface to all the extremely useful research materials out there.

It would have to have the following features:

  • A Yahoo-like category interface for browsing casually
  • Each work would have extensive metadata, covering a standard like Dublin Core
  • A search interface that lets you perform all the basic searches, but also searches by metadata, so you can dynmically regroup works to your liking
  • A nice gdict-like desktop application that is a search gateway
  • Palm/WAP interface?
  • Cross-referencing
  • Text prettification, so you don't get stuck with ascii if you don't want to....some nice HTML or XML with stylesheets would be nice

I'd imagine that the technical side would be the easy part. It would basically just have to be a big database with a well thought-out schema. The hard part is definitely organizing the content, attaching the metadata, and finding it all. Also, it would be good to be mirrored. Eventually it should be able to act as part of a distributed filesharing system. It would be an invaluable research tool.

With things like GNUpedia, and other similar efforts to create free-license encyclopedias, it seems like a much more worthwhile effort is to work on something like I describe above. An encyclopedia is only useful after there is a collection of works to reference. This would probably go further to accomplish what RMS wanted to get done: there is already no copyright on this material, so no competing interest can do anything about it. Once there is a community around it, it can be extended in all sorts of directions.

Of course, I think that the Library of Congress should provide such a resource, but the person running it seems to disagree with me. Ah well. Maybe one day when I have free time I'll try and get something like this started.

Ankh: Thanks for the suggestions for query languages. After spending a bit of time looking into it, I have found the following query languages that I will want to review: XPath, XQuery, SQL, OQL, and WebQL. One potential problem that I will have to review is any licensing issues and patents. WebQL looks like part of a proprietary product. I want to make sure what I do is unencumbered by patents. It will be beneficial to look at it for ideas though, I'm sure. The XPath and XQuery systems look very interesting - the only thing that I would be concerned about if I chose one of them explicitly is that I want to treat the fact that the metadata is stored as XML as an implementation detail. I would like seamless transition to a method that can support resource forks in filesystems, as well as files that store their own metadata. But, I do want to use the XML DOM as how I deal with the data itself. It seems well-designed. Another thing to consider is the XML Fragment Interchange spec, which looks ideal for simple metadata exchange systems. I'll definitely need to do a lot of reading on this. I understand XML well, but there are a lot of related technologies that I need to familiarize myself with. I really want to do this correctly......leveraging as many standards as possible. I need a sane public interface, a well-defined metadata set, and a well-defined query language. I really want to be able to support things like subqueries and unions. The ultimate goal is to provide a solid foundation for things like a much more powerful peer to peer filesharing system, superior search engines, and virtual folders on your desktop. I think if I can get the 3 things specced out right, the rest will fall into place for anyone to pick up and build on top of. But before I can get there, I need to keep doing a lot of research. And hopefully the other people involved in this project will bring as much to the table as I think they will. We'll see how it goes.

Most of my thinking time has been spent on metadata. I've been trying to do as much research on the issue as possible. The more I think about it, the more I think that it will be feasible to develop a metadata manager that will be forward-compatible. But there is still a lot of work left. I want to begin work on hashing out an XML schema for the format of the metadata. That in of itself will be a large undertaking. Then figuring out a query language. I want to model it after SQL. I am probably going to take a look at OAF's query system....it is SQL-like, and is designed to query similar kinds of information. So it will probably be a good starting point. Then comes the public interface specification for the metadata manager. It's definitely going to be a project with a long timeline. Hopefully soon we will be able to outline a formal roadmap, and plan exactly we need to do. Hopefully it will end up being used by a lot of people at some point. ;-)

I've been thinking more about metadata, and how ir could realistically be stored and organized. Of course, "realistically" is fairly subjective. My current thinking is pretty much based on the Semantic Web ideas...using RDF files to store metadata on files. That right there poses a couple problems: One, should the filesystem expose these files? I say no, at least not directly. The user should be able to modify information contained in these files, but only through utilities that are designed for it. If someone can just see them as files, and open them in emacs, then the metadata contained can be comprimised (due to how I want to organize the metadata.....see below). The other problem is whether or not these files should have metadata themselves....again, I say no. These should not be treated like normal files. They should be more like Mac's resource forks. Of course, this immediately requires a new filesystem API that knows about this, as well as protocol level support. Hence my subjective notion of realism. The one good thing is it is probably likely that protocols can be updated in a way that is backwards-compatible.

Now, you may ask, what are these RDF files going to contain? My thought on the matter is this: there should be a standard set of metadata fields to work with. Ideally, these should be based on the Dublin Core work in this area. This is a good start, but I would like to go one step further: use namespaces to specify standard metadata fields by MIME type. The Dublin Core stuff would most likely be the supertype */*, then it new metadata fields can be introduced for text/*, audio/*, image/*, video/*, and application/*. The annoying one is application/*, because there is no real continuity in the members of that set. Maybe it should be left out....I don't know. My other thought on breaking down these metadata namespaces is that the more specific areas of metadata should be typedefed fields - there should only be certain keywords that can be used. This greatly eases implementation issues, as there is no need for fuzzy logic in associating similar words. It also quickly establishes a lowest common demoninator that everyone can work with. This is why I don't think metadata files should be treated normally.

With this base, filesystems and OSes have a ton of room to innovate new features. One thing that I would like to see is the development of association graphs - so files that you use together regularly are associated together. Another side benefit to this is that a heuristic could be developed for dynamically adding metadata to files with incomplete metadata based on how the file is being used with other metadata-rich files. Also, searching that is purely based on metadata should be really fast, as all the files are small, easily indexible, and in a standard format. I'm sure that there are a ton of other things that could be built on top of this basic framework. I'd be interested in anyone's thoughts on this system....especially my initial thinking that metadata files should be treated differently than normal files.

dirtyrat: You're right, there is a lot of infrastructure work to be done. As I've said before, it would be a much easier problem to solve if there weren't any legacy compatibility issues. We could build metadata right into all the filesystems and protocols and/or file formats. Then all that would need to be done would be to develop the features that take advantage of the extra information. It then quickly becomes more of a straight HCI issue. But as it stands now, it is a pretty massive engineering problem, a computer science problem, and a HCI problem. Not simple. ;-) It is really foundational, which is a curse and a blessing. The curse is of course that it takes a lot of work to graft this onto existing infrastructure. The blessing is that once it is there, there are all sorts of cool things that can be done relatively cheaply. Hopefully this (huge) benefit will get people inspired to work on the issues. We'll see how it goes.

I really hate how all the books that I want to buy are in the $60-$200 range. Of course, all of these books are in Philosophy, CS, Semantics, or HCI. All of these fields find it reasonable to have very high prices on books. I can appreciate the fact that upon reading these books, I am theoretically more marketable/smart, but man, it is just a lot of money. My book list is somewhere in the $600 range right now. And that is after having gotten ~$200 worth of books in the past couple months. I am being sucked dry. :( Hopefully I'll learn enough to justify spending so much money. I hope that the authors are actually getting most of that money. Somehow I doubt it. Ah well. I'll live.

As for my article, I was hoping for a bit more conversation, but that's ok. I think that it is partially because it is a fairly specific area of study that not too many people necessarily think about. Or maybe because I'm completely off-base. ;-) I am finding it difficult to find people to do some research with. The Semantic Web stuff seems very cool, but I have no idea how I can get involved with that. Maybe I'll see if I can figure it out. Hmm..that's about all for today.

Hmmm......so far no one seems interested in my article. Hopefully that will change. ;-) I spent a bit of time today thinking more about doing a moniker-based clipbook/scrapbook program. The annoying thing is that I am becoming more and more aware of the hard parts of the X cut and paste model. And the most elegant solution to the problem would cut out all non-gnome/bonobo apps. So I can't do that. ;-) Hopefully I'll be able to solve my problems in a non-hackish way. It will just require more thinking. Unfortunately, my thinking time is increasingly split between a number of different problems/projects. And right now, my thoughts on GUI/semantics stuff is what wins out most of the time. Hopefully it will prove to be a worthwhile endeavor.

I read the articles linked from slashdot today about Napster....I have been thinking more and more that the appropriate route for developing industries on the Internet is to make the base infrastructure free (as in speech and beer), and then allow companies to build value on top of this. To some extent this happens, but not nearly enough. High-speed internet access is completely artificially expensive. Really, this should be something that the government (or some sanctioned non-profit institution) takes care of, not large telephony companies. Why? If we really want the internet to be an "information superhighway" we need to treat it like our real highways. They are maintained by the government, covered by taxes. We also have government-subsidized mass transportation (buses and subways). But then, we also have companies making a lot of money on top of this infrastructure - taxi companies, limo services, car companies, gas stations, fast food restaurants, etc. These businesses could not really have existed (or at least prospered) without this free infrastructure to build upon. We should do the same with the Internet. All base protocols, file formats, OSes, and distribution mechanisms should be open and free. Companies can then build services on top of them by creating unique content, providing a better Internet experience somehow, delivering existing content to you in a more convenient way, etc. And probably all sorts of things I can't even think of. But the important thing is that the government should recognize its roll as an infrastructure provider. One of the most important things a government does is have standards for weights and measures. Why is this important? Because it ensures that business transaction between economic entities remains fair. It is the same case for the Internet. A free market only works when everyone has a level playing field, and low barrier to entry. The Internet was originally thought of (when it first was allowed to be used for commercial purposes) as the ultimate leveler. I think that this is becoming less and less true. I really hope that the free software world will slowly but surely help bring this about. Of course, the most obvious attack to something like this is: "I don't want to pay more taxes! The government shouldn't stick its nose in other people's profit margins!" This is almost a valid complaint. ;-) When you think about it, you are already paying "taxes" on everything. You pay the telephony companies for the right to use their bandwidth. You pay the software companies the right to use their software. Soon, you'll pay the distribution companies to use their system. I'd rather have a consolidated tax. Furthermore, it would be less - because the government is not a for-profit institution. There would still be a lot of room to innovate and make new businesses on top of this public infrastructure. It just seems like the most sensical thing to do. (Of course, I think the same thing of the Health Industry, but that is a whole other diary entry). After ranting, I feel better. ;-)

jmg:As for versioning on files, I agree, it does take up extra disk space. However, I have 2 thoughts on this. First, consider the percentage of your hard drive that are actually files that you edit on a regular basis. My guess is that it is ~10% at most. Then, consider that most files that are frequently edited are text/* MIME types. Which means that it should be easy to do what CVS does, and have diffs made for each version. It should only add marginally to disk space. For non-text files, this is probably more of a problem, as I don't think that you can get anything meaningful from a diff. But, my guess is still that binary data is edited with much less frequency than ascii, so the penalty is probably not that great. If there's an ability to delete versions, I doubt that it would be a very costly feature (at least in the general case).

I haven't played with BeOS in a number of years, but as I am learning more about it, I would like to take a more extensive look at it. What it has for metadata is very cool. The GNOME project is trying to build some of that into the 1.4/2.0 platform (vFolders in Evolution and eventually Nautilus, and emblems in Nautilus that you can search by). This is a good start, but a couple things strike me: first, the usefulness drastically increases if it is implemented at a more fundamental level. Second, while I can do vFolders with all of my MP3s, I would rather be able to do ask a media player to find all the "mellow" MP3s I have, or all the MP3s similar to the one I am currently playing. The big problem is that this level of metadata is VERY dynamic, and needs an Agent/daemon to pay attention to what I'm doing, and constantly add more information as it figures out my usage patterns. The future vision that Hans Reiser has (that I mentioned in my last diary entry) could provide the foundation for at least how and where to put metadata. The problem still remains though - if I send you an MP3 across the network, how does that metadata get transported? Also, even if the metadata can be transferred with the file, what metadata is per-user, and what is general? In the case of MP3s, I think everyone can pretty much agree what "mellow" is, but what mp3s I would group together is probably different from what someone else would. I'm very interested in what other people are doing in this field, so I'm glad to hear your thoughts on the matter. What I would ultimately like to see will at best take several years.....I'm trying to figure out how this could be done in small, manageable steps. I don't want to work on something that ends up only being of academic interest....I want to work on something that I (and hopefully others) can use and benefit from. So that means tacking stuff on top of UNIX. ;-) How to go about this is a tricky question.

jmg: Thanks for the pointer to FreeBSD extended attributes. I'll definitely take a look. However, what really blew me away is found at www.reiser.org -> Future Vision. What is outlined there seems like a much better way of retrieving information. What I have yet to resolve is how to initially save a file with that system....what initial metadata is attached? for text, it is an easy problem. But for media files, it becomes more complicated. You can get standard modified time, etc, and what program edited it...but you need to be able to attach more meaningful data to it initially. Also, another feature I'd love to see built into a filesystem is versioning....I want CVS for the masses. It would be really nice to be able to do a query the filesystem for files modified within a given date range, and get the versions that were saved on that range, rather than just the newest version. Such a thing would be very useful for anyone working on documents over a long period of time. I'd like to see the hierarchal filesystem that we have today be only around for legacy purposes. But that is a long way off. ;-)

11 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!