Older blog entries for RyanMuldoon (starting at number 22)

slef: I don't think that we'd have to have a regress of metametadata......if the metadata format were standardized, and there were a query language built on top of it, there would be no need for additional description. I guess though, that a schema for the metadata itself is in a way metametadata, but it should all work out nicely. ;-)

Last night I got fed up with XMMS and how it displayed track information in a tasklist, so I changed it from being "XMMS - tracknum. track (time)" to being "track (time) - tracknum - XMMS", which lets you get a good deal more information at a quick glance. I sent a patch the the xmms people....hopefully it is incorporated.

The thought of extending metadata to services is cool. It has a lot of potential.

Quote of the day:
"The mind of man is capable of anything - because everything is in it, all the past as well as all the future. What was there after all? Joy, fear, sorrow, devotion, valour, rage - who can tell? - but truth - truth stripped of its cloak of time. Let the fool gape and shudder - the man knows, and can look on without a wink."
--Joseph Conrad, Heart of Darkness

Ankh: I'd definitely be interested in getting in touch with some people with similar goals. I think that it is ultimately essential to the health of the Internet that public domain works are made easily available, and also that searching is vastly improved.

I'll take this opportunity to rant a bit. First, I am becoming increasingly disillusioned by the world wide web. I used to think that it was the coolest thing since sliced bread. But, overcommercialization is killing it. It is becoming harder and harder to find the actual information that I want, because searching was tacked on as an afterthought. I think that the world-wide web can be broken down into 4 categories:

  • Community sites: Things like Advogato, Slashdot, K5, etc. Sites that are too broad or with too great a userbase are really showing it. Slashdot used to be great, but now it is barely worth going to, except that it is habit. Advogato in my mind strikes a great balance. I think the 2 big reasons for that are that it has a focus (free software developers), and the trust metric. The trust metric is great, and has a ton of uses.
  • Services: Things like expedia.com, ticketmaster.com, and buy.com are all pushing a "web application" and are pretty useful. But I think that they can probably be revised in terms of metadata to be much cooler.
  • News sites: salon.com, cnn.com, etc. These are all useful, but it is annoying that I have to go to the front page to see if there are even any articles I want to read. Another candidate for metadata magic, or client pull.
  • Research Material/Public domain works:This stuff was the original purpose of the WWW. It is really lacking though, because the nature of research is that you have to be able to find it. But, as I said before, searching is kind of weak right now. This is also a huge candidate for metadata magic.

The problem is that people are trying to make the desktop more like the web, where I think the opposite should be true. Web sites should be seen like any document or application. Mozilla should not be an evironment for me to do anything. It should be a rendering engine for the content that I asked for. I think that a nice unified search system should be how I find what content I want. Same with things I want to buy. News should be client-pulled for me and put into my desktop environment (like a "News" subdirectory in the gnome menu). Why "browse" unless I'm trying to kill time? It seems kind of dumb.

Now my rant will break out from just technological complaints to general intellectual property complaints. I completely agree with Ankh that people should be focusing their investments on Museums, Libraries, and other public repositories, rather than hogging important works to themselves. I can understand the joy of owning an original painting, or a first edition (and would definitely love to be in a position to be able to afford such things one day), but I'd like to think that it would be better to give or loan such things to museums, and just buy the print for my own enjoyment. Some things are too important to be held privately. However, Museums and libraries need to shape up. They don't display anywhere near 20% of their holdings. What isn't on display is packed in crates where no one can enjoy it, research it, or do anything with it. This isn't in line with the function of a museum. I think that they have an obligation to supply electronic versions of everything they have. Imagine the boon to research that this would represent. Or even just personal enrichment. It would be an admittedly enormous task, but even doing things piece by piece would be beneficial. The arguments that this would discourage people from actually seeing the real thing is foolish. I am thrilled that I can go to webmuseum and look at Van Gogh's amazing paintings, but that just makes me want to see the real things even more. And, when I do get the chance to see them, I appreciate it that much more. All of this stuff should be readily available.

I am thrilled to see organizations fund projects like ibiblio.org - it is an excellent collection of knowledge. But, while browsing it yesterday, I couldn't help but think how great it would be if all of that information had accompanying metadata. And then the development of a distributed filesharing system that has places like ibiblio.org as permenant nodes. It would be truly great. It frustrates me that the technology is there, but it is just not happening yet. Hopefully I can help make it happen a bit faster. Incidentally, a filesharing system that uses servers like ibiblio.org as permenant nodes would be virtually impossible to stop - the government couldn't help but fund such an effort eventually. It would be a quantum leap in the usefulness of computers and the internet. Being able to do crossreferencing on the fly would be cool as well, but I could live if that were a later feature. All a project like this needs is a lot of people willing to spend a little time adding metadata to things. After a while, it will be easy to maintain. To some extend, computers would be able to generate some of the metadata for us, leaving us to fill in the blanks as we have time. A guy can dream. ;-)

I took a look at www.canonicaltomes.org - it is a very cool idea. It reminded me of a project I wanted to do 3 or 4 years ago, but I still have yet to get off the ground, or see anyone else really do. The project would be a central compilation of every public domain work that has been digitized. The goal would be to provide a nice, navigable, searchable interface to all the extremely useful research materials out there.

It would have to have the following features:

  • A Yahoo-like category interface for browsing casually
  • Each work would have extensive metadata, covering a standard like Dublin Core
  • A search interface that lets you perform all the basic searches, but also searches by metadata, so you can dynmically regroup works to your liking
  • A nice gdict-like desktop application that is a search gateway
  • Palm/WAP interface?
  • Cross-referencing
  • Text prettification, so you don't get stuck with ascii if you don't want to....some nice HTML or XML with stylesheets would be nice

I'd imagine that the technical side would be the easy part. It would basically just have to be a big database with a well thought-out schema. The hard part is definitely organizing the content, attaching the metadata, and finding it all. Also, it would be good to be mirrored. Eventually it should be able to act as part of a distributed filesharing system. It would be an invaluable research tool.

With things like GNUpedia, and other similar efforts to create free-license encyclopedias, it seems like a much more worthwhile effort is to work on something like I describe above. An encyclopedia is only useful after there is a collection of works to reference. This would probably go further to accomplish what RMS wanted to get done: there is already no copyright on this material, so no competing interest can do anything about it. Once there is a community around it, it can be extended in all sorts of directions.

Of course, I think that the Library of Congress should provide such a resource, but the person running it seems to disagree with me. Ah well. Maybe one day when I have free time I'll try and get something like this started.

Ankh: Thanks for the suggestions for query languages. After spending a bit of time looking into it, I have found the following query languages that I will want to review: XPath, XQuery, SQL, OQL, and WebQL. One potential problem that I will have to review is any licensing issues and patents. WebQL looks like part of a proprietary product. I want to make sure what I do is unencumbered by patents. It will be beneficial to look at it for ideas though, I'm sure. The XPath and XQuery systems look very interesting - the only thing that I would be concerned about if I chose one of them explicitly is that I want to treat the fact that the metadata is stored as XML as an implementation detail. I would like seamless transition to a method that can support resource forks in filesystems, as well as files that store their own metadata. But, I do want to use the XML DOM as how I deal with the data itself. It seems well-designed. Another thing to consider is the XML Fragment Interchange spec, which looks ideal for simple metadata exchange systems. I'll definitely need to do a lot of reading on this. I understand XML well, but there are a lot of related technologies that I need to familiarize myself with. I really want to do this correctly......leveraging as many standards as possible. I need a sane public interface, a well-defined metadata set, and a well-defined query language. I really want to be able to support things like subqueries and unions. The ultimate goal is to provide a solid foundation for things like a much more powerful peer to peer filesharing system, superior search engines, and virtual folders on your desktop. I think if I can get the 3 things specced out right, the rest will fall into place for anyone to pick up and build on top of. But before I can get there, I need to keep doing a lot of research. And hopefully the other people involved in this project will bring as much to the table as I think they will. We'll see how it goes.

Most of my thinking time has been spent on metadata. I've been trying to do as much research on the issue as possible. The more I think about it, the more I think that it will be feasible to develop a metadata manager that will be forward-compatible. But there is still a lot of work left. I want to begin work on hashing out an XML schema for the format of the metadata. That in of itself will be a large undertaking. Then figuring out a query language. I want to model it after SQL. I am probably going to take a look at OAF's query system....it is SQL-like, and is designed to query similar kinds of information. So it will probably be a good starting point. Then comes the public interface specification for the metadata manager. It's definitely going to be a project with a long timeline. Hopefully soon we will be able to outline a formal roadmap, and plan exactly we need to do. Hopefully it will end up being used by a lot of people at some point. ;-)

I've been thinking more about metadata, and how ir could realistically be stored and organized. Of course, "realistically" is fairly subjective. My current thinking is pretty much based on the Semantic Web ideas...using RDF files to store metadata on files. That right there poses a couple problems: One, should the filesystem expose these files? I say no, at least not directly. The user should be able to modify information contained in these files, but only through utilities that are designed for it. If someone can just see them as files, and open them in emacs, then the metadata contained can be comprimised (due to how I want to organize the metadata.....see below). The other problem is whether or not these files should have metadata themselves....again, I say no. These should not be treated like normal files. They should be more like Mac's resource forks. Of course, this immediately requires a new filesystem API that knows about this, as well as protocol level support. Hence my subjective notion of realism. The one good thing is it is probably likely that protocols can be updated in a way that is backwards-compatible.

Now, you may ask, what are these RDF files going to contain? My thought on the matter is this: there should be a standard set of metadata fields to work with. Ideally, these should be based on the Dublin Core work in this area. This is a good start, but I would like to go one step further: use namespaces to specify standard metadata fields by MIME type. The Dublin Core stuff would most likely be the supertype */*, then it new metadata fields can be introduced for text/*, audio/*, image/*, video/*, and application/*. The annoying one is application/*, because there is no real continuity in the members of that set. Maybe it should be left out....I don't know. My other thought on breaking down these metadata namespaces is that the more specific areas of metadata should be typedefed fields - there should only be certain keywords that can be used. This greatly eases implementation issues, as there is no need for fuzzy logic in associating similar words. It also quickly establishes a lowest common demoninator that everyone can work with. This is why I don't think metadata files should be treated normally.

With this base, filesystems and OSes have a ton of room to innovate new features. One thing that I would like to see is the development of association graphs - so files that you use together regularly are associated together. Another side benefit to this is that a heuristic could be developed for dynamically adding metadata to files with incomplete metadata based on how the file is being used with other metadata-rich files. Also, searching that is purely based on metadata should be really fast, as all the files are small, easily indexible, and in a standard format. I'm sure that there are a ton of other things that could be built on top of this basic framework. I'd be interested in anyone's thoughts on this system....especially my initial thinking that metadata files should be treated differently than normal files.

dirtyrat: You're right, there is a lot of infrastructure work to be done. As I've said before, it would be a much easier problem to solve if there weren't any legacy compatibility issues. We could build metadata right into all the filesystems and protocols and/or file formats. Then all that would need to be done would be to develop the features that take advantage of the extra information. It then quickly becomes more of a straight HCI issue. But as it stands now, it is a pretty massive engineering problem, a computer science problem, and a HCI problem. Not simple. ;-) It is really foundational, which is a curse and a blessing. The curse is of course that it takes a lot of work to graft this onto existing infrastructure. The blessing is that once it is there, there are all sorts of cool things that can be done relatively cheaply. Hopefully this (huge) benefit will get people inspired to work on the issues. We'll see how it goes.

I really hate how all the books that I want to buy are in the $60-$200 range. Of course, all of these books are in Philosophy, CS, Semantics, or HCI. All of these fields find it reasonable to have very high prices on books. I can appreciate the fact that upon reading these books, I am theoretically more marketable/smart, but man, it is just a lot of money. My book list is somewhere in the $600 range right now. And that is after having gotten ~$200 worth of books in the past couple months. I am being sucked dry. :( Hopefully I'll learn enough to justify spending so much money. I hope that the authors are actually getting most of that money. Somehow I doubt it. Ah well. I'll live.

As for my article, I was hoping for a bit more conversation, but that's ok. I think that it is partially because it is a fairly specific area of study that not too many people necessarily think about. Or maybe because I'm completely off-base. ;-) I am finding it difficult to find people to do some research with. The Semantic Web stuff seems very cool, but I have no idea how I can get involved with that. Maybe I'll see if I can figure it out. Hmm..that's about all for today.

Hmmm......so far no one seems interested in my article. Hopefully that will change. ;-) I spent a bit of time today thinking more about doing a moniker-based clipbook/scrapbook program. The annoying thing is that I am becoming more and more aware of the hard parts of the X cut and paste model. And the most elegant solution to the problem would cut out all non-gnome/bonobo apps. So I can't do that. ;-) Hopefully I'll be able to solve my problems in a non-hackish way. It will just require more thinking. Unfortunately, my thinking time is increasingly split between a number of different problems/projects. And right now, my thoughts on GUI/semantics stuff is what wins out most of the time. Hopefully it will prove to be a worthwhile endeavor.

I read the articles linked from slashdot today about Napster....I have been thinking more and more that the appropriate route for developing industries on the Internet is to make the base infrastructure free (as in speech and beer), and then allow companies to build value on top of this. To some extent this happens, but not nearly enough. High-speed internet access is completely artificially expensive. Really, this should be something that the government (or some sanctioned non-profit institution) takes care of, not large telephony companies. Why? If we really want the internet to be an "information superhighway" we need to treat it like our real highways. They are maintained by the government, covered by taxes. We also have government-subsidized mass transportation (buses and subways). But then, we also have companies making a lot of money on top of this infrastructure - taxi companies, limo services, car companies, gas stations, fast food restaurants, etc. These businesses could not really have existed (or at least prospered) without this free infrastructure to build upon. We should do the same with the Internet. All base protocols, file formats, OSes, and distribution mechanisms should be open and free. Companies can then build services on top of them by creating unique content, providing a better Internet experience somehow, delivering existing content to you in a more convenient way, etc. And probably all sorts of things I can't even think of. But the important thing is that the government should recognize its roll as an infrastructure provider. One of the most important things a government does is have standards for weights and measures. Why is this important? Because it ensures that business transaction between economic entities remains fair. It is the same case for the Internet. A free market only works when everyone has a level playing field, and low barrier to entry. The Internet was originally thought of (when it first was allowed to be used for commercial purposes) as the ultimate leveler. I think that this is becoming less and less true. I really hope that the free software world will slowly but surely help bring this about. Of course, the most obvious attack to something like this is: "I don't want to pay more taxes! The government shouldn't stick its nose in other people's profit margins!" This is almost a valid complaint. ;-) When you think about it, you are already paying "taxes" on everything. You pay the telephony companies for the right to use their bandwidth. You pay the software companies the right to use their software. Soon, you'll pay the distribution companies to use their system. I'd rather have a consolidated tax. Furthermore, it would be less - because the government is not a for-profit institution. There would still be a lot of room to innovate and make new businesses on top of this public infrastructure. It just seems like the most sensical thing to do. (Of course, I think the same thing of the Health Industry, but that is a whole other diary entry). After ranting, I feel better. ;-)

13 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!