Open[Source]ing the Doors for Contributor-Run Digital Libraries
Open[Source]ing the Doors for Contributor-Run Digital Libraries
[a draft of an invited article for the digital library issue of Communications of the Association for Computing Machinery aka CACM due to publish in May 2001. your comments invited and appreciated.]
Open[Source]ing the Doors for Contributor-Run Digital Libraries
By Paul Jones
University of North Carolina - Chapel Hill
What if you could wave a wand, in this very Harry Potter decade, and make libraries - at least digital libraries - more open, more easy to manage, cheaper, and even more eclectic and democratic? What if content contributors could submit, catalog, index, manage, rate and rank materials in large collections themselves? I believe that, thanks to the innovations from the Open Source community and perhaps more importantly the Free Software community, that we can have a contributor-run library at this very moment.
In fact, there are several very successful examples from which we can draw not only best practices, but also - that grail of the programmer - working code. But better still, these projects are also examples of vibrant, lively, noisy, democratic communities.
The first step in contributor run libraries is to allow people to contribute. This may sound obvious, but many collections try to control or 'gatekeep' from the onset. Our experience with the Linux Software Archive at MetaLab.unc.edu which began in 1992 was that by removing nearly all barriers to submission and instituting instead some simple verification procedures, we were able to accept (and later distribute) some very high quality software with a very low rejection rate. Submissions are accepted by a simple FTP upload to a secure area. Along with the software, we require some basic metadata, called the Linux Software Map (see http://metalab.unc.edu/pub/Linux/LSM-TEMPLATE), to identify the author, title and describe the software. There are only 12 fields in all and only four are required. Our rejection rate due to missing or improper metadata is at a low 4.5% although we have contributors from every corner of the globe. (see Greenberg, J. Facilitating Author-Generated Metadata: Lessons Learned from an Analysis of Linux Software Maps (LSMs). [Draft report in progress.]). For a full description of the MetaLab Linux Software Archives and contributor demographics see Dempsey, Weiss, Jones and Greenberg. A Quantitative Profile of a Community of Open Source Linux Developers. [forthcoming in Communications of the ACM] or in draft at http://metalab.unc.edu/osrt/develpro.html.
What this experience tells us is that opening the doors to contributors may not be as scary as we may have been led to believe. Of course, digital libraries don't have the same shelf space problems as physical ones. But the fact that the metadata and the attendant organizational assistance taken directly from contributors are reliable and immediately useful is encouraging.
But others have found that encouraging contributors to rank and comment on the contributions of others adds great value and creates a favorable environment for a noisy, active, democratic community to develop and grow. Large book wholesellers, including Amazon (http://www.amazon.com/), and Barnes and Noble (http://www.bn.com/), add value to their offerings by collecting and ranking both user comments and comments on those comments.
Other sites, most notably Slashdot.org (see http://www.slashdot.org/), has instituted a rewards systems so that valued contributors and commenters accrue "karma" points which allow them to act as moderators of discussions and to rank comments and stories. Devices such as "karma" points serve as a hedge against trolls, group-take-overs, fakers, and the like.
More sophisticated structures such as Advogato's "trust metric" (see http://www.advogato.org/trust-metric.html) and other schemes to evaluate "reputation capital" offer an even stronger and more reliable community structure for insuring rich and useful ranking and evaluation.
By giving contributors and readers access to tools for evaluation, ranking and managing the collections, we are not just off-loading work; we are building communities of intellectual discourse. Strong community members are recognized by reputation capital and trust metrics and are rewarded. (For a good discussion on reputation capital in the Internet environment see Rishab Aiyer Ghosh's " Cooking pot markets: an economic model for the trade in free goods and services on the Internet" First Monday , Vol. 3 Issue 3 March 1998, http://www.firstmonday.org/issues/issue3_3/ghosh/).
Digital libraries can give back to contributors as well. By sharing collected information, contributors can see which items (manuscripts, songs, and software) are most in demand in the form of top ten lists or most recommended. This enhances not only the referral services, but helps new contributors understand what is considered a 'good' item.
More sophisticated sites for contributors, such as SourceForge for Open Source software developers (see http://sourceforge.net/), provide the tools that a project needs to get going on its own. Roadblocks to developers are removed by offering FTP and WWW hosting, list services, project status pages, version control software, backups, and discussion forums. By supplying these simple tools, SourceForge became one of the largest collections of Open Source projects in the world within a matter of months. While SourceForge directs its energy toward software developers, their needs are similar to those of contributor communities in any medium or genre.
What makes the tools described so far of particular interest to digital library projects is that they are Open Source and Free (issued under the Free Software Foundation's General Public License) for the most part. In the great tradition of public libraries, the tools and sites can be shared, built upon, and adjusted to local or particular circumstances. The tools and the concepts they use have been proven useful and effective in live and vocal communities. They have produced real and effective collections and more importantly real and effective communities in the best democratic sense.
By adopting not only the Open Source tools, but also the Open Source philosophy, which encourages community interaction and contributor involvement, digital libraries can open new horizons to new communities as well as greatly improve traditional services.
For more on the Open Source and Free Software philosophies see also Dibonna, Chris et al. Open Sources: Voices from the Open Source Revolution. O'Reilly and Associates, 1999. and Richard Stallman and the Free Software Project's philosophy pages http://www.gnu.org/philosophy/
Paul, your work rocks -- thanks for putting these ideas out there. Hopefully there are some librarians listening. :)
There are a few things about which I do my best to hammer home when giving talks about free software to librarians; perhaps you will find them useful as you prepare this piece or do pr for ibiblio or whatever.
First is that free software is every bit as colossal a contribution to our world (including libraries and library services) as the massive private funding done to build library buildings (at least in the USA) around the turn of the 20th century. Your point about "the great tradition of public libraries" makes this very clear. What the capital investment in libraries did was give people a shared place to go for shared information resources. It didn't fill the libraries with books or state any collection policies; those choices and funding responsibilities were left to the communities. In that same way, free code gives any community the tools they need to build shared information spaces they control. The difference here is that rather than coming from the donated riches of a small pool of uber-wealthy industrialists, the contribution today comes from all the distributed hackers and content creators. It's a mistake to do anything but jump at the opportunities and precedents you mention and build what you're building, given what we all have to work with.
Second is that the virtualization of content collections (ie that I can more easily find a song on a stranger's hard drive than in my public library) makes it infinitely easier to build big collections than filling library stacks with physical content. Nobody reading this needs to be convinced of this probably but I mention it because...
...your arguments for author-contributed metadata are dead on. Especially with increasing availability of self-describing content. But a huge boost to all of this would come from librarians if we (librarians, that is) reengineered our reference metadata to better fit this world of virtualized collections. For instance, could you imagine if the data in MARC records -- the moral equivalent of catalog cards -- were in a better-structured data access environment? We only need to look at the NCBI dbs or freedb or imdb to see what can happen when people design metadata systems right. If the name authority databases (giant cross-referencing for author names including pseudonyms and such) managed by distributed catalogers at a handful of privleged institutions were freed into an environment like what you're proposing, we could construct unprecedented tools for tying information together. This is what we're trying to do with the jake project, which is a fairly small niche but is about to scale pretty big pretty fast. All this is to say that yeah, we can do it without the active participation of our venerable libraries, but man it would be easier with their help. And data.
Just my $.02. Rock on, gregorsamsa...
What ibiblio needs now is a way to do all of this! We're working on building the tools for creating "The Public's Library", but contributions are always welcome. What works? What kind of user rating system would you recommend? How about backends, such as slash or scoop? Anything that we're missing? This is a great opportunity for free software to continue to make free software work. Let us know what's out there.
I'm glad Dan brought up freedb and imdb. Though these systems address a more constrained problem space than ibiblio seeks to conquer, they are certainly testament to the power of distributed contribution. In a sense, they really represent the same eyeballs that make "all bugs shallow". If you think of holes and inaccuracies in a data set as bugs, anyway.
I've always felt that these techniques would be great for a database of song lyrics and variations. Since any single host would be sued into oblivion immediately, perhaps this would be a great test case for systems such as freenet.
you may say, rightly, that even those two are examples of narrower audiences than say a general public library. but it's part of my business to see just how far we can push the ideas. will we need to partition the various areas of the library so that someone might be a expert in military history, a journeyer (to use advo-terms) in southeast asian culture, a novice in java, and a beginner in poetry? or should we have trusted people in general (ala slashdot) who accumulate points (like karma) based on good works?
Here at the Field Museum, we have databases for just about every collection... and given that we have over 21 million specimens (dead plants, dead animals, bones, tissue samples, cultural artifacts, etc.), we have many databases, and a whole heck of a lot of data to curate and annotate.
The most extensive database is the Anthropology db - they've had a digital record of all artifacts since the mid-1970s and are well ahead of any other department in this place. The second most developed db is the one accompanying our photo collection (we have in the neighborhood of 900,000 photos, negatives, lantern slides, 2x2 slides...). It is this latter db that has potential to be annotated by anybody in the Museum. The photo archivist is planning to permit annotation access to experts within the Museum, with the goal of having the most complete, current, and correct data as possible. She is not an expert in geology, so why should she know what an Archeopteryx looks like (let alone how to spell it)?
Given our serious staffing constraints (especially with regard to the photo archive), the ability for experts to annotate will improve the turn-around time for error correction.
Now we just need to beef up our IT infrastructure so this Museum-wide annotation can be possible.
The biggest problem today is that some bits are currently illegal. There is no algorithm to determine whether your bits are illegal but some are.
You can't know by looking at them if the bits are ok or not. There are many different types of illegal bits, ranging from libel in England, to proprietary software, to credit card numbers obtained by a waiter in a restaurant.
How can a digital library accept submissions when illegal information could be hiding in anything? Steganography is the art of hiding data in images or in the whitespace of a justified text document. The illegal bits could be encrypted and the key published elsewhere at a later date. No amount of diligence can protect your library from harrassment and attack.
Digital libraries are a great idea, but in the current age people still believe that some bits should not be knowable. I wish I could be more convincing when tell them they are wrong.
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!