Older blog entries for markpasc (starting at number 14)

I don't think I've mentioned it, but my current project is winget, a Windows port of GNU wget with an actual Windows interface. This is about the biggest thing I've done to date, being an actual software project, so I'm very pleased with even just how much I've done so far!

Kit 1.1.6 is out. It adds a Radio to the Past form bit to the weblog post page, and incorporates a couple minor minor fixes I'm going to let the Kit page claim I released as 1.1.5.

I've not been spending a lot of time in Radio-land lately, and will have to carefully consider it, since I may be ditching Windows in the not too distant future. I've invested enough in Radio that I should probably keep using it, but sunk time is a bad decision-making factor.

24 May 2002 (updated 24 May 2002 at 06:24 UTC) »

The next version of Stapler is chock-full (chockful?) of HTTP headery goodness.

So find some more bugs so I can put it out.

The headers in question are If-Modified-Since and User-Agent. Stapler identifies itself to the server as Stapler/x.y.z, and remembers the Last-Modified and Date headers (actually, all of them) so it can parrot it back for a 304 Not Modified as the spec suggests. Voila:

x.y.z.w - - [24/May/2002:01:58:29 -0400] "GET / HTTP/1.0" 304 - "-" "Stapler/2.0.1"

Next step would be to honor robots.txt files. Suppose I should put a referrer in, too, hmm. Might also be nice to say I'm using HTTP/1.1, but I'm not sure if I can.

23 May 2002 (updated 23 May 2002 at 05:43 UTC) »
Radio to the Whatever

It's rather depressing to find such a showstopping bug in Kit's Radio to the Past tool. I hadn't heard about it and didn't realize it was there, so that means no one whosoever used the thing and had the decency to drop me a note about it. After all the noise in the groups about it I figured someone might at least try the thing... but not so.

I've started planning for the next version of Stapler, in which everything old is new again under a different name and in a different place. Meanwhile the version of Stapler on my desktop and the one on the website are different, so I release the former as a "bugfix" version, 1.7.4.

One big idea (as in "What's the big idea?") will cause most of the change and provide a convenient excuse for the rest: eliminating the source-feed dichotomy. Since this is quite a big change, the next version of Stapler will, at least for now, be numbered 2.0 (0 as in "oh, boy").

Most sources required a corresponding feed, which I obviously realized since I added a "Make feed for this source" button not too long ago. However, the entire difference is a holdover from Stapler's original purpose being a feed of web comics, one of the few cases where it's better to have multiple sources in one feed.

So out go sources vs feeds--but you'll still be able to do the same thing, of course. (I'm not giving up my web comics feed yet.) Stapler 2.0 will allow users to disable writing feeds to disk independently of toggling their actual updating, and will include an "aggregate" scraper that aggregates the items of other feeds--presumably ones with disk writing turned off--into one feed. Literally where you had a feed for one source because of Stapler's design, you'll have one feed, and where you aggregated four sources into one feed for some value <dfn>four</dfn>, you'll have 4+1 feeds, only one of which has disk-writing enabled.

So maybe it's not such a hot idea, having a sourcefeed that can be sourcelike or feedlike or both; but it seems like a good idea at the moment.

In addition to that change, some things are changing name to make for (I hope) clearer nomenclature. Instead of the antiquated and scary <dfn>scraper</dfn>, feeds will have <dfn>extractors</dfn>. Instead of having <dfn>document types</dfn>, feeds will have <dfn>formats</dfn>. Those are the name changes I foresee now, but I'm sure one or two more will sneak in.

Oh, and the "ByNumbers" extractor becomes "By selector." Duh.

Ideally, of course, I would write a script that converts a 1.7.4 StaplerData table to a 2.0 one. In fact, that's how I refined the new data model, figuring out how to turn the old into the new. But I'd really rather not, since it's complicated, and anyone with custom scrapers or document types will have work to do anyway. (But then, I suppose that's actually very few people, so perhaps it is worthwhile.)

As is apparent, 2.0 is still very much in the planning stage, though it would be nice to have a copy to release 17 May, since that's the day I release version 1.0.1 last year. (I'm not sure when I released 1.0; I guess I could look it up in my blog archives, but I can't be arsed just now.) Just a heads up for y'all who actually care.

Huh, so Kit is "popular" now:

Mark Paschal released Kit 1.0.1, a popular set of interfaces and utilities for Radio 8.

That's good to know.

Kit 0.9.6. Two bugs fixed and a feature.

I mentioned I was installing Debian 2.2. It was actually pretty easy, because of some combination of the CD-ROM drive actually working, more experience since I installed Linux last, not having (read: bothering) to repartition the disk, and Debian being awesome.

I'm trying to install LiveJournal server next, but one of the early steps is to make sure one's CPAN module is up to date, and it caught me using the perl 5.005 that came with Debian 2.2. Yadda yadda, now I'm trying to find a server from which I can just apt-get it.

I talked more about my Linux-reinstalling experience here, here, here, and here.

17 Apr 2002 (updated 17 Apr 2002 at 18:04 UTC) »
The Google API

Yeah, I've not mentioned it yet, but suddenly several articles herd my thinking thataway.

Adam Vandenberg discusses XHTML (look for See, there's this Web, and it has these "standards"). His argument is eminently reasonable, though I've done my share of rah-rahing for XHTML and whatnot.

Here's the slightly-snuck assumption:

Well, people are people and computers are computers, and Webpages are primarily meant for communicating with other people and not communication with machines.

Web pages have been for people to communicate with people, but the whole point of XML, CSS, and XHTML is that web documents should be communicable to machines. For example, if I only had to specify particular paths along the DOMs of XHTML documents, Stapler would be much simpler software (an alarm clock, database, web fetcher, and the path walker). Also, machines have to communicate this content to people; that's all well and good if you have a standard way of doing that, such as the visual web browser, but what if the human can't see? The machine needs to be able to understand enough about the content to convert it between different media--so that's how the accessibility argument relates.

It's a good argument and certainly nothing to ignore, but the important part is:

The web browser as a universal client is still a very powerful idea. ... [N]on-HTML Internet APIs... are going to complement web browsing, not replace it.

I certainly don't read all the web in RSS. Even if I could add everything in there, would I? Probably not, though I would read more there than most people.

So, first off, HTML isn't going away any time soon. Meanwhile, this week's Disenchanted article is specifically on Google's SOAP API... by way of construction toys:

Where have all the young and amateur engineers gone? Apparently to computers, where the philosophy of olde-time Lego, Meccano and Heathkit is in super-overdrive.

This philosophy is all about building personal projects with easily understandable, easily connectable, pre-made parts, and the world of software is now awash with hundreds of thousands of them.

The article is a comprehensive guide to where the Google SOAP API came from, and while not explicitly saying this is only throwing the doors open to the web services world, it's so. Here I unveil my cynicism (or, perhaps, optimism): specifically I agree with Aaron Straup Cope in that the Google API isn't earth-shattering in and of itself. Gee, people can put "top-ten Google hits for <dfn>foo</dfn>" search boxen on their Radio pages. Couldn't you do that before?

Yeah, but it's qualitatively easier now. After all the moaning about how no one is deploying web services, this throws the door wide open to them, full stop. Now that Google's done it, will Dictionary.com do it? Aaron's weblog yields an example of the utility of such a service, even though you could do that with a more complex API too.

(Aside: probably not, since the revenue model remains to be seen. Might they start selling product placement in example usage text?)

I'd like to think this is, as I said, optimism. Maybe I'm a victim of the hype, but if this is only the beginning of web services, there are going to be so many even more amazing services, and they're all in the future, awaiting invention.

9 Apr 2002 (updated 9 Apr 2002 at 13:24 UTC) »

Stapler grousing

Should I be bitter that RssDistiller gets more noise than Stapler? (I am, at times.) Should I have chosen a better name, one that more obviously screams "I TURN STUFF INTO RSS!"? Is it a design and documentation issue? Is it because Stapler isn't pretty like RssDistiller, with its tabbed interface and eVectors' bumblebee colors?

Am I wrong that it's difficult to specify what one wants out of a page (ie, how hard is that in RssDistiller)? Was I wrong to have a feeds concept that aggregate sources? Should I reimplement feeds as a special source scraper?

Obviously I should figure how to share sources, since RssDistiller does that. I should probably make Stapler not autonumber new feeds and sources. I haven't worked on the refined interface yet (and if I make a radical change to the feeds thing, I shouldn't, yet).

5 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!