Older blog entries for jmason (starting at number 34)

3.0.0 works great. Must release it soon...

Well, after all that I've decided to release sitescooper 2.3.0 as 3.0.0 just to make a slightly bigger noise about it. ;)

Just did the first CVS checkin of it (the ci was delayed due to a problem I had with the sitescooper CVS at sourceforge).

Let's hope I haven't forgotten any cvs adds or I'll be getting angry mails tomorrow morning!

Sitescooper 2.3.0 news: works perfectly on Windows, first time! I'm impressed. Well, when I say it works perfectly, it works perfectly apart from the use of socketpair( ), fork( ) and the like when the -preload fork preloading mode is used. But that's purely an optional tweak that uses some severe UNIXisms so I'm not really surprised by that ;)

Shock horror -- clueful article at ZDNN!

Just added output templates to 2.3.0 -- which means that the output generated by sitescooper is now sedded into a template, rather than gradually built up with static strings and a stackload of conditionals inside the code.

This is a bit nicer and should help with i18n'ing the output for non-english speakers.

volsung: my personal preference for open()/close()-style APIs is as follows:

1. define a struct containing the state info for the API (let's call it API_t). You could even keep the function table and the state included in that struct, for a real C++-like feeling ;)

2. provide a API_new(API_t *) function that initialises it into a sane, unopened state

3. provide API_open(API_t *, ...) function that opens it

4. provide API_close(API_t *) to close it

5. provide API_free(API_t *) to free any dynamic stuff after use.

This is very C++-like in usage, and works pretty well IMHO. The problem with using a plain int as a return type in the open() style is that you then get stuck with handling a lookup table of that-int-to-state-struct mappings. yuck, and AFAICS not really necessary or elegant for user-mode library code...

What do you think?

Released 2.2.9 -- but without all the redirect fixes. 2.3.0 will become the devel version pretty soon, and that can handle all the tricky redirects nicely.

Running low on sitescooper-hacking time again. This will delay 2 things: getting the fixes for the redirect problems out of 2.3.0 and backported into 2.2.9, so that 2.2.9 can be released; and finalising 2.3.0 so that that can be released as a beta.

Too much stuff to do. Argh...

snicker: Dialog with an Internet Toaster (found at More Like This).

jaz hits the nail on the head in the SOAP discussion:

whereas the normal use of HTTP is geared toward passing information between two mutually distrusting applications, one of which is the client and one the server, the point of SOAP is to use HTTP to distribute one application across a network.

That's the bit those pesky network admins won't like about SOAP.

I wonder if SOAP supports a server-to-client callback mechanism? Now that is a whole new world of pain, security-wise...

Finally got around to doing a site file for Advogato. Thanks to David Desrosiers for reminding me, indirectly...

Mad thought, from Jim Gettys on the handhelds mailing list:

A moral from early 1990's work on the X server: we made it consume less than 40% of the data consumed by X11R1, while making it faster. Touching memory has gotten quite expensive. It is usually cheaper to recompute something than reference a precomputed value now.

I know CPU speeds have zoomed ahead of memory accesses etc., but I must admit I never thought of that...

Advogato stuff: just noticed that Adam Shostack is here too. Groovy -- I've picked up so much cool security advice from him over the years reading firewalls, bugtraq etc., and his text on writing secure code has proved very handy on several occasions. So I went to certify him -- and found I already had a while ago. ;)

Turns out that he'd also replied to a comment I made about the TV-free lifestyle a while back, but I hadn't noticed (presumably the comment had scrolled off the recentlog.html). That would be my #1 change request for Advogato -- a way to "reply" to other people's diary comments so that the original poster was notified somehow.

jmason--yeah, so TV-free can mean pubs. I'll claim that pubs are a superior form of entertainment any day. They involve human interaction, rather than sitting in front of the idiot box. So, perhaps people drink to excess? I know that I regret far less time spent drinking with freinds and buddies than I regret spent in front of the TV.

Perhaps the next morning is an exception.

I agree with the comment BTW, although with the (escalating) price of booze here in Ireland it's become harder to justify (let alone considering the hangover factor!)

I definitely find TV a waste of time, and generally wind up in front of the computer -- but that has implications for my tendonitis, which is bad. So that's TV=bad, pub=expensive, hacking=sore... need a new pastime I guess ;)

small-linux-related question: someone on the handhelds.org mailing list mentioned grope, a compiler-optimisation work-in-progress of Nat Friedman's which sounds very interesting -- but I can't find any sign of it on the net. Or Nat's new homepage wherever that may be. nat.org seems to point to www.helixcode.com. http://primates.helixcode.com/~nat/ is empty. A mystery!

2.3.0 update: it's working quite nicely now. I'm not sure when I want to make it the "official" dev version -- I'll probably wait 'til after 2.2.9 is released, and that still needs the 2 redirect bug fixes.

I've just done a little speed testing to check out how well the parallelised URL retrieval works. Check these figures:

2.3.0: 46.51user 1.17system 3:36.99elapsed 21%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (700major+25507minor)pagefaults 0swaps

2.2.9: 44.68user 1.62system 6:49.73elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (653major+29522minor)pagefaults 0swaps

That's running sitescooper 2.3.0alpha vs 2.2.9, creating HTML output, retrieving all the "palm" category of sites.

The 2.3.0 there is running 4 preforked HTTP client processes. It's using 10% more CPU (not too bad), not a seriously large increase in page faults, and using 2 more seconds of CPU time -- but it finishes loading the sites in half the time of 2.2.9.

I think that's a big thumbs up for the preloading and parallelised code then!

25 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!