Older blog entries for slamb (starting at number 55)

9 Feb 2007 (updated 9 Feb 2007 at 08:43 UTC) »
ncm, here are five tidbits you probably didn't know about me:

  1. I first learned about the Internet by dialing ISCABBS through my dad's dual-speed 300/1200 baud modem when I was 10.

  2. My official major in college was electrical engineering then computer science, but my favorite classes and professors were in physics. If I hadn't already loved computing for nine years before finding the physics for majors classes, I'd be on a different path today.

  3. When I drove from Iowa City to the Bay Area with a carload of belongings, I showed up at the doorstep of good friends I met through ISCABBS. I'd never been to Northern California or met them in person, and I lived with them for the next year.

  4. I was once nearly thrown in a Tanzanian jail. The Arusha and Serengeti offices of Tanzanian Immigration and Customs disagree on the legality of my crossing the border by bicycle from Kenya so far from an official border post. I probably shouldn't have taken the advice of a Kenyan who'd spent three days in Egyptian military custody for entering without a visa.[*]

  5. During that crossing, my friends and I went through a dozen inner tubes and two dozen patches before running out twelve kilometers from our destination and walking. Three days cycling from Karen, Kenya to Loliondo, Tanzania - much of it on a rocky, hilly, windy earthen path that diverged wildly from our map - will do that. We had to bum a 450 km ride to Arusha where we could get replacements. Probably the stupidest thing I've ever done, but the trip was one of the greatest experiences of my life. The people were kind, generous, and amazing in the truest sense of the word.

[*] "What did you see?" "Camels and sand. Camels and sand. What do you see?" But all worked out for him, too - he met his wife in Egypt.

5 Feb 2007 (updated 5 Feb 2007 at 22:00 UTC) »
ncm, fejj

The data structure you're talking about is called a Bloom filter. It seems to be one that I see and think "oh, that'll come in handy sometime" but then it never does. The nastiest limitation is that you can't ever remove anything from it. So if your web crawler should eventually rescan the same URL (as I hope it would), it'd be unsuitable for hashing URLs. I confess that I don't really understand what the interview question is getting at.

1 Feb 2007 (updated 1 Feb 2007 at 07:18 UTC) »
nconway, re: pgmemcache

Interesting project, but I don't understand how to use it correctly with transactions.

I can see that you and/or Sean Chittenden put some thought into this - your README.pgmemcache suggests using an AFTER UPDATE trigger, which runs within the transaction making the change. It further suggests deleting the stale entry from the cache rather than replacing it because the latter is not "transaction safe". By that, I assume it means that if the transaction gets rolled back, the cache will incorrectly hold the new value.

But how do records get added? If they're added on retrieval (as would make sense for a cache), it seems like there's still a potential problem:


conn a          conn b
------          ------
                begin work;
begin work;
update;
(remove entry)
commit work;
                select;
                (add entry)

PostgreSQL has MVCC, so I think this schedule is possible, even at set transaction isolation level serializable. At the end, the cache is stale and there doesn't seem to be a mechanism to correct it.

The readme linked to this presentation (PDF) which suggests an ON COMMIT trigger. I think an AFTER COMMIT trigger could mostly prevent this from happening, provided that the add does nothing if there's already an entry in place. But it introduces new problems:

  1. if there's a transient network outage affecting the memcached server, the entry can't be replaced, and it's too late to fail the transaction.
  2. other connections may briefly see a stale entry. It will usually be quickly corrected, so it may not be a problem for many applications
  3. if memcached's LRU ejection happens between the replace and the conditional add, the stale entry will be added back in.

The only easy possible solution I see is to lock the row on any cache miss or update, keeping your current procedure otherwise. I guess that's not so bad really. No real performance hit, select ... for share instead of just select ....

ekashp, re: advogato recentlog quality

There are two things you can do to make advogato's recentlog more focused on open source development:

  • Set a good example. Talk about something you've contributed to open source recently. Maybe people will get the hint and do something trendy like syndicating http://their.blog.here/?tag=opensource instead of http://their.blog.here/, thus keeping their open source stuff while suppressing their animé commentary.
  • Use advogato's "interestingness" metric. It will suppress stuff under a threshold, and it will even have some effect on entries you haven't rate at all. I don't know the exact implementation, but for entries you haven't rated, the metric's supposed to guess based on ratings from people whose diaries you've rated highly. My recentlog says "9 entries suppressed at threshold 3" at the bottom, and the first number will be higher next time the metric is recomputed.
4 Jan 2007 (updated 4 Jan 2007 at 23:07 UTC) »
apenwarr, re: scheduling

Thanks for your entry on Schedulator a while back. At work, we have been struggling with scheduling. I'd suggested making more use of bugzilla to track issues and tasks instead of just bugs. Then we had some management changes. Finally, we saw your entry. Looks like it was the tipping point. Your code is useless to us (for starters, we use Linux, Bugzilla, MediaWiki, and Python rather than Windows, FogBugz, GratefulTavi, and C#), but the documentation is wonderful.

We're looking at using bugzilla's existing time tracking features and adding better reporting. Your slip factor, automatic due dates, and load factors will feature in prominently, as well as Joel's original vs. current estimates. (Our existing schedules have had a certain "we've always been at war with Eurasia" feel, so there's no way to review historical slip factors.) We have a lot of other balls bouncing, but hopefully we'll be sending some patches back upstream eventually.

1 Jan 2007 (updated 1 Jan 2007 at 21:10 UTC) »

The spammers are getting more subtle. Case in point: 733tguru. Legitimate[*] post about mysql/ threading performance on Linux vs. FreeBSD, a few spammy links stuck in the middle.

[*] - though stolen, badly formatted, and rather outdated; it refers to FreeBSD 5.x as under development. I'm running 6.1-RELEASE.

lkcl, SQLite

when you get up to 30mbyte files, such as those used by embedded.org, the access time will drive you nuts.

People use SQLite databases much larger than that without problem, so your blanket statement is false - there's more to it than that. You said this is a monotone database? Maybe it's not normalized well, it isn't creating the right indexes, its queries are ill-considered, or it's doing unusual things like creating huge blobs in the database. (The last seems likely, since it's a version control system. I don't know if anyone has bothered to optimize this case in SQLite.) Has anyone benchmarked an equivalent PostgreSQL database for comparison?

Admittedly, there are ways SQLite could be causing your performance problem. I don't think the query planner is too sophisticated. And worse, SQLite doesn't support any sort of concurrency, though there's talk on their wiki of implementing it.

p.s. salmoni: if you get a crash in the middle of a write to a flat-file, then, just like ms-access, your data is corrupted. don't use it.

I assert otherwise, with the exception that your application corrupting SQLite's data structures can of course make anything happen. It sounds like you've encountered a problem with a pretty complex application that happens to use SQLite. Link to your bug report with a small demo application and I'll believe you when you say that SQLite is at fault.

27 Nov 2006 (updated 27 Nov 2006 at 22:56 UTC) »
salmoni, SQLite

SQLite's reliability is quite good. If you set pragma synchronous = full (the default in SQLite 3.0) and pragma fullfsync = 1 (for OS X, where they've decided to make fsync(2) dangerously worthless and nonstandard), SQLite is an ACID database. All bets are off if you corrupt its memory, which is possible since it runs in-process. Otherwise, it's the real deal.

9 Nov 2006 (updated 9 Nov 2006 at 21:45 UTC) »
robogato

I was one who pushed phpgurru over the spam threshold. It's a shame his last diary post was lost, because that's what convinced me he was a spammer. He said something about knowing a guy who will provide you content for your website so you get more AdWords money or something, and linked to the guy's spammy website.

ncm, C++ thread cancellation

That's interesting! But I bet against you - they'll fail. I'd say pthread cancellation has six problems that Java's interrupt() does not:

  1. widely inconsistent, broken implementations [1]
  2. it's a C standard that proposes a C++ way (exception stacks) rather than a C way (ECANCELED)
  3. inability to pass through language boundaries, like through the Python interpreter
  4. not reversible [2], making it worthless for canceling a single task in a long-running worker thread
  5. allows what I describe in the sigsafe documentation as the "after race" [3] . The standard says: [4]
    ... if the thread is suspended at a cancellation point and the event for which it is waiting occurs before the cancellation request is acted upon, it is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution.
  6. (for the people who implemented it via a C++ exception) crashes if cancellation happens in a destructor while an exception is being handled. (close(2) is a cancellation point, so this is likely! The solution I've seen is for the user to always disable cancellation before calling it in this case, but they won't. It'll just crash.)

The C++ people will probably avoid #2, #3, and #4; and introduce #7:

  1. no standard way of signaling cancellation to underlying C code that makes blocking system calls

Given that most everyone using C++ wraps some third-party C library or makes some direct system call, I think #7 is particularly important. C++ just can't pull it off without C. The pthread situation needs to be fixed first. #2 just can't be, short of them admitting they screwed up, ditching it, apologizing to everyone who tried to use it, and going with ECANCELED. I'd be just a tiny bit surprised to see a standards body do such a thing...

I think the screwed-up pthread cancellation is a consequence of premature standardization. After working with DSL Forum (worst standards body ever), I'm quite familiar with the problem. They stamped approval on a cool new thing before there were three useful, compatible implementations, so the problems weren't discovered until it was too late to change them.

So what hope is there for people who want to use cancellation? Well, you can easily build it on top of sigsafe. Unfortunately, sigsafe's only practical for people who make direct system calls (e.g., people who write servers instead of gtk-based GUI clients), and you either need to (1) port it to each platform you will use (takes me a few hours for each) or (2) have a "correct" path and a "racey but portable" path.

I think portability is overrated - everyone should use new versions of Linux, FreeBSD, or (maybe) OS X. Trying to port to old versions or to a million different systems is a colossal waste of time. [5] But my view is unpopular; thus the lack of major applications using sigsafe.

7 Nov 2006 (updated 8 Nov 2006 at 02:40 UTC) »
ncm

You were in Boston before, right? Welcome to the Bay Area...

robogato

Thank you! In response to your solving one of my long-standing complaints about the recentlog, I'll try to post something interesting to it.

Load test grapher

Background: I've been slowly playing with a suite of software for looking at performance. It's one of those problems that I think few people look at, because even though there's real computer science involved, any project relating to the word "test" is sneered off by most people as QA work. (Another project I've been playing with is a monitoring system for sysadmins; that's basically the same way. Sysadmin tools suck.) It's possible there's a little more computer science involved than necessary because lately I've been wanting to do a little less software engineering and a little more computer science...

No release, documentation, or even a name yet, but there's a little bit of code in my Subversion repository here.

Today's problem: I made a graph of transaction duration vs. time of a quick Apache load test.

It's worthless. You have basically no idea how long the average transaction takes because the scale is blown by a few outliers. I tried a couple obvious things to solve this:

  • switching to log scale
  • setting ymax to be some function of the mean and standard deviation

but wasn't happy with the result. What I really wanted was what network people do for billing: look at the 95th percentile. The obvious way to do that is to sort the data and pick element n/.95. But my program works with huge data sets - this graph was generated from 1.3 million transactions. Everything else uses (nearly) linear time and not much disk. I don't want to sort stuff.

Off to the ACM Digital Library:

  1. The P^2 algorithm for dynamic calculation of quantiles and histograms without storing observations.
  2. Space-efficient online computation of quantile summaries

These are really cool algorithms to estimate quantiles like the median or 95th percentile. The first has no defined error bounds but uses O(1) space and time. The second returns a value whose rank r' is within [r-εN, r +εN] (with ε chosen in advance). It runs in a single pass using O(1/ε log (εN)) space. Neither requires you to know N ahead of time.

I implemented the first one. A really simple test:


    def testShuffledUniformSequence(self):
        r = random.Random()
        e = QuantileEstimator(0.5)
        numbers = range(100)
        r.shuffle(numbers)
        for i in numbers:
            e.append(i)
        self.assertTrue(40 < e.get() < 60)

Though there aren't well-defined error bounds, it seems to do quite well on my data set. That range of 40 to 60 in the test case is crazy liberal; I just wanted to catch my blatant implementation errors.

I fed that into my graph tool to bound the graph at five times the 95%.

Much better.

Now I have a general-purpose statistics class I can use all over the place. Maybe I'll plug it into Axamol SQL Library's SQL statement timing to get the median and 95%.

46 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!