Older blog entries for apenwarr (starting at number 539)

The problem with stealing movies is you *can't* pay for them after

Imagine you go into a clothing store and steal a pair of pants. And you get away with it.

Not long after, you realize that, you know, these pants are *really great*. And never mind that, these physical object thingies really do cost people time and effort to produce. You feel guilty. So what do you do?

Well, you sure don't go back to the store and pay retroactively, that's for sure. Not only will you single yourself out and look like a weirdo, but you risk having the police called on you anyway. After all, you just admitted to a crime.

Nonphysical objects like movie downloads aren't quite the same. It doesn't cost the creator a dime if you make your own copy. Obviously it still cost them to create it, so they'll need to get paid somehow. And let's be honest, if someone creates a product - even a digital one that costs nothing to copy - that improves the lives of hundreds of millions of people, they deserve more than just a pittance. They deserve millions of dollars. Maybe hundreds of millions, which is still only a couple of dollars per person. We all know this is true. Even if we think the producers and marketers are morons who seem intent on making our lives worse instead of better, we know that somebody who made this awesome entertaining stuff deserves to get paid. A lot.

Copying a TV series from a friend, watching a few episodes, enjoying it, and then deciding to buy it is easy - and not something that will get you into trouble, because nobody will know but you and your friend. But somehow you'll still feel like a weirdo if you go out to a store and buy it just to relieve your guilt.

Why is that? Why don't people pay for stuff they stole, even if they love it, even if it won't get them into trouble?

Of course I don't know for sure. But I have a couple of thoughts. First of all, going out to the store and buying a DVD is silly; it wastes time and energy, plus you end up with a useless DVD, which is an obsolete form factor much less convenient than the one (file on hard disk) that you got for free. What are you going to do with the DVD? Either throw it away (not likely), sell it (defeats the purpose), or keep it on your shelf forever. Dumb.

You could go buy it on iTunes, I guess. But then you're sponsoring a wannabe monopolist, plus you know you're paying some middleman at Apple for their bandwidth, which you're not even planning to use. You already downloaded the thing.

You could buy it online and have it shipped to you - but then you end up with the useless physical DVD, plus you pay for shipping the useless physical DVD. (Even if the shipping is "free," you know it's hidden in the price somewhere.)

And you also might figure the price is too high. $47 for Season 5 of House, MD? You don't have that kind of money just lying around. Plus you didn't even watch the whole season.

Another reason you don't pay is simple: you just never quite get around to it. And by the time the guilt is overwhelming, you've watched so many shows from so many seasons of so many series that you can't even keep track of what you owe anymore, other than it's now way more money than you have lying around.

What if there was a service that could help?

PayUp.com: "Because you should."

(Note: payup.com is not a real domain name as of this writing. It's just an example.)

Imagine there was a plugin integrated into, say, Boxee or XBMC, that would track what you watched - and give you a dollar value based on the cost of the DVDs. If you have a season of House, MD and a season of Firefly, and you watch half of each one, your total would come to ... half the cost of the two added together.

Then what if, on the first of each month, the plugin pops up a list of the shows you watched but haven't paid for, calculates a bill, and offers to send money to the manufacturers on your behalf?

The amount of money could default to the auto-calculated amount. Or you could assign a monthly budget, and divide it between the shows you watched. Or you could manually override it and send whatever amount you wanted. Or you could have it round up to buy the whole DVD after you watch 50% of it, or whatever.

Then, after you approve the payment, it debits your credit card and the cards of all the other users, and buys a bunch of DVDs from Amazon.com (or whatever) based on what the users chose to pay for.

So if you pay for 10% of House, MD, Season 5, and nine other people do too, the service buys 1 copy on your behalf. Amazon ships the DVD to Payup.com, which promises never to open the DVD, and just stores it in the warehouse forevermore (or just destroys it). Maybe they charge 5-10% of the purchase price for being the middle man. Maybe they can get a bulk distribution deal somewhere and not charge any extra at all (since, after all, there's no more "free" shipping and handling charge).

A typical cost for cable TV is $60 a month or more; that's more than a full season of House, MD, which is more TV than I can handle in a month. (Of course, some people watch much more TV than me.) What if you cancelled your TV subscription and just assigned $60 a month for payup fees instead?

This won't make your movie stealing any more legally sound. But morally? That's up to you.

Business Considerations

The Payup.com service, of course, would be a bit of a strange entity. Think of it as Netflix without the distribution. Sure, business people think distribution is everything, but that's old fashioned thinking. This is the 21st century. In the 21st century, distribution is free and everybody can do it, so you'd better find some other way to add value.

Think about that. Distribution is free and everybody can do it. Of course, it's not really free yet. It seems free to you and me, because we pay our monthly internet bill, and as long as we don't go over the limit, we don't pay extra. And we can copy DVDs for our friends for less than $0.50 each. But if you're Apple or Amazon or Hulu or Google, distribution sure the heck isn't free. Outside of licensing, it's probably your #1 cost, your most complicated technical overhead, and it hurts your business every time you have to pay for it.

So distribution isn't valuable. What's valuable is getting money back to the copyright holders. Because when copyright holders don't get paid, all we have left is Youtube, and God help us all.

What if you could build a business around not the distribution, but just the paying for licenses? Okay, sure, BitTorrent sharing of movies is illegal and all, but that's kind of a side issue. It's not like your service is making it easier to steal stuff. That's already easy. You're just making it easy to give money to the copyright holders. You're totally taking the moral high ground here. You're turning a net loss - piracy - into a net gain - vastly reduced distribution costs (zero) and paying customers.

The Pirate Bay deserved to lose their court case because as much as they tried to pretend otherwise, they totally lacked the moral high ground, and were playing legal tricks to try to dodge that simple fact. The Pirate Bay could only reduce, never increase, income for copyright holders. Payup.com will be different; not only will it directly provide income to copyright holders, but it'll make services like The Pirate Bay retroactively less evil. Suddenly not all the piracy it caused would be a loss of income.

Of course the copyright holders won't see it that way. Your Payup.com will probably get sued. But when that happens, how can you lose? You're not doing anything wrong. Basically, you're accepting donations from users and forwarding them on to the copyright holders. You're not copying content, indexing content, linking to content, sharing content, or anything of the sort. In fact, you have nothing to do with the content whatsoever. That can't be illegal, can it? (Disclaimer: I really don't know. Talk to a lawyer.)

What about your users? Are they at risk? After all, sending you 10% of the cost of House, MD seems to suggest that they copied the other 90% without paying... or that they copied any of it without a license. But is it really true? Is it sufficient evidence for the police to demand your customer list and start arresting people? I don't know. If it's framed as just donations - I really like this show, and I want to pay you extra for it, but I really don't want another useless DVD - then maybe not.

Is the only reason this won't work, the same reason you can't go back to the store and pay for your stolen pants? If so, that's pretty sad.

Or maybe people just won't pay for stuff if you don't force them to. I don't want to live in such a world. Maybe I'm just a naive Canadian and/or a communist. But I'd pay for this service if it existed. Firefly and Futurama need more donations than the cost of a mere DVD.

Syndicated 2010-02-07 22:56:01 from apenwarr - Business is Programming

A git-subtree tutorial

Jakub Suder has a nice tutorial on how to use my git-subtree tool to manage git repositories that track other projects in subdirs.

It's quite nicely written, and unlike my own documentation, it has pretty diagrams.

Syndicated 2010-02-04 19:43:11 from apenwarr - Business is Programming

25 Jan 2010 (updated 26 Jan 2010 at 18:07 UTC) »

More on prorogation

In response to my previous post about prorogation, someone emailed me this comment:

    My participation in our democracy is limited to voting when there's an election and mostly ignoring everything else. [...] Despite that, I can tell from the rumbling that there's something unusual about this particular prorogation. If it was just normal boring governing, nobody would be talking about it.

I actually think this comment is very insightful, because it gets right to the heart of the issue I was trying to address. Other people sent responses that were more like, "But what gives them the right to manipulate it in such an evil way?" It's almost the same question, but those comments were not insightful. They missed the fact that there may not be any manipulation at all, and therefore turned it from a question of fact into a question of opinion.

I'm not a professional journalist. Yesterday's attempt at factual reporting took all the restraint I had. So this post will surely not meet my own high standards for journalistic integrity. I'm just some guy on the Internet. You've been warned.

How I came to know what "prorogation" means

Full disclosure: I personally don't like the Conservatives. I think if they had a majority government, Canada would be worse off. I think the fact that Harper does most of his public relations through a "spokesperson" is a total embarrassment. I also think our current batch of other political leaders, with the possible exception of Gilles Duceppe, are even worse, and I sorely miss the days when Jean Chretien used to beat people up with statues. But I wouldn't vote for Ignatieff, precisely because he pulls the current kind of crap. I wish I still lived in Quebec so I'd at least have a party worth voting for.

That's the perspective with which, not knowing anything about the current prorogation debate (or even what "prorogation" means), I returned to Canada from my vacation in Mexico and was met by an angry Internet mob complaining about our upcoming dictatorship.

See above: "My participation in our democracy is limited to voting when there's an election and mostly ignoring everything else." Me too. You know why? Because I think the system works. That's the beauty of representative democracy. But I figured, okay guys, I'm pretty smart, I can figure this stuff out. If there's really a dictatorship coming, I want to be on the winning team. So I thought I'd better look into it.

The actual mob of complainers were no help. They all figured that someone else knew what was bad about what was going on, or else they figured that just suspending parliament at all made the government a dictatorship.

This left me to do my own "research" (a word that is in quotes because I did all this "research" in bed using my laptop).

First stop: The National Post (via Google), in an article titled, "Thousands turn out at rallies to protest proroguing of Parliament."

    Intermission: A note on how to read political news
    All news sources are biased. The first thing you have to do is identify the bias. Both the National Post and the Globe and Mail are Liberal-friendly and anti-Conservative. How can you tell? Just read any headline about politics and watch the trend. Other tips: 1) all quotes from politicians are weasel words; don't trust them. 2) using quotations from any individual allow the newspaper to avoid fact checking; it is always a fact that "person X said Y," no matter how false Y may be. 3) in constructions like "estimates pegged the turnout at more than 3000 people" note that nobody in particular is being quoted; they are reporting that some random person estimated more than 3000 people. Do not fool yourself into thinking that they don't pull tricks like this. They have to research, write, edit, and publish a fat sheaf of paper like the National Post every single day. Corners will be cut.

...shockingly, the National Post headline is phrased in an anti-government way (since the government is Conservative). Reading on, we see lots of quotes from political leaders (ignore them; rule 1). The whole article is also really a big quote from an angry mob of protesters (no fact checking was done; rule 2), since it merely reports that there was an angry mob, not that what the mob was angry about even exists. And we don't know if the mob was really 10 people or a million people (rule 3).

So that article, although it used many words, was in fact 100% pure unenlightening.

Nevertheless, I could feel myself thinking anti-Conservative thoughts despite the total lack of facts. I thought I'd better go find something biased in the opposite direction so that I could balance things out a little. I realized that I couldn't think of any actual Canadian mainstream newspapers that are Conservative-friendly; I'm probably forgetting something obvious.

So I resorted to blogs, which made things easier. Next stop: Alberta Ardvark (via Google). I have to admit that I assumed they were Conservative-friendly just because they're in Alberta, which means I'm a racist. However, I was not disappointed.

That article was the usual politico-blogger nonsense, giving handy advice on how to win a political argument not by arguing about the issue, but instead by turning the conversation around to character assassination whenever possible. (In this case, it's about the fact that Ignatieff supported prorogation last year, so he's not allowed to be against this particular prorogation this year; that makes him inconsistent, therefore a liar, so it doesn't matter what he says, etc.) Nicely done as always, Internet. But there was one intriguing quote: "...the ones who have been convinced by the media that this prorogation is not a routine event..."

Wait... routine event? Surely this is Conservative propaganda. I was intrigued, so I followed the link.

It points out that Chretien prorogued parliament 4 times. And Pierre Elliot Trudeau, supposed Canadian hero... 11 times?! Holy crap! Who's the evil one around here, again?

My brain's magical pattern-detectors then kicked in and I thought: hey, wait a minute. Those prorogation counts seem to be roughly proportional to the amount of time a particular Prime Minister was in office. Perhaps there's a pattern here.

And then, at the bottom of the article: "In our 143 years of existence as Canada, Parliament has been prorogued 105 times."

Oh dear.

Maybe I'd better go learn what prorogation is. The answer is: it's an all-too-fancy word for the end of every parliamentary session.

And it's a word I didn't know the meaning of until now. A word that every news article and blog entry I've read so far has not bothered to define. A word that, in some tenses, has the word "rogue" in it.

Okay, this sucks. Reading biased articles in a search for truth is getting me nowhere. Is there not any news source that will just give me the facts and not try to spin it the way they want? Well, no, I guess there isn't. But there's something close: the CBC.

Because it would be weird if they didn't, the above-linked CBC article has quotes from politicians; try to ignore them (rule 1). If you fail to do so, you will discover that the first apparent use of the word "despotic" in this context was by Ralph Goodale, Liberal House Leader, whom you should not vote for because he is thus automatically a lunatic.

However, the non-politician-quote parts of the article seem to be well balanced and, notably, identify several reasons why Harper might have wanted to prorogue government right now as opposed to some other time.

So what have we learned? First, that prorogation of parliament is totally normal, and that the length of the just-ended session isn't even unusually short; and second, that Harper might very well have chosen this particular date in order to benefit himself or his party. Gasp! Let us look at these possible reasons in more detail.

The CBC's suggested reasons for the current prorogation

"Muzzle parliamentarians amid controversy over the Afghan detainees affair." Don't know about you, but parliament has been shut down for a whole month already and I haven't noticed any of those parliamentarians not talking. I wish. But maybe you have a point; when parliament next starts up, I bet the opposition parties will have completely forgotten about the whole thing, despite the obvious political leverage they could gain from bringing it up. Harper has totally outfoxed them on this one.

"To consult with Canadians, stakeholders and businesses as it moves into the 'next phase' of its economic action plan amid signs of economic recovery." Well, I guess theoretically, if you're in parliament you don't have as much time for consulting with Canadians. But don't we have Royal Commissions for that or something? Maybe they just wanted a longer Christmas holiday. (Aside: the reason the prorogation "doesn't start until January 25th" is that they've all been on holidays since sometime in December. Seriously.)

"Strategically, prorogation also prevents question period criticisms from the opposition parties during the Olympics." Hey, not bad. Avoid the bad PR for Canada from discussing our idiotic foreign affairs policies at the same time as we're in the global spotlight. Critically, this allows Mr. Harper, who (let's be honest) doesn't look all that lovable, to hide in the cellar for the whole time the Olympics are on, letting someone cuter represent us to the world. This seems to be a wise strategy no matter which side of the fence you're on.

"By proroguing Parliament, he is unilaterally making a decision to stop any kind of disclosure from happening." As if information can't be disclosed just because nobody's making any laws right now. Remember: parliament is the legislative branch of the government. It's for making new laws. No other part of the government is suspended just because parliament is. (Note: see update below.)

"Gilles Duceppe wrote that prorogation has become 'a tradition for Harper.'" Duceppe has an awesome sense of humour. I had to read this one a few times before I realized that he managed to give them a sound bite while simultaneously making fun of the fact that prorogation is totally normal, ie. a "tradition."

"By the time Parliament resumes, Harper would have had time to ask Jean to name five new senators, which would give the Conservatives a majority on the newly formed Senate committees and greater control for passing their own legislation." (Notably: nobody was quoted saying this. CBC had to look it up on their own.) "Soudas confirmed the prime minister will seek to fill the Senate vacancies between now and March." This one is actually a great example of a real political reason to prorogue parliament; to get more control of the senate in time for the next session. But the Canadian senate system is designed (on purpose) to work like that. That's why the current senate is mostly Liberal even though our elected representatives are mostly non-Liberal. Senators are appointed for life, at which time the Prime Minister selects new ones. No surprises here.

(In case you don't like that system: the only party in favour of senate reform in Canada is the Conservatives. They'd rather you could elect your senators. How "anti-democratic" of them. I actually think such reform would be a change for the worse, but that's just my opinion.)

"Shortly after Soudas' announcement, the government sent out an email saying it would reintroduce, in original form, the consumer safety bill and the anti-drug-crime law that the Tories claimed the Liberals 'gutted' in the Senate." This shows significant political maneuvering. However, bills take multiple rounds through both houses before they (might) get passed anyway, so this isn't as bad as it sounds. The "gutted" version might never have been passed anyway. Also, interestingly, this was pointed out in an email from a Conservative MP. Apparently they don't think it's evil. At least not evil enough to cover up.

Conclusions

Guys, I did my homework. But I'm just not seeing it. The actual facts are:

1) Prorogation is perfectly normal and the recent parliamentary session wasn't abnormally short.

2) We won't have any more new laws getting made for a month or so longer than usual. (Remember: they were on vacation until January 25th anyway.) But being unable to do stuff doesn't make you a despot, it makes you a eunuch.

3) If Harper is really evil, the first thing to happen in the new parliament in March is that there will be a vote of non-confidence followed by an election. If this doesn't happen, it's because the angry non-Conservative parties didn't actually believe he was evil either.

4) There are some valid political reasons why it's better for the Conservatives if they prorogue parliament right now instead of later. However, they aren't very exciting reasons.

5) There is at least one actual reason (Harper is scary-looking and the Olympics are coming) that it's better for Canada if they prorogue parliament right now.

6) All mainstream media that I read - which was quite a bit - failed to properly define the term "prorogation" or to mention that it's perfectly normal. This seems a rather critical thing to know. Its omission suggests to me that they're trying to make news out of non-news.

Epilogue

"If it was just normal boring governing, nobody would be talking about it."

Unfortunately not true.

That's textbook mob mentality: he must be guilty, because otherwise my friends wouldn't be burning down his house.

The only cure for mob mentality is thinking for yourself.

Updates

Some helpful people have emailed me to clarify or correct or question various parts of the above.

Information release on the Afghan torture investigation: make no mistake, this investigation, and the demand for release of information related to it, is very important. It will also be delayed (for about one month) because of prorogation. Because of "parliamentary immunity," the interesting testimony won't be released during the delay. However, you need to think about two key points: first, will the end result of the investigation be any different because of a one-month delay? And second: will the Conservatives benefit because of the delay? Keep in mind that if the results came out now, they would be largely overshadowed by news about the Olympics. If they come out later, the Olympics will be over, the opposition parties could force an election, and the results would be headlining right as we're thinking about who to vote for. And yet the Conservatives have chosen the latter, not the former.

Precise timing of prorogation: Several people responded by claiming that it's not prorogation that's the problem, it's the particular timing of Harper's use of prorogation. This is an insidious line of argument because it's impossible to disprove; if Harper had prorogued parliament back on the 9th of September (9/9/9 is the British equivalent of 9/1/1), you could have accused him of using numerology to choose prorogation dates, and it would be impossible to refute that claim, even if it had been a perfectly sensible date to end the parliamentary session. Thus, to demonstrate any wrongdoing, you really have to be more specific about why the current timing is so evil. I have discussed several possible reasons above. Please feel free to suggest more. But "the timing is evil!" is not specific enough.

With that in mind, however, a lot of bad statistics are being spread with regard to the lengths of various parliamentary sessions. Here are all of the sitting days per session since 1980:

Turner and Trudeau (Liberal): 591, 116
Mulroney (Conservative): 308, 389, 11, 308, 271
Chretien (Liberal): 283, 164, 243, 133, 211, 143 (avg: 196)
Martin (Liberal): 55, 159
Harper (Conservative): 175, 117, 13, 128 (avg. excluding outlier: 140)

Eyeballing it, Harper's numbers are very slightly lower than typical (except for last year's prorogation, which was indeed an interesting event). Ignoring the outlier (13), Harper's average is just 29% lower than Chretien's. However, Harper has also managed to hold together the longest-running (by a large margin) minority government in Canadian history. I find it unsurprising that the effort required to do so would result in somewhat shorter sessions.

Syndicated 2010-01-25 01:16:10 (Updated 2010-01-26 18:07:15) from apenwarr - Business is Programming

Proroguing parliament

"Prorogation" is the term we use to describe the end of any session of parliament in Canada.

Or as the Canadian Parliament web site says, "Each session of a Parliament ends with the prorogation of Parliament by the Governor General, on the advice of the Prime Minister."

Since Canada's confederation in 1867 (143 years ago), parliament has prorogued 105 times. That's an average of about 1.4 years per session (some are longer, some are shorter).

See the Government of Canada's complete list of prorogations.

Use of the term "despotic" in this context makes you automatically a lunatic.

That is all.

Syndicated 2010-01-24 07:11:07 from apenwarr - Business is Programming

The Google Phone

    ...even Microsoft never was brazen enough to pull something like this. Even Microsoft had some tiny bit of shame. Google is a different beast altogether. They're like nothing anyone has ever seen in our business. Not only are they not ashamed - they think they're the good guys!

    -- Fake Steve Jobs

Dot dot dot.

Syndicated 2010-01-06 23:55:58 from apenwarr - Business is Programming

4 Jan 2010 (updated 4 Jan 2010 at 18:03 UTC) »

bup 0.01: It backs things up

I just spent a few days of my Christmas vacation writing a new program, bup.

bup is a program that backs things up. It's short for "backup." Can you believe that nobody else has named an open source program "bup" after all this time? Me neither. It also has almost no other meanings.

Despite its unassuming name, bup is pretty cool. To give you an idea of just how cool it is, I wrote you this poem:

Bup is teh awesome
What rhymes with awesome?
I guess maybe possum
But that's irrelevant.

Hmm. Did that help? Maybe prose is more useful after all.

Reasons bup is awesome

bup has a few advantages over other backup software:

  • It uses a rolling checksum algorithm (similar to rsync) to split large files into chunks. The most useful result of this is you can backup huge virtual machine (VM) disk images, databases, and XML files incrementally, even though they're typically all in one huge file, and not use tons of disk space for multiple versions.
  • It uses the packfile format from git, so you can access the stored data even if you don't like bup's user interface.
  • Unlike git, it writes packfiles *directly* (instead of having a separate garbage collection / repacking stage) so it's fast even with gratuitously huge amounts of data.
  • Data is "automagically" shared between incremental backups without having to know which backup is based on which other one - even if the backups are made from two different computers that don't even know about each other. You just tell bup to back stuff up, and it saves only the minimum amount of data needed.
  • Even when a backup is incremental, you don't have to worry about restoring the full backup, then each of the incrementals in turn; an incremental backup *acts* as if it's a full backup, it just takes less disk space.
  • It's written in python (with some C parts to make it faster) so it's easy to extend and maintain.
Super quick example

(The README actually has a more detailed example.)

Try making a remote backup:

 	tar -cvf - /etc | bup split -r myserver: -n my-etc -vv
Try restoring your backup:
 	bup join -r myserver: my-etc | tar -tf -

(On myserver) look at how much disk space your backup took:

 	du -s ~/.bup
Make another backup (yes, that's exactly the same command):
 	tar -cvf - /etc | bup split -r myserver: -n my-etc -vv

Look how little extra space your second backup used on top of the first:

 	du -s ~/.bup
Restore your *first* backup over again (the ~1 is git notation for "one older than the most recent"):
   	bup join -r myserver: local-etc~1 | tar -tf -
What's next?

I have lots of plans for this lovely program, in the event that I actually get time to implement them. But if you think it's cool, please feel free to git clone it, hack away, and send some patches! Read the README for a list of some deficiencies in the current release.

I'm sure there are also more deficiencies that I don't know about, of course.

(Previous poetry-related adventures.)

Syndicated 2010-01-04 05:01:39 (Updated 2010-01-04 18:03:30) from apenwarr - Business is Programming

Version control of really huge files

So let's say you've got a database with a 100k rows of 1k bytes each. That comes to about 100 megs, which is a pretty small database by modern standards.

Now let's say you want to store the dumps of that database in a version control system of some sort. 100 megs is a pretty huge file by the standards of version control software. Even if you've only changed one row, some VCS programs will upload the entire new version to the server, then do the checksumming on the server side. (I'm not sure of the exact case with svn, but I'm sure it will re-upload the whole file if you check it into a newly-created branch or as a new file, even if some other branch already has a similar file.) Alternatively, git will be reasonably efficient on the wire, but only after it slurps up mind-boggling amounts of RAM trying to create a multi-level xdelta of various revisions of the file (and to do that, it needs to load multiple revisions into memory at once). It also needs you to have the complete history of all prior backups on the computer doing the upload, which is kind of silly.

Neither of those alternatives is really very good. What's a better system?

Well, rsync is a system that works pretty well for syncing small changes to giant files. It uses a rolling checksum to figure out which chunks of the giant file need to be transferred, then sends only those chunks. Like magic, this works even if the sender doesn't have the old version of the file.

Unfortunately, rsync isn't really perfect for our purposes either. First of all, it isn't really a version control system. If you want to store multiple revisions of the file, you have to make multiple copies, which is wasteful, or xdelta them, which is tedious (and potentially slow to reassemble, and makes it hard to prune intermediate versions), or check them into git, which will still melt down because your files are too big. Plus rsync really can't handle file renames properly - at all.

Okay, what about another idea: let's split the file into chunks, and check each of those blocks into git separately. Then git's delta compression won't have too much to chew on at a time, and we only have to send modified blocks...

Yes! Now we're getting somewhere. Just one catch: what happens if some bytes get inserted or removed in the middle of a file? Remember, this is a database dump: it's plaintext. If you're splitting the file into equal-sized chunks, every chunk boundary after the changed data will be different, so every chunk will have changed.

This sounds similar to the rsync+gzip problem. rsync really sucks by default on .tar.gz files, because if a single byte changes, every compressed byte after that will be different. To solve this problem, they introduced gzip --rsyncable, which uses a clever algorithm to "resync" the gzip bytestream every so often. And it works! tar.gz files compressed with --rsyncable change only a little if the uncompressed data changes only a little, so rsync goes fast. But how do they do it?

Here's how it works: gzip keeps a rolling checksum of the last, say, 32 bytes of the input file. (I haven't actually looked at what window size gzip uses.) If the last n bits of that checksum are all 0, which happens, on average, every 2^n bytes or so, then toss out the gzip dictionary and restart the compression as if that were the beginning of the file. Using this method, a chunk ends every time we see a conforming 32-byte sequence, no matter what bytes came before it.

So here's my trick: instead of doing this algorithm in gzip, I just do it myself in a standalone program. Then I write each chunk to a file, and create an index file that simply lists the filenames of the required chunks (in order). Naturally, I name each chunk after its SHA1 hash, so we get deduplication for free. (If we create the same chunk twice, it'll have the same name, so it doesn't cost us any space.)

...and to be honest, I got a little lazy when it came to creating the chunks, so I just piped them straight to git hash-object --stdin -w, which stores and compresses the objects and prints out the resulting hash codes.

An extremely preliminary experimental proof-of-concept implentation of this file splitting algorithm is on github. It works! My implementation is horrendously slow, but it will be easy to speed up; I just wrote it as naively as possible while I was waiting for the laundry to finish.

Future Work

For our purposes at EQL Data, it would be extremely cool to have the chunking algorithm split based only on primary key text, not the rest of the row. We'd also name each file based on the first primary key in the file. That way, typical chunks will tend to have the same set of rows in them, and git's normal xdelta stuff (now dealing with a bunch of small files instead of one huge one) would be super-efficient.

It would also be entertaining to add this sort of chunking directly into git, so that it could handle huge files without barfing. That would require some changes to the git object store and maybe the protocol, though, so it's not to be taken lightly.

And while we're dreaming, this technique would also be hugely beneficial to a distributed filesystem that only wants to download some revisions, rather than all of them. git's current delta compression works great if you always want the complete history, but that's not so fantastic if your full history is a terabyte and one commit is 100 GB. A distributed filesystem is going to have to be able to handle sparse histories, and this chunking could help.

Prior Art

I came up with this scheme myself, obviously heavily influenced by git and rsync. Naturally, once I knew the right keywords to search for, it turned out that the same chunking algorithm has already been done: A Low-Bandwidth Network Filesystem. (The filesystem itself looks like a bit of a dead end. But they chunk the files the same way I did and save themselves a lot of bandwidth by doing so.)

Syndicated 2010-01-03 11:37:35 from apenwarr - Business is Programming

"No Waiting Room" and Queuing Theory

I just read Nat Friedman's post, No Waiting Room, about his experiences in a hospital in Germany vs. one in Boston:

    I hadn't been asked to sit in a waiting room for 8 hours like the time I had a concussion in Boston.

    A few hours later, after my abdominal ultrasound was clear, I poked around the hallway near the main entrance just to confirm what I'd been wondering about.

    There was no waiting room.

He then goes on to relate this experience to health insurance models in the U.S. vs. Germany, how people respond to them, the number of ER vs. scheduled visits, and so on.

Now, there are surely lots of things to say about the difference between U.S. and German (and Canadian, for that matter) health insurance systems. But I don't think this actually provides a data point either way.

In any system, long-but-not-growing queues are a sign of a very specific kind of failure. People like to oversimplify and assume that a long queue means there aren't enough doctors (for example), but that's not right; if there are more people arriving at the hospital than are being treated, the queue would keep growing. But as far as I know, the queue at the hospital in Boston wasn't ever-growing,1 it was just really really long. That's a totally different condition. An ever-growing queue is a bandwidth problem; a long-but-constant queue is a latency problem.2

Somewhere in this system is an ineffective queue. What's the optimal route through the hospital queue, and how many people follow non-optimal routes? Does the optimal route involve filling out paperwork for two hours? If so, then your minimum stay is two hours. Does a typical route involve three layers of nurses pre-checking you so they can put you into the right doctor's queue? Each of those layers adds a delay (and probably another delay as you wait your turn for the following stage). Is the traffic bursty, ie. do more people arrive at certain times of day, so that you get an 8-hour queue built up at rush hour, and it empties out by 4am, so the "average bandwidth" is sane but the peak bandwidth is crazy? (Surely people don't get sick only during certain hours, so why would it happen?)

Interestingly, this sort of problem is almost always solvable without spending a lot of money. Upgrading bandwidth requires money; improving latency only requires cleverness. It's a variant of Just in Time manufacturing, where instead of "excess inventory" you have "patients in the waiting room." Companies that do JIT - German manufacturers come to mind - produce efficiency without spending more money.

I'm not an expert in hospitals; not in the slightest. In fact, I live in Canada, so I can't provide useful input on the U.S. healthcare system or the German one, having never experienced either.

But I know a law of the universe when I see one.

Footnote

1 Why I think the queues aren't overflowing: because if they did, you'd be turning away customers. You might complain about the U.S. health care system being run by profit-seeking corporations and individuals, and I'm sure that has its problems. But if there's one thing profit-seeking entities never want to do, it's turn away paying customers. If the problem were insufficient doctors, for example, you'd be much more likely to see it in socialized systems like Canada's, where doctors' salaries are capped and there's no supply-and-demand effect.

2 I wrote about BitTorrent and transmit queues a while ago, and that's another example of the same phenomenon. No matter how fast your network link is, BitTorrent can destroy your performance if you don't speed-limit it; if you do speed limit your BitTorrent, say to 90% of your max bandwidth, whatever your max bandwidth might be, then your latency won't suffer at all. That's the magic of queuing theory. It matters for hospitals just like it matters for networks and highways.

Syndicated 2009-12-29 00:47:55 from apenwarr - Business is Programming

Advice to people implementing dynamic dns with easydns.com

Forget about using one of the (many) totally idiotic dyndns clients for Linux. The correct solution is to just call curl from a cron job.

See the description of the easydns.com "protocol". Where by "protocol" we mean a single HTTP GET request.

Note that if you use myip=1.1.1.1 (why not 0.0.0.0?) the server will figure out your IP automatically, which is basically always what you want. Having your (aforementioned idiotic) dyndns client try to guess it for you, when it's almost always behind a NAT router, is totally useless.

I hate computers.

Syndicated 2009-12-27 23:09:29 from apenwarr - Business is Programming

The disappointingly ongoing success of WvDial

For those who don't know, I'm one of the original two authors of WvDial (Oh yes, that's a real Wikipedia link!)... back in 1998. WvDial and its spinoff C++ library, WvStreams (notably not a Wikipedia link), were the absolute first code written as part of Weaver, our startup's commercial product, which became Nitix (Wikipedia again, but article is outdated) and later Lotus Foundations (Wikipedia "stub" article).

The Wikipedia articles make a pretty good proxy for the comparative success of those programs. Another silly statistic I like is Twitter search (Lotus Foundations seems most popular in Japan at the moment).

Yes, yes, so WvDial remains popular. Yay us, right?

See, that's the funny thing. First of all, wvdial maintenance has been almost zero (not quite zero1) for most of the last ten years. WvDial has no GUI; it's purely a command-line tool. WvDial is a modem dialer and nobody uses modems anymore. WvDial is written in C++, which is generally unpopular for small tools and makes it ABI-unstable and bigger than you'd like. WvDial has no unit tests and so everyone is (rightly) afraid to change its rotting guts. And perhaps most oddly of all, none of the problems WvDial originally set out to solve are problems today.

This story is not a story of WvDial's awesomeness. It's a story of the open source world's pure unadulterated fabulous nonstop 12-year marathon of suck.

I tried Twitter a while back and I'm pretty much over it, but I do still subscribe to some Twitter search RSS feeds just in case anything interesting happens (it never does). Every single day, several people recommend wvdial to their friends. With comments like "wvdial always works, let me help you set it up." In multiple languages.

Speaking of suck, that Twitter search is how I learned that Ubuntu's most recent release(s?) have dropped wvdial from the CD in favour of NetworkManager. You can still download wvdial from the Ubuntu repo, except... whoops. You can't get online without wvdial. Because speaking of suck, NetworkManager apparently does. (I've never tried it; I haven't used a modem in years.) So we have people who can't get online, downloading the .deb files on another computer and moving them via sneakernet over to their laptop. And so on. People can't live without the thing. I would say they love it, but I still hope that isn't it.

Why WvDial was created

We originally created wvdial back in the late 1990's because setting up pppd 'chat' scripts was too annoying. At the time, a lot of dialup Internet providers required you to answer some menu items, log in, etc before you could start an actual PPP session, and every ISP did it differently... so everyone needed a different chat script. It was all gross.

The point of WvDial is that it would read the stupid inconsistent menus and prompts for you and answer them. For the first few versions, it failed almost every time. But we got lots of feedback from lots of people all over the Internet, and we fixed it. Eventually, it got to the point where basically nobody wrote to us anymore asking why wvdial didn't work with their ISP, even though thousands and thousands of people were downloading wvdial every month. (And that's just from us; we didn't have any stats on who installed it from a real Linux distro.)

But here's the thing. Windows 95 and later have a built-in dial-up PPP feature that doesn't even support anything like chat scripts. All they do is dial up and start PPP, and if your ISP can't handle it, well, that's too bad. Oh, there were ways of working around it, but they were gross, and it didn't take long before most ISPs abandoned their menu structures and just did things the easy way. In WvDial, we called the Windows 95 behaviour "stupid mode" since it was the opposite of wvdial's evolved, intelligent conversation style.

Unsurprisingly, some combination of Worse is Better and "everybody runs Windows" conspired to make wvdial's cutesy menu-guessing features completely obsolete. WvDial is, therefore, completely obsolete. Or so you'd think.

People keep using WvDial because everything else is worse

Why do people keep using WvDial? Well, it's hard to tell from a bunch of foreign-language 140-character Twitter posts. But I've observed the following: a) it's not for modems, it's for coupling to data networks via your cell phone and bluetooth that emulates a modem; and b) just things like detecting the right serial port, baud rate, and init strings, and redialing when you get disconnected,2 are apparently still big problems.

This all makes me extremely sad. First of all, pretending your advanced cell phone network is a modem, and thus bringing up questions like "should I use touch tone or pulse dialing?" and "do you think 19200 bps is fast enough?" and "what happens if I get a busy signal?" is a joke. Secondly, the fact that someone couldn't fudge up a bluetooth-cellphone interface for Linux that's more reliable than wvdial, even though wvdial was never designed for this use case at all, just really scares me. I mean, yes, I realized at the time that nobody had ever designed a tool like WvDial on any OS... but surely all the individual parts had been done before? You know, modem detection? Max baud rate detection? Init string detection? And now, more than 10 years later, surely someone has put the still-necessary parts together into a more sensible package? You know, maybe one with a GUI?

Nope.

I can't quite figure out what people like about wvdial so much; why it's the fallback that people still recommend to all their friends when Ubuntu/Gnome's default (GUI, at least, thank God) dialer can't take the pressure. I suspect it's just the fact that wvdialconf tries all your serial ports one by one to guess which one is right, rather than making you do it for yourself.

Perhaps I'll never know.

Happy birthday, WvDial.

(See also: other things that are still not dead.)

Footnotes

1 I haven't been involved in wvdial maintenance for quite a long time. Thanks to Patrick Patterson, Simon Law, William Lachance, and possibly others (Jim Morrison?) for keeping up with the maintenance over the years.

2 Okay, I admit it. WvDial's redialing support is pretty frickin' awesome. If you think about it, it's actually impossible to implement a redial backoff timer correctly using chat scripts. Do I don't think anybody else has even tried.

Syndicated 2009-12-24 01:34:27 from apenwarr - Business is Programming

530 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!