Older blog entries for apenwarr (starting at number 371)

1 Feb 2008 (updated 6 Mar 2009 at 16:58 UTC) »

2008-01-31: Git is the next Unix

When I first heard about git, I was suspicious that there could be anything special about it, but after watching Linus' talk about it, I was... even more suspicious. I tried it anyway.

When I tried it, I realized something right away: what made git awesome was actually none of the things Linus had talked about, not really. Those things were more like... symptoms of the underlying awesomeness. Yes, git is fast. Yes, it is distributed. Yes, it is definitely not CVS. Those things are all great, but they miss the point.

What actually matters is that git is a totally new way to operate on data. It changes the game. git has been described as "concept-heavy", because it does so many things so differently from everything else. After some reflection, I realized that this is far truer than I could see at first. git's concepts are not only unusual, they're revolutionary.

Come on, revolutionary? It's just a version control system!

Actually it's not. Git was originally not a version control system; it was designed to be the infrastructure so that someone else could build one on top. And they did; nowadays there are more than 100 git-* commands installed along with git. It's scary and confusing and weird, but what that means is git is a platform. It's a new set of nouns and verbs that we never had before. Having new nouns and verbs means we can invent entirely new things that we previously couldn't do.

Git is a new kind of filesystem, and it's faster than any filesystem I've ever seen: git checkout is faster than cp -a. It even has fsck.

Git stores revision history, and it stores it in less space than any system I've ever seen or heard of. Often, in less space than the original objects themselves!

Git uses rsync-style hash authentication on everything, as well as a new "tree of hashes" arrangement I haven't seen before, to enforce security and reliability in amazing ways that make the idea of "guaranteed identical every time" not something to strive for, but something that's always irrevocably built in.

Git names everything using globally unique identifiers that nobody else will ever accidentally use, so that being distributed is suddenly trivial.

Git is actually the missing link that has prevented me from building the things I've wanted to build in the past.

I wanted to build a distributed filesystem, but it was too much work. Now it's basically been done... in userspace, cross-platform.

At NITI we built a file backup system using what was a pretty clever data structure to speed up file accesses. But we never got around to implementing sub-file deltas, because we couldn't figure out a structure that would do it both quickly and space-efficiently. With git, they did. To build your own backup system that's much better than ours, just store it in git instead.

On top of our backup system we made a protocol for synchronizing changes up to a remote repository. Our protocol was sort of okay; git's is much better, and it will surely improve a lot in the months ahead. (Currently git requires you to sync *everything* if you want to sync *anything*, but that's an implementation restriction, not a design or protocol restriction. See shallow clones for just the beginning of this.)

Someone else I know built a hash-indexed backup system to efficiently store incremental backups from a large number of systems on a single set of disks. Git does the same, only even better, and supports sub-file deltas too.

We made a diskless workstation platform called Expression Desktop (now very dead). Knowing disks were cheap and getting cheaper, we wanted to make it "diskful" eventually, automatically syncing itself from a central server... but able to guarantee that it matched the server's files exactly. We couldn't find a protocol to do it. git is that protocol.

I built a system on top of Nitix, called Versabox, that let you install a Linux system on top of a Nitix system without virtualization. I wanted a way to make it easy to install software into that Linux environment, then repackage the entire thing as an all-in-one installer kit, but have the archive contain both the original package and the new content; that way you could upgrade either part without touching the other. To do that I invented a new file format and tool, called versatar. It works, and we use it at my new company. But git would do it much better, and includes digital signatures too for free.

Numerous people have written diff and merge systems for wikis; TWiki even uses RCS. If they used git instead, the repository would be tiny, and you could make a personal copy of the entire wiki to take on the plane with you, then sync your changes back when you're done.

When Unix pipes were invented, suddenly it was trivially easy to do something that used to be really hard: connect the output of one program to the input of the next. Pipes were the fundamental insight that shaped the face of Unix. Programs didn't have to be monolithic.

With git, we've invented a new world where revision history, checksums, and branches don't make your filesystem slower: they make it faster. They don't make your data bigger: they make it smaller. They don't risk your data integrity; they guarantee integrity. They don't centralize your data in a big database; they distribute it peer to peer.

Much like Unix itself, git's actual software doesn't matter; it's the file format, the concepts, that change everything.

Whether they're called git or not, some amazing things will come of this.

Syndicated 2008-02-01 01:33:37 from apenwarr - Business is Programming

2008-01-21: "git checkout" is faster than "cp -a"

"git checkout" is faster than "cp -a"

It's true. I've determined this experimentally. And it makes sense, too: if you've used "git-repack" on your repository, then you have a nice, compressed, sequential file that contains all the data you're going to read. So you read through it sequentially, and write into the disk cache. Up to a certain size, there's no disk seeking necessary! And beyond that size, you're still only seeking occasionally to flush the write cache, so it's about as fast as it gets.

Compare to "cp -a", where for each file you have to read the directory entry, the inode, and the contents, each of which is in a different place on disk. The directory is sequential, so it's probably read all at once and doesn't need a seek. But you still have about two seeks per file copied, which is awful.

Even if your disk cache already contains the entire source repository, copying files requires more syscalls (= slow) than reading large sequential blocks of a single huge file. In other words, even with no disk access involved, git-checkout is still faster than "cp -a". Wow.

In related news, check out this funny mailing list discussion from 2005, in which Linus defends his crazy ideas about merging. It reminds me of the famous "Linux is obsolete" discussion from back when Minix was clearly going to rule the world. Actually, it reminds me rather disturbingly of that, and the results we see now are very similar.

Here's an excellent discussion of some of the brilliant design behind git.

Yes, I have become a true believer. The UI consistency needs work, though. The feature list grew really really fast, and it shows.

Syndicated 2008-01-19 23:44:08 from apenwarr - Business is Programming

2008-01-18: And that, as they say, is that

And that, as they say, is that

Goodbye, NITI, hello, IBM!

For those just joining us: I founded NITI but I don't work there anymore.

However, I can now safely say from firsthand experience that while some people are demonstrably evil, at least some VCs are actually not. I suppose anti-VC sentiments are a form of racism; evilness and incompetence are traits that turn out to be independent from VCness.

Syndicated 2008-01-19 19:26:11 from apenwarr - Business is Programming

2008-01-19: The democrats are throwing the election

The democrats are throwing the election

I'm a Canadian. I do my very best to ignore American politics. Still, news of what's going on to the south can't help but permeate my thick consciousness occasionally.

Even then, only a tiny bit gets through. In this case, I have barely managed to absorb two facts: that President Bush's approval rating is a dismal 30% or so, and that the two frontrunners for Democratic presidential candidates are a woman and a coloured guy whose name rhymes with "Osama." Unfortunately, those two facts overwhelm everything else.

People. Seriously. This is not an accident!

I'm (I think) neither racist nor sexist. But I can do statistics. Nobody who is not white, and nobody who is not a man, has ever been President of the United States. And here we are, with such a huge, widespread dislike for the Bush regime that the Democrats are virtually guaranteed a win - if only they can produce a halfway viable candidate.

And yet they don't - spectacularly. Remember 2004? One of the only things that I noticed about that election was the "flip flopper" scandal, in which they successfully demonized the Democratic candidate because he changed his mind that one time.

But don't worry. I'm sure the spin machine will totally leave the wife-of-the-guy-who-almost-got-impeached-for-cheating-on-her-but-she-apparently-forgave-him and/or the probable-Muslim-terrorist-sympathizer(*) alone when the time comes. It's a shoe-in!

(*) Disclaimer: I have absolutely nothing against Muslims. Despite what you may have heard, only a vanishingly tiny number of Muslims are terrorists. Other religions and atheists also produce terrorists. In fact, I have no idea if the guy is even a Muslim. But none of that will matter, when the time comes, because his skin isn't white and his name rhymes with Osama. Watch.

Syndicated 2008-01-19 19:07:02 from apenwarr - Business is Programming

2008-01-16: This post is not about Macbook Air

This post is not about Macbook Air

Yes, this is the so-called "blogosphere," and yes, people in said "blogosphere" tend to start meme-of-the-moment posts with statements like "I promised I wouldn't talk about such-and-such in this blog, but..."

I've done the same.

But not this time.

This time, I simply didn't write a post about the advantages and disadvantages of the feature selection in the Macbook Air. Or whether I plan to buy one, or how cool or not cool it is.

See how much restraint I have?

Syndicated 2008-01-16 17:32:13 from apenwarr - Business is Programming

2008-01-13: DemoCampCUSEC2 in Montreal

DemoCampCUSEC2 in Montreal

If all goes well, I'll be presenting at DemoCampCUSEC2 in Montreal. I was a little late signing up, but the organizers claim there's still time. I hope so.

I should be demonstrating my wild combination of Nitix, VMware, and a few other things, showing how to get an entire database-driven Windows application, including the Windows it runs on, deployed 15 minutes or less. Come watch!

Syndicated 2008-01-14 03:00:48 from apenwarr - Business is Programming

2008-01-09: Why I Never Hire Brilliant Men

Why I Never Hire Brilliant Men

And now for something completely different: an article from 1924 called "Why I Never Hire Brilliant Men."

I find it's a very pleasant read. The soft tone is something we've sadly lost in our modern world of hyper-sensationalized "Top 5 blah blah" blog headlines. I'd like to be able to write like he did; no such luck, but at least I haven't resorted to the "top 5" yet.

As a bonus, the article also makes many good points about hiring.

Syndicated 2008-01-03 23:19:37 from apenwarr - Business is Programming

2008-01-07: More biases

More biases

I've written several times before about different kinds of statistical biases. I care a lot about that since, next to actual incorrect facts, the most common source of wrong decisions seems to be a misguided use of so-called statistics.

Here are two great articles about bias. The first is about the Anchor Bias:

    They spun a roulette wheel and when it landed on the number 10 they asked some people whether the number of African countries was greater or less than 10 percent of the United Nations. Most people guessed that estimate was too low. Maybe the right answer was 25 percent, they guessed.

    The psychologists spun their roulette wheel a second time and when it landed on the number 65, they asked a second group whether African countries made up 65 percent of the United Nations. That figure was too high, everyone agreed. Maybe the correct answer was 45 percent.

Isn't that amazing?

I claim, by the way, that people like Ayn Rand and Richard Stallman *have* to exist simply because they help de-anchor-bias others. "100% of software should be free?! Holy cow, you're crazy. Maybe more like 90%."

Meanwhile, Peter Norvig, who is (if I understand correctly; I'm offline as I write this) one of the Google researchers working on their PageRank statistics, wrote a great article about different kinds of bias in both experimental design and the interpretation of results.

It's long, but scroll down to section I4 and find the surprising answer to this question (via Eliezer Yudowsky):

    1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammograms. 9.6% of women without breast cancer will also get positive mammograms. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

It is not a trick question, but my answer was completely wrong. Think about it, then follow the link and check your answer in section I4.

Syndicated 2008-01-03 23:19:36 from apenwarr - Business is Programming

2008-01-05: Welcome to 2008, Part 3: Environmentalism Update

Welcome to 2008, Part 3: Environmentalism Update

Please note the following changes in environmental terminology. Remember, if you get these mixed up, you'll look old-fashioned.

We used to refer to "the hole in the ozone layer." This hole was reputedly caused by certain chemicals (like our dear departed otherwise-non-toxic freon, now replaced by mildly toxic alternatives) which, when released into the atmosphere, would bind with ozone particles and take them out of circulation. The ozone layer is responsible for "absorbing" certain kinds of dangerous radiation from the sun and turning them into "harmless" heat.

At the same time, there were warnings about an excess of "greenhouse gases" and the related problem of acid rain. At the time, the majority of activism was toward reducing emissions of various nasty particles like carbon monoxide, methane, and sulphur. Natural Gas was described as the "clean alternative fuel", because all it releases (when burned efficiently) is carbon dioxide.

Greenhouse gases work like this: the sun's radiation is partly absorbed by the earth, and partly reflected back. Greenhouse gases tend to absorb more of the reflected light, trapping it in the atmosphere instead of letting it escape, thus increasing the temperature.

Ironically, ozone is a greenhouse gas. The "hole in the ozone layer" prevents certain types of radiation from being absorbed and safely converted into harmless heat. Other greenhouse gases absorb other wavelengths of radiation, converting it into dangerous heat. Got it? Good.

We don't talk about the ozone layer or greenhouse gases anymore. Instead, we talk about "carbon emissions," by which we mostly mean "carbon dioxide emissions." Carbon dioxide is what you produce when you breathe. After you clean up your artificial pollution-spewing devices, carbon dioxide is pretty much all that comes out. Other than its contribution as a greenhouse gas, it is harmless.

So the question is: why do we hear so much now about "carbon emissions" instead of "greenhouse gases" in general, or acid rain, or the ozone layer? Is it good news, and the other problems are mostly solved? Or do we as a society just fixate randomly on the most recent problem that someone famous has made a movie about?

Syndicated 2008-01-03 23:19:35 from apenwarr - Business is Programming

362 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!