Older blog entries for badger (starting at number 81)

Adel Gadllah (dragoo1) ran my script on his computer with a couple other compressors: pbzip2 (a parallel implementation of bzip2) and pigz (a parallel version of gzip). His computer is a quad core with 6GB of RAM. A definite upgrade from the machine I tested on (dual core with 1GB of RAM). The results are quite interesting.

Since no new algorithms were introduced, just new implementations, the compression ratios didn't change much. But the times for the parallel implementations were very interesting. pbzip2 runs faster than gzip. pigz -9 runs faster than lzop -1! If compression was the only process being run on the machine then the parallel implementations are definitely worthwhile.

Well, after reading this message from notting about speeds and sizes of xz compression at various levels, I got curious about how gzip falls into the picture. So I wrote a little script to do some naive testing, found a 64MB text file (an sql database dump), and ran a naive benchmark. First, the script so you can all see what horrible assumptions I'm making:


#!/bin/sh                                              


LZOP='lzop -U' GZIP='gzip' BZIP='bzip2' XZ='xz'

TESTFILE='/var/tmp/test.dump'

for program in "$LZOP" "$GZIP" "$BZIP" "$XZ" ; do case $program in gz*) ext='.gz' ;; bz*) ext='.bz2';; xz*) ext='.xz';; lz*) ext='.lzo';; *) echo 'error! No configured compressor extension' exit ;; esac

COMPRESSEDFILE="$TESTFILE$ext"

for lvl in `seq 1 9` ; do c_time=`/usr/bin/time -f '%E' 2>&1 $program -$lvl $TESTFILE` c_size=`ls -l $COMPRESSEDFILE |awk '{print $5}'` d_time=`/usr/bin/time -f '%E' 2>&1 $program -d $COMPRESSEDFILE` printf '%-10s %10s %10s %10s\n' "$program -$lvl" $c_time $c_size $d_time done done

As you can see, I'm not flushing caches between runs or anything fancy to make this a truly rigorous test. I'm also running this on my desktop (although I wasn't actively doing anything on that machine, it was logged into a normal X session with all the wakeups and polling and etc that that implies.) I also only used a single input file for data. Binary files or tarballs with a mixture of text and images and executables could certainly give different results. Grab the script and try this out on your own sample data. And if you get radically different results, post them!


Compressor   Compress     Size   Decompress
----------   --------   -------  ----------
none [*]_     0:00.43   67348587    0:00.00


lzop -U -1 0:00.57 16293912 0:00.35 lzop -U -2 0:00.62 16292914 0:00.40 lzop -U -3 0:00.62 16292914 0:00.34 lzop -U -4 0:00.57 16292914 0:00.42 lzop -U -5 0:00.57 16292914 0:00.42 lzop -U -6 0:00.67 16292914 0:00.41 lzop -U -7 0:13.53 12824930 0:00.30 lzop -U -8 0:39.71 12671642 0:00.32 lzop -U -9 0:41.92 12669217 0:00.28

gzip -1 0:01.96 11743900 0:01.02 gzip -2 0:02.04 11397943 0:00.92 gzip -3 0:02.77 11054616 0:00.89 gzip -4 0:02.59 10480013 0:00.82 gzip -5 0:03.42 10157139 0:00.78 gzip -6 0:05.44 9972864 0:00.77 gzip -7 0:06.71 9703170 0:00.76 gzip -8 0:13.64 9592825 0:00.91 gzip -9 0:15.89 9588291 0:00.76

bzip2 -1 0:20.17 7695217 0:04.73 bzip2 -2 0:21.68 7687633 0:03.69 bzip2 -3 0:23.48 7709616 0:03.63 bzip2 -4 0:26.00 7710857 0:03.69 bzip2 -5 0:25.45 7715717 0:04.09 bzip2 -6 0:26.95 7716582 0:03.95 bzip2 -7 0:28.13 7733192 0:04.23 bzip2 -8 0:29.71 7756200 0:04.36 bzip2 -9 0:31.39 7809732 0:04.50 [@]_

xz -1 0:08.21 7245616 0:01.86 xz -2 0:10.75 7195168 0:02.23 xz -3 0:59.45 5767852 0:01.90 xz -4 1:01.75 5739644 0:01.83 xz -5 1:09.70 5705752 0:02.60 xz -6 1:46.23 5443748 0:02.09 xz -7 1:50.37 5431004 0:02.19 xz -8 2:02.41 5417436 0:02.19 xz -9 [#]_ 2:18.12 5421508 0:02.55

.. _[*]: Time to copy the file. .. _[@]: What's up with bzip2? Why does the size increase with higher levels? .. _[#]: Note, xz -9 is unfair on two counts: 1) it pushed me into swap. 2) As for the size, xz had this output during that run:: Adjusted LZMA2 dictionary size from 64 MiB to 35 MiB to not exceed the memory usage limit of 397 MiB

My conclusions based upon entirely too little data :-)

  • If you want transparent compression, use lzop at one of the lower compression settings. I got 25% of the size at 100 MB/s with lzop -2.
  • Do not use lzop with -7 or higher. If you want more compression than -2/3/4/5/6 (the algorithm for these is currently all the same) use gzip. You'll get better compression with better speed.
  • The only reason to use bzip2 is if you must have both a smaller size than gzip and you can't deploy xz there. If you don't need the smaller size or the remote side can get xz then bzip2 is a waste. This applies to distributing source code tarballs as two formats, for instance. If you're going to release in two formats, use tar.gz and tar.xz instead of tar.gz and tar.bz2.
  • xz gets the smallest size but it's versatile in other ways too: xz -2 is faster than gzip -9 with better compression ratios.
  • gzip beats xz at decompression but not nearly as badly it beat bzip2.

So thanks to cdfrey, I'm a little closer on two fronts.

First, the problem as given has a solution for hack #2 but apparently not hack #1. Here's the new sequence of commands:


git checkout base_url
git log
# Manually find the last commit in staging before I branched
git rebase --onto master [COMMIT ID FROM ABOVE]
git checkout master
git merge base_url

So no more patches, yay! However, you probably notice that we still have to use git log to find the branchpoint. After some discussion of this, it seems that if we have merged from the feature branch back to the branch it came from, there's no way around this. git does not maintain the history of where something came from and where it goes back to, it holds onto the heads and then follows the chain of commits back. So once we're merged, there's no branch point anymore... the trees are the same.

However, we did figure out a potential way to implement our workflow in the future. Instead of branching from staging, the feature branch should start off branching from master. After it's been worked on, it gets merged to staging. But since it started off from master, that should still leave the feature branch with a clear path of changes to apply to master. Once the changes have been tested in staging, we can merge the feature branch into master and it's then "okay" for the branchpoint to disappear since the work is completed.

Okay, git lovers, I have an incredibly simple problem but so far the only working solution is a kludge. I'm hoping someone can tell me what the elegant way to solve this problem is.

I'm working with three branches keeping configuration information for our environment. master is where our production configs live. staging is a branch where we merge changes and test them in our staging environment. Once tested, they get cherrypicked to master.

base_url is where I've been working on a new change that spans several commits. It was branched off of staging. After completion, I merged the changes into the staging branch and tested. So far so good.

Now I want to merge my branch into master. How do I do that?

Here's an idealized diagram of the branch relationships. In reality, sometimes changes go into master before staging.


master       staging   base_url
  |             |  merge |  ___
  | cherrypicks +<-------+   ^
  +<------------+        |   |
  |    (cp)     |        |  How do I merge these to master?
  +<------------+ branch |   |
  |    (cp)     +------->+  _V_
  +<------------+
  |   branch    |
  +------------>+
  |
  |
/srv/puppet

So far everything I've tried with git rebase or git merge seems to be sending changes from the staging branch that were present before I branched to base_url into master. I don't want that. I did my changes on a separate branch so I could merge just my changes to both staging and master later.

Here's the kludge that did work:


git checkout base_url
git log
# Manually find the last commit in staging before I branched
git format-patch [COMMIT ID FROM ABOVE]
git checkout master
git am [patch 0001 0002 0003....etc]

The two things that I find too hacky about this solution are:

  1. using git log to find the branch point. git should know where I branched from staging... I just need to tell it that I want to pull changes from the branch point forward somehow and it should find the proper commit id.
  2. generating patches and then applying them. git should be able to do this without generating "temporary files" like this. The deltas are in the repo, why pull them out into text files?

I have copies of my repository before I "fixed" it with the patch commands. So send me your recipes and I'll see if any of them work. Once we have a winner, I'll post strategies that worked and ones that didn't.

Of course, even after I know how to do this, there's still all sorts of follow on questions -- like, what happens if this new feature took a long time and I needed to remerge the base_url branch with staging in the middle?

a.badgergmail.com or abadger1999 on irc.freenode.net

Do not buy Swingline stapler model #545xx

My wife was having problems stapling today so I looked inside this one and found that the staple had fallen over inside the stapler. So, instead of the staples forming the upside down "U", all ready for the teeth to punch into the paper being stapled, the staples were positioned with the teeth to the front and the base of the "U" facing the springloaded rear of the stapler.

This is just poor design as some misguided engineer tried to cut costs. All other staplers I've seen have some sort of platform down the middle of the staple feed chamber. This allows the base of the "U" to rest supported on the platform and not depend on standing upright on their feet. Getting rid of that platform means that the staples can fall over when the stapler is loaded or if the spring's tension is off.

Or perhaps it isn't poor design -- A little experimentation showed that the stapler has one feature sure to please a company exec, provided it's the exec of Swingline: It's nearly impossible to load small quantities of staples into this design. With nothing to support them, they just fall over and slide underneath the other staples.

Had a productive evening planned out but didn't get to do any of it because of a chicken emergency. First time I've actually seen "it gave a spasm that threw its whole body in the air and died." Parents get back tomorrow night so hopefully I can start working long hours again starting next week.


On 07/21/2009 04:24 AM, Dimitris Glezos wrote:
> For me, Fedora isn't so much what we think it is -- it's
> what our community wants it to be. And if a part of our community
> wants to try new things out, given that the resources needed won't be
> unmanageable, we should encourage them to do so.
> 
Posted on the Fedora Advisory Board list

I think that this is very, very true and something that we need to keep in mind as go about defining what Fedora is. Thanks Dimitris, for phrasing that so succinctly!

24 Jun 2009 (updated 24 Jun 2009 at 17:31 UTC) »
FISL in the Morning

The Fedora booth has been well populated by Fedora Ambassadors from all around Latin America from Brazil to Mexico. For someone from the insular world of the United States, it's awe-inspiring to watch the ambassadors in action. Even though some speak Spanish and others Portuguese, they cheerfully work out their differences in language and laughingly toss jokes at one another. A line of potential Fedora users stretches out from the booth, entertained to watch interviews of Fedora ambassadors and developers as they wait to sign up for a FAS account and get the Brazilian Fedora 11 spin.

The conference is huge. And very oriented on free software. I attended LinuxWorld in San Francisco once and the crowd was roughly this size. The type of attendee is very different, though. Where LinuxWorld seemed to have an abundance of businessmen looking to buy or sell a solution to someone else, FISL seems populated mostly with enthusiasts eager to meet up with fellow contributors to the projects they are involved in. A very nice crowd to watch and try to interact with despite my limited Portuguese.

People who lean toward the DAG as *recording* history will prefer Mercurial or Bazaar. People who tend to see the DAG as a tool for *presenting* changes will prefer git.People who lean toward the DAG as *recording* history will prefer Mercurial or Bazaar. People who tend to see the DAG as a tool for *presenting* changes will prefer git.

-- Stephen J. Turnbull on python-dev

While this doesn't express my main issues with git (which are UI driven) (although the UI might be irretrievably broken because of these features... I dunno :-), it does capture the reason I don't jump up and down when contemplating git's "advanced features".

72 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!