Older blog entries for badger (starting at number 54)

Thanks to Mike McGrath and a host of other Fedora Infrastructure people, the new Fedora Hosted server has better support for Bazaar Repositories than the old one! Under the old setup we had sftp for authenticated checkout and http for anonymous checkout. In the new scheme we have smart servers replacing both of those.

Here's a bit of preliminary timing to show you the difference. Note that these times have several inaccuracies and should only be used as an extremely general indication of speed. A few caveats that I know of:

  • These times don't test push, merge, and pull speed, which are the networked operations that you'll be using most frequently. They only show the inital branch operation (which, however, is the first impression that most people get of an SCM's speed.)
  • rsync doesn't understand the bzr repository structure so I have to download the entire repository with rsync instead of just the branch I'm interested in. In this case, the difference is 4876K for rsync vs 4556K for the bzr protocol.
  • Although I have a bazaar repository setup on the server, I don't have one setup on the client. In normal operation, I'd already have most of the data checked out in a repository in a different branch so a new branch would have to transfer a lot less data.
  • This test is performed with bzr's dirstate-tags branch format. The latest bzr has a new "pack" format that is supposed to be faster. We currently have 1.0rc3 on the server and I'm running 0.92 on the client. I'll likely be experimenting with the new format after I get 1.0 installed on both sides.

Baseline rsync This runs rsync and then does a lightweight checkout in the directory (leaving us with a branch and working tree just like a bzr branch operation.)


rsync -a rsync://bzr.fedorahosted.org/bzr/packagedb/ .
cd fedora-packagedb-devel
bzr checkout --lightweight 


real 0m16.548s

bzr over http This is equivalent to what we have had to use on the old server:


bzr branch
http://bzr.fedorahosted.org/bzr/packagedb/fedora-packagedb-devel
Branched 258 revision(s).                                  
                  
real    1m2.844s
Ouch!

A fresh branch using the smart server


bzr branch
bzr://bzr.fedorahosted.org/bzr/packagedb/fedora-packagedb-devel
Copying repository content as tarball...
Branched 258 revision(s).


real 0m23.192s

Much better! This is only 7 seconds slower than using rsync.

bzr+ssh smart server for rw access:


bzr branch
bzr+ssh://bzr.fedorahosted.org/bzr/packagedb/fedora-packagedb-devel
Copying repository content as tarball...
Branched 258 revision(s).


real 0m25.501s

The ssh protocol adds a few seconds to the total time but it's also quite speedy.

One interesting point is that the smart server times are much better than the ones I posted in October. This could be because we're dealing with a different server on a different network or it could be that we've gone from using bazaar-0.18 to bazaar-1.0rc3 on the server. I'll have to do more timings after we get settled into the new machine to tell more.

My last week was largely spent working on Fedora Infrastructure tasks: fixing a lot of annoyances, creating scripts to automate tasks, and helping to get the new hosted infrastructure up and running. This week has started off the same but I'm hoping to get some work done on some new authentication code for python-fedora before next week starts. J5, Ricky Zhou, and I have been talking about what it would take to get single sign on, certificate authentication, and openid working and I want to start modifying code to make that happen.

Note: I did get to cross one of my PackageDB tasks off without doing much of anything -- porting to RHEL5 turned out to only be a matter of getting a few packages branched for EPEL. No coding required.

Appalachian Trail Memories

I dozed off after dinner last night and dreamt of a hiker named Raindancer (after some googling, I'm pretty sure I never hiked with a Raindancer but it seems to have been a suitable mental prod). On waking, I felt compelled to scour the Internet for people from my 1999 hike of the Appalachian Trail. It was somewhat depressing to do. There's a lot of memories of the 1999 hikers on the web but not as much that can be used to reconnect to them. So, for Lank, Butterfly, Hummingbird, Shortcut, Flame, Irie, Moose of the New FBI, Llama, Wadi, Lu the Wanderer, Walking Pine, Hobbit, Algae, Beaner, Wild Turkey, Mouse's Motor, Rockfish, Buzzard, Jojo Smiley, T-Bone, Captain, Monifa, Wide Brim, Dingo, Dragonfly, Cassiopia and Freak Dog, Goat, Tenderfoot, and all the hikers that may try to reconnect to me at some unspecified future date, Anonymous Badger is still thinking of you -- a.badger+gmail.

Giflib 4.1.6

The story so far...

I posted a new giflib-4.1.6 release over the weekend since it was brought to my attention that several seg faults were present when corrupted gif images were fed into the utilities. At the same time, I posted a notice that I was no longer going to be updating the libungif package due to the Unisys LZW patent expiring.

Today I was wondering if many people still used giflib so I had a quick look at the sourceforge statistics:

Downloads for project giflib


Date            Downloads Bandwidth
12 Nov 2007 *  	1,693  	  819.3 MB
11 Nov 2007 	2,568     1.2 GB
10 Nov 2007 	187 	  80.6 MB
09 Nov 2007 	147 	  74.1 MB

Definite bump from the release. Cool! But wait, there's more:

Downloads for Package giflib


12 Nov 2007 *  	57  	30.3 MB
11 Nov 2007 	150 	75.2 MB
10 Nov 2007 	60 	22.9 MB
09 Nov 2007 	37 	19.0 MB
Only a hundred extra people are downloading giflib now? Where are the extra two thousand downloads coming from?

Downloads for Package libungif


12 Nov 2007 *  	1,784  	862.5 MB
11 Nov 2007 	2,418 	1.1 GB
10 Nov 2007 	127 	57.7 MB
09 Nov 2007 	110 	55.0 MB
Hmm....

So despite the fact that I sent a note that libungif is now unmaintained and the fact that I did not make a new release for it this time around, everyone is rushing to download the old libungif package instead of the new giflib.

This is a little worrisome as the segfaults fixed in giflib are still present in the libungif package everyone's downloading. So here's my intarweb plea for the day:

If you actually find libungif useful, please contact me about becoming a maintainer for it. Otherwise, please switch to giflib!

Spent the night without electricity due to thunderstorms rolling through the area. It's amazing how central to my life this one little bit of technology is. No computers, no movies, no reading after dark...

More Bazaar Timings

I spent today getting Bazaar setup on the Fedora SCM server so that Bazaar repositories for hosted will work. Unfortunately, the SCM server has an old python2.3 install rather than the python2.4 needed to run Bazaar. Not a big deal, we can still run Bazaar repositories... we just can't use the smart server. My immediate question was how much will that hurt the time it takes to download repositories?

To answer that, I made some highly inaccurate comparisons[1] checking out the packageDB repository to see how much it would suffer from moving from fedorapeople to the official repo:

First I ran:


bzr init-repo people-packagedb
bzr init-repo scm-packagedb
to set up a local shared repository for each set of branches I was going to checkout. Since it's normal for a developer to use a shared repository to save space and download time, I decided I should test the time we'll lose for checking out an initial branch and for checking out a related branch into the shared repository.

For the initial branch in each repository I checked out the stable branch of the packagedb from fedorapeople into people-packagedb and the stable branch from the SCM server into scm-packagedb:


time bzr branch
bzr+ssh://toshio@fedorapeople.org/home/fedora/toshio/public_html/bzr-repo/packagedb/fedora-packagedb-stable
Branched 103 revision(s).


real 1m11.957s

time bzr branch sftp://toshio@hg.fedoraproject.org/bzr/hosted/packagedb/fedora-packagedb-stable Branched 103 revision(s). real 2m20.937s

As expected, the smart server on fedorapeople helped tremendously. The initial checkout took about twice as long from the new server without the smart server.

Next I branched the devel branch from each repository into the shared repositories. Currently stable is pretty close to devel, so they're only a few revisions apart.


time bzr branch
bzr+ssh://toshio@fedorapeople.org/home/fedora/toshio/public_html/bzr-repo/packagedb/fedora-packagedb-devel
Branched 210 revision(s).


real 0m11.511s

time bzr branch sftp://toshio@hg.fedoraproject.org/bzr/hosted/packagedb/fedora-packagedb-devel Branched 210 revision(s).

real 0m9.882s

Wow. Unexpectedly, the SCM Server with no smart server branched the revisions faster than the fedorapeople server. This could be due to a lot of factors including random differences in my connection to the two sites. What it does show is that the large difference that exists when making an initial branch practically disappears when we checkout a related branch into a shared repository.

[1]: Remember, this comparison used two separate servers in two different geographical areas. This data isn't good for much more than telling us that the Fedora SCM will be much slower than the fedorapeople on initial checkout but comparable on subsequent checkouts of related branches.
20 Oct 2007 (updated 20 Oct 2007 at 18:29 UTC) »
Unicode and Python

I stumbled across http://www.amk.ca/python/howto/unicode a few weeks ago and have found it an absolute must-read for anyone who wants to understand unicode in python. In particular, this section contains a few tips for writing Unicode aware apps that must be taken to heart. Here's the cardinal rule:

Software should only work with Unicode strings internally, converting to a particular encoding on output.

For debugging python apps that are throwing UnicodeDecodeError or UnicodeEncodeError, this means a first step is to print type(VARIABLE) and see which of your strings are <type 'unicode'> and which are <type 'str'>.

All the ones that are str are bugs.

Trace the variable through your code to where it is entering the system (via a database call, user input, read from file, etc) and at that point, convert from str to python's unicode type.

Also note the second half of the rule. When you want to write your output to a file or send it over the network or even display it to the user you have to be sure to convert the data to a specific encoding of unicode that the recipient of the data expects. For network protocols, this might be ascii or utf-8. For many end-users on Linux these days this is utf-8.

Here's a contrived example that takes a utf-8 encoded string from the user, processes it as a number and mathematical symbol, and then outputs the answer as utf-8.


#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import sys
import math
print 'Enter [operator] [number]: ',
problem = sys.stdin.readline().strip()
# problem is a utf-8 encoded 'str' type at this point
# type(problem) will tell you that this is 'str'
problem = problem.decode('utf-8')
# type(problem) will now tell you that this is 'unicode'
operator, number = problem.split()
number = float(number)
# Square root
if operator == u'\u221A':
    answer = math.sqrt(number)
# Cube root
elif operator == u'\u221b':
    answer = math.pow(number, 1.0/3.0)
# Fourth root
elif operator == u'\u221c':
    answer = math.pow(number, 1.0/4.0)
else:
    raise Exception('Unknown operator')
answerLine = problem + ' = ' + unicode(answer)
print answerLine.encode('utf-8')


$ ./unicode-root.py Enter [operator] [number]: √ 16 √ 16 = 4.0

Note that the header line: # -*- coding: utf-8 -*- would have allowed me to directly use the unicode characters u'√', u'∛' , and u'∜' instead of their \u escape codes but I wanted to use something that would render correctly in any web browser.

13 Oct 2007 (updated 13 Oct 2007 at 18:40 UTC) »

Thanks to some prodding by Seth Vidal, I've merged Descriptions from the yum package repositories into the Package Database.

Package Descriptions

This is most exciting because the code to implement this opens up numerous possibilities for integrating with other repository information, either by incorporating it directly into the pages or by finding out what's available and linking to the relevant repoview pages.

Using setuptools/distutils to install Applications

In Fedora Infrastructure we've started switching from deploying our TurboGears applications directly from source control to packaging them in rpms. Since many of these packages are only useful within Fedora Infrastructure, we don't have to make sure they are paragons of packaging virtue. Still, it got me thinking about how we are going to package TurboGears applications correctly in Fedora. There are several TurboGears applications (loggerhead and stickum) that I'd like to see make it into Fedora and they won't be able to do that if they don't follow the Packaging Guidelines.

The Problem

When you create a TurboGears project via tg-admin quickstart it creates an environment for you to quickly get started programming a new python project. This includes a setuptools using setup.py file to distribute your project when the time comes. This is nice for developers who want to quickly create a tarball for others to try out or deploy their web application to the corporate server but it has some problems for people attempting to package the files for a Linux distribution. In system packaging distro packagers strive to follow the FHS's Guidelines for where files belong. In a TurboGears application the FHS might have you use the following directories for the different file types produced by TurboGears:

python's sitelib
a public library to interface programmatically with the app
python's sitelib or /usr/share/APPNAME
a private library of things that the program uses. In a C program, these files would be compiled together into a single executable. In python, you have to install all the files somewhere
/usr/share/APPNAME
static data files
/var/lib/APPNAME
variable data files
/etc
config files
/etc/rc.d/init.d
init script
/usr/share/doc
documentation
/usr/sbin
programs that are only run by the superuser or system
/usr/bin
programs for use by the general public

Since setuptools and TurboGears mix static data (raw CSS, javascript, and images) with the application's private library, I decided that the easiest way to follow the FHS would be to install the private library and data files to /usr/share/APPNAME. This allows us to sidestep the need to separate the static data into /usr/share while keeping the private library in python's sitelib. There are two things to note when doing this:

  1. This won't work if the private library contains compiled modules, only if it's pure python.
  2. If some of your TG app's library is actually public, you need to create two python modules instead of one. One python module will be public and reside in the sitelib where other modules can import it freely to have access to the public API. The other portion will contain the portions necessary to running the TurboGears app. As a very rough guide, your TurboGears controllers and the view (templates and static data) need to be in the private library. The model and any helper classes/functions can be placed in the public interface if they need to be used by other programs.

There's one further twist on this. Although the above directories are correct for the FHS as written today, in Fedora we like to make our spec files easy to adapt to future changes. One way that we do this is to use macros for the directory hierarchies so that changes to where the hierarchies are later can just involve changing the macro. For instance, we currently use %{_datadir} instead of /usr/share and %{_sysconfdir} in place of /etc. To make this truly allow us to change the macro and have the package build for the new directory structure we have to make sure that references to specific directories within the built files are set at build time to the value of our macros.

An Initial Solution

With the problem outlined, I was able to write some code for the PackageDB that implemented the correct FHS layout. This consisted of several changes:

  1. setup.py -- contains the most important changes. By overriding the default setuptools/distutils commands for build, build_scripts, install, and install_lib I'm able to pass in command line arguments to install config files and the private library to different directories. This allows us to modify the built scripts to know where to look for those files.
  2. start-pkgdb.in -- had a large makeover. This was start-pkgdb.py. I removed the ".py" so it would be suitable for installation in a bin directory and added ".in" so that the custom setup.py methods had a means to decide it needed to have variables defined at build time. Inside the script we had to make a few changes:
    • Changed the way pkg_resources finds TurboGears. This is actually to fix brokenness in TurboGears start-APPNAME template.
      
      + __requires__ = 'TurboGears[future]'
        import pkg_resources
      - pkg_resources.require('TurboGears'
      
    • Set the directory the configuration file will be installed in:
      
      CONFDIR='@CONFDIR@'
      
    • Set the path that the private library resides in:
      
      PKGDBDIR=os.path.join('@DATADIR@', 'fedora-packagedb')
      sys.path.append(PKGDBDIR)
      
    • Look for the configuration file in the proper directory:
      
      update_config(configfile=os.path.join(CONFDIR,'pkgdb.cfg'),modulename="pkgdb.config")
      
  3. And finally, fedora-packagedb.spec is used to build the package into an rpm.

Room for Improvement

Although this set of changes successfully installs the files in FHS compatible directories, there is plenty of room for improvement.

  • I had to resort to moving the start-pkgdb script from /usr/bin to /usr/sbin in the spec. setup.py should have a method to mark a script as needing to be installed for sysadmin use (/usr/sbin) rather than enduser (/usr/bin) use.
  • The path substitutions done on the start-pkgdb script from setup.py are a bit ad hoc. It might be nicer to use genshi or another text templating language to define the script that needs to have substitutions made.
  • The list of file types listed above needs to be incorporated into distutils setup() function somehow. Then we can mark files as application_package(), configFile(), initscript(), etc and have install_* classes to handle them.
Don't Believe a Word of It

Yesterday, someone wrote onto several Fedora channels that bzr was still slow and thus wasn't in the running to compete with Mercurial and git. They said that it took them 10 minutes to branch the bzr repo for awn trunk. ( Repo information here)

Something is wrong with that person's bzr installation (or perhaps, their connection to the repository is slow.)

Using bzr 0.18 (which is one behind the current release ATM) against launchpad should be nowhere near that painful:


$ time bzr co
http://bazaar.launchpad.net/~awn-core/awn/trunk
avant-window-navigator
                
real    2m39.057s
user    0m2.410s
sys     0m0.601s

Now, this is quite a bit faster than the 10 minutes claimed by the person on #fedora-admin and #fedora-devel. but it gets even better than this.

bzr was designed to serve repositories over plain http so that you could easily run a bzr repository with just web space on a random server. However, it has also had a "smart server" since 0.11 (released in September of 2006). Since Launchpad's smart server is currently for developers only (must have an ssh key there), I uploaded the awn branch to fedorapeople and timed branching there.

Here's the time it takes to do a plain http based branch similar to the one I did on launchpad earlier:


$ time bzr branch http://fedorapeople.org/~toshio/awn
Branched 112 revision(s). 
real    3m43.649s

Here's the time it takes to do a fresh checkout using the smart server:


$ time bzr branch
bzr+ssh://fedorapeople.org/home/fedora/toshio/public_html/awn
Copying repository content as tarball...
Branched 112 revision(s).
real    0m35.212s

bzr may not be able to compare to the all C git in terms of speed but it isn't a 10 minute monster either.

45 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!