Open Source metrics

Posted 8 May 2000 at 20:21 UTC by cdegroot Share This

The first edition of the Orbiten Free Software Survey analyzes some metrics from 25million lines of code by over 12,000 authors (both institutional and individual authors). Nice data, and maybe your name is on the top producers list as well...


Nice, but not representative, posted 8 May 2000 at 22:45 UTC by gsutter » (Master)

The idea of this survey is interesting, but their implementation leaves a lot to be desired. They have chosen such Linux-centric sources that the information really isn't relevant at all for non-Linux or non-Linux-related projects.

I understand that this is their first survey, but even when talking about the code base for their next one, they demonstrate severe lack of clue by mentioning only SourceForge, OpenBSD, and Perl CPAN. What about the other BSD projects? What about the original UCB BSD code? Minix? Anything not having to do with Linux?

This metric is a Linux metric, not an Open Source metric, and the authors should change their documents to reflect that.

Re: Nice, but not representative, posted 9 May 2000 at 02:54 UTC by joey » (Master)

As you say yourself, they are planning to expand to at least 2 projects not at all releated to linux, namely CPAN and OpenBSD. Anyway, given that the third author listed in their data is "the regents of the university of california", perhaps this data tells us how important BSD code is to linux?

Based on how they collect data, I doubt it makes a lot sense to scan two linux distributions, or two similar BSD's and combine that data. I'd imagine this would weigh the data toward giving items that appeared in two sources double weight in at last some cases.

Anyway, they provide source code, so stop whining about their data sources and generate your own statistics if you so desire.

That said, I don't like the quality of their data much. I won't know enough about statistics to tell if it is useful statistically, but it has so many nasties in it! For example, I scanned the source of rpm. The results:

0. kristof.depraetere@rug.ac.be: 804800 (48.346%) 
1. free software foundation, inc: 414249 (24.885%) 
2. red hat software: 124317 (7.468%) 
3. gord@gnu.ai.mit.edu: 55267 (3.32%) 
4. drepper@gnu.ai.mit.edu: 34228 (2.056%) 
5. drepper@cygnus.com: 33875 (2.034%) 
...

This is just plain wrong (TM). The person listed as #1 here appears in the rpm source exactly once: He wrote the README.amiga file! For some unknowable reason, their data gatherer (CODD) seems to have decided this means he is the primary author of rpm, and credited him for everything that had no other names attached. This although there is a README file that lists the two real primary authors.

My second example is number 4 on their list of top authors, Ggordon Matzigkeit. Gordon, their data would have you belive, wrote 1.2% of all code they scanned, and is a participant in 267 projects. And you thought Alan Cox was busy!

Well, if you recognize Gordon's name, you'll remember what project he is perhaps best known for: libtool. Now, packages that use libtool happen to include some rather long (autogenerated) files in them that have Gordon's name attached. So for every package that uses libtool, Gordon gets credited with about 8 thousand lines of code. What a sweet deal!

So I don't trust these statistics very much, especially their finding that "The top 1271 authors, 10% of the total, accounted for 72.3% of the total code base. [...] Free software development may be distributed, but it is most certainly very top heavy.".

Not so reliable, posted 9 May 2000 at 02:55 UTC by hp » (Master)

It shows me as a major contributor to "gnuclear" and nothing else - I don't even know what gnuclear is. ;-)

Re: not so reliable, posted 9 May 2000 at 03:17 UTC by joey » (Master)

Gnuclear was built by using GnomeHello as a template. Havoc wrote GnumeHello. They haven't removed many of the gnomehello traces, and thus Havoc shows up as a primary author. The actual author only shows up in 2 files, and so he's near the bottom of the list.

Of course, Gordon is listed as contributing more than Havoc, thanks to ltconfig again. :-)

It is weird that Havoc didn't show up in anything else they scanned.

poor Python representation, posted 9 May 2000 at 06:57 UTC by dalke » (Journeyer)

It doesn't know anything about who wrote Python, and puts Guido as contributor to sox and xanim. Zope isn't there, nor is PIL. Lemburg's mx* packages are present.

And then there's the lack of chem and bio software. No RasMol, TINKER, HMMer, or other free packages. Problem there is these aren't usually advertised on freshmeat or other 'common' places.

They need to clean up their names. They have "international business machines" and "international business machines incorporated" and "international business machines corp" and "international business machines, inc" and "ibm corporation" and "ibm deutschland entwicklung gmbh, ibm corporation". Okay, that last one probably legally is a different company.

Re: gnuclear, posted 9 May 2000 at 12:59 UTC by lkcl » (Master)

actually, gnuclear is an open source research project into cold fusion.

looks line the codd does a search by Copyright somename in the .c and .h files. it also looks like they used freshmeat.net and any srpms they could find: they reference samba 2.0.5a which was released at least four or maybe even five months ago, yet i created pam_ntdom on freshmeat.net only appx two months ago. so, something weird going on, there.

pleased to see they released source, i imagine they'd get lambasted otherwise by ppl they're researching!

Re: Nice, but not representative, posted 9 May 2000 at 17:22 UTC by joey » (Master)

I received the following mail from the authors of the study in response to my comments, and I'm postinsg it here since they couldn't.

hi,

cees pointed me to the thread on ofss at advogato, and since i can't post there i thought i'd respond to your comments directly.

> This is just plain wrong (TM). The person listed as #1 here > appears in the rpm source exactly once: He wrote the README.amiga > file! For some unknowable reason, their data gatherer (CODD) > seems to have decided this means he is the primary author of rpm, > and credited him for everything that had no other names attached. > This although there is a README file that lists the two real > primary authors.

thanks. that's a bug; as a last resort, if there is too much uncredited code in a package, we look for names in readme files. i guess we should find a way to decide which README is the real one!

> My second example is number 4 on their list of top authors, > Gordon Matzigkeit. Gordon, their data would have you belive, > wrote 1.2% of all code they scanned, and is a participant in 267 > projects. And you thought Alan Cox was busy! > > Well, if you recognize Gordon's name, you'll remember what > project he is perhaps best known for: libtool. Now, packages that > use libtool happen to include some rather long (autogenerated) > files in them that have Gordon's name attached. So for every > package that uses libtool, Gordon gets credited with about 8 > thousand lines of code. What a sweet deal!

well... even rms wasn't able to tell me why gord would have showed up so frequently! in fact most of his credits are being split with the FSF - i.e. his name turns up, presumably thanks to libtool, in lots of FSF code.

as you pointed out, CODD source code is available and our methodology is clearly documented, so if you think there's a problem you're welcome to fix it! we can fix the gordon bug and adjust our readme file scan (the next survey will look at documentation files more closely), but as such there will always be bugs - or a huge "uncredited" share - unless authors claim credit in a more organised way. maybe not LSM forms... but at least a "written by" or "author:" line somewhere in the comments?

> So I don't trust these statistics very much, especially their > finding that "The top 1271 authors, 10% of the total, accounted > for 72.3% of the total code base. [...] Free software development > may be distributed, but it is most certainly very top heavy.".

well despite the bugs in assigning credits, they are not statistically significant. (i.e. if you remove gord, it would be someone else at about 1%) the "top heavy" graphical charts remain pretty much the same shape when we generate them for smaller projects with much better documentation where we check the authors, and are also similar to paul jones's findings for the linux apps at sunsite using LSM author info. certainly removing some anomalies (if only we had author credits for all the FSF code...) would flatten the curve out a bit, but not very much. which is not that surprising, really; the majority of contributors write bug fixes, or small amounts of code.

best, rishab

Another inaccuracy, posted 11 May 2000 at 03:17 UTC by jamesh » (Master)

This survey doesn't seem to detect duplicate code very well. One very frequently duplicated piece of code is GNU gettext. If you use GNU gettext for internationalisation, that adds about 140k of source to your package, which for small packages can be quite a large percentage. So the it looks like there is a lot of FSF code in a lot of packages, but it is the same code each time.

I am not saying the FSF hasn't made big contributions, but this form of double counting probably affects the results quite a bit.

Another thing, posted 11 May 2000 at 03:29 UTC by jamesh » (Master)

I think I count as at least two people in the survey (my name and my email address). I think it is because I sometimes include my email address on the copyright line along with my name, and sometimes don't. Future surveys should try to match these up.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page