Older blog entries for peat (starting at number 14)

    Still thinking about some comments made by schoen (Software as science)

    I grew up reading very enthusiastic accounts of the work of very idealistic scientists, who mostly believed that they were working on a shared enterprise which by right belonged to all of humanity. The cool thing is that there have actually been a lot of scientists who believed that, and who lived that way.

    There have been great strides in the hard sciences and many of the natural sciences (geography, biostatistics, bioinformatics) thanks chiefly to the intuition of the researchers and the availability of the technology capable of carrying out the tasks required.

    From what I've seen, technology in the sciences is still by and large captive tech. Most large instrumentation requires software using undocumented interface specs. Sure, it may use a standard DB9 RS232 connector, but good luck finding the command set documented anywhere without an NDA or a substantial amount of cash (or both). Unfortunately, even among the scientists I know, one in particular insists on keeping his computer model private; this precludes peer review and slows down development. Others will typically release binaries for Win32 (most ecologists seem to be Win32-dependant) but none of the source.

    There have been some tremendous exceptions (examples here and elsewhere), in and out of ecology - which is great. As long as practicing scientists work within the spirit of open source, I think we may just see ecology explode just as the computing world did as more and more people embraced open source.

    The younger crowd of grad students seem more inclined to use Linux and free software, so there is hope. Unfortunately, the general technical literacy level is low enough to worry me. We go to schools, colleges, universities to learn to think critically, to analyse, to delve and write. Courses in methodologies, critical thought, and the like, are offered. As important as these are, they don't seem to go far enough; but in most biology departments I've been to, computer courses appear to be too far 'away' from the actual science at hand. And hell, I don't want to take a course on learning how to use Word, thanks.

    Perhaps the largest impediment to 'free science' is communication; most scientific journals are increasingly expensive. At up to $3000 per institutional subscription, it doesn't take long before many libraries carry fewer journals. As these costs increase, the next logical step is to start an "open source, peer review online journal"; indeed, this has already started. Unfortunately, they're not all that print friendly (granted, they're easier for a braille terminal to read, compared to a pdf). I think it will take numerous far-sighted individuals to pull off an SGML-based, open source journal; but I also think it will happen soon enough.

    Back to analysing those pesky data.

    Hm. Almost three weeks since the last entry. I've been checking in every so often before going to bed lately (and unfortunately usually too tired or too uncoordinated to write anything resembling a coherent thought), and noticed the discussion thread on the similarities between open source and science.

    Over the last few years, I've seen great examples of how some forward-looking people are working together, inside and outside of the university environment, to do better research and make a difference. Sadly, I've also found these people to be few and far between. Unfortunately, I've seen too many instances where a PI or possible collaborator would actively try to squelch fruitful discussions among grad students, post-docs, other profs, etc. The primary goal thus far seems to be focused on getting as many good publications out there as primary author as quickly and as often as possible. Although it's a nice ego boost, the primary reason for keeping a high publication rate is primary financial (okay, prestige is there as well, let's be honest :) Maintaining this "competative advantage" often appears to be the unwritten standing order, and this is especially seen in the infighting between people on large projects.

    There also tends to be a closed-mindedness particularly about technology and how it can impact on the way science is "done," for lack of a better term. Frankly, I don't see ecology / environmental science as having a data problem as much as it has a problem of lack of data organization, in such a way as to make it:

    • easy to submit data for inclusion and prior verification (this is a big one)
    • being able to ensure effective access to the data
    • being able to ensure effective USE of the data

    I've done some work on this already, and have had good results - won't go too far into details; it would be boring and I need to finish this thesis. Rest assured, tho, that these will be "published" later under the GPL - some is already available at the SGPL project site, more will follow (and the much needed porting will happen soon, any gnumeric hackers out there? :).

    Open source has opened up some tremendous potential for science. Perhaps the biggest contribution though, is to start getting scientists to be thinking in the "Unix" frame of mind, or at least gaining an appreciation of the Unix philosophy - copious small, specialized and reusable tools rather than few large applications. I can't speak for anyone else, but thanks to some people, I've come to see a raft of new possibilities that only a few years ago I couldn't even dream of. The key to this ephiphany was not to feel that I needed to create new software or programs but rather to look carefully at how existing software CAN be put to different or interesting uses...

    • using a spreadsheet as an effective interface to a data source for complex, focused calculations
    • using a web server as an efficient tool for data analysis and visualization
    • using a search engine as a personal cataloging system for online journal articles
    • using repository and good markup techniques to facilitate keeping local lab and study documentation up to date.

    The latter is usually an underappreciated and undervalued aspect of any endeavor, scientific or otherwise, and I've gained a lot of respect for those people or groups working on good docs.

    Even with all of this great open source software available, there is a still a very considerable price to pay for gaining this perspective. Pretty much everyone I know working with Linux and Unix in general for their ecological research feels pretty isolated because picking up *nix means that they no longer have any peers in a research world dominated by Win- and Mac-users. The energy (well spent, admittedly) in climbing the learning curve means that many in this situation (myself included) are perceived as being more interested with the technology than in doing science.

    Interestingly, in our case, we can easily work around this lack-of-peer-support problem by using that venerable geek tool - IRC - to maintain and develop our virtual peer group. Not only does this bring together some pretty competent *nix folk, but we get the added benefit of working in a very diverse community of researchers, and a place to talk with others about research and possible collaborations.

    Moral of the story: Hug an ecolgical *nix geek today. :)

    From the ongoing-saga-of-a-quack-gone-to-the-dogs dept:

      Finished the data prep work, at least enough to get data to analyse. 1058 lines of sql, three weeks or so of learning "proper" use of the query language. Making a lot of mistakes in the process (but hey, they're *my* mistakes :) All in all, things are well.

      Tomorrow, esox will get its installation of R upgraded at last, and I'll finally install grace as well, and start playing.

      In the meanwhile, it's way past my bedtime.

      Gamble of the ages. Suit me up, I'm ready to go. - Tom Cochrane

    More dreams of databases and lakes.

    Not much to report lately, mostly been fixing some SQL that seems to keep breaking. I have the underlying data ready, so this should end today. I'm kinda sad about that, because I'm starting to get all kinds of ideas about how ecological databases can be used. I have several functions set out already that I want to port from VBA to Postgres, simple things like temperature conversions, oxygen saturation, etc.

    Having these functions inside the db proper makes sense, mostly because they can have utility in large scale data workup (like I'm doing right now). From a design standpoint, however, other functions should not be inside the db proper at all. I'm thinking of some of the more complex functions that can be brought to bear on data subsets. Much of the data we deal with has both spatial and temporal structure, usually both (even if one is only implied), so conventional SQL breaks down for complex calculations. Besides, for profile analysis, Octave or something similar is most appropriate.

    Interfaces to data are also important. I had a really warped idea using infobot or another bot as the basis for an information retrieval interface, albeit a very simplistic one, for a db. SQL is often overkill, not to mention confusing, for simple queries. I can see something like this happening:

      <pete>fish, list tables to me
      <fish>sending /msg to pete
      (I get a list of tables pasted as a /msg)
      <pete>fish, list years for lake 'pete' in temperature table
      <pete>fish, list dates for lake 'pete' for year '1996' in temperature table
      and so on...

    Granted, this is a little contrived, because now the user wanting to drill down further and further in order to get at data, and not using SQL syntax for this is not wise. Also, at this granularity, adding a generic user to a given table and letting said person "play" in the tables (read only, of course) is probably more intelligent. This latter approach is lacking somewhat because it means that only one person can see the data, rather than everyone on channel which was the intent behind the 'fish' infobot mods.

    The other nice thing about this approach is that the bot logs all of its communications, so finding out what people are trying to do (which, of course, almost NEVER matches the spec of the system, 'cuz <cynical>Users Don't Read </cynical> :) provides hints for altering the query model.

    Other data does not lend itself well to being viewed textually, in that this spatial / temporal structure remains hidden until seen graphically. Oxygen and temperature profiles are good examples of this sort of data. I had some initial work done on an interface for profile data, but this was put aside due to lack of time. My recent departure from the Windows/VB world means the opportunity to do this 'properly' (read: reimplementing this using X / OpenGL), and better yet, there are open source examples of distributed data visualization apps I can draw on for this.

    Cool. I can hardly wait. ;)

    ADMiSSeS is mostly recovered now. Woo!

    After having made what in retrospect, turned out to be some pretty silly decisions when reintroducing link data, i managed to miss two fields in my primary data tables. These are fixed now, and IU've learned a fair bit about the way postgres handles date types. In particular, I got thrown when the elephant[1] figured out that some of the data was taken during eastern DAYLIGHT time, and not eastern STANDARD time. heh. mumble mumble 3 am mumble. Oh well, at least the 900 lines of SQL ran without breaking. I really hope to finish the data work up tomorrow.

    Had a neat conversation tonight with Jody and miguel about future directions of gnumeric. There had apparently been discussion about separating the front end (interface) from the back end logic (core functionality) at some point in the future. This would be great for a few reasons...

  1. it would expose core logic behind gnumeric to other code, allowing it to be extended in all kinds of sundry ways. So long as the strict separation of data from code is maintained (unlike much VBA code), it should keep the avenues of exploitation reduced. I have a list of things I want to implement once this is available (most of the SGPL and PLT code, for starters), so I look forward to future developments
  2. Permit a the development of a text-based front end to gnumeric. I can think of a few reasons why a text mode interface would be useful for a spreadsheet - broader possible use (esp. on more mature hardware), for one. The longer I'm in this field , though, the more people I meet who are using specialized peripnerals like eyetrackers, speech synthesizers and especially braille displays. Adding speech synth support to an app is a great idea, save that as an interface it is rather clunky when dealing with complex data (try running festival on math notation, f'rinstnace :). In many cases, braille displays are more useful for intricate work
  3. That said, some functionality exposed via the gui would likely be lost at least initially to text mode users. However, given (1) above, some functions otherwise primarily presented via the gui can be presented via "scripting" (probably the wrong term to use here, but. it's 3 am, and I'm tired. so there. :)

gnumeric already rocks. I can hardly wait to see how it develops from here.

What else... Not much. Too tired to write much more, and hopefully what's already typed is somewhat intelligible. 03:15 hours, waaaay past my bed time.

[1] The elephant being the mascot for postgres, of course

    ADMiSSeS is sick, and that is *not* good. :(

    Found a not-so slight problem with my research database (which I'm using as a prototype for a project I intend to release under the GPL later). Fixing it will likely break the 1 kloc of SQL I've already written to do my thesis calcs, but...

    So far, the main data table's been fixed, but all of the tables linking to it have to be updated. Rather than tempting fate and trying to fix everything at once (while hungry and distracted), I decided to take a few hours off and go downtown. Walking through my usual haunts just north of downtown usually clears my head. Had dinner downtown at a little deli, bought a book of poetry (a first) written by a friend (his first book), then headed up to the grad pub for an hour and read. David O'Meara's Storm Still is a wonderful book, btw (insert shameless plug here).

    Alas, it looks like tonight is a wash for database work. I've had a couple of pints of beer this eve (Hey, do I turn down a friend offering a pint of Guiness?? I think not! :) and coming in the door tonight, my housemate hands me a glass of Jamaican rum.

    So much for productivity.

    Given that DROP TABLE queries don't forgive, I'm off to bed to sleep off this feeling, and perhaps I'll find tomorrow a better and productive day. My copy of Desai's Intro to DBMS awaits.

    Still here, still alive, still kicking.

    Recap: After having dinner with a colleague, I realized why I couldn't finish my thesis: Having followed my supervisor's advice (read: orders) to take his approach to analysing my data, the results were indeterminate. After 3 years and $200K spent on equipment, salaries, etc, and the only answer I could give was 'we can't find an effect using this approach', I finally accepted that I couldn't submit a thesis that could be easily co-opted for political and other ends. Back to the drawing board.

    After my NT box crashed (boom) and took my research db (and two years of my work) with it, I had little choice but to redo all my calculations from scratch using postgres. Relearning everything I thought I knew (from originally using access, natch) is somewhat painful, the kind of 'character building' activity that our parents told us about years ago, but we never truly believed. so far, I've successfully reimported the three years of data from our study catchments, and have ~80% of the data prep queries working properly. The remaining 20% are touchy in that they depend somewhat on the nature of the data returned by the earlier queries, so it's time consuming. On the other hand, I have 1 kloc of beautifully commented SQL. :)

    My biggest criticisms of postgres are the lack of left outer join (tho I know it's coming in the next couple of months) and lack of native crosstabulation capabilities (TRANSFORM / PIVOT predictes), crosstabs being queries having aggregation at both the row and column level. That said, some reasearch and more error and trial later,

    crosstabulation queries are possible !!!

      SELECT lakename,
        COUNT(CASE WHEN year = 1996 THEN depth ELSE NULL END) AS SY1996,
        COUNT(CASE WHEN year = 1997 THEN depth ELSE NULL END) AS SY1997,
        COUNT(CASE WHEN year = 1998 THEN depth ELSE NULL END) AS SY1998
      FROM crosstab
      GROUP BY lakename

      C12a    |     0|     0|     1
      C23a    |     1|     0|     1
      C24a    |     7|     5|     7
      C29a    |     2|     0|     3
      C2a     |     4|     3|     5
      C40a    |    20|    18|    16
      C44a    |    11|     9|    10
      C48b    |     5|     1|     0
      C9b     |    11|    11|     7
      FBP10a  |     5|     8|     3
      FBP9b   |     0|     6|    11
      FP15b   |     7|     9|     5
      FP24a   |     4|     4|     6
      FP27a   |    14|    15|     9
      FP2a    |     5|     5|     3
      FP30a   |     6|     7|     6
      FP31a   |     8|     7|     5
      FP32a   |     5|     5|     4
      N106a   |     4|     4|     2
    Now, granted, this isn't nearly as nice as:

      TRANSFORM count(depth)
      SELECT lakename, year
      FROM crosstab
      GROUP BY lakename
      PIVOT ON year

    ... But even SQL Server 7 doesn't do that either.:)

    Still, it works, at least on this limited scale. I'd hate to have used this approach on some species diversity work I did awhile ago, on a few massive datasets (well, massive by ecology standards :) of 50-200k rows, and about 50-80 different species of interest (usually the species names are used as the column heads, so that means a table with with about 50-80 cols and x rows (one row / plot / date). Something's gonna have to give here, cuz I may have another one to do after I finish. Something in either Perl or Python to build the table. Hmmm. <rubs hands>

    And in the meanwhile, back to our regularly scheduled data work up.

Hi. peat isn't here right now. His two fish, acip and esox are writing this for him as he snores in the other room.

peat spent much of the day with one of the developers from the TINY (Tiny's Independence 'N Yet) Linux effort , working on a good distro for low-power hardware (386 and better). Have a look at TINY - it's a REALLY good idea!

We were both quite happy to see that peat was able to get most of his presentation for the upcoming Linux Expo Ameriques conference ready. (At this rate, he might actually be ready for the presentation before he gives it. Now that would be a swell change!) If you're in Montreal this coming Wednesday, you might even be able to catch him presenting in the Linux in Education track.

In the meanwhile, we'll finish this diary entry here and let him sleep. We'd write more, but the keys are getting rather slippery. What do you expect? We're fish, ya know...

It's been a neat day.

Some time ago, I wrote some fairly simple routines to front end a spreadsheet to a database to simplify some fairly complex calculations. The goal here was to develop a way to rapidly get at info that would otherwise take far too long to calculate manually, or would require a phenomenal handle of SQL. I'm not yet such a person, but that may yet happen.

I found myself re-writing one of the main functions merging disparate data, and dammit, if the rewritten version wasn't a heckuva lot more logical than the first one. Trouble was, tho, that both versions were really, really expensive for disk calls. Two am this morning I realized that I wouldn't get the numbers if I relied on my current machine only. By tonight, it will have crunched 18 k rows of the data set, and there are twice as many records in the set.

A couple of email last night secured some time on a couple of kick-ass boxes locally, and I spent most of the day going from place to place to set up the db and program. End result: I have my 36 k rows crunched, and can finish things up tonight. At this rate, I might even sleep :P I'm going to have to, I can't really work as it stands.

This solution ("More power! Ar! Ar! Ar!") worked fine this time, but this is clearly not workable for the future. The problem with the spreadsheet approach is that it will make repeated calls for a similar - but not identical - subset of data, in no particular order - caching may not be effective in this case. I have to think about this more carefully. After sleep. :)

Another late night. Happy April Fools morning. :)

I finally have an update for the lake hacking thing, and had written two or three paragraphs to describe it. Or, at least I thought I had written two or three paragraphs, until I re-read them. That's usually the sign that I rely on to tell me that it's time for bed.

I'll post something at some point, I promise. In the meanwhile, I'll get some sleep.

5 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!