Older blog entries for rufius (starting at number 43)

Things I Learned Today…

I’ve been writing a bioinformatics program to test some Bayesian naive classification of K-mer/oligonucleotides. I started with some code I was given that was in Perl, wrote some in Python and then moved Java. In that time I learned a lot about optimizing Python and Java with respect to string manipulations. 

Today I was working with a program to build k-mer distributions in a format that a SVM (Support Vector Machine) can read and process. This requires building huge strings and putting them all in a file line by line. The files are usually in the area of > 50 MB so they’re fairly sizable.

Doing this process was fine as long as I was using k-mers less than 6 (4^6 = 4096), so lines that are no longer than 4096 entries. I noticed a fair slow down when I built a data set with 7-mers but didn’t think much of it. When I tried with 8-mers a little while ago, it was painfully slow. Turns out doing the following with really big strings is bad joojoo:

String my_line = "";
for (int i = 0; i < 20000000; i++) 

   for (int k = 0; k < i; k++) 
       line = line + i;
   line = line + ” | ”;
}

Obviously I’m not doing exactly that but you get the idea. Basically your string concatenation starts of really fast but as the string gets bigger and bigger, it will get slower and slower. Though I don’t claim to know the inner workings of the String class, my best guess is that every time you concat a string to the end of another string, the JVM realloc’s (as in the C version) the memory to make room for the added information. I may not be right, but from just thinking about it halfway, thats the best I’ve got. 

To alleviate these situations, this is my solution:
StringBuilder str_bldr = new StringBuilder();
for (int i = 0; i < 20000000; i++)
{
    for (int k = 0; k < i; k++)
          str_bldr.append(i);
    str_bldr.append(” | ”);
}
String line = str_bldr.toString();

As you can see above, I’m using a class called StringBuilder. Again, no claim of knowledge, but it probably just acts as a Vector/ArrayList (not sure if its synchronized) and you just append items and the toString just iterates the array and returns a big string.

To most this is probably amateur business but I figure its useful for others to know in case they ever wondered. Even if I am a fairly seasoned programmer, I’ve got new things to learn and so does everyone else.

Syndicated 2008-11-11 21:39:22 from blog.zacbrown.org - just run away, now.

Live Mesh

I’ve been exploring Windows again, partly from a need to refresh my brain on the inner workings of the operating system as well as to take a little time to test out Vista since it was first released. Its certainly better than it was when it came out but definitely not the best I’ve ever seen.

I think the most positive I’ve ever been towards an operating system when it first came out was either Ubuntu 7.10 (Gutsy Gibbon) or Windows XP. Not sure which but both stick out in my mind as exceptionally good operating systems.

To get back to the topic, I have been working on a group project with a close friend (also a future full-time Microsoft employee) and we needed to share some files. He suggested Live Mesh, which at first glance looks a lot like Groove. Except that its free and its got a bit of a different slant to it. It appears to be designed more for ad hoc sharing rather than the way Groove works. 

However, besides being an easy way to share some files, it actually has a pretty impressive setup. In the future it’ll allow you to sync not just your Windows desktop/laptop but also sync to Mac computers as well as mobile devices. This is all fine and good but the most impressive thing I saw was clean integration of Live Mesh with the Windows Explorer interface and conflict resolution. Conflict resolution is something I tend to group with serious revision control software (ie: bazaar, git, mercurial) but Live Mesh has a pretty decent system setup for these problems.

The web interface for Live Mesh itself is fairly decent as well. Its got a Vista look’n'feel to it and is fairly snappy. Don’t envision myself using it much but its worth mentioning if you ever need to link up to files you shared, you can get them that way.

Syndicated 2008-11-10 22:38:01 from blog.zacbrown.org - just run away, now.

Windows Vista and more fun

So recently I’ve managed to pull off getting an internship (and hopefully a job afterwards) with Microsoft in the WEX group. For the uninitiated, WEX is short for Windows Experience and they’re primarily in charge of the “face” of Windows. I’ll be working as an SDET (Software Development Engineer in Test) on Windows 7.

In response to this news as well as a partition of Windows XP that died on me, I installed Windows Vista. I had tried Windows Vista once before during its beta as well as right after its release. I wasn’t impressed then, a lot of it annoyed me so I stuck to Linux and Windows XP. Since Microsoft has already made the transition internally to Vista I figured it’d be a good time to get familiar with it before I get there next summer.

In the past I’ve had problems with not having access to a decent command line interface in Windows. In the past it also wasn’t an issue Microsoft considered very serious. When PowerShell came out I began to use that but still found deficiencies in it since there was no support for tabs and its only an *ok* terminal with respect to options on linux. So I decided this time around I would find a terminal emulator that would give me tabs or I’d write my own. Fortunately someone else wrote one for me so I’m off the hook.

A project called “Console” is on Source Forge by a guy named Marko Bozikovic. It provides a tabbing interface with a nice copy/paste setup thats more in line with the rest of windows rather than the goofy setup provided in “cmd” or PowerShell. It even allows you to specify the console you’d like to use so in my case I use PowerShell. Now Marko doesn’t provide an actual installer so I took it upon myself to use a wizard with NSIS to create a very simple one. Hopefully someone will find them useful as I prefer to have an actual installed program when I can on Windows.

There are two installers, the first one I’ll mention has the MSVCRT dll included. This is a runtime provided by Visual Studio so if in doubt, download this link: Console-2.0-beta141-mvscrt-setup.exe. The other link I have provided does not package the MSVCRT runtime if you do in fact have Visual Studio (Pro or Free) you should already have these, and here’s the link: Console-2.0-beta141-setup.exe.

Now I won’t be keeping this up to date in any sort of consistent capacity. It’ll basically be updated whenever I update it for my own computer. I’ll probably eventually put up an actual page and/or link on my main site.

Other than looking for a console emulator, Vista has been an ok experience. I’m not overly joyed with it but there are nice things about it. Stuff that comes to mind include 1) quick association with access points rather than 30 seconds of associating in linux, 2) better battery life (+1 hour or more), 3) cohesiveness in terms of interface and functionality and 4) a (finally) usable start menu that allows me to search.

I’ll probably intermittently add things in further posts as I start playing more with the operating system.

Syndicated 2008-11-09 19:51:22 from blog.zacbrown.org - just run away, now.

More results, new and improved software

Switched from Python to Java (sigh) to improve speed. Python (besides being dynamic) has worthless threading in comparison to Java. Java version is faster, by a lot, runs on just Archaea go from 4 days to something in the range of 12 hours with Archaea + Bacteria (625 genomes).

Ran into problems with threading but learned a lot while doing it. Will definitely make this process faster. Runs are still a slow process though but not much room for improvement without moving to a cluster.

Loading full relative distributions for 9-mers is currently not possible right now without more consideration of the program. Maybe switching from Hashmap<String, Double> to an array of doubles (double[]) will save some space. Need to investigate that further, we’ll see.

Results for piece sizes 36, 100, and 200 (excluding 8-mers) for 3 through 8-mers as follows:

Still waiting on a run to finish for 200 piece size 8-mers, then will run taxonomic classifier. Unsure of how well that will go in efforts to match data in db to data from files. Not sure the “species” match up between the two in the right way.

Syndicated 2008-11-06 17:24:29 from blog.zacbrown.org - just run away, now.

Old data bad, New data good, Program too slow

So the last set of data posted is definitely incorrect. Found flaws in the scripts’ function to generate relative distributions. Also modified the original identification script to work with classifying organisms.

The data for correct identification below…

The data for phylogenetic classification below…

Full bacterial and bacterial+archaeal analysis will be harder as the current program is too slow. Rewriting parts to make the process faster. Possibly working OCaml to do this.

Syndicated 2008-10-16 17:43:11 from blog.zacbrown.org - just run away, now.

More genomics…

After some misunderstanding, now have a program that does what is needed. Seems slow and memory constraints on loading higher level distributions is difficult (kmer size > 9).

Started a run last night(~18:00) on 625 genomes (50 Archaea, 525 Bacteria), still running. Got no significant results from 3-5-mers, now running on 6-9-mers.

Have a completed run from just doing Archaea, results not so great, around 1.1-1.6% success in identification with 10000 samplings. See graph below:

Syndicated 2008-10-02 18:00:59 from blog.zacbrown.org - just run away, now.

Genend - Update 1

Moved from Perl to Python. Extensive use of Perl in larger files proved to be hard to organize for myself, was having trouble keeping straight what I was doing. Also don’t like the Perl object/class system, more at home with Python’s.

Current progress includes a custom database object for use with interfacing to a sqlite database (and possibly PostgreSQL/MySQL/Firebird if it gets too slow). Everything except ‘updates’ to an entry are done. Database object is about as simple as it gets, using a list of tuples for adding k-mer’s and a large tuple for taxonomy.

Started working on an object that will take in a directory full of genomes, the output directory and a number for the number of threads to run and it will pool objects to process files. Will have a threadable object that accesses BioPython libraries to parse the genome files. Important question for queueing threads is whether SQLite will like concurrent access to the same database. Need to figure out how to handle inserts so that there isn’t fragmentation. There should be little fragmentation as each file and species will be unique.

For next week:

Finish up database object and threading objects. Do preliminary run to start building genomes. Determine largest feasible genome before laptop machine (2×2.4GHz w/ 4 GB ram) will puke. If it proves to do so before getting to high in the phlya, will need to start writing some string operating libraries in C to deal with static length strings.

Syndicated 2008-09-18 18:19:16 from blog.zacbrown.org - just run away, now.

Updates

Just got back from LA. My stint with Google is over :(. It was a great experience despite my awful apartment.

Managed to get myself some confidence in my abilities and am now working on two OSS projects.

Back to the grind of school now. At least I’ll be starting a new research project.

Syndicated 2008-08-19 14:11:54 from blog.zacbrown.org - just run away, now.

Oppose the Orphaned Works Act of 2008!

I won’t go into full detail of its evil here, as you can read and get the gist of it here: http://blamcast.net/articles/orphaned-works-open-source-copyright.

Essentially, it means any company can take a piece of software (among other things) and “claim” they looked for the author and then use it without obeying the GPL license on it. When the copyright holder sues on grounds of infringement, the people that violated the copyright merely have to provide “proof” that they looked for an author and could not locate one.

The only way to really prevent the infringement if the bill passes, is to register with some sort of copyright registry, which costs money no doubt.

Oppose it! You can go here: http://capwiz.com/illustratorspartnership/home/ to find out how to easily write your congressman/woman.

Syndicated 2008-07-05 17:05:05 from blog.zacbrown.org - just run away, now.

Do something nice today.

To be honest, I’m long overdue on saying something. I am alive, just barely, but I am.

I came across this article today. It made me realize that society has become disenchanted with… well society itself. Do something nice today for someone you don’t even know. Seriously.

http://www.npr.org/templates/story/story.php?storyId=89164759

Syndicated 2008-03-28 22:14:09 from blog.zacbrown.org - just run away, now.

34 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!