Older blog entries for rufius (starting at number 51)

Genend Update

The server that I was running the computations hard locked sometime during the winter break. Apparently it ran out of disk space while another user was running simulations on it. Wasn’t able to access the machine till I returned to Miami.

Since I had no access to machine with large amounts of memory, I spent some time trying to figure out what was wrong with the training software. Still wasn’t able to find the problem, must be missing something simple.

Upon return to Miami, did the following:

  • Fixed the server, apparently it ran out of disk space from log files created from other user’s run.
  • Researched building a database for taxonomies.
  • Built a database using the BioSQL schema after discovering that Genbank files track phylogeny through recursive ranks.
  • Wrote a Python script to fetch the Genbank file for each of the 625 fasta-format genomes and load it into the BioSQL database.
  • Began revising taxonomic classifier, ~80% done.

Next things to do:

  • Run the taxonomic classifier.
  • While waiting for taxonomic classifier results, tear apart training classifier and figure out whats wrong.

Syndicated 2009-02-02 17:02:11 from Zac Brown

Ruby…

In the past I’ve vehemently argued against using Ruby. My encounters with it had shown a shoddy VM, decent libraries but my greatest grievance was the lack of any clear standard. It was really just whatever Matz felt like, or at least that was my understanding. Others may correct me if I’m wrong.

Despite my distaste for Ruby I always sensed that at some point I would pick it up when it became clear that the language was stabilizing, the VM was ready for the big leagues and when I actually had some time to make sure my past encounters weren’t just issues of my own ignorance (which I’m prone to as much as anyone else).

That said, I am picking up some Ruby now and am enjoying it. Unlike most these days, my primary interest wasn’t in Rails/Merb or I guess what is now “Rails 3″ but rather _why’s Shoes framework. I have long been irritated by GUI programming. In almost every language it feels incredibly… unwieldy (I think thats the word I want to use). It seemed as though no matter the language and how graceful it generally is to work with, there were always things that made its GUI toolkit interfaces ugly to work with.

Shoes on the other hand has been quite enjoyable. Its minimal and still has a ways to go but in a couple days I was able to learn enough Ruby and enough of how the Shoes framework works to write a pretty simple application to solve a long-standing issue that my mother has had with trying to copy her music to her MP3 player. She’s not well suited to navigating multiple Windows Explorer windows and copy/pasting her way to victory. With that, I wrote a really simple Shoes app that basically shows two panes, the “all the music you have” pane and the “music thats on your MP3 player” pane. It only shows files not already on her MP3 player so you just click the file you want and it will “move” the file in the window to the other pane as well as actually copying the file.

If anyone is interested in this brain dead app, I’ll post it later once I’ve attached all the license information for it, etc.

Syndicated 2009-01-14 02:18:37 from Zac Brown

Vista Media Streaming

I recently built a new media center PC. I’ve wanted one for a while along with also wanting a computer that would be capable of playing the “Latest and Greatest” computer games. The specs looks something like this:

  • Processor: AMD Athlon X2 6000+ (2 x 3.10 GHz)
  • Motherboard: Foxconn A7GM-S - a good motherboard, has integrated HDMI though I have a vid card with that too
  • Video Card: Power Color AX4830 (ATI Radeon 4830 with 512 MB DDR3 + HDMI audio/video)
  • Memory: 4 GB, soon to be 8 GB whenever I get my hands on x64 Vista or Windows 7
  • TV Card: Hauppage HVR-2250 - dual digital/analog tuner with an MCE remote
  • HDD: Western Digital Caviar 640GB w/ 16MB cache
  • Operating System: Windows Vista Ultimate x32 using Vista Media Center to record the TV stuff. Gotta say, thats some brain dead simple software to setup with my TV card.

Obviously not the best computer on the market but it can play Fallout 3 on Ultra High settings and I can’t seem to really slow it down. I anticipate an upgrade this summer whenever the novelty of the Phenom II chip comes down a bit. This new computer fits both niches nicely though I did find an interesting caveat with streaming my recorded TV to other computers in the house.

I don’t always like to be in the study watching TV, partly because my Dad works in there as well so unless its fairly late at night I can’t watch things I’ve recorded. In these cases I head to my room with my laptop. My laptop has Vista Enterprise on it and it immediately picked up the media sharing of my desktop/htpc.

However whenever I tried to play an episode of Southpark I found I was only getting audio, no video output whatsoever. My first suspicion was that the files were to big to be transferring over wireless at a rate that would transfer video but after thinking about that a moment I realized that was idiotic as it would have shown the video in some choppy fashion.

After some cryptic searching on Google I found that I was missing the DScaler MPEG Filter for Windows Media Player. After installing this codec, I was getting video playback fine. It seems odd to me that with all this media sharing between the different computers that Vista does so well that it doesn’t include the codecs to playback the PVR files recorded by Vista Ultimate or Home Premium.

So with that, I make a humble request to the group in charge of Media Center or the Media Codecs (in fact the team I’ll be on next summer as an Intern):

Dear Future Employer:

Will you please include this codec on all versions of the operating system in the future. It would make life a lot nicer for those of us not using the same version of Windows on all our computers. Kthxbai.

With love,

Your humble (future) minion

Syndicated 2009-01-06 16:57:36 from Zac Brown

To Do for December 2008 - Revisited

This is my brief todo list for December 2008 revisited, also known as the first vacation I’ve had since I started college.

  1. Build the new PC I bought and get Windows Vista Ultimate (Super Fantastic) Media Center running.
  2. Turn 21 and become a drunkard overnight (hah). (T-minus 2 hours)
  3. Reinstall the laptop to repartition the operating systems. In the process of this, also install Ubuntu 8.10. (Still haven’t done this… will wait till I’m done with a few things.)
  4. Play a lot of video games. (Bought the two Guild Wars expansions, Factions and Nightfall)
  5. Sleep… this hasn’t been consistently done in a long time. (Still need more of this)
  6. Learn C++ better, especially proper template design. (Heh, this is work, haven’t started on that yet.)
  7. Finish leftover things for my bioinformatics research work. That is, build a database for the organisms for doing phylogenetic classification. Maybe play more with SVM’s…  (Same as 6)
  8. Sleep more. (Working on it)
  9. Play more games. (Working on this too)
  10. Setup the new Roku Soundbridge M1001 I bought for my parents for Christmas. (Have to return the one I ordered and wait till they have more of these available….)

There are probably other things that should be on here… like my senior project work. I suspect I’ll start that Tuesday as tomorrow is my birthday and I will be eating and drinking merrily.

Syndicated 2008-12-22 04:00:16 from Zac Brown

To Do for December 2008

This is my brief todo list for December 2008, also known as the first vacation I’ve had since I started college.

  1. Build the new PC I bought and get Windows Vista Ultimate (Super Fantastic) Media Center running.
  2. Turn 21 and become a drunkard overnight (hah).
  3. Reinstall the laptop to repartition the operating systems. In the process of this, also install Ubuntu 8.10.
  4. Play a lot of video games.
  5. Sleep… this hasn’t been consistently done in a long time.
  6. Learn C++ better, especially proper template design.
  7. Finish leftover things for my bioinformatics research work. That is, build a database for the organisms for doing phylogenetic classification. Maybe play more with SVM’s… 
  8. Sleep more.
  9. Play more games.
  10. Setup the new Roku Soundbridge M1001 I bought for my parents for Christmas.

Syndicated 2008-12-05 16:36:48 from Zac Brown

Archaea Classification Continued

After having thoroughly examined the code for a couple days and tried the code with replacement of fragments, I’ve convinced myself that the code is correct. After thinking about it, it occured to me that the relative k-mer distribution profiles for larger k-mers (7,8,9) might be skewed by even very small sampling without replacement.

I went ahead and took the difference between the relative distributions for Pyrobaculum calidifontis for 4 different cases:

  • 8-mers - 100% genome vs 99.5% genome
  • 8-mers - 100% genome vs 67% genome
  • 4-mers - 100% genome vs 99.5% genome
  • 4-mers - 100% genome vs 67% genome. 
Since 4-mers showed little variation between training and full genomes, I felt that was a good base for “lack of difference” in the distributions. Here’s the data:

As can be seen, the variation in relative distributions for the 4-mers is very small, generally no larger than +/- 0.002  and thats with training 67% of the genome. Meanwhile, the 8-mers show significant variation with training 67% of the genome there is a variation of up to nearly +/- 0.2 which entirely changes a profile. Even with 99.5% training, it shows variation in the hundreths place which is enough to skew the profile. This was tested on several organisms, but Pyrobaculum calidifontis just happens to be my pick.
That to me, explains why this technique might not be applicable the way its currently designed as the profiles for the organisms don’t match as well. Of course the other side of this is since every one of the genomes’ profiles would be skewed, wouldn’t that even it out. Without some serious statistical analysis (and time), I can’t say for sure.
Here also is a comparison of distributions:
From this, it can be seen that sampling with replacement (100 pieces) is pretty close to sampling 95% of the genome with replacement. Those are two separate pieces of software which is what leads me to believe the software is written correctly.

Syndicated 2008-12-05 15:21:55 from Zac Brown

Genend Update 2.33421

Still having problems loading full data sets into memory for Bacteria + Archaea genomes. Need to come up with a good way to do this with the 67/80/90% runs. Right now, I can only do it with Archaea.

The results for the run strike me as being somewhat odd. You’ll see below…

Despite having gone over the algorithm repeatedly I’ve been unable to find a fault in it. As near as I can tell its doing exactly what I thought it should be. I thought it was odd that the results for 3-6mers are about the same despite training more or less (training 50% showed almost identical results as well). The oddest thing is that the results drop off after peaking at either 6-mer or 7-mer. Thats the part that makes no sense to me. I’m not sure what to make of it.

Maybe I’m missing something obvious. I’ll switch to something else for a bit and come back to it.

Syndicated 2008-11-20 18:06:10 from Zac Brown

WEX: Devices and Media - SDET

The title of this blog post is the official team I’ll be joining next May at Microsoft as an Intern (and hopefully fulltime after that). It turned out that after my interviews at the beginning of November, each team expressed interest in having me join their teams (I didn’t really think they all would).

The teams I’d chosen to interview with before I flew up were FNO (Find and Organize), CoreUX, and DNM (Devices & Media). Originally my interests in each group were roughly in the aforementioned order. That is I was most interested in working for FNO and least interested in DNM. As I spent more time learning about DNM and what they do, it became apparent that I’d learn more there than I would in any other group.

Each group was interesting in its own ways. FNO has a very young group of developers and is a very high energy group. They own the Explorer and Desktop interfaces with anything that has to do with file manipulation included in that. They also own the indexing service used for desktop search. Had I chosen to work with that group, I probably would have tried to get in on the indexing side of things. Its a lot of coding (what I like) and its at the core of my interest in that group.

CoreUX on the other hand owns the start/taskbar, the window framing, sidebar, and so on. Things that make Windows look like… well Windows. The team members I met with were all very encouraging and a group of really interesting individuals. Their manager, John Cable, was the guy I interviewed with during my first round interviews and was indispensible to me through the whole process in helping me make decisions about my time with Microsoft.

Finally, DNM manages the pipelines that serve up audio/video to the screen and speakers, interfacing with devices like the Zune, cellphones, bluetooth devices and things like the Roku (look it up, its sweet). They are a “foundation” team, meaning that a lot of other groups in WEX build on top of what they provide. For example, CoreUX is in control of Windows Media Player which has to use the media technologies supported/owned by DNM. This type of exposure to different technologies inside Microsoft as well as outside (like Roku) is what attracts me to the team. They get a lot of face time with a lot of products which means there will never be too little for me to learn. 

Since I will be at Microsoft to learn, I figure picking a group like DNM is a good way to learn a lot. Thats not to say I wouldn’t learn anything in the other groups. I just feel that at the point that I am now with my education, my weakest points are in the areas that DNM is focused and in the end would provide me with the most “bang for my/their buck” in my time at Microsoft. Hopefully that time will be a long time as the culture is very attractive.

Syndicated 2008-11-19 19:20:15 from Zac Brown

Things I Learned Today…

I’ve been writing a bioinformatics program to test some Bayesian naive classification of K-mer/oligonucleotides. I started with some code I was given that was in Perl, wrote some in Python and then moved Java. In that time I learned a lot about optimizing Python and Java with respect to string manipulations. 

Today I was working with a program to build k-mer distributions in a format that a SVM (Support Vector Machine) can read and process. This requires building huge strings and putting them all in a file line by line. The files are usually in the area of > 50 MB so they’re fairly sizable.

Doing this process was fine as long as I was using k-mers less than 6 (4^6 = 4096), so lines that are no longer than 4096 entries. I noticed a fair slow down when I built a data set with 7-mers but didn’t think much of it. When I tried with 8-mers a little while ago, it was painfully slow. Turns out doing the following with really big strings is bad joojoo:

String my_line = "";
for (int i = 0; i < 20000000; i++) 

   for (int k = 0; k < i; k++) 
       line = line + i;
   line = line + ” | ”;
}

Obviously I’m not doing exactly that but you get the idea. Basically your string concatenation starts of really fast but as the string gets bigger and bigger, it will get slower and slower. Though I don’t claim to know the inner workings of the String class, my best guess is that every time you concat a string to the end of another string, the JVM realloc’s (as in the C version) the memory to make room for the added information. I may not be right, but from just thinking about it halfway, thats the best I’ve got. 

To alleviate these situations, this is my solution:
StringBuilder str_bldr = new StringBuilder();
for (int i = 0; i < 20000000; i++)
{
    for (int k = 0; k < i; k++)
          str_bldr.append(i);
    str_bldr.append(” | ”);
}
String line = str_bldr.toString();

As you can see above, I’m using a class called StringBuilder. Again, no claim of knowledge, but it probably just acts as a Vector/ArrayList (not sure if its synchronized) and you just append items and the toString just iterates the array and returns a big string.

To most this is probably amateur business but I figure its useful for others to know in case they ever wondered. Even if I am a fairly seasoned programmer, I’ve got new things to learn and so does everyone else.

Syndicated 2008-11-11 21:39:22 from blog.zacbrown.org - just run away, now.

Live Mesh

I’ve been exploring Windows again, partly from a need to refresh my brain on the inner workings of the operating system as well as to take a little time to test out Vista since it was first released. Its certainly better than it was when it came out but definitely not the best I’ve ever seen.

I think the most positive I’ve ever been towards an operating system when it first came out was either Ubuntu 7.10 (Gutsy Gibbon) or Windows XP. Not sure which but both stick out in my mind as exceptionally good operating systems.

To get back to the topic, I have been working on a group project with a close friend (also a future full-time Microsoft employee) and we needed to share some files. He suggested Live Mesh, which at first glance looks a lot like Groove. Except that its free and its got a bit of a different slant to it. It appears to be designed more for ad hoc sharing rather than the way Groove works. 

However, besides being an easy way to share some files, it actually has a pretty impressive setup. In the future it’ll allow you to sync not just your Windows desktop/laptop but also sync to Mac computers as well as mobile devices. This is all fine and good but the most impressive thing I saw was clean integration of Live Mesh with the Windows Explorer interface and conflict resolution. Conflict resolution is something I tend to group with serious revision control software (ie: bazaar, git, mercurial) but Live Mesh has a pretty decent system setup for these problems.

The web interface for Live Mesh itself is fairly decent as well. Its got a Vista look’n'feel to it and is fairly snappy. Don’t envision myself using it much but its worth mentioning if you ever need to link up to files you shared, you can get them that way.

Syndicated 2008-11-10 22:38:01 from blog.zacbrown.org - just run away, now.

42 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!