More results, new and improved software
Switched from Python to Java (sigh) to improve speed. Python (besides being dynamic) has worthless threading in comparison to Java. Java version is faster, by a lot, runs on just Archaea go from 4 days to something in the range of 12 hours with Archaea + Bacteria (625 genomes).
Ran into problems with threading but learned a lot while doing it. Will definitely make this process faster. Runs are still a slow process though but not much room for improvement without moving to a cluster.
Loading full relative distributions for 9-mers is currently not possible right now without more consideration of the program. Maybe switching from Hashmap<String, Double> to an array of doubles (double) will save some space. Need to investigate that further, we’ll see.
Results for piece sizes 36, 100, and 200 (excluding 8-mers) for 3 through 8-mers as follows:
Still waiting on a run to finish for 200 piece size 8-mers, then will run taxonomic classifier. Unsure of how well that will go in efforts to match data in db to data from files. Not sure the “species” match up between the two in the right way.