Archaea Classification Continued
After having thoroughly examined the code for a couple days and tried the code with replacement of fragments, I’ve convinced myself that the code is correct. After thinking about it, it occured to me that the relative k-mer distribution profiles for larger k-mers (7,8,9) might be skewed by even very small sampling without replacement.
I went ahead and took the difference between the relative distributions for Pyrobaculum calidifontis for 4 different cases:
- 8-mers - 100% genome vs 99.5% genome
- 8-mers - 100% genome vs 67% genome
- 4-mers - 100% genome vs 99.5% genome
- 4-mers - 100% genome vs 67% genome.
As can be seen, the variation in relative distributions for the 4-mers is very small, generally no larger than +/- 0.002 and thats with training 67% of the genome. Meanwhile, the 8-mers show significant variation with training 67% of the genome there is a variation of up to nearly +/- 0.2 which entirely changes a profile. Even with 99.5% training, it shows variation in the hundreths place which is enough to skew the profile. This was tested on several organisms, but Pyrobaculum calidifontis just happens to be my pick.