After some misunderstanding, now have a program that does what is needed. Seems slow and memory constraints on loading higher level distributions is difficult (kmer size > 9).
Started a run last night(~18:00) on 625 genomes (50 Archaea, 525 Bacteria), still running. Got no significant results from 3-5-mers, now running on 6-9-mers.
Have a completed run from just doing Archaea, results not so great, around 1.1-1.6% success in identification with 10000 samplings. See graph below: