一种选择是使用分类贝叶斯如Reverend。牧师主页给出了一个天真的语言检测器的建议:
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french', 'le la les du un une je il elle de en')
guesser.train('german', 'der die das ein eine')
guesser.train('spanish', 'el uno una las de la en')
guesser.train('english', 'the it she he they them are were to')
guesser.guess('they went to el cantina')
guesser.guess('they were flying planes')
guesser.train('english', 'the rain in spain falls mainly on the plain')
guesser.save('my_guesser.bay')
使用更复杂的标记集进行训练会加强结果。有关贝叶斯分类的更多信息,请参阅see here和here。
重复:http://stackoverflow.com/questions/257125/human-language-of-a-document – 2008-12-21 02:13:35