首先想到的是:对于50页的工作,您可以通过只用人工来节省更多的时间。但是,如果你的团队中有一位优秀的数据科学家,那么你可以试试gensim。比较两种不同短语的最新技术是词嵌入。您可以将其视为通过对数百万个文档进行培训将单词转换为高维矢量(从200到1000维)。
例如,如果你的字符串是“人机交互”,你会寻找类似的东西。
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
来自:https://radimrehurek.com/gensim/tut3.html