2015-02-12 81 views
0

我有以下数据的.txt文件:文本挖掘与斯卡拉

L666371 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Lord Chelmsford seems to want me to stay back with my Basutos. 
L666370 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ I'm to take the Sikali with the main column to the river 
L666369 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Your orders, Mr Vereker? 
L666257 +++$+++ u9030 +++$+++ m616 +++$+++ DURNFORD +++$+++ Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot 
L666256 +++$+++ u9034 +++$+++ m616 +++$+++ VEREKER +++$+++ Colonel Durnford... William Vereker. I hear you 've been seeking Officers? 

我想要导入的文本文件导入斯卡拉(我做了),然后通过提取所有有关它的工作文本。之后:标记,小写,忽略单词形式,单独标点符号,之后我想要学习单词的计数,如下所示:unigram,bigram和trigram count,以最高计数排序结果。

有人可以告诉我怎么实现吗?我有以下的尝试,但它似乎并不奏效:

import io.Source 
val s = Source.fromFile("movie_lines.txt")("ISO-8859-1") 
val lines = s.getLines 
val str = s.mkString 

val Pattern = "([A-Z]+.!)".r`enter code here` 

Pattern.findAllIn(str).foreach { x => println(x) } 

println ("\n This is the result\n")`enter code here` 
    } 
+0

任何人都可以回答? – 2015-02-22 05:12:59

回答

0

可以使用Epic库从ScalaNLP西装preprocesing文字(符号化),然后解析,标签和提取实体。