我想统计每个双字母的频率。斯卡拉 - 火花字数,为什么滑动不工作
所以我写了
val intputFile = "bible+shakes.nopunc"
val sentences = sc.textFile(intputFile)
val bigrams = sentences.map(sentence => sentence.trim.split(' ')).flatMap(wordList =>
for (i <- List.range(0, (wordList.length - 2))) yield ((wordList(i), wordList(i + 1)), 1)
)
val bigrams2 = sentences.map(sentence => sentence.trim.split(' ')).flatMap(wordList =>
wordList.sliding(2).map{case Array(x, y) => ((x,y), 1)}
)
而且他们似乎有相同的类型。
scala> bigrams
res11: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[7] at flatMap at <console>:28
scala> bigrams2
res12: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[11] at flatMap at <console>:28
阶> bigrams.collect res15:数组[((字符串,字符串),智力)] =阵列(((圣,圣经),1),((圣经,授权),1), ((授权,国王),1),((国王,詹姆斯),1),((詹姆斯,版本),1),((version,textfile),1),((in,the),1), ((the,beginning),1),((beginning,god),1),((god,created),1),((created,the),1),((the,heaven),1), ((天,和),1),((和),1),((和),1),((地球),1),((地球,是),1), ((无),1),((无,形式),1),((形式和),1),((和void),1),((void,and),1), ((和,黑暗),1),((黑暗,是),1),((是,在),1),((在),1),((,),1), ((,),1),((of,the),1),((the,deep),1),((deep,and),1),((and,the),1), ((the,spirit),1),((spirit,of),1),((of,god),1),((god,moving),1),((move,upon),1), ((在,),1),((,,脸),1), ((...,1),((of,the),1),((and,god),1),((god),1),((...) ,当我这样做时
scala> bigrams.collect
res13: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1), ((version,textfile),1), ((in,the),1), ((the,beginning),1), ((beginning,god),1), ((god,created),1), ((created,the),1), ((the,heaven),1), ((heaven,and),1), ((and,the),1), ((and,the),1), ((the,earth),1), ((earth,was),1), ((was,without),1), ((without,form),1), ((form,and),1), ((and,void),1), ((void,and),1), ((and,darkness),1), ((darkness,was),1), ((was,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((the,deep),1), ((deep,and),1), ((and,the),1), ((the,spirit),1), ((spirit,of),1), ((of,god),1), ((god,moved),1), ((moved,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((and,god),1), ((god,said),1), ((...
scala> bigrams2.collect
16/10/05 10:17:52 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 20)
scala.MatchError: [Ljava.lang.String;@3224ea91 (of class [Ljava.lang.String;)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$apply$1.apply(<console>:29)
bigrams2.take(5)
res25: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1))
评估它的第二种方法导致了一个错误。
为什么?如何解决它?我更喜欢第二种,确切的方式。
你应该提到一个'Option'可以被认为是一个空的序列或者有一个项目 –
@ShihaoXu实际上'Option'与序列不是空的或有1个项目相似。对于Scala初学者(或者函数式编程初学者)来说,解释'Option'的实际定义是不可行的,所以人们告诉他们这样想'Option'。您现在可以这样想'Option',但请记住这不是真的。一旦你已经掌握了一个名为'Monad'的函数式编程概念,你就会更好地理解'Option'。 –