2017-02-22 85 views
0

我有一个推文的小数据集,我想从推文中删除用户名。我应该删除以@开头的所有单词,但在以下代码的最后一个map()操作中,我得到java.lang.StringIndexOutOfBoundsException: String index out of range: 0。 由于在该映射操作中,我将一个句子拆分为单词,然后使用集合中的过滤器操作而不是Spark,所以我想知道问题与此有关。我试图评论.filter(_(0) != '@'),一切工作正常在Spark映射操作中使用Scala过滤器

val logFile = "tweets10.csv" 
val config = new SparkConf().setMaster("local").setAppName("Spark App") 
val sc = new SparkContext(config) 

val logData = sc.textFile(logFile, 2).cache() 


val tweets = logData.mapPartitionsWithIndex((index, line) => if (index == 0) line.drop(1) else line) 
           .map(_.split(",")(1).replace("\"", "")) 
           .map(line => line.split(" ") 
            .filter(_(0) != '@') 
            .reduce((x,y) => x + " " + y)) 

数据集:

"","text","favorited","favoriteCount","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName","retweetCount","isRetweet","retweeted","longitude","latitude" 
"1","RT @WDD: Check today how you can join World Diabetes Day: htts/EIQ1Za0R0t. Eyes on #diabetes htts/rN3VJYC7T0",FALSE,0,NA,2016-09-07 20:12:03,FALSE,NA,"773614831018643457",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","un_ncd",27,TRUE,FALSE,NA,NA 
"2","RT @JDRFUK: With his #Rio2016 medal in hand Team GB gymnast @louissmith1989 puts type 1 #diabetes in the picture! htts:/OKkPtQLuvi",FALSE,0,NA,2016-09-07 20:10:44,FALSE,NA,"773614501853880320",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","sg0809",2,TRUE,FALSE,NA,NA 
"3","RT @CleanairCA: Speaking of the things in the air you breath... 
    #asthma #diabetes #copd #lungcancer #smog #losangeles #HeartDisease htts:/",FALSE,0,NA,2016-09-07 20:09:03,FALSE,NA,"773614075284746240",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","tt85207533",9,TRUE,FALSE,NA,NA 
"4","So - tonight's #tweetchat is about FOOD - ""#Diabetes and Diets"" (aka - stuff we eat) #gbdoc",FALSE,1,NA,2016-09-07 20:08:28,FALSE,NA,"773613929515941888",NA,"<a href=""htt://www.tchat.io"" rel=""nofollow"">tchat.io</a>","theGBDOC",0,FALSE,FALSE,NA,NA 
"5","Learn the most important things you can do to prevent #diabetes here: htts:/eHu5pesgKw.",FALSE,0,NA,2016-09-07 20:07:00,FALSE,NA,"773613560320495617",NA,"<a href=""htt://sproutsocial.com"" rel=""nofollow"">Sprout Social</a>","MountainPointMC",0,FALSE,FALSE,NA,NA 
"6","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/Ul0vwRpqbw htts:/YU77iuudeR",FALSE,0,NA,2016-09-07 20:06:09,FALSE,NA,"773613345480007680",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","CureExchange",0,FALSE,FALSE,NA,NA 
"7","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/wEjrW9f9b1 htts:/iHlSpbwzZl",FALSE,0,NA,2016-09-07 20:06:08,FALSE,NA,"773613341826805760",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","GuineaHenWeed",0,FALSE,FALSE,NA,NA 
"8","Linda Yip hopes to find better ways to diagnose, treat &amp; prevent #diabetes: htts:/tmjgnEFUkZ #WIMmonth htts:/xL25me7ckK",FALSE,0,NA,2016-09-07 20:05:14,FALSE,NA,"773613114533171200",NA,"<a href=""htts://about.twitter.com/products/tweetdeck"" rel=""nofollow"">TweetDeck</a>","StanfordDeptMed",0,FALSE,FALSE,NA,NA 
"9","A Farm Stand In South Dallas Is Fighting #Diabetes With Common Sense And Vegetables htts:/l9pWvnAA5W",FALSE,0,NA,2016-09-07 20:05:08,FALSE,NA,"773613090378166273",NA,"<a href=""htt://www.hootsuite.com"" rel=""nofollow"">Hootsuite</a>","DiabetesDallas",0,FALSE,FALSE,NA,NA 
"10","Hi #gbdoc Paul here, #t1d #teampump and #cgm - 4.5 years with #diabetes now!",FALSE,0,NA,2016-09-07 20:04:25,FALSE,NA,"773612908693614592",NA,"<a href=""htt://itunes.apple.com/us/app/twitter/id409789998?mt=12"" rel=""nofollow"">Twitter for Mac</a>","t1hba1c",0,FALSE,FALSE,NA,NA 

回答

3

不知道什么数据集实际上包含了,我会在这里预感去说,拆分后的数据集包含空字符串。添加额外的空白检查:

_.split(" ") 
.filter(word => word != "" && word(0) != '@') 
.reduce((x,y) => x + " " + y)