2017-05-31 42 views
0

我在火花阶新,并希望找到最大的工资在各部门斯卡拉 - GROUPBY和马克斯在对RDD

Dept,Salary 
Dept1,1000 
Dept2,2000 
Dept1,2500 
Dept2,1500 
Dept1,1700 
Dept2,2800 

我实现下面的代码

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 


object MaxSalary { 
    val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]")) 

    case class Dept(dept_name : String, Salary : Int) 

    val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(",")) 
    val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt))) 
    val a = recs.max()??????? 
}) 
} 

抱住如何实现group by和max函数。我正在使用RDD对。

感谢

回答

1

如果您在此处使用的数据集是解决

case class Dept(dept_name : String, Salary : Int) 


val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]")) 

    val sq = new SQLContext(sc) 

    import sq.implicits._ 
    val file = "resources/ip.csv" 

    val data = sc.textFile(file).map(_.split(",")) 

    val recs = data.map(r => Dept(r(0), r(1).toInt)).toDS() 


    recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show() 

输出:

+---------+------------+ 
|dept_name|max_solution| 
+---------+------------+ 
| Dept2|  2800| 
| Dept1|  2500| 
+---------+------------+ 
+0

得到错误'值TODS不是org.apache.spark.rdd.RDD [MaxSalary.Dept]' – Ajay

+0

你用进口spark.implicits._ –

+0

没有成员..我是否需要写因为它返回错误'找不到:value sqlContext' – Ajay

5

如果你不希望创建一个数据帧,

val emp = sc.textFile("file:///home/user/Documents/dept.txt").mapPartitionsWithIndex((idx, row) => if(idx==0) row.drop(1) else row).map(x => (x.split(",")(0).toString, x.split(",")(1).toInt)) 
val maxSal = emp.reduceByKey(math.max(_,_)) 

应该给你:

Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))