2016-05-30 346 views
1

获取一个月,我有一个数据帧以“周” &“年”列,需要计算月供下同:SPARK SQL:从周数和年

输入:

+----+----+ |Week|Year| +----+----+ | 50|2012| | 50|2012| | 50|2012|

预期输出:

+----+----+-----+ |Week|Year|Month| +----+----+-----+ | 50|2012|12 | | 50|2012|12 | | 50|2012|12 |

任何帮助,将不胜感激。由于

+1

什么跨2个月跨越星期?不是一个月来推导出一个微弱的变量吗? –

回答

1

感谢@ zero323,谁指出我出到sqlContext.sql查询,我转换的查询如下所示:

import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.sql.DataFrame; 
import org.apache.spark.sql.RowFactory; 
import org.apache.spark.sql.SQLContext; 
import org.apache.spark.sql.types.DataTypes; 
import org.apache.spark.sql.types.StructField; 
import org.apache.spark.sql.types.StructType; 

import java.util.ArrayList; 
import java.util.Arrays; 
import java.util.List; 

import static org.apache.spark.sql.functions.*; 

public class MonthFromWeekSparkSQL { 

    public static void main(String[] args) { 

     SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local"); 
     JavaSparkContext sc = new JavaSparkContext(conf); 
     SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); 

     List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012)); 
     JavaRDD myRDD = sc.parallelize(myList); 

     List<StructField> structFields = new ArrayList<StructField>(); 

     // Create StructFields 
     StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true); 
     StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true); 

     // Add StructFields into list 
     structFields.add(structField1); 
     structFields.add(structField2); 

     // Create StructType from StructFields. This will be used to create DataFrame 
     StructType schema = DataTypes.createStructType(structFields); 

     DataFrame df = sqlContext.createDataFrame(myRDD, schema); 
     DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week"))) 
       .withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek"); 

     df2.show(); 

    } 

} 

你居然用一年,周格式化为创建新列“YYYY w“,然后使用unix_timestamp将其转换为可以从中看到的月份。

PS:看来,投行为是火花1.5不正确 - 在这种情况下

因此,它是更普遍的做.cast("double").cast("timestamp")

+0

就我而言,它只是增加时间而不改变月份和年份。请看看gist https://gist.github.com/nareshbab/7d945ccaaae07ca743dec0ea07bb50c0 – nareshbabral

+0

你没有正确复制代码,所以请检查你的代码! – eliasah

+1

现在感谢它的工作 – nareshbabral