获取一个月,我有一个数据帧以“周” &“年”列,需要计算月供下同:SPARK SQL:从周数和年
输入:
+----+----+ |Week|Year| +----+----+ | 50|2012| | 50|2012| | 50|2012|
预期输出:
+----+----+-----+ |Week|Year|Month| +----+----+-----+ | 50|2012|12 | | 50|2012|12 | | 50|2012|12 |
任何帮助,将不胜感激。由于
获取一个月,我有一个数据帧以“周” &“年”列,需要计算月供下同:SPARK SQL:从周数和年
输入:
+----+----+ |Week|Year| +----+----+ | 50|2012| | 50|2012| | 50|2012|
预期输出:
+----+----+-----+ |Week|Year|Month| +----+----+-----+ | 50|2012|12 | | 50|2012|12 | | 50|2012|12 |
任何帮助,将不胜感激。由于
感谢@ zero323,谁指出我出到sqlContext.sql查询,我转换的查询如下所示:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static org.apache.spark.sql.functions.*;
public class MonthFromWeekSparkSQL {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012));
JavaRDD myRDD = sc.parallelize(myList);
List<StructField> structFields = new ArrayList<StructField>();
// Create StructFields
StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true);
StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true);
// Add StructFields into list
structFields.add(structField1);
structFields.add(structField2);
// Create StructType from StructFields. This will be used to create DataFrame
StructType schema = DataTypes.createStructType(structFields);
DataFrame df = sqlContext.createDataFrame(myRDD, schema);
DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week")))
.withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek");
df2.show();
}
}
你居然用一年,周格式化为创建新列“YYYY w“,然后使用unix_timestamp将其转换为可以从中看到的月份。
PS:看来,投行为是火花1.5不正确 - 在这种情况下
因此,它是更普遍的做.cast("double").cast("timestamp")
就我而言,它只是增加时间而不改变月份和年份。请看看gist https://gist.github.com/nareshbab/7d945ccaaae07ca743dec0ea07bb50c0 – nareshbabral
你没有正确复制代码,所以请检查你的代码! – eliasah
现在感谢它的工作 – nareshbabral
什么跨2个月跨越星期?不是一个月来推导出一个微弱的变量吗? –