2016-11-12 83 views

回答

3

可以使用from_unixtime功能。

val sqlContext = new SQLContext(sc) 

import org.apache.spark.sql.functions._ 
import sqlContext.implicits._ 

val df = // your dataframe, assuming transaction_date is timestamp in seconds 
df.select('transaction_number, hour(from_unixtime('transaction_date)) as 'hour) 
     .groupBy('hour) 
     .agg(count('transaction_number) as 'transactions) 

结果:

+----+------------+ 
|hour|transactions| 
+----+------------+ 
| 10|  1000| 
| 12|  2000| 
| 13|  3000| 
| 14|  4000| 
| ..|  ....| 
+----+------------+ 
+0

非常感谢,我会尝试一下,更新... – Shankar

1

在这里,我想给一些指针的方法,而完整的代码,请参阅本

Time Interval Literals : Using interval literals, it is possible to perform subtraction or addition of an arbitrary amount of time from a date or timestamp value. This representation can be useful when you want to add or subtract a time period from a fixed point in time. For example, users can now easily express queries like “Find all transactions that have happened during the past hour”. An interval literal is constructed using the following syntax: [sql]INTERVAL value unit[/sql]


下面是蟒蛇的方式。您可以修改以下示例以符合您的要求,即交易日期开始时间,相应的结束时间。而不是在你的情况下它的交易编号。

# Import functions. 
from pyspark.sql.functions import * 
# Create a simple DataFrame. 
data = [ 
    ("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1), 
    ("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2), 
    ("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3)] 
df = sqlContext.createDataFrame(data, ["start_time", "end_time", "id"]) 
df = df.select(
    df.start_time.cast("timestamp").alias("start_time"), 
    df.end_time.cast("timestamp").alias("end_time"), 
    df.id) 
# Get all records that have a start_time and end_time in the 
# same day, and the difference between the end_time and start_time 
# is less or equal to 1 hour. 
condition = \ 
    (to_date(df.start_time) == to_date(df.end_time)) & \ 
    (df.start_time + expr("INTERVAL 1 HOUR") >= df.end_time) 
df.filter(condition).show() 
+———————+———————+—+ 
|start_time   |   end_time |id | 
+———————+———————+—+ 
|2015-01-02 23:00:00.0|2015-01-02 23:59:59.0|2 | 
+———————+———————+—+ 

使用此方法,您可以应用组函数来查找您的案例中的事务总数。

以上是python代码,scala又如何?

expr function使用上面也可以在斯卡拉以及

也看看spark-scala-datediff-of-two-columns-by-hour-or-minute 这下面介绍..

import org.apache.spark.sql.functions._ 
    val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long") 
    val df2 = df1 
     .withColumn("diff_secs", diff_secs_col) 
     .withColumn("diff_mins", diff_secs_col/60D) 
     .withColumn("diff_hrs", diff_secs_col/3600D) 
     .withColumn("diff_days", diff_secs_col/(24D * 3600D)) 
相关问题