2017-12-27 371 views
1

我使用PySpark V1.6.1创建一个数据帧,我想用另外一个来创建一个数据帧:Pyspark:如何使用其他数据框

  • 转换已在不同的三个值中的一个结构体的列
  • 从字符串转换的时间戳DATATIME
  • 使用时间戳
  • 更改列名和类型
其余创建更多的列

现在正在使用.map(func)创建一个使用该函数的RDD(它从原始类型的一行转换并返回一个新行)。但是这是创建一个RDD,我不会这样做。

有没有更好的方法来做到这一点?

谢谢!

+0

如果你已经能够创建RDD,您可以轻松地将其转化为DF。 –

+0

但我不想创建一个RDD,我想避免使用RDD,因为它们是python的性能瓶颈,我只是想做DF转换 – frm

+1

请提供一些代码,说明您已经尝试过,以便我们可以提供帮助。同时,查看spark数据框的'pyspark.sql.functions.udf'和'withColumn()'方法。 – pault

回答

1

希望这有助于!

from pyspark.sql.functions import unix_timestamp, col, to_date, struct 

#### 
#sample data 
#### 
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'], 
         [20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'], 
         [40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\ 
    toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"]) 

#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1) 
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\ 
    drop('age', 'name', 'sex') 
df.show() 
df.printSchema() 

#### 
#query 1 
#### 
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex') 
df = df.withColumn('name', col('struct_col.name')).\ 
    withColumn('age', col('struct_col.age')).\ 
    withColumn('sex', col('struct_col.sex')).\ 
    drop('struct_col') 
df.show() 
df.printSchema() 

#### 
#query 2 
#### 
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat') 
df = df.withColumn('datetime_in_tsFormat', 
        unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp")) 
df.show() 
df.printSchema() 

#### 
#query 3 
#### 
#create more columns using above timestamp (e.g. fetch date value from timestamp column) 
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat'))) 
df.show() 

#### 
#query 4.a 
#### 
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name) 
df = df.withColumnRenamed('initial_col_name', 'new_col_name') 
df.show() 

#### 
#query 4.b 
#### 
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat') 
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double")) 
df.show() 
df.printSchema() 

的样本数据:

+---------------------+----------------+----------------+------------+ 
|datetime_in_strFormat|initial_col_name|col_in_strFormat| struct_col| 
+---------------------+----------------+----------------+------------+ 
| 12-21-2006 11:00:05|    abc|    1| [25,Prem,M]| 
| 05-30-2007 10:05:00|   asdf|    2| [20,Kate,F]| 
| 12-30-2017 01:00:01|   qwerty|    3|[40,Cheng,M]| 
+---------------------+----------------+----------------+------------+ 
root 
|-- datetime_in_strFormat: string (nullable = true) 
|-- initial_col_name: string (nullable = true) 
|-- col_in_strFormat: string (nullable = true) 
|-- struct_col: struct (nullable = false) 
| |-- age: long (nullable = true) 
| |-- name: string (nullable = true) 
| |-- sex: string (nullable = true) 

最终输出数据:

+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ 
|datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat| 
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ 
| 12-21-2006 11:00:05|   abc|    1| Prem| 25| M| 2006-12-21 11:00:05|   2006-12-21|    1.0| 
| 05-30-2007 10:05:00|  asdf|    2| Kate| 20| F| 2007-05-30 10:05:00|   2007-05-30|    2.0| 
| 12-30-2017 01:00:01|  qwerty|    3|Cheng| 40| M| 2017-12-30 01:00:01|   2017-12-30|    3.0| 
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ 

root 
|-- datetime_in_strFormat: string (nullable = true) 
|-- new_col_name: string (nullable = true) 
|-- col_in_strFormat: string (nullable = true) 
|-- name: string (nullable = true) 
|-- age: long (nullable = true) 
|-- sex: string (nullable = true) 
|-- datetime_in_tsFormat: timestamp (nullable = true) 
|-- datetime_in_dateFormat: date (nullable = true) 
|-- col_in_doubleFormat: double (nullable = true) 
+0

嗨Prem !,那段代码就是我一直在寻找的!谢谢! – frm

+0

很高兴它帮助! – Prem

相关问题