2017-08-03 121 views
0
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SparkSession 
import gc 
import pandas as pd 
import datetime 
import numpy as np 
import sys 



APP_NAME = "DataFrameToCSV" 

spark = SparkSession\ 
    .builder\ 
    .appName(APP_NAME)\ 
    .config("spark.sql.crossJoin.enabled","true")\ 
    .getOrCreate() 

group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2] 

dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"] 

#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0] 
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0] 

dataFrameArr = np.column_stack((group_ids,dates,event)) 

df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"]) 

上述python代码将在gcloud dataproc上的spark簇上运行。我想将大熊猫数据框保存为gcloud存储桶中的csv文件,位于gs:// mybucket/csv_data/将熊猫数据帧保存为csv到gcloud存储桶

我该怎么做?

回答

1

所以,我想出了如何做到这一点。从上面的代码继续,这里是解决方案:

sc = SparkContext.getOrCreate() 

from pyspark.sql import SQLContext 
sqlCtx = SQLContext(sc) 
sparkDf = sqlCtx.createDataFrame(df)  
sparkDf.coalesce(1).write.option("header","true").csv('gs://mybucket/csv_data')