2017-07-26 119 views
0

我正在使用pyspark 2.1。下面是我的数据框内容从日期到字符串Pyspark类型转换问题

expecteddays,date 

139,30.JUl.2017 

134,01.NOV.2018 

我的输出应该如下

138,30.JUL.2017,<30/SEP/2018,4/FEB/2019> 

最后一列的Poupulation是照顾我的下面模块dateRangeBetweenget_date

下面是我的代码

from datetime import datetime 
from datetime import timedelta 
import pandas as pd 
from datetime import timedelta 
from pyspark.sql import SparkSession 
from pyspark import SparkContext 
from pyspark.sql.functions import concat,explode 
from datetime import datetime 
from pyspark.sql.functions import udf 
from pyspark.sql.types import StringType 
from datetime import timedelta 
import pandas as pd 
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType 
from pyspark.sql import types maintenance_final_join=spark.read.csv('/user/NaveenSri/adh_dev_engg/test.csv',header=True) 

def get_date(dateFormat="%d-%m-%Y", addDays=0 ,timeNow=0): 
    #print('inside get date',timesNow) 
    if (addDays!=0): 
     anotherTime = timeNow + timedelta(days=addDays) 
    else: 
     anotherTime = timeNow 
    return anotherTime.strftime(dateFormat) 
def dateRangebetween(expectedDate , estimatedDays): 
output_format = '%d-%m-%Y' 



dateRangeList =[] 
j=2 
#print('inside Date range',expectedDate) 
rangeEnddate= datetime.strptime(get_date(output_format, 730,expectedDate), '%d-%m-%Y').date() 
#print('rangeEnddate---',rangeEnddate) 
calculatedDate = datetime.strptime(get_date(output_format,estimatedDays ,expectedDate), '%d-%m-%Y').date() 
#print('calculatedDate----',calculatedDate) 

while(calculatedDate<=rangeEnddate):  
    # print(calculatedDate) 
    #print (estimatedDays) 
    dateRangeList.append(calculatedDate) 
    calculatedDate = datetime.strptime(get_date(output_format,estimatedDays ,calculatedDate), '%d-%m-%Y').date() 

#print('-----', datetime.strptime(get_date(output_format,estimatedDays ,calculatedDate), '%d-%m-%Y').date()) 
return dateRangeList 

dateRange = udf(dateRangebetween, types.ArrayType(types.StringType())) 
addDays=182 
result = maintenance_final_join.withColumn('Part_Dates',dateRange(maintenance_final_join.Expected,maintenance_final_join.estimateddays)).show() 

执行后我得到这个错误:

TypeError: coercing to Unicode: need string or buffer, datetime.timedelta found 

回答

1

首先,请问您是否可以修复您的缩进。您的dateRangebetween()功能很难正确读取。

然而,你的问题是这个:

dateRangeList.append(calculatedDate) 
calculatedDate = datetime.strptime(get_date(output_format,estimatedDays, 
     calculatedDate), '%d-%m-%Y').date() 

你calculatedDate是DateTime对象。然后你将这个对象(不是字符串表示)追加到dateRangeList并返回它。然后在你的主程序中,你试着对一组datetime对象做udf。

我假设你的意图是使用字符串表示。如果您更改了

dateRangeList.append(calculatedDate.strftime("......")) 

并插入正确的格式字符串代替点,您至少会处理字符串对象而不是日期时间。

+0

非常感谢Hannu的工作。谢谢你的建议 –