2016-09-30 81 views
-1

我正在处理1-12个月的10,000个客户数据集。我在每个客户的12个月期间为不同的价值产生了相关性。Python对大数据集进行迭代并删除评估数据

目前我的输出关联文件比我的原始文件有更多的行。我意识到这是从我试图从原始数据集中删除已评估的行时的迭代错误。

我期望的结果是一个数据集,每个客户年度评估对应的各种相关性有10,000个条目。

我粗体显示(出演)我认为错误的地方。

这里是我当前的代码:

for x_customer in range(0,len(overalldata),12): 

     for x in range(0,13,1): 
       cust_months = overalldata[0:x,1] 

       cust_balancenormal = overalldata[0:x,16] 

       cust_demo_one = overalldata[0:x,2] 
       cust_demo_two = overalldata[0:x,3] 

       num_acct_A = overalldata[0:x,4] 
       num_acct_B = overalldata[0:x,5] 

       out_mark_channel_one = overalldata[0:x,25] 
       out_service_channel_two = overalldata[0:x,26] 
       out_mark_channel_three = overalldata[0:x,27] 
       out_mark_channel_four = overalldata[0:x,28] 


    #Correlation Calculations 

       #Demographic to Balance Correlations 
       demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0] 
       demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0] 


       #Demographic to Account Number Correlations 
       demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0] 
       demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0] 
       demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0] 
       demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0] 

       #Marketing Response Channel One 
       mark_one_corr_acct_a = numpy.corrcoef(num_acct_A, out_mark_channel_one)[1, 0] 
       mark_one_corr_acct_b = numpy.corrcoef(num_acct_B, out_mark_channel_one)[1, 0] 
       mark_one_corr_balance = numpy.corrcoef(cust_balancenormal, out_mark_channel_one)[1, 0] 

       #Marketing Response Channel Two 
       mark_two_corr_acct_a = numpy.corrcoef(num_acct_A, out_service_channel_two)[1, 0] 
       mark_two_corr_acct_b = numpy.corrcoef(num_acct_B, out_service_channel_two)[1, 0] 
       mark_two_corr_balance = numpy.corrcoef(cust_balancenormal, out_service_channel_two)[1, 0] 

       #Marketing Response Channel Three 
       mark_three_corr_acct_a = numpy.corrcoef(num_acct_A, out_mark_channel_three)[1, 0] 
       mark_three_corr_acct_b = numpy.corrcoef(num_acct_B, out_mark_channel_three)[1, 0] 
       mark_three_corr_balance = numpy.corrcoef(cust_balancenormal, out_mark_channel_three)[1, 0] 

       #Marketing Response Channel Four 
       mark_four_corr_acct_a = numpy.corrcoef(num_acct_A, out_mark_channel_four)[1, 0] 
       mark_four_corr_acct_b = numpy.corrcoef(num_acct_B, out_mark_channel_four)[1, 0] 
       mark_four_corr_balance = numpy.corrcoef(cust_balancenormal, out_mark_channel_four)[1, 0] 


       #Result Correlations For Exporting to CSV of all Correlations 
       result_correlation = [(demo_one_corr_balance),(demo_two_corr_balance),(demo_one_corr_acct_a),(demo_one_corr_acct_b),(demo_two_corr_acct_a),(demo_two_corr_acct_b),(mark_one_corr_acct_a),(mark_one_corr_acct_b),(mark_one_corr_balance), 
             (mark_two_corr_acct_a),(mark_two_corr_acct_b),(mark_two_corr_balance),(mark_three_corr_acct_a),(mark_three_corr_acct_b),(mark_three_corr_balance),(mark_four_corr_acct_a),(mark_four_corr_acct_b), 
             (mark_four_corr_balance)] 
       result_correlation_nan_nuetralized = numpy.nan_to_num(result_correlation) 
       c.writerow(result_correlation) 

     **result_correlation_combined = emptylist.append([result_correlation]) 
     cust_delete_list = [0,x_customer,1] 
     overalldata = numpy.delete(overalldata, (cust_delete_list), axis=0)** 
+0

为了扩展,当我给一个10个客户的文件,每个文件有12个月的数据时,我会收到一个130行的输出文件,它应该只有10个。 –

回答

0

这可能不能完全解决你的问题,但我认为这是相关的。

当您在列表对象上运行.append(空或其他)时,该方法返回的值为None。因此,对于行result_correlation_combined = emptylist.append([result_correlation]),无论empty_list是否为空或非空列表,result_correlation_combined的值将为None

下面是我正在谈论的一个简单例子 - 由于没有提供数据,我只是编一些数字。

>>> empty_list = [] 
>>> result_correlation = [] 

>>> for j in range(10): 
     result_correlation.append(j) 

>>> result_correlation 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

>>> result_correlation_combined = empty_list.append(result_correlation) 
>>> print(result_correlation_combined) 
None 

所以,你可以运行result_correlation_combined.append(result_correlation)result_correlation_combined += result_correlation,甚至result_correlation_combined.extend(result_correlation) ......他们都将产生相同的结果。看看这是否给你你正在寻找的答案。如果没有,请回来。