2017-04-13 64 views
2

我是一个初学Python的人,想知道是否有更快的方法来做这个代码,所以请原谅我的无知。我有2个Excel工作表:其中一个(结果)拥有大约30,000行唯一用户标识,然后我提出了30个问题列,下面的单元格为空。我的第二张(回答),有大约400,000行和3列。第一列有用户ID,第二栏有问题,第三栏有用户对应的每个问题的答案。我想要做的事实质上是一个索引匹配数组excel函数,我可以通过匹配用户标识和问题来填充表单1中的空白单元格以及来自表单2的答案。通过python数组循环以匹配第二个数组中的多个条件,快速方法?

Results sheet Answers sheet

现在我写了一段代码,但花了大约2个小时只处理从表1,我试图找出4列,如果我做这件事的方式是不采取完整的Numpy功能优势。

import pandas as pd 
import numpy as np 

# Need to take in data from 'answers' and merge it into the 'results' data 
# Will requiring matching the data based on 'id' in column 1 of 'answers' and the 
# 'question' in column 2 of 'answers' 
results = pd.read_excel("/Users/data.xlsx", 'Results') 
answers = pd.read_excel("/Users/data.xlsx", 'Answers') 

answers_array = np.array(answers) ######### 

# Create a list of questions being asked that will be matched to column 2 in answers. 
# Just getting all the questions I want 
column_headers = list(results.columns) 
formula_headers = []    ######### 
for header in column_headers: 
    formula_headers.append(header) 
del formula_headers[0:13] 

# Create an empty array with ids in which the 'merged' data will be fed into 
pre_ids = np.array(results['Id']) 
ids = np.reshape(pre_ids, (pre_ids.shape[0], 1)) 
ids = ids.astype(str) 

zero_array = np.zeros((ids.shape[0], len(formula_headers))) 
ids_array = np.hstack((ids, zero_array)) ########## 


for header in range(len(formula_headers)): 
    question_index = formula_headers[header] 
    for user in range(ids_array.shape[0]): 
     user_index = ids_array[user, 0] 
     location = answers_array[(answers_array[:, 0] == int(user_index)) & (answers_array[:, 1] == question_index)] 
     # This location formula is what I feel is messing everything up, 
     # or could be because of the nested loops 
     # If can't find the user id and question in the answers array 
     if location.size == 0: 
      ids_array[user][header + 1] = '' 
     else: 
      row_location_1 = np.where(np.all(answers_array == location[0], axis=1)) 
      row_location = int(row_location_1[0][0]) 
      ids_array[user][header + 1] = answers_array[row_location][2] 

print ids_array 

回答

1

不是用第二个数据填充第一个数据帧,我们可以转向第二个数据帧。

answers.set_index(['id', 'question']).answer.unstack() 

如果你所需要的行和列是相同的results数据框,你可以,如果你有添加reindex_like方法

answers.set_index(['id', 'question']).answer.unstack().reindex_like(results) 

复制

cols = ['id', 'question'] 
answers.drop_duplicates(cols).set_index(cols).answer.unstack() 
+0

嗯问题那就是答案页中的第1列有重复的用户ID来说明他们对每个问题的回答 –

+0

@MiriamAlh是的,这就是为什么我在'id'上设置索引的原因和'question' – piRSquared

+0

@MiriamAlh你有我可以证明的样本数据吗?谈论我无法看到的数据集非常困难。 – piRSquared