2015-10-16 103 views
3

import pandas as pd 
import numpy as np 
import random 

labels = ["c1","c2","c3"] 
c1 = ["one","one","one","two","two","three","three","three","three"] 
c2 = [random.random() for i in range(len(c1))] 
c3 = ["alpha","beta","gamma","alpha","gamma","alpha","beta","gamma","zeta"] 
DF = pd.DataFrame(np.array([c1,c2,c3])).T 
DF.columns = labels 

数据框的样子:熊猫:最有效的方法,使词典的词典从数据帧列

 c1    c2  c3 
0 one 0.440958516531 alpha 
1 one 0.476439953723 beta 
2 one 0.254235673552 gamma 
3 two 0.882724336464 alpha 
4 two 0.79817899139 gamma 
5 three 0.677464637887 alpha 
6 three 0.292927670096 beta 
7 three 0.0971956881825 gamma 
8 three 0.993934915508 zeta 

我能想到做字典的唯一办法是:

D_greek_value = {} 

for greek in set(DF["c3"]): 
    D_c1_c2 = {} 
    for i in range(DF.shape[0]): 
     row = DF.iloc[i,:] 
     if row[2] == greek: 
      D_c1_c2[row[0]] = row[1] 
    D_greek_value[greek] = D_c1_c2 
D_greek_value 

生成的词典如下所示:

{'alpha': {'one': '0.67919712421', 
    'three': '0.67171020684', 
    'two': '0.571150669821'}, 
'beta': {'one': '0.895090207979', 'three': '0.489490074662'}, 
'gamma': {'one': '0.964777504708', 
    'three': '0.134397632659', 
    'two': '0.10302290374'}, 
'zeta': {'three': '0.0204226923557'}} 

我不想让c1来块(“one”每次都在一起)。我正在做一个几百MB的csv,我觉得我做错了。如果您有任何想法请帮助!

回答

4

IIUC,你可以利用groupby来处理大部分工作:

>>> result = df.groupby("c3")[["c1","c2"]].apply(lambda x: dict(x.values)).to_dict() 
>>> pprint.pprint(result) 
{'alpha': {'one': 0.440958516531, 
      'three': 0.677464637887, 
      'two': 0.8827243364640001}, 
'beta': {'one': 0.47643995372299996, 'three': 0.29292767009599996}, 
'gamma': {'one': 0.254235673552, 
      'three': 0.0971956881825, 
      'two': 0.79817899139}, 
'zeta': {'three': 0.993934915508}} 

一些解释。这给了我们,我们要转换成字典组:

>>> grouped = df.groupby("c3")[["c1", "c2"]] 
>>> grouped.apply(lambda x: print(x,"\n","--")) # just for display purposes 
     c1     c2 
0 one 0.679926178687387 
3 two 0.11495090934413166 
5 three 0.7458197179794177 
-- 
     c1     c2 
0 one 0.679926178687387 
3 two 0.11495090934413166 
5 three 0.7458197179794177 
-- 
     c1     c2 
1 one 0.12943266757277916 
6 three 0.28944292691097817 
-- 
     c1     c2 
2 one 0.36642834809341274 
4 two 0.5690944224514624 
7 three 0.7018221838129789 
-- 
     c1     c2 
8 three 0.7195852795555373 
-- 

鉴于这些子帧的,说下到最后,我们需要想出一个办法把它变成一本字典。例如:

>>> d3 
     c1  c2 
2 one 0.366428 
4 two 0.569094 
7 three 0.701822 

如果我们试图dictto_dict,我们没有得到我们想要的,因为指数和列标签的方式获得:

>>> dict(d3) 
{'c1': 2  one 
4  two 
7 three 
Name: c1, dtype: object, 'c2': 2 0.366428 
4 0.569094 
7 0.701822 
Name: c2, dtype: float64} 
>>> d3.to_dict() 
{'c1': {2: 'one', 4: 'two', 7: 'three'}, 'c2': {2: 0.36642834809341279, 4: 0.56909442245146236, 7: 0.70182218381297889}} 

但是,我们可以通过删除忽略此一直到带有.values可以传递到dict基础数据,然后:

>>> d3.values 
array([['one', 0.3664283480934128], 
     ['two', 0.5690944224514624], 
     ['three', 0.7018221838129789]], dtype=object) 
>>> dict(d3.values) 
{'three': 0.7018221838129789, 'one': 0.3664283480934128, 'two': 0.5690944224514624} 

因此,如果我们将此我们得到一个系列指数作为我们想要的C3键和值的字典,我们可以变成使用.to_dict()字典:

>>> result = df.groupby("c3")[["c1", "c2"]].apply(lambda x: dict(x.values)) 
>>> result 
c3 
alpha {'three': '0.7458197179794177', 'one': '0.6799... 
beta  {'one': '0.12943266757277916', 'three': '0.289... 
gamma {'three': '0.7018221838129789', 'one': '0.3664... 
zeta      {'three': '0.7195852795555373'} 
dtype: object 
>>> result.to_dict() 
{'zeta': {'three': '0.7195852795555373'}, 'gamma': {'three': '0.7018221838129789', 'one': '0.36642834809341274', 'two': '0.5690944224514624'}, 'beta': {'one': '0.12943266757277916', 'three': '0.28944292691097817'}, 'alpha': {'three': '0.7458197179794177', 'one': '0.679926178687387', 'two': '0.11495090934413166'}} 
+1

很不错的。我想知道这是否比我发布的更快。我希望'groupby'速度非常快,但lambda可能会减慢速度。我虽然懒得时间。 –

+2

@StevenRumbalski:我也是。 :-)我试图看看是否可以使用矢量化操作获得相同的结果,但弹回;别人可能会有更聪明的东西。但我认为你已经把你的手指放在了一个大问题上(太多的迭代),除此之外的一切都是微不足道的。 – DSM

+0

@DSM我知道如何使用lambda函数进行排序,但确切地说是从“.apply”到“.to_dict()”? –

3

对于每个独特的希腊字母,您在数据框上迭代多次。最好迭代一次。

由于需要字典的字典,你可以使用一个collections.defaultdictdict作为嵌套http://stardict.sourceforge.net/Dictionaries.php下载的默认构造函数:

from collections import defaultdict 

result = defaultdict(dict) 
for dx, num_word, val, greek in DF.itertuples(): 
    result[greek][num_word] = val 

,或使用普通的字典,并setdefault调用创建嵌套字典。通过C3首先我们组,并选择列C1和C2:

result = {} 
for dx, num_word, val, greek in DF.itertuples(): 
    result.setdefault(greek, {})[num_word] = val