2017-10-17 61 views
1

我有含有约10K(10,000)行如下所示的CSV:处理字符串的列表删除重复并添加相应的值

1: ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'] 
... 
N: ['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70'] 

我不得不重复值组合,例如:

['Andhra Pradesh-133', 'Meetai-5781'] // 5781 = 1358 + 2146 + 2277 

任何人都可以建议一个快速的方法来做到这一点吗?

回答

0

使用list comprehensiongroupby

from itertools import groupby 


df = pd.DataFrame({'a':[['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'], 
         ['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70']]}) 


data = [] 
for x in df['a']: 
    b = [a.split('-') for a in x] 
    L = [t for k, g in groupby(b, key=lambda x: x[0]) 
     for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(L) 

print (data) 

[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']] 

df['b'] = data 
print (df) 

                a \ 
0 [Andhra Pradesh-133, Meetai-1358, Meetai-2146,... 
1 [Andhra Pradesh-20, Rajasthan-60, Rajasthan-70] 

            b 
0 [Andhra Pradesh-133, Meetai-5781] 
1 [Andhra Pradesh-20, Rajasthan-130] 

编辑:

data = [] 
for line in open('file.csv'): 
    #strip new-line characters, split by [ and get second list 
    items = line.strip('\r\n" ]').split('[')[1] 
    #split lines, remove whitespace 
    items = [item.strip("' ") for item in items.split(',')] 
    #split to sublist 
    items = [a.split('-') for a in items] 
    #sum splitted sublists 
    items = [t for k, g in groupby(items, key=lambda x: x[0]) 
       for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(items) 

print (data) 
[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']] 

编辑:如果输入文件

解决方案:

你需要通过[首次出现分裂,然后剥离[]太:

data = [] 
for line in open('file.csv'): 
    #strip new-line characters, split by [ and get second list 
    items = line.strip('\r\n" ]').split('[', 1)[1] 
    #split lines, remove whitespace 
    items = [item.strip("'[] ") for item in items.split(',')] 
    #split to sublist 
    items = [a.split('-') for a in items] 
    print (items) 
    #sum splitted sublists 
    items = [t for k, g in groupby(items, key=lambda x: x[0]) 
       for t in [k + '-' + str(sum((int(j) for i, j in g)))]] 
    data.append(items) 
+0

有一个小疑问在这里,如果我考虑的是X = [ '潘吉姆-20', '北方邦-23185',“ Gujurat-1013','Uttar Pradesh-51']声明函数组似乎不起作用。 b = [a.split(' - ')for a x] for k,g in groupby(b,key = lambda x:x [0]):不会被'uttar Pradesh'分组也不是'uttar Pradesh'一样。你能帮助我们了解什么是错过的? –

+0

我觉得有问题double'[['。我编辑答案。 – jezrael

+0

对于我正在尝试处理的名单中的错字x = ['panjim-20','Uttar Pradesh-23185','Gujurat-1013','Uttar Pradesh-51']表示歉意。 ? –

0

我会为每一行创建一个字典。通过分割或使用正则表达式解析字符串数字。该串例如'安得拉邦'是关键,价值是一个整数。将数字添加到由字符串确定的字典条目的值中。

0

不知道这是做它的最快的途径,但这个工作对我来说:

data = [ 
    ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'], 
    ['Andhra Pradesh-20','Rajasthan-60','Rajasthan-70'] 
] 

values = {} 
for row in data: 
    for x in row: 
    tokens = x.split('-') 
    values[tokens[0]] = int(tokens[1]) if tokens[0] not in values else values[tokens[0]] + int(tokens[1]) 
    out = [x + '-' + str(y) for x,y in values.iteritems()] 

print out # prints: ['Andhra Pradesh-153', 'Meetai-5781', 'Rajasthan-130'] 
0

在熊猫,你可以做

In [3475]: L = ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'] 

In [3476]: s = (pd.DataFrame(x.split('-') for x in L) 
        .assign(v=lambda x: x[1].astype(int)) 
        .groupby(0)['v'].sum()) 

In [3478]: (s.index + '-' + s.values.astype(str)).tolist() 
Out[3478]: ['Andhra Pradesh-133', 'Meetai-5781'] 

详细

In [3480]: pd.DataFrame(x.split('-') for x in L) 
Out[3480]: 
       0  1 
0 Andhra Pradesh 133 
1   Meetai 1358 
2   Meetai 2146 
3   Meetai 2277 

1是类型,我们assign类型荷兰国际集团列vint

In [3481]: pd.DataFrame(x.split('-') for x in L).assign(v=lambda x: x[1].astype(int)) 
Out[3481]: 
       0  1  v 
0 Andhra Pradesh 133 133 
1   Meetai 1358 1358 
2   Meetai 2146 2146 
3   Meetai 2277 2277 

In [3479]: s 
Out[3479]: 
0 
Andhra Pradesh  133 
Meetai   5781 
Name: v, dtype: int32 
相关问题