2017-05-26 37 views
1

我有一个pandas.DataFrame,由于文件(.csv)命名不一致,因此列名冗余。这导致列与大多NaN值从不一致命名的列创建数据帧

Bike # Bikenumber Bike# SubscriberType SubscriptionType 
NaN  NaN  W20848  NaN    Subscriber 
NaN  NaN  W20231  NaN    Subscriber 
NaN  NaN  W00785  NaN    Subscriber 
NaN  NaN  W00126  NaN    Subscriber 
NaN  NaN  W20929  NaN    Casual 

有没有一种方法来创建一个新列,并从具有值的多个列填充它?如果多个列不是NaN,我可以选择从哪个列中提取值?

Bike# Bikenumber Bike # Selected_Num 
number1 number2  NaN  number2 

试图填补与单个列时,我能得到这个

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#']) 
print(sample) 

    Bike # Bikenumber Bike# SubscriberType SubscriptionType Bike_Num 
    NaN  NaN  W20848  NaN    Subscriber  W20848 
    NaN  NaN  W20231  NaN    Subscriber  W20231 
    NaN  NaN  W00785  NaN    Subscriber  W00785 
    NaN  NaN  W00126  NaN    Subscriber  W00126 
    NaN  NaN  W20929  NaN    Casual   W20929 

这失败的

sample['Bike_Num'] = sample['Bike #'].fillna(sample['Bike#'], sample['Bikenumber']) 

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 
+0

那岂不是更好地清洁当时的数据是从CSV读?数据是如何从csv文件中读取的? –

+0

@StephenRauch:我从目录中读取了〜20个csv文件,并使用'for'循环并将它们与'total_df = pd.concat(dfs,ignore_index = True)'连接起来。 –

+0

您正在使用'pandas.read_csv'?我也有理解你基本上有一些列名称的同义词列表吗? –

回答

1

我建议你在当时解决了在CSV的阅读,而不是这个以后尝试解开它们。一种方法是在将CSV文件传递到pandas之前使用小型解析器。

该解析器将csv的打开文件句柄和一个将所需列名映射到各种可能同义词的词典。

代码:

def read_my_csv(file_handle, column_map): 
    # reverse the column mapping dict to use for synonym lookup 
    synoms = dict(sum([ 
     [(syn, k) for syn in v] for k, v in column_map.items()], [])) 

    # build csv reader 
    reader = csv.reader(file_handle) 

    # get the header, and map columns to desired names 
    header = next(reader) 
    header = [synoms.get(c, c) for c in header] 

    # yield the header 
    yield header 

    # yield the remaining rows 
    for row in reader: 
     yield row 

测试代码:

import pandas as pd 
import csv 

column_map = { 
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'), 
    'Sub_Num': ('SubscriberType', 'SubscriptionType'), 
} 

with open("sample.csv", 'rU') as f: 
    generator = read_my_csv(f, column_map) 
    columns = next(generator) 
    df = pd.DataFrame(generator, columns=columns) 

print(df) 

Sample.csv:

Bike #,SubscriptionType 
W20848,Subscriber 
W20231,Subscriber 
W00785,Subscriber 
W00126,Subscriber 
W20929,Casual 

个结果:

Bike_Num  Sub_Num 
0 W20848 Subscriber 
1 W20231 Subscriber 
2 W00785 Subscriber 
3 W00126 Subscriber 
4 W20929  Casual 

解决方案#2

一个更清洁,但几乎没有乐趣,解决办法是做CONCAT之前,列重命名:

代码:

def fix_column_names(df, column_map): 
    # reverse the column mapping dict to use for synonym lookup 
    synoms = dict(sum([ 
     [(syn, k) for syn in v] for k, v in column_map.items()], [])) 

    # rename columns 
    df.columns = [synoms.get(c, c) for c in df.columns] 

测试代码:

import pandas as pd 
import csv 

column_map = { 
    'Bike_Num': ('Bike #', 'Bikenumber', 'Bike#'), 
    'Sub_Num': ('SubscriberType', 'SubscriptionType'), 
} 

df = pd.read_csv('sample.csv', header=0) 
fix_column_names(df, column_map) 
print(df) 
+0

非常好,非常感谢!我对python仍然很陌生,这是我没有考虑过的一种方法。爱它! :) –