2017-04-21 93 views
0

我需要对多个文件进行大熊猫DF操作,大熊猫几个文件操作和合并

df1 = pd.read_csv("~/pathtofile/sample1.csv") 
some_df=pd.read_csv("~/pathtofile/metainfo.csv") 
df1.sort_values('col2') 
df1 = df1[df1.col5 != 'N'] 
df1['new_col'] = df1['col3'] - df1['col2'] + 1 
f = lambda row: '{col1}:{col2}-{col3}({col4})'.format(**row) 
df1.astype(str).apply(f,1) 
df4 = df1.assign(Unique=df1.astype(str).apply(f,1)) 
# print(df4) 
##merge columns 
df44 = df4.merge(some_df, left_on='genes', right_on='name', suffixes=('','_1')) 
df44 = df44.rename(columns={'id':'id_new'}).drop(['name_1'], axis=1) 
# print(df44) 
df44['some_col'] = df44['some_col'] + ':E' + 
df44.groupby('some_col').cumcount().add(1).astype(str).str.zfill(3) 
print(df44) 
##drop unwanted columns adapted from http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe 
df4 = df44 
df4.drop(df4.columns[[3,7,9,11,12,13]], axis=1, inplace=True) 

df4 = df4[['col0', 'col1', 'col2', 'col4', 'col5', 'col6', 'col8']] 
df4 
df4.to_csv('foo.csv', index=False) 

上面的代码仅仅是一个文件,几个问题 1)我有〜15个文件,我需要执行这组如何使用这对所有的15个文件 2)和命令的写入15个不同的CSV 3)合并某些列从所有15 DF,并作出矩阵(例如只是合并3个DFS)

sample1 = pd.DataFrame.set_index(df4,['col1'])["col4"] 
sample2 = pd.DataFrame.set_index(df5,['col1'])["col4"] 
sample3 = pd.DataFrame.set_index(df6, ['col1'])["col4"] 
concat = pd.concat([sample1,sample2,sample3], axis=1).fillna(0) 
# print(concat) 
concat.reset_index(level=0, inplace=True) 
concat.columns = ["newcol0", "col1", "col2", "col3"] 
concat.to_csv('bar.csv', index=False) 

有没有更好的w唉,要做到这一点,比复制粘贴15次?

+0

是,做一个脚本,并推广你的操作到功能 –

+0

喜@DmitryPolonskiy请你展示了如何做到这一点片段? – novicebioinforesearcher

+0

你不知道如何编写脚本? –

回答

1

好吧,我只是很快把它放在一起为上述代码。我会建议学习如何编写脚本和概括事物。我没有清理代码或解决冗余问题,我会把它留给你。如果您发布的代码有效,这应该从命令行起作用。

import sys 
import pandas as pd 

def load_df(input_file): 
    df = pd.DataFrame(pd.read_csv(input_file)) 
    return df 

def perform_operations(df): 
    df.sort_values('col2') 
    df = df[df.col5 != 'N'] 
    df['new_col'] = df['col3'] - df['col2'] + 1 
    f = lambda row: '{col1}:{col2}-{col3}({col4})'.format(**row) 
    df.astype(str).apply(f,1) 
    df4 = df.assign(Unique=df.astype(str).apply(f,1)) 
    return df4 

def merge_stuff(df, df1): 
    df44 = df.merge(df1, left_on='genes', right_on='name', suffixes=('','_1')) 
    df44 = df44.rename(columns={'id':'id_new'}).drop(['name_1'], axis=1) 
    return df44 


def group_and_drop(df): 
    df['some_col'] = df['some_col'] + ':E' + 
    df.groupby('some_col').cumcount().add(1).astype(str).str.zfill(3) 
    df4 = df 
    df4.drop(df4.columns[[3,7,9,11,12,13]], axis=1, inplace=True) 
    return df4 

def write_out_csv(df): 
    df = df[['col0', 'col1', 'col2', 'col4', 'col5', 'col6', 'col8']] 
    df.to_csv('foo.csv', index=False) 


def main(): 
    file_1 = sys.argv[1] 
    file_2 = sys.argv[2] 
    df = load_df(file_1) 
    df1 = load_df(file_2) 
    df4 = perform_operations(df) 
    df44 = merge_stuff(df4, df1) 
    grouped = group_and_drop(df44) 
    write_out_csv(grouped) 

if __name__ == '__main__': 
    main() 
+0

感谢您的帮助,将在此工作,并学习...非常感谢 – novicebioinforesearcher

+1

如果你不知道它是如何工作的,从命令行你会做这样的事情'python name_of_script.py location_of_first_csv location_of_second_csv' –