2016-09-20 64 views
1

我一直在尝试这一段时间,并且卡住了。这里是问题:使用其他数据框替换数据帧中的值并将字符串作为键与熊猫

我正在处理一些关于我在CSV文件中的文本的元数据。它看起来像这样:

enter image description here

真正的表是更长,更复杂,但是也遵循同样的逻辑:每一行是一个文本,每列是文本的不同方面。我在一些专栏中有很多变化,我希望它能够更简单地进行改造。例如,从叙述角度改变同源性和自动性的价值观到非异源性的价值观。我定义在另一个CSV文件中这种新的模式称为关键字,看起来像这样:

enter image description here

正如你所看到的,元数据的每一列成为行新模式的关键字,在旧值在term_value列中,新值在new_model列中。

所以我需要映射或使用熊猫来替换这个值。这是我得到了现在:

import re 
import pandas as pd 

df_metadata = pd.read_csv("/metadata.csv", encoding="utf-8", sep=",") 
df_keywords = pd.read_csv("/keywords.csv", encoding="utf-8", sep="\t") 

for column_metadata,value_metadata in df_metadata.iteritems(): 

    if str(column_metadata) in list(df_keywords.loc[:,"term_type"]): 

     df_metadata.loc[df_metadata[column_metadata] == value_metadata, column_metadata] = df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]] 

和Python总是给这个错误回:

"ValueError: Series lengths must match to compare"

我认为这个问题是在的第二部分的value_metadata与LOC替换,我这里的意思:

df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]] 

我不明白的是为什么value_metadata作品在此命令的第一部分,但在第二个没有做......

请,我将不胜感激任何帮助。也许有比遍历数据框更简单的方法...我对任何建议都非常开放。最好的问候, 何塞

回答

1

你可以先通过slicers和由旧环map新的价值创造df_keywordsMultiindex更快的选择:

df_keywords.set_index(['term_type','term_value'], inplace=True) 

idx = pd.IndexSlice 
#first maping in column narrative-perspective 
print (df_keywords.loc[idx['narrative-perspective',:]].to_dict()['new_model']) 
{'heterodiegetic': 'heterodiegetic', 'other/mixed': 'n-heterodiegetic', 
'homodiegetic': 'n-heterodiegetic', 'autodiegetic': 'n-heterodiegetic'} 

#column names for replacing  
L = ['narrative-perspective','narrator','protagonist-gender'] 
for col in L: 
    df_metadata[col] = 
    df_metadata[col].map(df_keywords.loc[idx[col,:]].to_dict()['new_model']) 

print (df_metadata) 
    idno author-name narrative-perspective  narrator protagonist-gender 
0 ne0001  Baroja  n-heterodiegetic third-person    male 
1 ne0002  Galdos  heterodiegetic third-person    n-male 
2 ne0003  Galdos  n-heterodiegetic third-person    male 
3 ne0004  Galdos  n-heterodiegetic third-person    n-male 
4 ne0005  Galdos  heterodiegetic third-person    n-male 
5 ne0006  Galdos  heterodiegetic third-person    male 
6 ne0007  Sawa  heterodiegetic third-person    n-male 
7 ne0008 Zamacois  heterodiegetic third-person    n-male 
8 ne0009  Galdos  heterodiegetic third-person    n-male 
9 ne0011  Galdos  n-heterodiegetic n-third-person    male 

to_dict可以被遗漏的,然后通过Series地图:

df_keywords.set_index(['term_type','term_value'], inplace=True) 
idx = pd.IndexSlice 

#first maping in column narrative-perspective 
print (df_keywords.loc[idx['narrative-perspective',:]]['new_model']) 
term_value 
autodiegetic  n-heterodiegetic 
heterodiegetic  heterodiegetic 
homodiegetic  n-heterodiegetic 
other/mixed  n-heterodiegetic 
Name: new_model, dtype: object 

L = ['narrative-perspective','narrator','protagonist-gender'] 
for col in L: 
    df_metadata[col] = df_metadata[col].map(df_keywords.loc[idx[col,:]]['new_model']) 

print (df_metadata) 
    idno author-name narrative-perspective  narrator protagonist-gender 
0 ne0001  Baroja  n-heterodiegetic third-person    male 
1 ne0002  Galdos  heterodiegetic third-person    n-male 
2 ne0003  Galdos  n-heterodiegetic third-person    male 
3 ne0004  Galdos  n-heterodiegetic third-person    n-male 
4 ne0005  Galdos  heterodiegetic third-person    n-male 
5 ne0006  Galdos  heterodiegetic third-person    male 
6 ne0007  Sawa  heterodiegetic third-person    n-male 
7 ne0008 Zamacois  heterodiegetic third-person    n-male 
8 ne0009  Galdos  heterodiegetic third-person    n-male 
9 ne0011  Galdos  n-heterodiegetic n-third-person    male 
+0

哇,很多很多谢谢! :) 有用!你会提出什么建议作为最好的方法来创建元数据的列表,以重塑取决于两个文件的输入?因为现在我有这三列,但明天我可能有20 ...我做到了这一点,它的工作原理,但肯定你有更好的方式: ' list_ = []; (df_keywords.loc [:,“term_type”])中的str(column_name): list_.append(column_name); print(list_); ' 我不会让评论的代码渲染好:(对不起! –

+0

我认为最简单的是['drop_duplicates'](http://pandas.pydata.org/pandas-docs/stable/)生成/ pandas.Series.drop_duplicates.html) - 'print(df_keywords.term_type.drop_duplicates()。tolist())' – jezrael

+0

很多很多谢谢!祝你有美好的一天! –