python，比较位于两个不同文本文件的列中的字符串

-1

我有两个文本文件，“animals.txt”和“colors.txt”，如下所示，其中每行的2个字符串由选项卡分隔。python，比较位于两个不同文本文件的列中的字符串

“animals.txt”

12345 dog 

23456 sheep 

34567 pig

“colors.txt”

34567 pink 

12345 black 

23456 white

我想编写Python代码：

对于“animals.txt每行“取第一列中的字符串（12345，然后是23456，然后是34567）
将此字符串与st在 “colors.txt”
第一列环如果找到一个匹配（12345 12345 ==等），将其写入两个输出文件：

OUTPUT1，含有animals.txt的行+对应于该查询值在colors.txt的第二列的值（12345）：

含有对应于所述查询值colors.txt的第二列中的值的列表

12345 dog black 
23456 sheep white 
34567 pig pink

OUTPUT2（12345 ，然后是23456，然后是34567））：

black 
white 
pink

来源

2012-07-17 user1532389

你试过了什么？ – Dhara 2012-07-17 16:57:06

你需要使用python吗？如果你正在使用bash和你的输入进行排序，这样做：

$ join -t $'\t' <(sort animals.txt) <(sort colors.txt) > output1 
$ cut -f 3 output1 > output2

如果您还没有支持进程替换一个壳，然后进行排序输入文件并执行：

$ join -t '<tab>' animals.txt colors.txt > output1 
$ cut -f 3 output1 > output2

凡<tab>是一个实际的制表符。根据你的shell，你可以用ctrl-V后跟一个制表键来输入它。（或使用切割不同的分隔符。）

来源

2012-07-17 17:03:23

您排序错误 - “animals.txt”已经排序，“colors.txt”需要排序。请注意，在bash中，可以使用'$'\ t''来表示一个选项卡。由于只有一个文件需要排序，因此您可以执行'sort colors.txt |加入-t $'\ t'animals.txt -'。 – 2012-07-17 17:09:58

@sven感谢您指出'$'\ t''。 – 2012-07-17 17:21:52

如果顺序并不重要，这将成为一个非常简单的问题：

with open('animals.txt') as f1, open('colors.txt') as f2: 
    animals = {} 
    for line in f1: 
     animal_id, animal_type = line.split('\t') 
     animals[animal_id] = animal_type 

    #animals = dict(map(str.split,f1)) would work instead of the above loop if there are no multi-word entries. 

    colors={} 
    for line in f2: 
     color_id, color_name = line.split('\t') 
     colors[color_id] = color_name 

    #colors = dict(map(str.split,f2)) would work instead of the above loop if there are no multi-word entries. 
    #Thanks @Sven for pointing this out. 

common=set(animals.keys()) & set(colors.keys()) #set intersection. 
with open('output1.txt','w') as f1, open('output2.txt','w') as f2: 
    for i in common: #sorted(common,key=int) #would work here to sort. 
     f1.write("%s\t%s\t%s\n"%(i,animals[i],colors[i]) 
     f2.write("%s"%colors[i])

你也许能更优雅地做到这一点有点通过defaultdict哪里当遇到一个特定的键时，你会追加到一个列表中，然后当你写输出之前测试列表的长度是2时，但是我不相信这种方法更好。

来源

2012-07-17 17:21:39 mgilson

你也可以做'animals = dict（map（str.split，f1））'。 – 2012-07-17 17:38:57

@SvenMarnach - 好点。出于某种原因，我不倾向于使用它来经常创建字典。一个值得警惕的是，当涉及名称中有空格的动物（例如“棕色斑点蜥蜴”）时，它是有点脆弱的。我的原始版本（使用裸“split”有类似的问题）。我已更新。 – mgilson 2012-07-17 17:46:31

下，在输入文件的每一行完全一样的例子是结构化的假设：

with open("c:\\python27\\output1.txt","w") as out1, \ 
    open("c:\\python27\\output2.txt","w") as out2: 

    for outline in [animal[0]+"\t"+animal[1]+"\t"+color[1] \ 
        for animal in [line.strip('\n').split("\t") \ 
        for line in open("c:\\python27\\animals.txt","r").readlines()] \ 
        for color in [line.strip('\n').split("\t") \ 
        for line in open("c:\\python27\\colors.txt","r").readlines()] \ 
        if animal[0] == color[0]]: 

     out1.write(outline+'\n') 
     out2.write(outline[outline.rfind('\t')+1:]+'\n')

我认为这会为你做。

也许不是最优雅/快速/清晰的方法 - 但很短。从技术上讲，我相信这是4条线。

来源

2012-07-17 18:08:37 selllikesybok

我会用熊猫

animals, colors = read_table('animals.txt', index_col=0), read_table('colors.txt', index_col=0) 
df = animals.join(colors)

结果：

animals.join(colors) 
Out[73]: 
     animal color 
id 
12345 dog  black 
23456 sheep white 
34567 pig  pink

然后输出颜色ID的顺序文件：

df.color.to_csv(r'out.csv', index=False)

如果无法添加列标题为文本文件，可以在导入时添加它们

animals = read_table('animals.txt', index_col=0, names=['id','animal'])

来源

2012-07-23 02:04:26 mrjoh3

python，比较位于两个不同文本文件的列中的字符串

回答

相关问题