2016-06-10 98 views
1

输入文件:删除重复,但只保留最后一次出现在Linux文件

5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C 
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H 
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,,user,,f660818af5625b3be61fe12489689601,50328589469,,,30002,C 
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H 
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,,user,,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,C 
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H 
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,,user,,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,C 
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H 
0,,OR1,1000,Nawras,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C 

所需的输出:

5,,OR1,1000,UY,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H 
5,,OR2,2000,UY,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H  
5,,OR1,1000,UY,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H  
0,,OR5,5000,UY,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H 
0,,OR1,1000,UY,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C* 

代码中使用:

for i in `cat file | awk -F, '{print $13}' | sort | uniq` 
do 
grep $i file | tail -1 >> TESTINGGGGGGG_SV 
done 

这由于该文件有30个,花了很多时间10万条记录,并且在13列有6500万条uniq记录。

所以我需要一个可以遍历第13列值的输出 - 最后一次出现在文件中作为输出。

+0

取行号lN'的perl -F,-le“$看出{$ F [12]} = $ _; END {print $ seen {$ _} for sort keys%seen}'' – melpomene

回答

1

awk来救援!

awk -F, 'p!=$13 && p0 {print p0} {p=$13; p0=$0} END{print p0}' file 

需要排序输入。

如果您可以成功运行脚本,请发布时间。

如果排序是不可能的,另一种选择是

tac file | awk -F, '!a[$13]++' | tac 

扭转文件,采取$ 13中的第一项和反向结果返回。

0

这里是一个应该工作的解决方案:

awk -F, '{rows[$13]=$0} END {for (i in rows) print rows[i]}' file 

说明:

  • rows是场13 $13索引的关联数组,数组由$13索引的元素会被覆盖每次有字段13的副本;它的值是整行$0

但是由于保存数组所需的空间,这在内存方面效率很低。

到仍然没有使用分选,只是保存关联数组中的行号在上述溶液的改进:

awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file|while read lN; do sed "${lN}q;d" file; done 

说明:

  • rows作为之前,但这些值是行号而不是整行
  • awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}'文件输出包含查找行的行号列表
  • sed "${lN}q;d"file
+1

你有没有关于你的程序使用多少内存? 6500万条独特记录。如果每个记录是50个字节,它将变成大约3GB的原始数据,而不计算AWK需要保持阵列结构。自己计算'perl -le'打印65_000_000 * 50/1024/1024/1024'' – andlrc

相关问题