2017-08-30 30 views
0

我必须根据所有文件的第7列合并来自多个文件的第2列的值,所以根据Ed Morton在类似问题中的回答(Combining certain columns of several tab-delimited files based on first column),我写这样的代码:根据特定列合并多个文件中的某些列,而不是删除重复的名称

awk 'FNR==1 { ++numFiles} 
!seen[$7]++ { keys[++numKeys] = $7 } 
{ a[$7,numFiles] = $2 } 
END { 
for (keyNr=1; keyNr<=numKeys; keyNr++) { 
    key = keys[keyNr] 
    printf "%s", key 
    for (fileNr=1;fileNr<=numFiles;fileNr++) { 
     printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA") 
    } 
    print "" 
} } ' file1.txt file2.txt file3.txt > combined.txt 

输入文件1:

+-------+-----------------+----------+-------------+----------+-------------+-------------+ 
| ID | adj.P.Val_file1 | P.Value |  t  | B  | logFC | Gene.symbol | 
+-------+-----------------+----------+-------------+----------+-------------+-------------+ 
| 36879 | 1.66E-09  | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA   | 
| 33623 | 1.66E-09  | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA   | 
| 23271 | 2.70E-09  | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB   | 
| 67 | 2.70E-09  | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB   | 
| 33207 | 1.21E-08  | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC   | 
| 24581 | 1.81E-08  | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC   | 
| 32009 | 3.25E-08  | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC   | 
+-------+-----------------+----------+-------------+----------+-------------+-------------+      

输入文件2:

+-------+-----------------+----------+------------+-----------+------------+--------------+ 
| ID | adj.P.Val_file2 | P.Value |  t  |  B  | logFC | Gene.symbol | 
+-------+-----------------+----------+------------+-----------+------------+--------------+ 
| 40000 | 5.43E-13  | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH   | 
| 32388 | 1.15E-11  | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH   | 
| 33623 | 6.08E-11  | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD   | 
| 25002 | 6.08E-11  | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD   | 
| 33207 | 2.03E-10  | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD   | 
| 13018 | 2.03E-10  | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD   | 
| 5539 | 2.24E-10  | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD   | 
+-------+-----------------+----------+------------+-----------+------------+--------------+ 

所需的输出:

+-------------+-----------------+-----------------+ 
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 | 
+-------------+-----------------+-----------------+ 
| AA   | 1.66E-09  | NA    | 
| AA   | 1.66E-09  | NA    | 
| BB   | 2.70E-09  | NA    | 
| BB   | 2.70E-09  | NA    | 
| CC   | 1.21E-08  | NA    | 
| CC   | 1.81E-08  | NA    | 
| CC   | 3.25E-08  | NA    | 
| FGH   | NA    | 5.43E-13  | 
| FGH   | NA    | 1.15E-11  | 
| ASD   | NA    | 6.08E-11  | 
| ASD   | NA    | 6.08E-11  | 
| ASD   | NA    | 2.03E-10  | 
| ASD   | NA    | 2.03E-10  | 
| ASD   | NA    | 2.24E-10  | 
+-------------+-----------------+-----------------+ 

的问题是,第7列有重复的名称,代码需要一个特别的名字第一次出现,我想对所有的重复名称的结果。我尝试删除代码的每一行,并理解,但不能拿出解决方案

+0

请张贴样本输入和期望的输出,以便它对读者有用。 –

+0

我希望上面的例子会有所帮助,文件的列是分开的,但是当我想要通过按Tab键来分隔标签时,它会打开标签对话框,所以请考虑上面的例子。第一和第二文件具有相同coloumn头,即:ID \t,adj.P.Val_file1 \t,P.Value \t,吨\t,B \t,logFC \t,Gene.symbol而所需的输出文件应该只有:Gene.symbol \t,adj.P.Val_file1 \t,adj.P.Val_file2 –

+0

不确定第8行和第9行来自您的预期输出? –

回答

0

终于搞清楚了自己的答案!

我只是要消除线路:从我的代码看到[$ 7] ++,如包括它只会考虑第七列任何复制的名称,一般第一次出现(第n列! )

相关问题