0
我必须根据所有文件的第7列合并来自多个文件的第2列的值,所以根据Ed Morton在类似问题中的回答(Combining certain columns of several tab-delimited files based on first column),我写这样的代码:根据特定列合并多个文件中的某些列,而不是删除重复的名称
awk 'FNR==1 { ++numFiles}
!seen[$7]++ { keys[++numKeys] = $7 }
{ a[$7,numFiles] = $2 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "NA")
}
print ""
} } ' file1.txt file2.txt file3.txt > combined.txt
输入文件1:
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| ID | adj.P.Val_file1 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
| 36879 | 1.66E-09 | 7.02E-14 | -12.3836337 | 21.00111 | -2.60060826 | AA |
| 33623 | 1.66E-09 | 7.39E-14 | -12.3599517 | 20.95461 | -2.53106808 | AA |
| 23271 | 2.70E-09 | 2.30E-13 | -11.8478184 | 19.93024 | -2.15050984 | BB |
| 67 | 2.70E-09 | 2.40E-13 | -11.829044 | 19.892 | -3.06680932 | BB |
| 33207 | 1.21E-08 | 1.35E-12 | -11.0793461 | 18.32425 | -2.65246816 | CC |
| 24581 | 1.81E-08 | 2.41E-12 | -10.8325542 | 17.79052 | -1.87937753 | CC |
| 32009 | 3.25E-08 | 5.05E-12 | -10.5240537 | 17.11081 | -1.46505166 | CC |
+-------+-----------------+----------+-------------+----------+-------------+-------------+
输入文件2:
+-------+-----------------+----------+------------+-----------+------------+--------------+
| ID | adj.P.Val_file2 | P.Value | t | B | logFC | Gene.symbol |
+-------+-----------------+----------+------------+-----------+------------+--------------+
| 40000 | 5.43E-13 | 1.21E-17 | 17.003819 | 29.155646 | 2.4805744 | FGH |
| 32388 | 1.15E-11 | 5.12E-16 | 14.920047 | 25.829874 | 2.2497567 | FGH |
| 33623 | 6.08E-11 | 4.43E-15 | -13.8115 | 23.870549 | -2.8161587 | ASD |
| 25002 | 6.08E-11 | 5.40E-15 | 13.713018 | 23.689571 | 2.2164681 | ASD |
| 33207 | 2.03E-10 | 2.29E-14 | -13.009752 | 22.36291 | -2.8787392 | ASD |
| 13018 | 2.03E-10 | 2.71E-14 | 12.929201 | 22.207038 | 3.0181585 | ASD |
| 5539 | 2.24E-10 | 3.48E-14 | 12.810902 | 21.976634 | 3.0849706 | ASD |
+-------+-----------------+----------+------------+-----------+------------+--------------+
所需的输出:
+-------------+-----------------+-----------------+
| Gene.symbol | adj.P.Val_file1 | adj.P.Val_file2 |
+-------------+-----------------+-----------------+
| AA | 1.66E-09 | NA |
| AA | 1.66E-09 | NA |
| BB | 2.70E-09 | NA |
| BB | 2.70E-09 | NA |
| CC | 1.21E-08 | NA |
| CC | 1.81E-08 | NA |
| CC | 3.25E-08 | NA |
| FGH | NA | 5.43E-13 |
| FGH | NA | 1.15E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 6.08E-11 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.03E-10 |
| ASD | NA | 2.24E-10 |
+-------------+-----------------+-----------------+
的问题是,第7列有重复的名称,代码需要一个特别的名字第一次出现,我想对所有的重复名称的结果。我尝试删除代码的每一行,并理解,但不能拿出解决方案
请张贴样本输入和期望的输出,以便它对读者有用。 –
我希望上面的例子会有所帮助,文件的列是分开的,但是当我想要通过按Tab键来分隔标签时,它会打开标签对话框,所以请考虑上面的例子。第一和第二文件具有相同coloumn头,即:ID \t,adj.P.Val_file1 \t,P.Value \t,吨\t,B \t,logFC \t,Gene.symbol而所需的输出文件应该只有:Gene.symbol \t,adj.P.Val_file1 \t,adj.P.Val_file2 –
不确定第8行和第9行来自您的预期输出? –