我对你所有的awk/sed/perl专家有个疑问。我遇到具有以下格式如文件:如何删除x个具有相同字符串的条目并只保留一个具有修改标题的条目?
>GALHOMG00000016026_1 GALHOMT00000016026_1 GALHOMP00000016026_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
>HUMHOMG00000262990_1 HUMHOMT00000262990_1 HUMHOMP00000262990_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
>TGUHOMG00000002432_1 TGUHOMT00000002432_1 TGUHOMP00000002432_1 JH556633.1:35740-45316 1
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
我想修改该文件到以下几点:
>JH556633.1:35740-45316
MPKKKTGARKKAENRREREKQIRASRANIDLAKHPCNASMECDKCQRRQKNRAFCYFCNS
VQKLPICAQCGKTKCMMKSSDCVIKHAGVYSTGLAMVGAICDFCEAWVCHGRKCLSTHAC
TCPLADAECIECERSVWDHGGRIFACSFCHDFLCEDDQFEHQASCQVLEAETFKCVSCNR
LGQHSCLRCKACFCGDHVRSKVFKQEKGKEPPCPKCGHETQQTKDLSMSTRSLKFGRQTG
GEDADGASGYDAYWKNLSSSKPGDAGDREDEYDEYEAEDDDEDDNDEGGKDSDTETTDLF
SNLNLGRTYASGYAHYEEPED
我知道我可以改变我所说的头(我的意思如下所示的行):
awk 'NF > 1{$0=">"$4}; {print $0}' file.fa > file2.fa
我的问题是,我该如何删除其他两段?文件中可能存在段落中的字符序列(即,不包括标题行)不相同的实例。在这种情况下,我想根据具有相同标识符的条目数来附加扩展名(例如,在本例中,第一个JH556633.1-2:35740-45316
为JH556633.1-1:35740-45316
,或类似的情况)。重点是使相同的标题(以>
开头的行)不同,但如果它们不相同,则保留原始字符序列。
如果有人有想法解决这个问题,我将不胜感激的援助。谢谢!
好吧!得到它了! – BashN3wb 2014-09-27 05:17:13
你的意思是在'大于号'或'大于号'开头的行之后的行吗? – 2014-09-27 07:02:00
请向我们展示您解决问题的尝试(不仅仅是您发布的awk命令,它只处理第一行)。 – 2014-09-27 07:28:59