2013-05-14 65 views
0

我有一个MySQL数据库内数据与7列(chrposnumiAiBiCiD)和包含各含有一个数据集40000000行的文件。每行具有4个制表符分隔的列,而第一三列总是包含数据,以及第四列可以包含多达三个不同key=value对由分号写部分制表符分隔的数据MySQL数据库

chr pos num info 
1  10203 3  iA=0.34;iB=nerv;iC=45;iD=dskf12586 
1  10203 4  iA=0.44;iC=45;iD=dsf12586;iB=nerv 
1  10203 5  
1  10213 1  iB=nerv;iC=49;iA=0.14;iD=dskf12586 
1  10213 2  iA=0.34;iB=nerv;iD=cap1486 
1  10225 1  iD=dscf12586 

在列信息的键值对具有分离没有特定的顺序。我也不确定一个键是否会出现两次(我不希望)。

我想将数据写入数据库。前三列没有问题,但是从info-columns中提取值使我困惑,因为key = value对是无序的,并不是每个键都必须在行中。 对于一个类似的数据集(有序的信息列),我用一个java-Programm与正则表达式相关联,这使得我可以(1)检查和(2)提取数据,但现在我陷入困境。

我该如何解决这个任务,最好用bash脚本或直接在MySQL中解决?

+0

什么? – HamZa 2013-05-14 08:03:11

+3

对不起,这可以用几乎任何语言来完成:p我要做的是以下内容:遍历每一行,由\ t +'分隔(tab(s))。 *用';'分割最后一个制表符,再用'='分割。现在你有了* info *的值,你只需创建它后面的逻辑并创建一个查询并执行它。 – HamZa 2013-05-14 08:08:30

+0

@R_User,你是否得到了答案? – svante 2013-09-10 13:11:34

回答

2

你没有提到你想要如何写入数据。但下面的示例awk显示了如何获取每行中的每个单独的ID和密钥。而不是printf的,你可以用你自己的逻辑来写入数据

[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log 
{ 
    if(length($4)) { 
    split($4,array,";"); 
    print "In " $1, $2, $3; 
    for(element in array) { 
     key=substr(array[element],0,index(array[element],"=")); 
     value=substr(array[element],index(array[element],"=")+1); 
     printf("found %s key and %s value for %d line from %s\n",key,value,NR,array[element]); 
    } 
    } 
} 
########### 
In 1 10203 3 
found iD= key and dskf12586 value for 1 line from iD=dskf12586 
found iA= key and 0.34 value for 1 line from iA=0.34 
found iB= key and nerv value for 1 line from iB=nerv 
found iC= key and 45 value for 1 line from iC=45 
In 1 10203 4 
found iB= key and nerv value for 2 line from iB=nerv 
found iA= key and 0.44 value for 2 line from iA=0.44 
found iC= key and 45 value for 2 line from iC=45 
found iD= key and dsf12586 value for 2 line from iD=dsf12586 
In 1 10213 1 
found iD= key and dskf12586 value for 4 line from iD=dskf12586 
found iB= key and nerv value for 4 line from iB=nerv 
found iC= key and 49 value for 4 line from iC=49 
found iA= key and 0.14 value for 4 line from iA=0.14 
In 1 10213 2 
found iA= key and 0.34 value for 5 line from iA=0.34 
found iB= key and nerv value for 5 line from iB=nerv 
found iD= key and cap1486 value for 5 line from iD=cap1486 
In 1 10225 1 
found iD= key and dscf12586 value for 6 line from iD=dscf12586 
2

从@abasu awk中的解决方案与刀片也解决了无序键值对。

parse.awk:

NR>1 { 
    col["iA"]=col["iB"]=col["iC"]=col["iD"]="null"; 

    if(length($4)) { 
    split($4,array,";"); 
    for(element in array) { 
     split(array[element],keyval,"="); 
     col[keyval[1]] = "'" keyval[2] "'"; 
    } 
    } 
    print "INSERT INTO tbl VALUES (" $1 "," $2 "," $3 "," col["iA"] "," col["iB"] "," col["iC"] "," col["iD"] ");"; 
} 

测试/运行:

有关使用PHP
$ awk -f parse.awk file 
INSERT INTO tbl VALUES (1,10203,3,'0.34','nerv','45','dskf12586'); 
INSERT INTO tbl VALUES (1,10203,4,'0.44','nerv','45','dsf12586'); 
INSERT INTO tbl VALUES (1,10203,5,null,null,null,null); 
INSERT INTO tbl VALUES (1,10213,1,'0.14','nerv','49','dskf12586'); 
INSERT INTO tbl VALUES (1,10213,2,'0.34','nerv',null,'cap1486'); 
INSERT INTO tbl VALUES (1,10225,1,null,null,null,'dscf12586');