2014-12-27 64 views
0

我想为下面的查询写一个猪脚本。在猪中合并线

输入是:

AAA,,, 
,BBB,, 
,,,DDD 
AAA,,, 
,BBB,, 
,,CCC, 
,,,DDD 
AAA,,, 
,BBB,, 
,,,DDD 

输出应该是:

AAA,BBB,,DDD 
AAA,BBB,CCC,DDD 
AAA,BBB,,DDD 

我试图与Merge two lines in Pig但如果我试图拆袋BagSplit(3,$ 1),则输出是不正确的,因为我的输出将合并前三条线,然后接下来的四条线,再接下来的三条线

输入可能会增加,但最后一行将会增加一个重要的东西是,,, DDD。

任何人都可以帮我吗?

回答

0

您的输入数据应该分成不同的长度(3,4,3),所以BagSplit函数在这种情况下不起作用。你可以尝试下面的方法吗?关系E (TOTUPLE)的重复部分可以使用MACROS进一步优化,但会导致更多混淆,所以我现在没有进行优化。

input.txt中

AAA,,, 
,BBB,, 
,,,DDD 
AAA,,, 
,BBB,, 
,,CCC, 
,,,DDD 
AAA,,, 
,BBB,, 
,,,DDD 

PigScript:

A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4); 
B = RANK A; 
C = GROUP B ALL; 
D = FOREACH C { 
       firstRecord = FILTER B BY rank_A<=3;    /* store first 3 records*/ 
       secondRecord= FILTER B BY rank_A>3 AND rank_A<=7; /* store next 4 records */ 
       thirdRecord = FILTER B BY rank_A>7;     /* store next 3 records */ 
       GENERATE firstRecord,secondRecord,thirdRecord; 
       } 

/* Convert each split bags(firstRecord,secondRecord and thirdRecord) into strings and replace 'null' and '_' with empty characters.*/ 
E = FOREACH D GENERATE FLATTEN(TOBAG(
             TOTUPLE(REPLACE(BagToString(firstRecord.f1),'[null|_]',''), 
               REPLACE(BagToString(firstRecord.f2),'[null|_]',''), 
               REPLACE(BagToString(firstRecord.f3),'[null|_]',''), 
               REPLACE(BagToString(firstRecord.f4),'[null|_]','')), 
             TOTUPLE(REPLACE(BagToString(secondRecord.f1),'[null|_]',''), 
               REPLACE(BagToString(secondRecord.f2),'[null|_]',''), 
               REPLACE(BagToString(secondRecord.f3),'[null|_]',''), 
               REPLACE(BagToString(secondRecord.f4),'[null|_]','')), 
             TOTUPLE(REPLACE(BagToString(thirdRecord.f1),'[null|_]',''), 
               REPLACE(BagToString(thirdRecord.f2),'[null|_]',''), 
               REPLACE(BagToString(thirdRecord.f3),'[null|_]',''), 
               REPLACE(BagToString(thirdRecord.f4),'[null|_]','')) 
             ) 
           ); 
DUMP E; 

输出:

(AAA,BBB,,DDD) 
(AAA,BBB,CCC,DDD) 
(AAA,BBB,,DDD) 
+0

但我不知道长度。它可能是3,3,4,但最后一行应该是,,, DDD。请让我知道我们该如何解决这个问题 – Viru 2014-12-28 11:59:47

+0

如果您事先不知道长度,那么解决这个问题将会很困难。一旦选项可以用于自定义UDF并调整UDF内的输出。 – 2014-12-29 06:41:43