您的输入数据应该分成不同的长度(3,4,3),所以BagSplit
函数在这种情况下不起作用。你可以尝试下面的方法吗?关系E (TOTUPLE)
的重复部分可以使用MACROS
进一步优化,但会导致更多混淆,所以我现在没有进行优化。
input.txt中
AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = RANK A;
C = GROUP B ALL;
D = FOREACH C {
firstRecord = FILTER B BY rank_A<=3; /* store first 3 records*/
secondRecord= FILTER B BY rank_A>3 AND rank_A<=7; /* store next 4 records */
thirdRecord = FILTER B BY rank_A>7; /* store next 3 records */
GENERATE firstRecord,secondRecord,thirdRecord;
}
/* Convert each split bags(firstRecord,secondRecord and thirdRecord) into strings and replace 'null' and '_' with empty characters.*/
E = FOREACH D GENERATE FLATTEN(TOBAG(
TOTUPLE(REPLACE(BagToString(firstRecord.f1),'[null|_]',''),
REPLACE(BagToString(firstRecord.f2),'[null|_]',''),
REPLACE(BagToString(firstRecord.f3),'[null|_]',''),
REPLACE(BagToString(firstRecord.f4),'[null|_]','')),
TOTUPLE(REPLACE(BagToString(secondRecord.f1),'[null|_]',''),
REPLACE(BagToString(secondRecord.f2),'[null|_]',''),
REPLACE(BagToString(secondRecord.f3),'[null|_]',''),
REPLACE(BagToString(secondRecord.f4),'[null|_]','')),
TOTUPLE(REPLACE(BagToString(thirdRecord.f1),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f2),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f3),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f4),'[null|_]',''))
)
);
DUMP E;
输出:
(AAA,BBB,,DDD)
(AAA,BBB,CCC,DDD)
(AAA,BBB,,DDD)
但我不知道长度。它可能是3,3,4,但最后一行应该是,,, DDD。请让我知道我们该如何解决这个问题 – Viru 2014-12-28 11:59:47
如果您事先不知道长度,那么解决这个问题将会很困难。一旦选项可以用于自定义UDF并调整UDF内的输出。 – 2014-12-29 06:41:43