0
我想通过拆分出两个元组(或任何它在Pig中调用的),基于col2
中的条件,并在操作col2
后,将其转换为另一列,比较两个操纵的元组并进行额外的排除操作。Apache Pig:在另一个上过滤一个元组?
REGISTER /home/user1/piggybank.jar;
log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);
--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;
is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;
分裂和操纵是很容易的部分。这是它变得复杂的地方。 。 。
merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};
如果我能想出这条线(S),其余将下降到位。
merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;
STORE merge_limited into 'file';
这里的I/O的例子:
col1 col2 manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W
isnt
E
Y
col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty
请提供示例输入和输出。我不清楚你想要做什么。 –
你走了。看看这是否有助于更好地解释它。 – Jonathan
任何人?我需要这个尽快,请。 – Jonathan