2015-10-01 36 views
1

我期待在Pig中实现以下功能。我有一套这样的样本记录。Apache PIG - GROUP BY

enter image description here

注意,EFFECTIVEDATE列有时是空白,为同一客户ID也不同。

现在,作为输出,我想每一个客户ID记录,其中EFFECTIVEDATE是MAX。所以,对于上面的例子,我希望记录突出显示,如下所示。

enter image description here

我目前使用的猪做它的方式是这样的:

customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray); 

--Group customer data by CustomerID 
customerdata_grpd = GROUP customerdata BY CustomerID; 

--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate. 
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate; 

--Join the above with the original data so that we get the other details like CustomerName, Age etc. 
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate); 

finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate; 

我基本上是分组的原始数据,以发现具有最大EFFECTIVEDATE记录。然后我再次使用原始数据集将这些“分组”记录加入到最大生效日期,但这次我还会获得其他数据,例如客户名称,年龄和性别。这个数据集非常庞大,所以这种方法花费了很多时间。有更好的方法吗?

回答

3

输入:

1,John,28,M,1-Jan-15 
1,John,28,M,1-Feb-15 
1,John,28,M, 
1,John,28,M,1-Mar-14 
2,Jane,25,F,5-Mar-14 
2,Jane,25,F,5-Jun-15 
2,Jane,25,F,3-Feb-14 

猪脚本:

customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray); 

customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date; 

customer_data_grp_id = GROUP customer_data_fmt BY id; 

req_data = FOREACH customer_data_grp_id { 
     customer_data_ordered = ORDER customer_data_fmt BY date DESC; 
     req_customer_data = LIMIT customer_data_ordered 1; 
     GENERATE FLATTEN(req_customer_data.id) AS id, 
       FLATTEN(req_customer_data.name) AS name, 
       FLATTEN(req_customer_data.gender) AS gender, 
       FLATTEN(req_customer_data.effective_date) AS effective_date; 
}; 

输出:

(1,John,M,1-Feb-15) 
(2,Jane,F,5-Jun-15) 
+0

感谢穆拉利。那效果很好.. – user3031097