2017-03-09 81 views
0

我有以下要求 -猪查询设置操作AB

我有一个包含JSON格式的数据行的大文件 -

{ 
    "_length": "88", 
    "_id" : "1", 
    "_store": { 
     "meta": { 
      "value": { 
       "uid": "sam", 
      } 
     } 
    } 
} 
{ 
    "_length": "22", 
    "_id" : "2", 
    "_store": { 
     "meta": { 
      "value": { 
       "uid": "uncle", 
      } 
     } 
    } 
} 

....

我有另一个文件包含以下 -

{ 
     "uid" : "sam", 
     "zid" : "121212121" 
    } 
    { 
     "uid" : "aborted", 
     "zid" : "9989821" 
    } 

....

现在我需要从第一个文件生成一个新文件,其中包含所有记录 udi不在第二个文件中。

我是猪新手,想知道支持什么样的JOIN或SET操作。

+0

请看这里http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html –

回答

0

我认为elephantbird可以帮助你在这里。我从来没有尝试过这样的东西,但是因为你的嵌套json,你可以使用大象鸟将2个文件读入2个变量,然后加入并实现你的目标。

这些是几个链接,这将帮助您与大象鸟开始。

ElephantBird ERROR 1070: --- > class not getting read

https://github.com/twitter/elephant-bird

+0

我已经在使用大象鸟,并想通了,我需要使用2'复制'与FILTERs连接到实现这种情况。 – user1619355

+0

你介意发布代码,可能会对未来的任何帮助。 ! – ANI

0

以下是示例文件和相应的中间和最终结果 -

cat ids_test.json 
{"A":"a1","B":"a2"} 

cat part-test 
{"content":"both_A_a1_B_a2","meta":{"A":"a1","B":"a2"}} 
{"content":"only_B_a2","meta":{"A":"","B":"a2"}} 
{"content":"only_A_a1","meta":{"A":"a1","B":""}} 
{"content":"both_A_b1_B_b2","meta":{"A":"b1","B":"b2"}} 
{"content":"only_A_c1","meta":{"A":"c1","B":""}} 

cat /tmp/j1/part-m-00000 
{"user_data::json":{"meta":"{B=a2, A=a1}","content":"both_A_a1_B_a2"},"ids::json":{"B":"a2","A":"a1"}} 
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null} 
{"user_data::json":{"meta":"{B=, A=a1}","content":"only_A_a1"},"ids::json":{"B":"a2","A":"a1"}} 
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null} 
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null} 

cat /tmp/j1_filter/part-m-00000 
{"user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"ids::json":null} 
{"user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"ids::json":null} 
{"user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"ids::json":null} 

cat /tmp/j2/part-m-00000 
{"J1_FILTER::user_data::json":{"meta":"{B=a2, A=}","content":"only_B_a2"},"J1_FILTER::ids::json":null,"ids::json":{"B":"a2","A":"a1"}} 
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"},"J1_FILTER::ids::json":null,"ids::json":null} 
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"},"J1_FILTER::ids::json":null,"ids::json":null} 

cat /tmp/results/part-m-00000 
{"J1_FILTER::user_data::json":{"meta":"{B=b2, A=b1}","content":"both_A_b1_B_b2"}} 
{"J1_FILTER::user_data::json":{"meta":"{B=, A=c1}","content":"only_A_c1"}} 

以下是脚本 -

user_data = LOAD 'part-test' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]); 
ids = LOAD 'ids_test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]); 
J1 = JOIN user_data BY json#'meta'#'A' LEFT OUTER, ids BY json#'A' USING 'replicated'; 

rmf /tmp/j1 
store J1 into '/tmp/j1' USING JsonStorage; 

J1_FILTER = FILTER J1 BY ids::json is null; 

rmf /tmp/j1_filter 
store J1_FILTER into '/tmp/j1_filter' USING JsonStorage; 

J2 = JOIN J1_FILTER BY user_data::json#'meta'#'B' left outer, ids BY json#'B' USING 'replicated'; 

rmf /tmp/j2 
store J2 into '/tmp/j2' USING JsonStorage; 

J2_FILTER = FILTER J2 BY ids::json is null; 

RESULTS = FOREACH J2_FILTER GENERATE J1_FILTER::user_data::json; 
--filtered_ids = FOREACH user_data_MINUS_ids GENERATE user_data AS data; 
--DUMP filtered_ids; 
rmf /tmp/results 
store RESULTS into '/tmp/results' USING JsonStorage;