2014-12-02 78 views
4

我有一个Hive表,它跟踪在进程的各个阶段中移动的对象的状态。该表是这样的:使用python转换函数的Hive:“无法识别'transform'附近的输入”“错误

hive> desc journeys; 
object_id   string          
journey_statuses array<string> 

这里有一个记录的一个典型的例子:采用蜂巢0.13的collect_list产生

12345678 ["A","A","A","B","B","B","C","C","C","C","D"] 

在表中的记录和状态有一个订单(如果为了并不重要,我会用collect_set)。对于每个object_id,我想缩短旅程以按照它们出现的顺序返回旅程状态。

我写了一个快速的Python脚本,从标准输入读取:

#!/usr/bin/env python 
import sys 
import itertools 

for line in sys.stdin: 
    inputList = eval(line.strip()) 
    readahead = iter(inputList) 
    next(readahead) 
    result = [] 
    for id, (a, b) in enumerate(itertools.izip(inputList, readahead)): 
     if id == 0: 
      result.append(a) 
     if a != b: 
      result.append(b) 
    print result 

我计划在蜂房transform调用中使用此。看来工作时,本地运行:

$ echo '["A","A","A","B","B","B","C","C","C","C","D"]' | python abbreviate_list.py 
['A', 'B', 'C', 'D'] 

然而,当我添加了文件,并尝试蜂巢内执行,则返回一个错误:

hive> add file abbreviateList.py;                   
Added resource: abbreviateList.py 

hive> select 
    > object_id, 
    > transform(journey_statuses) using 'python abbreviateList.py' as journey_statuses_abbreviated 
    > from journeys; 
NoViableAltException(... wall of Java error messages ...) 
FAILED: ParseException line 3:2 cannot recognize input near 'transform' '(' 'journey_statuses' in select expression 

你能看到我在做什么错?

回答

5

显然你不能选择不在变换中的其他字段(在你的例子中,object_id)。这其他的SO问题似乎间接地解决:

How can select a column and do a TRANSFORM in Hive?

理论上你可以修改你的Python接受OBJECT_ID作为输入参数,并使其成为直通到另一个输出字段,如果你需要将它收录在输出中。