2016-07-26 58 views
0

这是我做了猪来标记, 我的猪脚本符号化猪(使用python UDF)

--set the debugg mode 
SET debug 'off' 
-- Registering the python udf 
REGISTER /home/hema/phd/work1/coding/myudf.py USING streaming_python as myudf 

RAWDATA =LOAD '/home/hema/temp' USING TextLoader() AS content; 
LOWERCASE_DATA =FOREACH RAWDATA GENERATE LOWER(content) AS con; 
TOKENIZED_DATA =FOREACH LOWERCASE_DATA GENERATE myudf.special_tokenize(con) as conn; 
DUMP TOKENIZED_DATA; 

我的Python UDF

from pig_util import outputSchema 
import nltk 

@outputSchema('word:chararray') 
def special_tokenize(input): 
    tokens=nltk.word_tokenize(input) 
    return tokens 

代码工作很好,但输出很混乱。我如何删除不需要的下划线和垂直条。输出看起来像这样

(|{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_) 
(|{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_) 
(|{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_) 
(|{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_) 
(|{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_) 
(|{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_) 
(|{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_) 
(|{_|(_best|)_|,_|(_,|)_|}_) 
(|{_|(_svetoslav|)_|}_) 

原始数据

AdditionalContext in NameFinder 
Is there any possibility to use additionalContext with the NameFinderME.train? If so, how? If there isn't maybe this should be an issue to be added in the future releases? 
I would REALLY greatly appreciate if someone can help (give me some sample code/show me) how to add POS tag features while training and testing NameFinder. 
If the incoming data is just tokens with NO POS tag information, where is the information taken then? A new file? Run a POS tagging model before training? Or? 
And what is the purpose of the resources (i.e. Collection.<String,Object>emptyMap()) in the NameFinderME.train method? What should be ideally included in there? 
I just can't get these things from the Java doc API. 
in advance! 
Best, 
Svetoslav 

我想有提前令牌作为我的最后output.thanks的列表。

+0

@ cricket_007我已经发布我的原始数据作为编辑。我不认为NLTK正在生成下划线和竖线。当我在grunt shell中执行时,同样的Word_tokenize()方法可以正常工作。 –

+0

好吧,继发问题。你期望的输出是什么? (和旁注:你传递字符串到python,那么为什么额外的map-reduce工作来小写pig中的字符串呢?) –

+0

我期待一个标记字符串的元组作为输出。例如('additionalcontext','in','namefinder')。实际上我想在猪身上做所有的预处理。猪中的内置函数(tokenize)并不能标记我喜欢的方式,这就是为什么我想使用NLTK。 –

回答

0
from pig_util import outputSchema 
import nltk 
import re 

@outputSchema('word:chararray') 
def special_tokenize(input): 
    #splitting camel-case here 
    temp_data = re.sub(r'(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])'," ",input) 
    tokens = nltk.word_tokenize(temp_data.encode('utf-8')) 
    final_token = ','.join(tokens) 
    return final_token 

有一些问题与输入的编码。将其改为utf-8解决了这个问题。

0

将REPLACE用于'_'和'|'然后使用TOKENIZE标记。

NEW_TOKENIZED_DATA =FOREACH TOKENINZED_DATA GENERATE REPLACE(REPLACE($0,'_',''),'|',''); 
TOKENS = FOREACH NEW_TOKENIZED_DATA GENERATE TOKENIZE($0); 
DUMP TOKENS; 
+0

你是否要我标记两次。为什么这些下划线和垂直条进入由udf返回的包中。不能替代它们并进行第二轮标记化。 –

+0

我不确定它来自哪里。您可以轻松标记输入以获取令牌而不是UDF。我发布的脚本是用于输出的,而不是原始数据。 –

+0

内置的标记器不会以我喜欢的方式标记文本。这就是为什么我使用UDF和NLTK –