2014-09-26 40 views
0

我正在寻找一个很好的模式,用于加载一个昂贵的资源,例如服务器上的pickled python对象。这是我想出来的;我已经通过将输入文件和运行缓慢的程序直接输入到bash脚本中进行了测试,但尚未在hadoop集群上运行它。对于你来说,hadoop向导---我是否正在处理io,以便它可以作为python串流作业?我想我会在亚马逊上测试一些东西来测试,但如果有人知道顶端的东西,它会很好。Hadoop流媒体---昂贵的共享资源(COOL)

您可以通过cat file.txt | the_script./a_streaming_program | the_script

#!/usr/bin/env python 

import sys 
import time 

def resources_for_many_lines(): 
    # load slow, shared resources here 
    # for example, a shared pickle file 

    # in this example we use a 1 second sleep to simulate 
    # a long data load 
    time.sleep(1) 

    # we will pretend the value zero is the product 
    # of our long slow running import 
    resource = 0 

    return resource 

def score_a_line(line, resources): 
    # put fast code to score a single example line here 
    # in this example we will return the value of resource + 1 
    return resources + 1 

def run(): 
    # here is the code that reads stdin and scores the model over a streaming data set 
    resources = resources_for_many_lines() 
    while 1: 
    # reads a line of input 
    line = sys.stdin.readline() 

    # ends if pipe closes 
    if line == '': 
     break 

    # scores a line 
    print score_a_line(line, resources) 
    # prints right away instead of waiting 
    sys.stdout.flush() 

if __name__ == "__main__": 
    run(); 

回答

0

这看起来好像没什么问题对其进行测试。我经常在映射器中加载yaml或sqlite资源。

您通常不会运行很多映射器在您的工作中,所以即使您花费几秒钟从磁盘加载某些东西,它通常也不是一个大问题。