Hadoop流媒体---昂贵的共享资源（COOL）

我正在寻找一个很好的模式，用于加载一个昂贵的资源，例如服务器上的pickled python对象。这是我想出来的;我已经通过将输入文件和运行缓慢的程序直接输入到bash脚本中进行了测试，但尚未在hadoop集群上运行它。对于你来说，hadoop向导---我是否正在处理io，以便它可以作为python串流作业？我想我会在亚马逊上测试一些东西来测试，但如果有人知道顶端的东西，它会很好。Hadoop流媒体---昂贵的共享资源（COOL）

您可以通过cat file.txt | the_script或./a_streaming_program | the_script

#!/usr/bin/env python 

import sys 
import time 

def resources_for_many_lines(): 
    # load slow, shared resources here 
    # for example, a shared pickle file 

    # in this example we use a 1 second sleep to simulate 
    # a long data load 
    time.sleep(1) 

    # we will pretend the value zero is the product 
    # of our long slow running import 
    resource = 0 

    return resource 

def score_a_line(line, resources): 
    # put fast code to score a single example line here 
    # in this example we will return the value of resource + 1 
    return resources + 1 

def run(): 
    # here is the code that reads stdin and scores the model over a streaming data set 
    resources = resources_for_many_lines() 
    while 1: 
    # reads a line of input 
    line = sys.stdin.readline() 

    # ends if pipe closes 
    if line == '': 
     break 

    # scores a line 
    print score_a_line(line, resources) 
    # prints right away instead of waiting 
    sys.stdout.flush() 

if __name__ == "__main__": 
    run();

来源

2014-09-26 user1609682

这看起来好像没什么问题对其进行测试。我经常在映射器中加载yaml或sqlite资源。

您通常不会运行很多映射器在您的工作中，所以即使您花费几秒钟从磁盘加载某些东西，它通常也不是一个大问题。

来源

2014-09-26 23:16:15 Nonnib

Hadoop流媒体---昂贵的共享资源（COOL）

回答

相关问题