0
我正在寻找一个很好的模式,用于加载一个昂贵的资源,例如服务器上的pickled python对象。这是我想出来的;我已经通过将输入文件和运行缓慢的程序直接输入到bash脚本中进行了测试,但尚未在hadoop集群上运行它。对于你来说,hadoop向导---我是否正在处理io,以便它可以作为python串流作业?我想我会在亚马逊上测试一些东西来测试,但如果有人知道顶端的东西,它会很好。Hadoop流媒体---昂贵的共享资源(COOL)
您可以通过cat file.txt | the_script
或./a_streaming_program | the_script
#!/usr/bin/env python
import sys
import time
def resources_for_many_lines():
# load slow, shared resources here
# for example, a shared pickle file
# in this example we use a 1 second sleep to simulate
# a long data load
time.sleep(1)
# we will pretend the value zero is the product
# of our long slow running import
resource = 0
return resource
def score_a_line(line, resources):
# put fast code to score a single example line here
# in this example we will return the value of resource + 1
return resources + 1
def run():
# here is the code that reads stdin and scores the model over a streaming data set
resources = resources_for_many_lines()
while 1:
# reads a line of input
line = sys.stdin.readline()
# ends if pipe closes
if line == '':
break
# scores a line
print score_a_line(line, resources)
# prints right away instead of waiting
sys.stdout.flush()
if __name__ == "__main__":
run();