创建分布式抓取python应用程序。它由主服务器以及将在客户端服务器上运行的关联客户端应用程序组成。客户端应用程序的目的是在目标网站上运行,以提取特定数据。客户需要在网站内部“深入”,落后于多层次的表单,因此每个客户都专门面向特定的网站。架构python问题
每个客户端应用程序看起来像
main:
parse initial url
call function level1 (data1)
function level1 (data)
parse the url, for data1
use the required xpath to get the dom elements
call the next function
call level2 (data)
function level2 (data2)
parse the url, for data2
use the required xpath to get the dom elements
call the next function
call level3
function level3 (dat3)
parse the url, for data3
use the required xpath to get the dom elements
call the next function
call level4
function level4 (data)
parse the url, for data4
use the required xpath to get the dom elements
at the final function..
--all the data output, and eventually returned to the server
--at this point the data has elements from each function...
我的问题: 因为这是由当前函数 孩子函数的调用数不同,我试图找出 出最好的方法。
each function essentialy fetches a page of content, and then parses
the page using a number of different XPath expressions, combined
with different regex expressions depending on the site/page.
if i run a client on a single box, as a sequential process, it'll
take awhile, but the load on the box is rather small. i've thought
of attempting to implement the child functions as threads from the
current function, but that could be a nightmare, as well as quickly
bring the "box" to its knees!
i've thought of breaking the app up in a manner that would allow
the master to essentially pass packets to the client boxes, in a
way to allow each client/function to be run directly from the
master. this process requires a bit of rewrite, but it has a number
of advantages. a bunch of redundancy, and speed. it would detect if
a section of the process was crashing and restart from that point.
but not sure if it would be any faster...
我正在写在python解析脚本..
所以...任何想法/意见,将不胜感激......
我可以进入一个很大的更详细,但不想忍受任何人!
谢谢!
汤姆
您可能希望从问题的后半部分删除“代码缩进”,因为它没有代码。 – viksit 2010-04-19 19:52:58
此外,请使用大写字母,特别是人称代词(I)。如果你的问题很容易阅读,你会得到很好的答案。如果你的问题很难阅读(例如,小写'i'无处不在),人们将停止尝试解析它并继续前进。 – 2010-04-19 20:06:46
说真的,为什么要删除它?有效的问题,有效的答案。如何选择最能帮助你的答案? – Will 2011-01-14 18:26:14