2017-08-01 144 views
0

所以我玩Scrapy &飞溅,我遇到了一些问题。 我试着运行我的蜘蛛,并不断得到HTTP错误。好吧,所以我试图在浏览器中查看Splash。 首先我做了“sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600 -v3”开始运行Splash,然后我去了localhost:8050。 Web用户界面打开正确,我可以输入代码。 这里是我试图运行的基本功能:尝试Scrapy +飞溅

function main(splash, args) 
    assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js")) 
    splash.resource_timeout = 30.0 
    splash.images_enabled = false 
    assert(splash:go(args.url)) 
    assert(splash:wait(0.5)) 
    return { 
    html = splash:html(), 
    --png = splash:png(), 
    --har = splash:har(), 
    } 
end 

我尝试呈现http://boingboing.net/blog,使用此功能,并得到一个“无效的主机名” LUA错误;这里的日志:

2017-08-01 18:26:28+0000 [-] Log opened. 
2017-08-01 18:26:28.077457 [-] Splash version: 3.0 
2017-08-01 18:26:28.077838 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2 
2017-08-01 18:26:28.077900 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] 
2017-08-01 18:26:28.077984 [-] Open files limit: 65536 
2017-08-01 18:26:28.078046 [-] Can't bump open files limit 
2017-08-01 18:26:28.180376 [-] Xvfb is started: ['Xvfb', ':1937726875', '-screen', '0', '1024x768x24', '-nolisten', 'tcp'] 
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root' 
2017-08-01 18:26:28.226937 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 
2017-08-01 18:26:28.301002 [-] verbosity=3 
2017-08-01 18:26:28.301116 [-] slots=50 
2017-08-01 18:26:28.301202 [-] argument_cache_max_entries=500 
2017-08-01 18:26:28.301530 [-] Web UI: enabled, Lua: enabled (sandbox: enabled) 
2017-08-01 18:26:28.302122 [-] Site starting on 8050 
2017-08-01 18:26:28.302219 [-] Starting factory <twisted.web.server.Site object at 0x7ffa08390dd8> 
2017-08-01 18:26:32.660457 [-] "172.17.0.1" - - [01/Aug/2017:18:26:32 +0000] "GET/HTTP/1.1" 200 7677 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:27:18.860020 [-] "172.17.0.1" - - [01/Aug/2017:18:27:18 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5656 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:27:19.038565 [pool] initializing SLOT 0 
libpng warning: iCCP: known incorrect sRGB profile 
libpng warning: iCCP: known incorrect sRGB profile 
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text 
See the manual page for dbus-uuidgen to correct this issue. 
2017-08-01 18:27:19.066765 [render] [140711856519656] viewport size is set to 1024x768 
2017-08-01 18:27:19.066964 [pool] [140711856519656] SLOT 0 is starting 
2017-08-01 18:27:19.067071 [render] [140711856519656] function main(splash, args)\r\n assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend 
2017-08-01 18:27:19.070107 [render] [140711856519656] [lua_runner] dispatch cmd_id=__START__ 
2017-08-01 18:27:19.070270 [render] [140711856519656] [lua_runner] arguments are for command __START__, waiting for result of __START__ 
2017-08-01 18:27:19.070352 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=() 
2017-08-01 18:27:19.070424 [render] [140711856519656] [lua_runner] send None 
2017-08-01 18:27:19.070496 [render] [140711856519656] [lua_runner] send (lua) None 
2017-08-01 18:27:19.070657 [render] [140711856519656] [lua_runner] got AsyncBrowserCommand(id=None, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'}) 
2017-08-01 18:27:19.070755 [render] [140711856519656] [lua_runner] instructions used: 70 
2017-08-01 18:27:19.070834 [render] [140711856519656] [lua_runner] executing AsyncBrowserCommand(id=0, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'}) 
2017-08-01 18:27:19.071141 [network] [140711856519656] GET https://code.jquery.com/jquery-3.1.1.min.js 
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method 
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method 
2017-08-01 18:27:19.082150 [pool] [140711856519656] SLOT 0 is working 
2017-08-01 18:27:19.082298 [pool] [140711856519656] queued 
2017-08-01 18:28:39.151814 [network-manager] Download error 3: the remote host name was not found (invalid hostname) (https://code.jquery.com/jquery-3.1.1.min.js) 
2017-08-01 18:28:39.152087 [network-manager] Finished downloading https://code.jquery.com/jquery-3.1.1.min.js 
2017-08-01 18:28:39.152202 [render] [140711856519656] [lua_runner] dispatch cmd_id=0 
2017-08-01 18:28:39.152268 [render] [140711856519656] [lua_runner] arguments are for command 0, waiting for result of 0 
2017-08-01 18:28:39.152339 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'invalid_hostname'),) 
2017-08-01 18:28:39.152400 [render] [140711856519656] [lua_runner] send PyResult('return', None, 'invalid_hostname') 
2017-08-01 18:28:39.152468 [render] [140711856519656] [lua_runner] send (lua) (b'return', None, b'invalid_hostname') 
2017-08-01 18:28:39.152582 [render] [140711856519656] [lua_runner] instructions used: 79 
2017-08-01 18:28:39.152642 [render] [140711856519656] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:2: invalid_hostname',) 
2017-08-01 18:28:39.152816 [pool] [140711856519656] SLOT 0 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'invalid_hostname', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:2: invalid_hostname', 'line_number': 2} 
    /app/splash/browser_tab.py:1180:_return_reply 
    /app/splash/qtrender_lua.py:901:callback 
    /app/splash/lua_runner.py:27:return_result 
    /app/splash/qtrender.py:17:stop_on_error_wrapper 
    --- <exception caught here> --- 
    /app/splash/qtrender.py:15:stop_on_error_wrapper 
    /app/splash/qtrender_lua.py:2257:dispatch 
    /app/splash/lua_runner.py:195:dispatch 
    ] 
2017-08-01 18:28:39.152883 [pool] [140711856519656] SLOT 0 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48> 
2017-08-01 18:28:39.152944 [render] [140711856519656] [splash] clearing 0 objects 
2017-08-01 18:28:39.153026 [render] [140711856519656] close is requested by a script 
2017-08-01 18:28:39.153304 [render] [140711856519656] cancelling 0 remaining timers 
2017-08-01 18:28:39.153374 [pool] [140711856519656] SLOT 0 done with <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48> 
2017-08-01 18:28:39.153997 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "invalid_hostname", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: invalid_hostname", "line_number": 2}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 107916, "qsize": 0, "path": "/execute", "timestamp": 1501612119, "fds": 18, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend", "resource_timeout": 0, "uid": 140711856519656, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 80.11527562141418, "method": "POST", "_id": 140711856519656, "load": [0.46, 0.51, 0.54]} 
2017-08-01 18:28:39.154127 [-] "172.17.0.1" - - [01/Aug/2017:18:28:38 +0000] "POST /execute HTTP/1.1" 400 325 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:28:39.154237 [pool] SLOT 0 is available 

如果我第一次尝试没有装载了JQuery的,我收到了“network5” LUA错误(这是超时的一些物种)。对于该日志如下:

2017-08-01 18:31:07.110255 [-] "172.17.0.1" - - [01/Aug/2017:18:31:06 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5658 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:31:07.489653 [pool] initializing SLOT 1 
2017-08-01 18:31:07.490576 [render] [140711856961016] viewport size is set to 1024x768 
2017-08-01 18:31:07.490692 [pool] [140711856961016] SLOT 1 is starting 
2017-08-01 18:31:07.490829 [render] [140711856961016] function main(splash, args)\r\n --assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend 
2017-08-01 18:31:07.493641 [render] [140711856961016] [lua_runner] dispatch cmd_id=__START__ 
2017-08-01 18:31:07.493782 [render] [140711856961016] [lua_runner] arguments are for command __START__, waiting for result of __START__ 
2017-08-01 18:31:07.493865 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=() 
2017-08-01 18:31:07.493937 [render] [140711856961016] [lua_runner] send None 
2017-08-01 18:31:07.494010 [render] [140711856961016] [lua_runner] send (lua) None 
2017-08-01 18:31:07.494270 [render] [140711856961016] [lua_runner] got AsyncBrowserCommand(id=None, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'}) 
2017-08-01 18:31:07.494416 [render] [140711856961016] [lua_runner] instructions used: 166 
2017-08-01 18:31:07.494502 [render] [140711856961016] [lua_runner] executing AsyncBrowserCommand(id=0, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'}) 
2017-08-01 18:31:07.494576 [render] [140711856961016] HAR event: _onStarted 
2017-08-01 18:31:07.494697 [render] [140711856961016] callback 0 is connected to loadFinished 
2017-08-01 18:31:07.495031 [network] [140711856961016] GET http://boingboing.net/blog 
2017-08-01 18:31:07.495617 [pool] [140711856961016] SLOT 1 is working 
2017-08-01 18:31:07.495741 [pool] [140711856961016] queued 
2017-08-01 18:31:37.789845 [network-manager] timed out, aborting: http://boingboing.net/blog 
2017-08-01 18:31:37.790154 [network-manager] Finished downloading http://boingboing.net/blog 
2017-08-01 18:31:37.791064 [render] [140711856961016] mainFrame().urlChanged http://boingboing.net/blog 
2017-08-01 18:31:37.796078 [render] [140711856961016] mainFrame().initialLayoutCompleted 
2017-08-01 18:31:37.796343 [render] [140711856961016] loadFinished: RenderErrorInfo(type='Network', code=5, text='Operation canceled', url='http://boingboing.net/blog') 
2017-08-01 18:31:37.796420 [render] [140711856961016] loadFinished: disconnecting callback 0 
2017-08-01 18:31:37.796518 [render] [140711856961016] [lua_runner] dispatch cmd_id=0 
2017-08-01 18:31:37.796576 [render] [140711856961016] [lua_runner] arguments are for command 0, waiting for result of 0 
2017-08-01 18:31:37.796640 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'network5'),) 
2017-08-01 18:31:37.796699 [render] [140711856961016] [lua_runner] send PyResult('return', None, 'network5') 
2017-08-01 18:31:37.796765 [render] [140711856961016] [lua_runner] send (lua) (b'return', None, b'network5') 
2017-08-01 18:31:37.796883 [render] [140711856961016] [lua_runner] instructions used: 175 
2017-08-01 18:31:37.796943 [render] [140711856961016] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:5: network5',) 
2017-08-01 18:31:37.797093 [pool] [140711856961016] SLOT 1 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'network5', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:5: network5', 'line_number': 5} 
    /app/splash/browser_tab.py:533:_on_content_ready 
    /app/splash/qtrender_lua.py:702:error 
    /app/splash/lua_runner.py:27:return_result 
    /app/splash/qtrender.py:17:stop_on_error_wrapper 
    --- <exception caught here> --- 
    /app/splash/qtrender.py:15:stop_on_error_wrapper 
    /app/splash/qtrender_lua.py:2257:dispatch 
    /app/splash/lua_runner.py:195:dispatch 
    ] 
2017-08-01 18:31:37.797158 [pool] [140711856961016] SLOT 1 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828> 
2017-08-01 18:31:37.797217 [render] [140711856961016] [splash] clearing 0 objects 
2017-08-01 18:31:37.797310 [render] [140711856961016] close is requested by a script 
2017-08-01 18:31:37.797430 [render] [140711856961016] cancelling 0 remaining timers 
2017-08-01 18:31:37.797491 [pool] [140711856961016] SLOT 1 done with <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828> 
2017-08-01 18:31:37.798067 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "network5", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network5", "line_number": 5}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 113372, "qsize": 0, "path": "/execute", "timestamp": 1501612297, "fds": 21, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n --assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend", "resource_timeout": 0, "uid": 140711856961016, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 30.308406591415405, "method": "POST", "_id": 140711856961016, "load": [0.39, 0.42, 0.49]} 
2017-08-01 18:31:37.798190 [-] "172.17.0.1" - - [01/Aug/2017:18:31:37 +0000] "POST /execute HTTP/1.1" 400 309 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:31:37.798294 [pool] SLOT 1 is available 

如果我还注释掉resource_timeout线,我得到一个网络3 LUA错误(同样的,无效的主机名,但这次不同的呈现)。

任何想法我做错了什么?

回答

0

事实证明,这根本不是Scrapy/Splash问题 - 它是一个Docker/IP路由/网络管理问题。网络管理员进行设置,以便我只能通过特定的目的地发出HTTP请求;给我的码头启动添加“--net = host”似乎已经解决了这个问题。 This webpage是非常有帮助的。

0

尝试改变

function main(splash, args) 
    ... 
    assert(splash:go(args.url)) 
    ... 

function main(splash) 
    ... 
    assert(splash:go(splash.args.url)) 
    ... 

至少在那个时候我在默认打开脚本飞溅端口8050是如何读取。随着这一变化,你的脚本适合我。