2015-08-08 63 views
0

为什么我在搜索表单中搜索“”空间时没有在[本] [1]页获取产品我只能看到菜单,而没有搜索结果产品搜索提交后机械手未加载完整网页

Ruby代码:

require 'nokogiri' 
require 'mysql2' 
require 'logger' 
require 'mechanize' 
agent = Mechanize.new{|a| a.log = Logger.new(STDERR) } 
agent.user_agent_alias = 'Windows Mozilla' 
agent.read_timeout = 60 
def add_cookie(agent, uri, cookie) 
uri = URI.parse(uri) 
Mechanize::Cookie.parse(uri, cookie) do |cookie| 
agent.cookie_jar.add(uri, cookie) 
end 
end 
login_page = agent.get "http://www.example.com.mx/login.php?location=%2F" 
login_form = login_page.form_with(:method => 'POST') 
email_field = login_form.field_with(name: "correo_ingresar") 
password_field = login_form.field_with(name: "password") 
email_field.value = '[email protected]' 
password_field.value = 'password' 
home_page = login_form.submit 
myarray = home_page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/) 
myarray.each{|line| add_cookie agent, 'http://www.example.com.mx', "#{line[0]}=#{line[1]}"} 
add_cookie(agent, 'http://www.example.com.mx', "forzar_existencias=1; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "articulos_mostrar=50; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "forz_existencias=1=; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "no_actualiza=1; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "orden_mostrar=8; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "page=1; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "precio_inicio=0; path=/; domain=www.example.com.mx") 
add_cookie(agent, 'http://www.example.com.mx', "location=%2Farticulos.php%3Fbuscar%3D%2B; path=/; domain=www.example.com.mx") 

search_form = home_page.forms.first 
search_field = search_form.field_with(name: "buscar") 
search_field.value = ' ' 
search_results = search_form.submit 
resultados = 'http://example.com.mx/articulos.php?buscar=+' 

我下载了直播HTTP头插件用于Firefox与萤火虫。当我填充一个空格并单击[网页] [1]上的搜索按钮时,我会在实时HTTP标头中获得以下结果。

http://example.com.mx/articulos.php?buscar=+ 

GET /articulos.php?buscar=+ HTTP/1.1 
Host: example.com.mx 
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
Accept-Language: en-US,en;q=0.5 
Accept-Encoding: gzip, deflate 
Referer: http://example.com.mx/articulos.php?buscar=+ 
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1 
Connection: keep-alive 

HTTP/1.1 200 OK 
Date: Sat, 08 Aug 2015 04:29:40 GMT 
Server: Apache 
x-powered-by: PHP/5.4.30 
Cache-Control: no-cache, no-store, must-revalidate 
Pragma: no-cache 
Expires: 0 
Keep-Alive: timeout=5, max=100 
Connection: Keep-Alive 
Transfer-Encoding: chunked 
Content-Type: text/html 
---------------------------------------------------------- 
http://www.google-analytics.com/collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832 

GET /collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832 HTTP/1.1 
Host: www.google-analytics.com 
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0 
Accept: image/png,image/*;q=0.8,*/*;q=0.5 
Accept-Language: en-US,en;q=0.5 
Accept-Encoding: gzip, deflate 
Referer: http://example.com.mx/articulos.php?buscar=+ 
Connection: keep-alive 

HTTP/1.1 200 OK 
Pragma: no-cache 
Expires: Mon, 07 Aug 1995 23:30:00 GMT 
Access-Control-Allow-Origin: * 
Last-Modified: Sun, 17 May 1998 03:00:00 GMT 
x-content-type-options: nosniff 
Content-Type: image/gif 
Date: Wed, 29 Jul 2015 12:33:33 GMT 
Server: Golfe2 
Content-Length: 35 
Age: 834969 
Alternate-Protocol: 80:quic,p=0 
Cache-Control: private, no-cache, no-cache=Set-Cookie, proxy-revalidate 
---------------------------------------------------------- 
http://example.com.mx/resultados.php 

POST /resultados.php HTTP/1.1 
Host: example.com.mx 
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0 
Accept: */* 
Accept-Language: en-US,en;q=0.5 
Accept-Encoding: gzip, deflate 
Content-Type: application/x-www-form-urlencoded; charset=UTF-8 
X-Requested-With: XMLHttpRequest 
Referer: http://example.com.mx/articulos.php?buscar=+ 
Content-Length: 204 
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1 
Connection: keep-alive 
Pragma: no-cache 
Cache-Control: no-cache 
opcion=&buscar=+&page=1&articulos_mostrar=10&orden_mostrar=1&seccion=&linea=&sublinea=&forz_existencias=1&precio_inicio=0&precio_final=20000&location=%252Farticulos.php%253Fbuscar%253D%252B&no_actualiza=1 
HTTP/1.1 200 OK 
Date: Sat, 08 Aug 2015 04:29:42 GMT 
Server: Apache 
x-powered-by: PHP/5.4.30 
Keep-Alive: timeout=5, max=99 
Connection: Keep-Alive 
Transfer-Encoding: chunked 
Content-Type: text/html 
---------------------------------------------------------- 

的问题是:我怎么得到充分的产品展示在网页上,这样我就可以开始刮,如果它有一个引荐链接,它不会自动得到产品。 [这] [2]是生成的HTML:

+0

上一级:你真正感兴趣的是什么?产品价格?自动订购东西? – Felix

+0

我对获得全部产品感兴趣 – ingalcala

回答

0

我给2个解决方案,但只有一个使用POST,你在你的问题要求:

require 'mechanize' 

agent = Mechanize.new 
agent.get("http://www.sistemasaplicados.com.mx/") 
agent.page.forms.first.field_with(name: "buscar").value = ' ' 
result_page = agent.page.forms.first.submit 

的另一种选择是编码您的搜索词,直接用nokogiri在简单的GET请求(在URL中编码)中使用它。在您的特定情况下,搜索“160GB”会导致出现以下URL http://www.sistemasaplicados.com.mx/articulos.php?buscar=160GB,您可以仅使用GET

顺便说一句,你不一定需要机械化所有这一切,除非你想自动下订单到你的账户或类似的东西。我假设你为了sistemasaplicados的利益这么做,否则我会认为这是无礼的,它会给你带来不良的业力。

更新 手动检查发生了什么时,应该看看如果JavaScript被禁用(在这种情况下,没有结果)会发生什么情况。然后,通过浏览器的“检查员”,“控制台”或“开发人员工具”(通常通过按F12打开),查看会发生什么情况。在你的情况下,POST请求resultados.php完成。我发现与Firefox,开发工具,“网络”选项卡。您还可以在POST请求中找到相关参数。

+0

你好菲利克斯我确实联系过这位网站管理员,这样做没有问题。目的是用他们的库存产品更新我的电子商务 – ingalcala

+0

我已经在搜索,但是当我得到输出时,它只显示页面左侧的菜单和过滤器meno。我如何让产品出现 – ingalcala