2012-08-01 51 views
0

我有一个Xpath查询其接受使用Axslx输出数组元素,我需要整理我的输出中的某些条件,其中之一是“软件包括”axslx - 我该如何检查一个数组元素是否存在,如果改变了它的输出?

我的XPath刮下面的网址http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

我的代码示例如下:

clues = Array.new 
clues << 'Optical drive' 
clues << 'Pointing device' 
clues << 'Software included' 

selector = "//td[text()='%s']/following-sibling::td" 

data = clues.map do |clue| 
     xpath = selector % clue 
     [clue, doc.at(xpath).text.strip] 
     end 

Axlsx::Package.new do |p| 
    p.workbook.add_worksheet do |sheet| 
    data.each { |datum| sheet.add_row datum } 
    end 
    p.serialize 'output.xlsx' 
end 

我的电流输出格式

enter image description here

我所需的输出格式

enter image description here

回答

0

如果你可以依靠始终使用数据 ';'作为分隔符,必须在此一展身手:

data = [] 
clues.each do |clue| 
    xpath = selector % clue 
    details = doc.at(xpath).text.strip.split(';') 
    data << [clue, details.pop] 
    details.each { |detail| data << ['', detail] } 
end 

生成数据的Axlsx :: Package.new阻止

之前,在回答您评论/问题:你有像这样做;)

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 
require 'axlsx' 

class Scraper 

    def initialize(url, selector) 
    @url = url 
    @selector = selector 
    end 

    def hooks 
    @hooks ||= {} 
    end 

    def add_hook(clue, p_roc) 
    hooks[clue] = p_roc 
    end 

    def export(file_name) 
    Scraper.clues.each do |clue| 
     if detail = parse_clue(clue) 
     output << [clue, detail.pop] 
     detail.each { |datum| output << ['', datum] } 
     end 
    end 
    serialize(file_name) 
    end 

    private 

    def self.clues 
    @clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics', 
       'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless', 
       'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)', 
       'Warranty', 'Software included', 'Product color'] 
    end 

    def doc 
    @doc ||= begin 
       Nokogiri::HTML(open(@url)) 
       rescue 
       raise ArgumentError, 'Invalid URL - Nothing to parse' 
       end 
    end 

    def output 
    @output ||= [] 
    end 

    def selector_for_clue(clue) 
    @selector % clue 
    end 

    def parse_clue(clue) 
    if element = doc.at(selector_for_clue(clue)) 
     call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip) 
    end 
    end 

    def call_hook(clue, element) 
    if hooks[clue].is_a? Proc 
     value = hooks[clue].call(element) 
     value.is_a?(Array) ? value : [value] 
    end 
    end 

    def package 
    @package ||= Axlsx::Package.new 
    end 

    def serialize(file_name) 
    package.workbook.add_worksheet do |sheet| 
     output.each { |datum| sheet.add_row datum } 
    end 
    package.serialize(file_name) 
    end 
end 

scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td") 

# define a custom action to take against any elements found. 
os_parse = Proc.new do |element| 
    element.inner_html.split('<br>').each(&:strip!).each(&:upcase!) 
end 

scraper.add_hook('Operating system', os_parse) 

scraper.export('foo.xlsx') 

而最终的答案是...一颗宝石。

http://rubydoc.info/gems/ninja2k/0.0.2/frames

+0

嗨兰迪,不幸的';'手动添加到我自己。有没有办法执行一个操作,如array.element =“Software includes”do fuction {}? – Ninja2k 2012-08-02 06:54:16

+0

我已经编辑了答案,以显示可以完成的一种方式。 – randym 2012-08-02 09:02:47

+0

该死的几乎是一个全新的宝石:P谢谢你,但它是我的头。有没有办法让它变得非常简单?像4线操作? – Ninja2k 2012-08-02 11:55:41

相关问题