Python HTML - 通过属性获取元素

我经常阅读音乐网站，它有一个部分，用户可以发布自己的虚构音乐相关故事。有一个91部分系列（写在一段时间，部分上传），总是遵循以下约定： http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html。Python HTML - 通过属性获取元素

我希望能够从每个部分获得格式化的文本并将其放入一个html文件中。

方便地，有一个打印版本的链接，正确格式为我的目的。我所要做的就是编写一个脚本来下载所有的部分，然后将它们转储到文件中。不难。

不幸的是，印刷版的网址为： www.ultimate-guitar.com/print.php?what=article & ID = 95932

知道什么文章对应的唯一方法是什么ID字段是查看原始文章中某个输入标签的值属性。

我想要做的是这样的：

Go to each page, incrementng through the varying numbers. 

Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute. 

Go to www.ultimate-guitar.com/print.php?what=article&id=<value>. 
Append everything (minus <html><head> and <body> to a html file. 

Rinse and repeat.

这可能吗？ python是正确的语言吗？另外，我应该使用什么dom/html/xml库？

感谢您的任何帮助。

来源

2012-02-26 Matt Hood

你实际上可以在javascript/jquery中做到这一点，没有太多的麻烦。 javascripty-pseudocode，附加到空文档：

for(var pageNum = 1; i<= 91; i++) { 
    $.ajax({ 
     url: url + pageNum, 
     async: false, 
     success: function() { 
      var printId = $('input[name="rowid"]').val(); 
      $.ajax({ 
       url: printUrl + printId, 
       async: false, 
       success: function(data) { 
        $('body').append($(data).find('body').contents()); 
       } 
      }); 
     } 
    }); 
}

加载完成后，您可以将生成的HTML保存到文件中。

来源

2012-02-26 03:19:15 beerbajay

这将被视为跨域，并且不适用于浏览器安全目的 – Vigrond 2012-02-26 03:51:59

正确。它将作为一个修改一点点的油门猴脚本。 – beerbajay 2012-02-26 05:04:34

随着LXML和的urllib2：

import lxml.html 
import urllib2 

#implement the logic to download each page, with HTML strings in a sequence named pages 
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s" 

for page in pages: 
    html = lxml.html.fromstring(page) 
    ID = html.find(".//input[@name='rowid']").value 
    article = urllib2.urlopen(url % ID).read() 
    article_html = lxml.html.fromstring(article) 
    with open(ID + ".html", "w") as html_file: 
     html_file.write(article_html.find(".//body").text_content())

编辑：在运行此，似乎有可能在一些页面Unicode字符。解决此问题的一种方法是执行article = article.encode("ascii", "ignore")或将.read（）后面的encode方法强制为ASCII并忽略Unicode，尽管这是一个懒惰的修复。

这是假设你只想要正文标签内所有内容的文本内容。这将在Python文件的本地目录中以storyID.html格式（如“95932.html”）保存文件。如果您喜欢，请更改保存语义。

来源

2012-02-26 03:46:04 Anorov

Python HTML - 通过属性获取元素

回答

相关问题