如何在使用python的文本块中找到文件名？

我已经使用Python获取了网页的HTML，现在我想要查找所有链接到标题中的.CSS文件。我尝试分区，如下所示，但运行时发生错误“IndexError：字符串索引超出范围”，并将它们各自保存为自己的变量（我知道如何执行此操作）。如何在使用python的文本块中找到文件名？

sytle = src.partition(".css") 
style = style[0].partition('<link href=') 
print style[2] 
c =1

我不认为这是正确的方法来解决这个问题，所以会喜欢一些建议。提前谢谢了。以下是我需要从中提取.CSS文件的那种文本部分。

<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" /> 

<!--[if gte IE 7]><!--> 
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" /> 

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" /> 
<!-- <![endif]--> 
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" /> 
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" /> 
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" />

来源

2012-07-26 zch

你已经接受了一个答案，似乎有一些奇怪的原因upvotes。使用正则表达式来解析HTML只是丑陋，容易出错，容易中断和不灵活。您应该使用适当的HTML解析器来处理HTML数据[lxml.html，BeautifulSoup等...）HTML是结构化数据，它不仅仅是“文本” – 2012-07-26 22:14:20

你应该使用regular expression这一点。请尝试以下操作：

/href="(.*\.css[^"]*)/g

编辑

import re 
matches = re.findall('href="(.*\.css[^"]*)', html) 
print(matches)

来源

2012-07-26 21:55:19

明白了！感谢你能这么快回复。 – zch 2012-07-26 22:09:04

我已经扩展了我的答案。这有帮助吗？ – 2012-07-26 22:10:18

是的，非常非常。再次感谢你的帮助。 – zch 2012-07-26 22:16:27

对于它的价值（使用lxml.html）作为解析库。

未经检验

import lxml.html 
from urlparse import urlparse 

sample_html = """<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" /> 

<!--[if gte IE 7]><!--> 
<link href="/stylesheets/master.css?1342791430" media="screen, projection" rel="stylesheet" type="text/css" /> 

<link href="/stylesheets/adapt.css?1342791413" media="screen, projection" rel="stylesheet" type="text/css" /> 
<!-- <![endif]--> 
<link href="/stylesheets/print.css?1342791421" media="print" rel="stylesheet" type="text/css" /> 
<link href="/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed" /> 
<link href="http://dribbble.com/shots/popular.rss" rel="alternate" title="RSS" type="application/rss+xml" /> 
""" 

import lxml.html 
page = lxml.html.fromstring(html) 
link_hrefs = (p.path for p in map(urlparse, page.xpath('//head/link/@href'))) 
for href in link_hrefs: 
    if href.rsplit(href, 1)[-1].lower() == 'css': # implement smarter error handling here 
     pass # do whatever

来源

2012-07-26 22:29:36

我的回答是沿着相同的路线为Jon Clements' answer，但我测试矿，并添加说明的下降。

你应该不是使用正则表达式。 You can't parse HTML with a regex。正则表达式回答可能的工作，但编写一个强大的解决方案是非常容易与lxml。这种方法保证返回所有<link rel="stylesheet">标签的完整href属性，而不是其他标签。

from lxml import html 

def extract_stylesheets(page_content): 
    doc = html.fromstring(page_content)      # Parse 
    return doc.xpath('//head/link[@rel="stylesheet"]/@href') # Search

有没有需要检查的文件名，因为XPath的搜索的结果已经知道是样式表的链接，而且也不能保证文件名会有一个.css扩展反正。简单的正则表达式将只捕获一个非常特殊的形式，但一般的HTML解析器的解决方案也将做正确的事情在这样的情况下，这里的正则表达式将惨遭失败：

<link REL="stylesheet" hREf = 

    '/stylesheets/print?1342791421' 
    media="print" 
><!-- link href="/css/stylesheet.css" -->

它也可以很容易地扩展到只选择特定媒体的样式表。

来源

2013-09-29 21:24:08

回答

相关问题