语法错误 - Python re.search（字符类，插入符号）

使用BeautifulSoup刮页;试图筛选出在最终环节 “... HTML＃评论”语法错误 - Python re.search（字符类，插入符号）

代码如下：

import urllib.request 
import re 
from bs4 import BeautifulSoup 

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/" 
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a') 
links_to_follow = [] 
for i in soup: 
     if i.has_key('href') and \ 
    re.search(base_url, i['href']) and \ 
    len(i['href']) > len(base_url) and \ 
    re.search(r'[^(comments)]', i['href']): 
     print(i['href'])

的Python 3.2，Windows 7的64位。

以上脚本保存在“#comments”

我试过re.search([^comments], i['href'])，re.search([^(comments)], i['href'])和re.search([^'comments'], i['href'])结尾的链接 - 所有扔语法错误。

对Python来说很陌生，所以对于平庸的道歉。（a）我对'r'前缀的正确理解不够详细，或者（b）响应[^（foo）] re.search返回的不是该集合排除'foo'的行，但是只包含多于'foo'的行。例如，我保留我的...＃注释链接，因为... texttexttext.html＃注释先于它或（c）Python将“＃”解释为结束re.search应匹配的行的注释。

我觉得我错了（b）。

对不起，知道这很简单。谢谢，

扎克

来源

2012-03-24 Zack

你应该包括你得到的错误/回溯的确切文本。 – Amber 2012-03-24 18:49:05

[^(comments)]

指 “一个字符既不是(也不是c，一个o，一个m，一个e，一个n，一个t，一个s或)”。可能不是你想要的。

如果你的目标是有，只有当提供的字符串不#comments结束相匹配的正则表达式，然后我会用

... and not re.search("#comments$", i['href'])

甚至更好（为什么使用正则表达式，如果就这么简单，在所有？）：

... and not i['href'].endswith("#comments")

至于你的其他问题：

的r'...'符号允许你写的“原始字符串”，这意味着反斜线不需要转义：

r'\b'指 “反斜线+ B”（这将是由正则表达式引擎被解释为 “字边界”
'\b'指 “退格字符”
等

#在正则表达式中没有特殊含义，除非使用(?x)或re.VERBOSE选项。在那种情况下，它确实在多行正则表达式中开始评论。

来源

2012-03-24 18:48:53

不得不离开，现在回来 - 谢谢你的答案。 – Zack 2012-03-25 13:47:12

正则表达式可能不是最好的解决方案：

import urllib.request 
from bs4 import BeautifulSoup 

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/" 
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a') 
links_to_follow = [] 
for i in soup: 
    href = i.get('href') 
    if href is None: 
     continue 
    if not href.startswith(base_url): 
     continue 
    if href.endswith('#comments'): 
     print href

来源

2012-03-24 18:52:55 Amber

语法错误 - Python re.search（字符类，插入符号）

回答

相关问题