2014-09-21 224 views
1

我在csv文件中遍历多个URL;网址都具有一个结构:从python中的URL中提取部分

http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21 
http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil- boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml 

等,

我需要的物品类别(4斜线,“阿姆斯特丹中央火车站”和“POLITIEK”后,在这种情况下),和将它们附加到列表中。

我与urllib2的工作:

reader=CsvUnicodeReader(open("my.csv","r")) 
for row in reader: 
    url = row[0] 
    req=urllib2.Request(url) 

有没有一种方法来解析URL?

+0

对于它的部分分裂的URL(协议,主机,端口,路径等。)有['urlparse'](https://docs.python.org/2/library/urlparse.html)模块(用于Python 3.x的'urllib.parse')。但它看起来像对路径的特定部分感兴趣,因此您可能还需要使用正则表达式。 – 2014-09-21 21:25:14

回答

0

这里你不需要正则表达式。

>>> a=[] 
>>> with open('in','r') as f: 
...  r=csv.reader(f,delimiter='/') 
...  for row in r: 
...    a.append(row[6]) 
... 
>>> a 
['AMSTERDAM-CENTRUM', 'POLITIEK'] 



>>> a=[] 
>>> with open('in','r') as f: 
...  r=csv.reader(f) 
...  for row in r: 
...    a.append(row[0].split('/')[6]) 
... 
>>> a 
['AMSTERDAM-CENTRUM', 'POLITIEK'] 
2

您可以使用urlparse.urlparse的URL拆分成其组件和可靠地提取路径组件,然后用regular expression提取路径的类别部分你有兴趣:

from urlparse import urlparse 
import re 


URLS = ["http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21", 
     "http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil-boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml"] 

pattern = re.compile("/parool/nl/\d*/(.*?)/article/detail/.*$") 


for url in URLS: 
    parsed = urlparse(url) 
    match = pattern.match(parsed.path) 
    if match: 
     category = match.group(1) 
     print category 

输出:在常规expressio

AMSTERDAM-CENTRUM 
POLITIEK 

注意事项N:

  • \d*任何数字(0-9)零多次
  • /(.*?)/匹配任何字符斜线零多次两者之间,非贪婪相匹配,并为斜线之间的部分产生一组
0

可以用指数与urlparse模块弄清楚并以其path方法得到的商品分类,然后用split('/')功能我们splite与路径“\”并获得第5场[4]。

演示:

>>> from urlparse import urlparse 
>>> your_url=['http://www.parool.nl/parool/nl/4024/AMSTERDAM-CENTRUM/article/detail/3751723/2014/09/21','http://www.parool.nl/parool/nl/5/POLITIEK/article/detail/3751624/2014/09/20/VVD-wil- boete-van-250-euro-voor-het-naroepen-van-vrouwen.dhtml'] 
>>> [urlparse(ul).path.split('/')[4] for ul in your_url] 
['AMSTERDAM-CENTRUM', 'POLITIEK'] 
1

如果所有网址具有类似的结构,你可以简单地使用

url.rsplit('/')[6]