2017-04-15 83 views
0

我有以下链接: https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJhttps://cooking.nytimes.com/learn-to-cook + & CD = 5 & HL = EN &克拉= clnk网页抓取 - 如何获取一个网络链接的特定部分

我有一个多链路数据集。每个链接都是相同的模式。我想获得链接的特定部分,因为上面的链接我将成为上面链接的大胆部分。我想从第二个http开始到第一个+符号之前的文本。

我不知道如何使用正则表达式。我在Python中工作。请帮助我。

回答

0

如果每个链接有相同的模式,你不需要正则表达式。您可以使用string.find()string cutting

link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 

# This finds the second occurrence of "https://" and returns the position 
second_https = link.find("https://", link.find("https://")+1) 
# Index of the end of the link 
end_of_link = link.find("+") 

new_link = link[second_https:end_of_link] 

print(new_link) 

这将返回“https://cooking.nytimes.com/learn-to-cook”,并描述如果链接遵循相同的模式就可以了(它是第二HTTPS:在链接//结束+号)

0

我会去与urlparse (Python 2)urlparse (Python 3)重新 GEX一点点:

import re 
from urlparse import urlparse 

url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk" 
parsed = urlparse(url_example) 
result = re.findall('https?.*', parsed.query)[0].split('+')[0] 
print(result) 

输出:

https://cooking.nytimes.com/learn-to-cook