2017-10-16 161 views
0

个人清单:的Python:在列表中替换 n r 吨不包括起始 n n和与 n r n 吨结束

['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']

示例代码:

import requests 
from bs4 import BeautifulSoup 
import re 
re=requests.get('http://www.abcde.com/banana') 
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser") 
title_tag = soup.select_one('.page_article_title') 
print(title_tag.text) 
list=[] 
for tag in soup.select('.page_article_content'): 
    list.append(tag.text) 
#list=([c.replace('\n', '') for c in list]) 
#list=([c.replace('\r', '') for c in list]) 
#list=([c.replace('\t', '') for c in list]) 
print(list) 

我刮了一个网页后,我需要做数据清理。我想,以取代所有的"\r""\n""\t""",但我发现我有字幕可以在这一点,如果我这样做,字幕和句子要一起混合。

每个字幕总是与\n\n开始,以\n\r\n\t结束,是有可能,我可以做些什么来区分它们在此列表中像\aEtymology\a。如果我将\n\n\n\r\n\t分别替换为\a,首先会导致其他部分可能具有相同的元素,例如\n\n\r,它将变成\a\r。提前致谢!

回答

1

方法

  1. 更换字幕列表
  2. 自定义字符串,<subtitles>更换\n\r\t等列表
  3. 实际字幕
  4. 更换自定义字符串

代码

l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n'] 

import re 
regex=re.findall("\n\n.*.\n\r\n\t",l[0]) 
print(regex) 

for x in regex: 
    l = [r.replace(x,"<subtitles>") for r in l] 

rep = ['\n','\t','\r'] 
for y in rep: 
    l = [r.replace(y, '') for r in l] 

for x in regex: 
    l = [r.replace('<subtitles>', x, 1) for r in l] 
print(l) 

输出

['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t'] 

['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.'] 
+0

这非常整洁!对我来说很容易学习和理解。只是列表中的问题列表= [r.replace('',x,1)],1用于什么?当我删除它时,它打印出相同的结果。只是好奇:)谢谢! – Makiyo

+0

@Makiyo 1是单独替换第一个出现的。如果删除1,则输出中的字幕将相同。 –

0
import re  

print([re.sub(r'[\n\r\t]', '', c) for c in list]) 

我想你可以使用正则表达式

+0

,我不认为这是一个正确的答案,他的 “\ n \ r \ t” 的意思是 '\ n' 或 '\ r' 或 '\ T',如果你阅读它为“\ n \ r \ t”,那么下面的句子将是无用的“开始\ n \ n并以\ n \ r \ n \ t结尾”。检查他的例子,根本没有“\ n \ r \ t” –

0

您可以通过使用正则表达式做到这一点:

import re 
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t') 
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li] 

\g<1>是一个逆向引用的第一正则表达式(\ w +)。它可以让你重用那里的东西。

+0

嗨!我试过了,但它不起作用,不知道是不是我把它放在了错误的地方。我刚刚上传了上面的整个代码:) – Makiyo

+0

什么没有工作?任何错误? –

+0

AttributeError:'Response'对象没有'compile'属性 – Makiyo

相关问题