删除Python中的网址，空行和Unicode字符

我需要使用python从大文本文件（500MiB）中删除带有Unicode字符的url，空行和行。删除Python中的网址，空行和Unicode字符

这是我的文件：

https://removethis1.com 
http://removethis2.com foobar1 
http://removethis3.com foobar2 
foobar3 http://removethis4.com 
www.removethis5.com 


foobar4 www.removethis6.com foobar5 
foobar6 foobar7 
foobar8 www.removethis7.com

正则表达式后，它应该是这样的：

foobar1 
foobar2 
foobar3 
foobar4 foobar5 
foobar6 foobar7 
foobar8

的代码我想出是这样的：

file = open(file_path, encoding="utf8") 
    self.rawFile = file.read() 
    rep = re.compile(r""" 
         http[s]?://.*?\s 
         |www.*?\s 
         |(\n){2,} 
         """, re.X) 
    self.processedFile = rep.sub('', self.rawFile)

但产量是不正确的：

foobar3 foobar4 foobar5 
foobar6 foobar7 
foobar8 www.removethis7.com

我还需要删除所有包含至少一个非ASCII字符的行，但我无法为此任务想出正则表达式。

来源

2015-09-25 Federico

不要一次做这一切，一行一行地去做 –

@PadraicCunningham我试过了，但是速度很慢 – Federico

你想改变原始文件内容还是创建一个新文件？ –

你可以尝试编码为ASCII赶上我假定非ASCII码是你想要什么：

with open("test.txt",encoding="utf-8") as f: 
    rep = re.compile(r""" 
         http[s]?://.*?\s 
         |www.*?\s 
         |(\n) 
         """, re.X) 
    for line in f: 
     m = rep.search(line) 
     try: 
      if m: 
       line = line.replace(m.group(), "") 
       line.encode("ascii") 
     except UnicodeEncodeError: 
      continue 
     if line.strip(): 
      print(line.strip())

输入：

https://removethis1.com 
http://removethis2.com foobar1 
http://removethis3.com foobar2 
foobar3 http://removethis4.com 
www.removethis5.com 

1234 ā 
5678 字 
foobar4 www.removethis6.com foobar5 
foobar6 foobar7 
foobar8 www.removethis7.com

输出：

foobar1 
foobar2 
foobar3 
foobar4 foobar5 
foobar6 foobar7 
foobar8

或者使用正则表达式来匹配任何非ASCII：

with open("test.txt",encoding="utf-8") as f: 
    rep = re.compile(r""" 
         http[s]?://.*?\s 
         |www.*?\s 
         |(\n) 
         """, re.X) 
    non_asc = re.compile(r"[^\x00-\x7F]") 
    for line in f: 
     non = non_asc.search(line) 
     if non: 
      continue 
     m = rep.search(line) 
     if m: 
      line = line.replace(m.group(), "") 
      if line.strip(): 
       print(line.strip())

与上述相同的输出。你不能将正则表达式组合起来，因为如果有任何匹配并且只是用另一个匹配替换，就会完全删除行。

来源

2015-09-25 17:43:24

-1

这将删除所有链接

(?:http|www).*?(?=\s|$)

解释

(?:   #non capturing group 
    http|www #match "http" OR "www" 
) 
    .*?  #lazy match anything until... 
(
    ?=\s|$  #it is followed by white space or the end of line (positive lookahead) 
)

新行\n更换空白\s然后去掉所有空行后

来源

2015-09-25 17:13:48

-1

取决于如何接近你的样品您想要结果匹配的文字：

(+)?\b(?:http|www)[^\s]*(?(1)|(+)?)|\n{2,}

regex101 demo

这片神奇的查找前导空格，如果存在捕捉它们。然后它会查找http或www部分，其次是所有不是空白的（如果您想添加更多条件以排除，我使用[^\s]*而不是简单的\S*）。之后，它使用一个正则表达式来检查是否有更早的收集空白。如果没有，那么它会尝试捕获任何尾随的空白（例如，您不会在foobar4 www.removethis6.com foobar5之间移除太多）。或者它寻找2+换行符。

如果全部替换为空，它应该会给出与您请求相同的输出。

现在，这个正则表达式相当严格，并且可能会有很多边界情况下不起作用。这适用于OP，但如果您需要更灵活，您可能需要提供更多详细信息。

来源

2015-09-25 17:24:14 OnlineCop

删除Python中的网址，空行和Unicode字符

回答

相关问题