从Python中的多个网页中刮取文本

我的任务是将我们主机的某个客户端的所有网页都删除掉。我已经设法编写了一个脚本，可以从单个网页中删除文本，并且您可以在每次要抓取其他网页时手动替换代码中的网址。但显然这是非常低效的。理想情况下，我可以让Python连接到一些列表，其中包含我需要的所有URL，它将遍历列表并将所有刮取的文本打印成单个CSV。我试图通过创建一个2 URL长列表并试图让我的代码去删除这两个URL来编写此代码的“测试”版本。但是，正如您所看到的，我的代码只会删除列表中最近的url并且不会保留在它所刮取的第一个页面上。我认为这是由于我的印刷声明中有一个缺陷，因为它总会自行写入。是否有办法让我所抓到的所有东西都保存在某个地方，直到循环遍历整个列表，然后打印所有内容。从Python中的多个网页中刮取文本

随意完全拆除我的代码。我对计算机语言一无所知。我只是继续分配这些任务，并使用Google来尽我所能。

import urllib 
import re 
from bs4 import BeautifulSoup 

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv' 
urlTable = ['url1','url2'] 

def extractText(string): 
    page = urllib.request.urlopen(string) 
    soup = BeautifulSoup(page, 'html.parser') 

##Extracts all paragraph and header variables from URL as GroupObjects 
    text = soup.find_all("p") 
    headers1 = soup.find_all("h1") 
    headers2 = soup.find_all("h2") 
    headers3 = soup.find_all("h3") 

##Forces GroupObjects into str 
    text = str(text) 
    headers1 = str(headers1) 
    headers2 = str(headers2) 
    headers3 = str(headers3) 

##Strips HTML tags and brackets from extracted strings 
    text = text.strip('[') 
    text = text.strip(']') 
    text = re.sub('<[^<]+?>', '', text) 

    headers1 = headers1.strip('[') 
    headers1 = headers1.strip(']') 
    headers1 = re.sub('<[^<]+?>', '', headers1) 

    headers2 = headers2.strip('[') 
    headers2 = headers2.strip(']') 
    headers2 = re.sub('<[^<]+?>', '', headers2) 

    headers3 = headers3.strip('[') 
    headers3 = headers3.strip(']') 
    headers3 = re.sub('<[^<]+?>', '', headers3) 

    print_to_file = open (data_file_name, 'w' , encoding = 'utf') 
    print_to_file.write(text + headers1 + headers2 + headers3) 
    print_to_file.close() 


for i in urlTable: 
    extractText (i)

来源

2016-08-04 confusedanalyst

试试这个，'w'会用指针打开文件的开头。您希望指针指向文件

print_to_file = open (data_file_name, 'a' , encoding = 'utf')

这里到底是供将来参考所有不同的读写模式

The argument mode points to a string beginning with one of the following 
sequences (Additional characters may follow these sequences.): 

``r'' Open text file for reading. The stream is positioned at the 
     beginning of the file. 

``r+'' Open for reading and writing. The stream is positioned at the 
     beginning of the file. 

``w'' Truncate file to zero length or create text file for writing. 
     The stream is positioned at the beginning of the file. 

``w+'' Open for reading and writing. The file is created if it does not 
     exist, otherwise it is truncated. The stream is positioned at 
     the beginning of the file. 

``a'' Open for writing. The file is created if it does not exist. The 
     stream is positioned at the end of the file. Subsequent writes 
     to the file will always end up at the then current end of file, 
     irrespective of any intervening fseek(3) or similar. 

``a+'' Open for reading and writing. The file is created if it does not 
     exist. The stream is positioned at the end of the file. Subse- 
     quent writes to the file will always end up at the then current 
     end of file, irrespective of any intervening fseek(3) or similar.

来源

2016-08-04 19:52:25

非常感谢！那正是我所期待的。我想，一旦我从客户端获得了真正的URL列表，我就可以应用相同的原则。再次感谢你！ – confusedanalyst

从Python中的多个网页中刮取文本

回答

相关问题