HTML刮板输出卡在utf-8中

我正在处理一些中文文档的刮板。作为项目的一部分，我试图将文档主体刮到列表中，然后从该列表中编写html版本的文档（最终版本将包括元数据以及文本，以及一个文件夹文档的单个html文件）。HTML刮板输出卡在utf-8中

我已经设法将文档的主体刮到列表中，然后使用该列表的内容创建一个新的HTML文档。当我将列表输出到csv时，我甚至可以查看内容（到目前为止这么好......）。不幸的是，输出的HTML文档全部是"\u6d88\u9664\u8d2b\u56f0\u3001\"。

有没有办法对输出进行编码，以避免这种情况发生？我是否只需要长大并为真正的页面划分（解析和组织<p><p>而不是仅仅复制所有现存的HTML），然后按元素构建新的HTML页面元素？

任何想法将不胜感激。

from bs4 import BeautifulSoup 
import urllib 
#csv is for the csv writer 
import csv 

#initiates the dictionary to hold the output 

holder = [] 

#this is the target URL 
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm" 

data = [] 

filename = "fullbody.html" 
target = open(filename, 'w') 

def bodyscraper(url): 
    #opens the url for read access 
    this_url = urllib.urlopen(url).read() 
    #creates a new BS holder based on the URL 
    soup = BeautifulSoup(this_url, 'lxml') 

    #finds the body text 
    body = soup.find('td', {'class':'b12c'}) 


    data.append(body) 

    holder.append(data) 

    print holder[0] 
    for item in holder: 
     target.write("%s\n" % item) 

bodyscraper(target_url) 


with open('bodyscraper.csv', 'wb') as f: 
    writer = csv.writer(f) 
    writer.writerows(holder)

来源

2017-04-09 mweinberg

由于源htm是UTF-8编码，所以当使用bs只是解码哪些urllib返回哪个工作。我已经测试了HTML和CSV输出将显示中国的人物，这里是修改代码：

from bs4 import BeautifulSoup 
import urllib 
#csv is for the csv writer 
import csv 

#initiates the dictionary to hold the output 

holder = [] 

#this is the target URL 
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm" 

data = [] 

filename = "fullbody.html" 
target = open(filename, 'w') 

def bodyscraper(url): 
    #opens the url for read access 
    this_url = urllib.urlopen(url).read() 
    #creates a new BS holder based on the URL 
    soup = BeautifulSoup(this_url.decode("utf-8"), 'lxml') #decoding urllib returns 

    #finds the body text 
    body = soup.find('td', {'class':'b12c'}) 
    target.write("%s\n" % body) #write the whole decoded body to html directly 


    data.append(body) 

    holder.append(data) 


bodyscraper(target_url) 


with open('bodyscraper.csv', 'wb') as f: 
    writer = csv.writer(f) 
    writer.writerows(holder)

来源

2017-04-10 03:05:05

这工作了CSV但是HTML还是给了我（一种新的）垃圾输出。然而，在html的开头添加''这一行给了我想要的东西。 – mweinberg

HTML刮板输出卡在utf-8中

回答

相关问题