2017-03-15 83 views
0

嗨,大家好我已经用Python编写了一个网页抓取工具,试图从dictionary-ish网站上刮掉一些单词样本等等为我的GRE单词列表并把它们放在一个CSV文件中。拼音内容包含汉字。在Python和TypeError中编码中文字符:需要类似字节的对象,而不是'str'

给我的脚本唯一的问题是,当我试图写那些出到CSV文件,我可能要么得到

UnicodeEncodeError的错误:“ASCII”编解码器的位置不能编码字符13- 15:在范围序数不(128)

类型错误:需要对类字节对象,而不是 'STR'

这是我的完整代码:

#!/usr/bin/python 
# -*- coding: <encoding name> -*- 

from urllib.request import urlopen as uReq 
from bs4 import BeautifulSoup as soup 

# make a word list (grabbed from the wordlist pdf, converted to Excel and extracted) 

wordList = '''Day One 
abandon 
abate 
abbreviate 
Day Two 
abate 
abbreviate 
Day Three 
abandon 
abate 
Day Four 
abandon 
abate 
abandon 
abate 
Day Five 
abandon 
abate 
Day Six 
abandon 
abate 
Day Seven 
abandon 
abate''' 

wordList = [y for y in (x.strip() for x in wordList.splitlines()) if y] 

dayIndex = 0 
dayArray = ['Day One', 'Day Two', 'Day Three', 'Day Four', 'Day Five', 'Day Six', 'Day Seven'] 

for item in wordList: 
     if item == dayArray[dayIndex]: 
       if dayIndex == 0: 
         fileName = "Word " + dayArray[dayIndex] + ".csv" 
         f = open(fileName, 'w') 
         headers = "word, separater, detail, lineSep\n" 
         f.write(headers) 
         dayIndex += 1 
       elif dayIndex == 6: 
         f.close() 
       else: 
         f.close() 
         fileName = "Word " + dayArray[dayIndex] + ".csv" 
         f = open(fileName, 'w') 
         headers = "word, separater, detail, lineSep\n" 
         f.write(headers) 
         dayIndex += 1 
     else: 
       # construct url for each word 
       myUrl = 'http://gre.kmf.com/vocab/detail/' + item 

       # opening up the connection, grabbing the page 
       uClient = uReq(myUrl) 
       page_html = uClient.read() 
       uClient.close() 

       # html parsing 
       pageSoup = soup(page_html, "html.parser") 

       # grab word container 
       container = pageSoup.findAll("div", {"class", "word-d-maintile"}) 
       contain = container[0]# actually only 1 item in the container array 

       # grab the word(should be the same as item) 
       word = contain.span.text 

       # grab word detail 
       wordDetail_container = contain.findAll("div", {"class": "word-g-translate"}) 
       wordDetail = wordDetail_container[0].text.strip()# again should be only 1 item in the array.strip() the extra spaces and useless indentation 

       # manipulate the string wordDetail(string is immutable but you know what I mean) 
       detailArray = [] 
       for letter in wordDetail: 
         if letter != '【' and letter != '例' and letter != '近' and letter != '反': 
          detailArray.append(letter) 
         elif letter == '【': 
          detailArray.append("\n\n\n" + letter) 
         else: 
          detailArray.append("\n\n" + '[' + letter + ']' + ' ') 
         newWordDetail = ''.join(detailArray) 
       #print("CUT\n") debug 
       #print(word + '\n') debug 
       #print(newWordDetail) debug 
       f.write(word +',' + '&' + ',' + newWordDetail.replace(',', 'douhao') + ',' + '$') 

问题出在最后一行。当第一个错误发生时,我在newWordDetail尝试对这些中文字符进行编码之后添加了一个“.encode('gb2312')”,但是当我这样做后,我得到了第二个错误。我在网上查过,但几乎找不到适合我的情况的解决方案。

谢谢你们拯救我的生命!

+0

On the 2nd line should specify the encoding name, i.e. '# -*- coding: utf-8 -*-' – davedwards

+0

Thanks downshift I copied these two lines from a SO question but didn't notice the content... I've added it but still the code doesn't work :( – Hang

回答

1

你的代码,写成了面条式代码,造成有的情况下,文件关闭了,不能写。

f.write(字+ '' + '&' + '' + newWordDetail .replace(',','douhao')+','+'$')

有时您写入关闭文件,所以这是错误的。下面的代码是正确的,运行这段代码,我可以得到正确的内容。

#!/usr/bin/env python 
# coding:utf-8 
'''黄哥Python''' 


from urllib.request import urlopen as uReq 
from bs4 import BeautifulSoup as soup 

# make a word list (grabbed from the wordlist pdf, converted to Excel and 
# extracted) 

wordList = '''Day One 
abandon 
abate 
abbreviate 
Day Two 
abate 
abbreviate 
Day Three 
abandon 
abate 
Day Four 
abandon 
abate 
abandon 
abate 
Day Five 
abandon 
abate 
Day Six 
abandon 
abate 
Day Seven 
abandon 
abate''' 

wordList = [y for y in (x.strip() for x in wordList.splitlines()) if y] 

dayIndex = 0 
dayArray = ['Day One', 'Day Two', 'Day Three', 
      'Day Four', 'Day Five', 'Day Six', 'Day Seven'] 

for item in wordList: 
    if item == dayArray[dayIndex]: 
     if dayIndex == 0: 
      fileName = "Word " + dayArray[dayIndex] + ".csv" 
      f = open(fileName, 'w') 
      headers = "word, separater, detail, lineSep\n" 
      f.write(headers) 
      dayIndex += 1 
     elif dayIndex == 6: 
      f.close() 
     else: 
      f.close() 
      fileName = "Word " + dayArray[dayIndex] + ".csv" 
      f = open(fileName, 'w') 
      headers = "word, separater, detail, lineSep\n" 
      f.write(headers) 
      dayIndex += 1 
    else: 
     # construct url for each word 
     myUrl = 'http://gre.kmf.com/vocab/detail/' + item 

     # opening up the connection, grabbing the page 
     uClient = uReq(myUrl) 
     page_html = uClient.read() 
     uClient.close() 

     # html parsing 
     pageSoup = soup(page_html, "html.parser",) 

     # grab word container 
     container = pageSoup.findAll("div", {"class", "word-d-maintile"}) 
     contain = container[0] # actually only 1 item in the container array 

     # grab the word(should be the same as item) 
     word = contain.span.text 

     # grab word detail 
     wordDetail_container = contain.findAll(
      "div", {"class": "word-g-translate"}) 
     # again should be only 1 item in the array.strip() the extra spaces and 
     # useless indentation 
     wordDetail = wordDetail_container[0].text.strip() 

     # manipulate the string wordDetail(string is immutable but you know 
     # what I mean) 
     detailArray = [] 
     for letter in wordDetail: 
      if letter != '【' and letter != '例' and letter != '近' and letter != '反': 
       detailArray.append(letter) 
      elif letter == '【': 
       detailArray.append("\n\n\n" + letter) 
      else: 
       detailArray.append("\n\n" + '[' + letter + ']' + ' ') 
      newWordDetail = ''.join(detailArray) 
     # print("CUT\n") debug 
     # print(word + '\n') debug 
     # print(newWordDetail) debug 
     # print(f) 
     try: 
      f.write(word + ',' + '&' + ',' +newWordDetail.replace(',', 'douhao') + ',' + '$') 
     except Exception as e: 
      pass 

输出结果,其中一个文件的内容如下。 字,分体,详细地说,lineSep 抛弃,&,

【考法1】N。放纵:从约束carefreedouhao自由

[例]添加香料炖完全抛弃肆无忌惮地向炖菜里面加调料

[近] unconstraintdouhao uninhibitednessdouhao始知

【考法2】诉放纵:给(自己)在情不自禁地

[例]放弃自己对情感感情用事‖放弃她自己完全闲置的生活她放纵自己过着闲散的生活

[近] indulgedouhao投降

【考法3】v。放弃:从退出经常在危险或侵占

[例]弃船/舍弃船脸;离家

[反]打捞救援

【考法4】诉停止做某事:结束(某事计划或先前同意)

[例]恶劣的天气迫使美国宇航局放弃发射。坏天气迫使NASA停止了发射。

[近] abortdouhao dropdouhao repealdouhao rescinddouhao revokedouhao call offdouhao give up

[反] keepdouhao continuedouhao maintaindouhao carry on 继续,$abate, & ,

【考法1】v. 减轻(程度或者强度): to reduce in degree or intensity

[例] abate his rage/pain 平息他的愤怒/减轻他的痛苦

[近] moderatedouhao recededouhao subsidedouhao remitdouhao wanedouhao die (away or down or out)douhao let updouhao phase downdouhao taper off

[反] intensify加强, 加剧

【考法2】v. 减少(数量), 降低(价值): to reduce in amount or value

[例] abate a tax 降低税收

[近] de-escalatedouhao depletedouhao downscaledouhao dwindledouhao ratchet (down)

[反] augmentdouhao promote 增加

【考法3】v. 停止, 撤销: to put an end to

[例] abate a nuisance 停止伤害

[近] abrogatedouhao annuldouhao invalidatedouhao nullifydouhao rescinddouhao vacate,$

+0

Hi 黄哥 yea sure at first I thought it could run into copyright issue so I replaced the true url, but I just figured out I can changed that later so I have now changed to the correct url. Thanks for your help! – Hang

+0

pageSoup = soup(page_html, "html.parser", from_encoding='gbk') –

+0

Thanks 黄哥 let me try – Hang

相关问题