嗨,大家好我已经用Python编写了一个网页抓取工具,试图从dictionary-ish网站上刮掉一些单词样本等等为我的GRE单词列表并把它们放在一个CSV文件中。拼音内容包含汉字。在Python和TypeError中编码中文字符:需要类似字节的对象,而不是'str'
给我的脚本唯一的问题是,当我试图写那些出到CSV文件,我可能要么得到
UnicodeEncodeError的错误:“ASCII”编解码器的位置不能编码字符13- 15:在范围序数不(128)
或
类型错误:需要对类字节对象,而不是 'STR'
这是我的完整代码:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# make a word list (grabbed from the wordlist pdf, converted to Excel and extracted)
wordList = '''Day One
abandon
abate
abbreviate
Day Two
abate
abbreviate
Day Three
abandon
abate
Day Four
abandon
abate
abandon
abate
Day Five
abandon
abate
Day Six
abandon
abate
Day Seven
abandon
abate'''
wordList = [y for y in (x.strip() for x in wordList.splitlines()) if y]
dayIndex = 0
dayArray = ['Day One', 'Day Two', 'Day Three', 'Day Four', 'Day Five', 'Day Six', 'Day Seven']
for item in wordList:
if item == dayArray[dayIndex]:
if dayIndex == 0:
fileName = "Word " + dayArray[dayIndex] + ".csv"
f = open(fileName, 'w')
headers = "word, separater, detail, lineSep\n"
f.write(headers)
dayIndex += 1
elif dayIndex == 6:
f.close()
else:
f.close()
fileName = "Word " + dayArray[dayIndex] + ".csv"
f = open(fileName, 'w')
headers = "word, separater, detail, lineSep\n"
f.write(headers)
dayIndex += 1
else:
# construct url for each word
myUrl = 'http://gre.kmf.com/vocab/detail/' + item
# opening up the connection, grabbing the page
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
# html parsing
pageSoup = soup(page_html, "html.parser")
# grab word container
container = pageSoup.findAll("div", {"class", "word-d-maintile"})
contain = container[0]# actually only 1 item in the container array
# grab the word(should be the same as item)
word = contain.span.text
# grab word detail
wordDetail_container = contain.findAll("div", {"class": "word-g-translate"})
wordDetail = wordDetail_container[0].text.strip()# again should be only 1 item in the array.strip() the extra spaces and useless indentation
# manipulate the string wordDetail(string is immutable but you know what I mean)
detailArray = []
for letter in wordDetail:
if letter != '【' and letter != '例' and letter != '近' and letter != '反':
detailArray.append(letter)
elif letter == '【':
detailArray.append("\n\n\n" + letter)
else:
detailArray.append("\n\n" + '[' + letter + ']' + ' ')
newWordDetail = ''.join(detailArray)
#print("CUT\n") debug
#print(word + '\n') debug
#print(newWordDetail) debug
f.write(word +',' + '&' + ',' + newWordDetail.replace(',', 'douhao') + ',' + '$')
问题出在最后一行。当第一个错误发生时,我在newWordDetail尝试对这些中文字符进行编码之后添加了一个“.encode('gb2312')”,但是当我这样做后,我得到了第二个错误。我在网上查过,但几乎找不到适合我的情况的解决方案。
谢谢你们拯救我的生命!
On the 2nd line should specify the encoding name, i.e. '# -*- coding: utf-8 -*-' – davedwards
Thanks downshift I copied these two lines from a SO question but didn't notice the content... I've added it but still the code doesn't work :( – Hang