2017-08-30 52 views
0

我目前正在做一个项目,其中包括从SMTP服务器读取日志文件,并提取有关每个通过的电子邮件的有意义的信息。我有一张表,其中有一些列稍后将与搜索相关;垃圾邮件分数,域名,域名,时间戳,主题等。 一切正常,直到我遇到一些非ASCII字符,通常在主题字段(如预期的)。保存unicode到sqlite的问题

我试图将str解码为iso-8859-1(这是该文件的编码)并保存,并且我也尝试将其编码回UTF-8,并且说实话,我这里有点迷路了。我听说在Python 2.7中使用unicode是一场噩梦,但直到现在,我从未体验过它。

无论如何,让我解释一下。这是我如何提取主题:

if 'subject' in realInfo: 
emailDict[keywrd].setSubject(realInfo[realInfo.index('subject') + 
len('subject') + 1:].decode('ISO-8859-1')) 

emailDict是一个字典,其中包含所有正在处理的电子邮件。

这就是我怎样,我将一切都变成了数据库:

info = (e.getID(), str(e.getSpamScore()), str(e.getMCPScore()), " ".join(e.getFrom()), " ".join(e.getTo()), e.getStatus(), e.getTimestamp(), e.getSubject(), dumps(e)) 
    print repr(e.getSubject()) # DEBUG 
    print type(e.getSubject()) # DEBUG 
    self.conn.cursor().execute(u"INSERT INTO emails (emailID, SpamScore, MCPScore, FromDomain, ToDomain, status, timestamp, subject, object)" 
         " VALUES (?,?,?,?,?,?,?,?,?)", info) 
    self.conn.commit() 

我加2个print语句来帮助我了解问题的所在。

'e'是一个电子邮件对象,用作每个电子邮件的蓝图。它包含以前由口译员获得的信息。之后,我将保存最重要的信息,如前所述,将用于搜索(“对象”列是一个电子邮件对象,在此使用pickle)。但只要特殊字符的出现,将引发一个异常:

u'VPXL \xffM-^W no more compromises. Better size, better life. \n' 
<type 'unicode'> 
Exception in thread Thread-25: 
Traceback (most recent call last): 
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner 
self.run() 
File "/usr/local/lib/python2.7/threading.py", line 754, in run 
self.__target(*self.__args, **self.__kwargs) 
File "/ProjMail/projMail_lib.py", line 174, in refresher 
self.interpreter.start() 
File "/ProjMail/projMail_lib.py", line 213, in start 
c.save(self.emailTracker) 
File "/ProjMail/projMail_lib.py", line 56, in save 
self.saveEmails() 
File "/ProjMail/projMail_lib.py", line 62, in saveEmails 
else: self.add(key) # If it's new 
File "/ProjMail/projMail_lib.py", line 82, in add 
" VALUES (?,?,?,?,?,?,?,?,?)", info) 

ProgrammingError: You must not use 8-bit bytestrings unless you use a 
text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
It is highly recommended that you instead just switch your application to 
Unicode strings.   

从我所看到的,它是Unicode,所以我不明白为什么SQLite是抱怨。 任何想法我可能在这里做错了吗?提前致谢!

回答

0

问题是没有将主题本身插入数据库,它插入了腌渍Email实例。

>>> subject = u'VPXL \xffM-^W no more compromises. Better size, better life. \n' 
>>> conn = sqlite3.connect(':memory:') 
>>> c = conn.cursor()        
>>> c.execute("""CREATE TABLE foo (bar text, baz text)""")         
<sqlite3.Cursor object at 0x7fab5cf280a0> 
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, 'random text')) 
<sqlite3.Cursor object at 0x7fab5cf280a0> 

>>> class Email(object):pass 
... 
>>> e = Email() 
>>> e.subject = subject 
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, pickle.dumps(e))) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. 

采莲Email实例创建混合编码字节字符串内部,引发异常(即使只选择将subject做到这一点)。

为了防止异常时,您可以在连接的text_factory属性更改为str

>>> conn.text_factory = str 
>>> c.execute(stmt2, (subject, pickle.dumps(e))) 
<sqlite3.Cursor object at 0x7fab5b3343b0> 

如果您希望使用默认unicodetext_factory留着,你可以在腌制类存放在blob列,裹着一个buffer实例。

>>> conn.text_factory = unicode 
>>> c.execute("""CREATE TABLE foo2 (bar text, baz blob)""") 
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, buffer(pickle.dumps(e))))      
<sqlite3.Cursor object at 0x7fab5b3343b0> 

腌制实例恢复上检索:

>>> c.execute("""SELECT bar, baz FROM foo2""") 
<sqlite3.Cursor object at 0x7fab5b3343b0> 
>>> res = c.fetchone() 
>>> res 
(u'VPXL \xffM-^W no more compromises. Better size, better life. \n', <read-write buffer ptr 0x7fab5e9706c8, size 167 at 0x7fab5e970688>) 
>>> pickle.loads(res[1]) 
<__main__.Email object at 0x7fab5b333ad0> 
+0

我做了你的建议,和它的工作!非常感谢! –