2015-11-05 83 views
2

我已经从我的Gmail帐户下载了邮件存档。我正在使用以下从博客取得的python(2.7)代码将存档的内容转换为csv。邮箱使用Python的csv

import mailbox 
import csv 
writer = csv.writer(open(("clean_mail.csv", "wb")) 
for message in mailbox.mbox('archive.mbox'): 
    writer.writerow([message['subject'], message['from'], message['date']]) 

我想包括邮件的正文(实际的消息)......但是不知道如何。我以前没有用过python,有人可以帮忙吗?我已经使用了其他SO选项,但无法通过。

为了完成同样的任务,我也使用了下面的代码:但是得到第60行的缩进错误:return json_msg。我尝试过不同的缩进选项,但没有任何改进。

import sys 
import mailbox 
import email 
import quopri 
import json 
import time 
from BeautifulSoup import BeautifulSoup 
from dateutil.parser import parse 

MBOX = 'Users/mymachine/client1/Takeout/Mail/archive.mbox' 
OUT_FILE = 'Users/mymachine/client1/Takeout/Mail/archive.mbox.json' 

def cleanContent(msg): 
    msg = quopri.decodestring(msg) 
    try: 
     soup = BeautifulSoup(msg) 
    except: 
     return '' 
    return ''.join(soup.findAll(text=True)) 
# There's a lot of data to process, and the Pythonic way to do it is with a 
# generator. See http://wiki.python.org/moin/Generators. 
# Using a generator requires a trivial encoder to be passed to json for object 
# serialization. 
class Encoder(json.JSONEncoder): 
    def default(self, o): return list(o) 

def gen_json_msgs(mb): 
    while 1: 
     msg = mb.next() 
     if msg is None: 
      break 
      yield jsonifyMessage(msg) 

def jsonifyMessage(msg): 
    json_msg = {'parts': []} 
    for (k, v) in msg.items(): 
     json_msg[k] = v.decode('utf-8', 'ignore') 

    for k in ['To', 'Cc', 'Bcc']: 
      if not json_msg.get(k): 
       continue 
    json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\ 
    .replace(' ', '').decode('utf-8', 'ignore').split(',') 

for part in msg.walk(): 
    json_part = {} 
    if part.get_content_maintype() == 'multipart': 
     continue 


    json_part['contentType'] = part.get_content_type() 
    content = part.get_payload(decode=False).decode('utf-8', 'ignore') 
    json_part['content'] = cleanContent(content) 

    json_msg['parts'].append(json_part) 
    then = parse(json_msg['Date']) 
    millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000) 
    json_msg['Date'] = {'$date' : millis} 

return json_msg 

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file) 

f = open(OUT_FILE, 'w') 
for msg in gen_json_msgs(mbox): 
    if msg != None: 
      f.write(json.dumps(msg, cls=Encoder) + '\n') 
f.close() 

回答

1

试试这个。

import mailbox 
import csv 
writer = csv.writer(open(("clean_mail.csv", "wb")) 
for message in mailbox.mbox('archive.mbox'): 
    if message.is_multipart(): 
     content = ''.join(part.get_payload() for part in message.get_payload()) 
    else: 
     content = message.get_payload() 
    writer.writerow([message['subject'], message['from'], message['date'],content]) 

或本:

import mailbox 
import csv 

def get_message(message): 
    if not message.is_multipart(): 
     return message.get_payload() 
    contents = "" 
    for msg in message.get_payload(): 
     contents = contents + str(msg.get_payload()) + '\n' 
    return contents 

if __name__ == "__main__": 

    writer = csv.writer(open("clean_mail.csv", "wb")) 
    for message in mailbox.mbox("archive.mbox"): 
     contents = get_message(message) 
     writer.writerow([message["subject"], message["from"], message["date"],contents]) 

查找的文档here

+0

非常感谢拉胡尔。我已经尝试了代码,但得到这个错误:回溯(最近通话最后一个): 文件“sendmail.py” ,第10行,在 content =''.join(part.get_payload()为message.get_payload()中的一部分) TypeError:序列项0:期望的字符串,找到的列表 – Apricot

+0

试试这个并告诉我结果。 'content =''.join(''。join(part.get_payload())为message.get_payload()中的一部分)' – Rahul

+0

得到以下消息:'Traceback(最近呼叫最后一个): “sendmail.py “,第10行,在 content =''.join(''。join(part.get_payload())for message in message.get_payload()) 文件”sendmail.py“,第10行,在 内容=''.join(''。join(part.get_payload())为message.get_payload()中的一部分) TypeError:序列项目0:期望的字符串,找到的实例' – Apricot

0

拉胡尔片段为多内容的一个小改进:

import sys 
import mailbox 
import csv 
from email.header import decode_header 

infile = sys.argv[1] 
outfile = sys.argv[2] 
writer = csv.writer(open(outfile, "w")) 


def get_content(part): 
    content = '' 
    payload = part.get_payload() 
    if isinstance(payload, str): 
     content += payload 
    else: 
     for part in payload: 
      content += get_content(part) 
    return content 


writer.writerow(['date', 'from', 'to', 'subject', 'content']) 
for index, message in enumerate(mailbox.mbox(infile)): 
    content = get_content(message) 
    row = [ 
     message['date'], 
     message['from'].strip('>').split('<')[-1], 
     message['to'], 
     decode_header(message['subject'])[0][0], 
     content 
    ] 
    writer.writerow(row)