2010-07-15 93 views
4

我对Python很新,希望能用它来解析文本文件。该文件具有线以下格式250-300之间:使用Python解析文本文件

---- Mark Grey ([email protected]) changed status from Busy to Available @ 14/07/2010 16:32:36 ---- 
---- Silvia Pablo ([email protected]) became Available @ 14/07/2010 16:32:39 ---- 

我需要为所有条目的以下信息到另一个文件(Excel或文本)存储从该文件

UserName/ID Previous Status New Status Date Time 

所以我的结果文件应该是这样提前为上述entried

Mark Grey/[email protected] Busy Available 14/07/2010 16:32:36 
Silvia Pablo/[email protected] NaN Available 14/07/2010 16:32:39 

感谢,

任何帮助将非常感谢

+1

的NaN ........... – 2010-07-15 07:38:16

+0

嗯,这是不是一个数字好吗:) – 2010-07-15 07:54:00

+1

编辑注:马塞洛和Tim给你你想要做什么一个很好的答案。以下是包含在Python中的正则表达式库的文档,它可以帮助您进一步扩展代码: http://docs.python.org/library/re.html – 2010-07-15 07:51:17

回答

1

好吧,如果我要解决这个问题,可能我会开始将每个条目分割成它自己的单独的字符串。这看起来可能是面向行的,所以inputfile.split('\n')可能就足够了。从那里我可能会制定一个正则表达式来匹配每个可能的状态变化,子组包装每个重要的字段。

6
import re 

pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) @ (.*?) ----\s*") 
with open("data.txt") as f: 
    for line in f: 
     (name, email, prev, curr, date) = pat.match(line).groups() 
     print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date) 

这使得对空白的假设,并假设每一条线符合模式。如果想要正常处理脏输入,可能需要添加错误检查(如检查pat.match()不返回None)。

15

为了让你开始:

result = [] 
regex = re.compile(
    r"""^-*\s+ 
    (?P<name>.*?)\s+ 
    \((?P<email>.*?)\)\s+ 
    (?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+ 
    (?P<new>.*?)\[email protected]\s+ 
    (?P<date>\S+)\s+ 
    (?P<time>\S+)\s+ 
    -*$""", re.VERBOSE) 
with open("inputfile") as f: 
    for line in f: 
     match = regex.match(line) 
     if match: 
      result.append([ 
       match.group("name"), 
       match.group("email"), 
       match.group("previous") 
       # etc. 
      ]) 
     else: 
      # Match attempt failed 

将让你在比赛的各个部分的数组。然后我建议你使用csv module以标准格式存储结果。

6

两个感兴趣的RE模式似乎是...:

p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$' 
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$' 

所以我会做:

import csv, re, sys 

# assign p1, p2 as above (or enhance them, etc etc) 

r1 = re.compile(p1) 
r2 = re.compile(p2) 
data = [] 

with open('somefile.txt') as f: 
    for line in f: 
     m = p1.match(line) 
     if m: 
      data.append(m.groups()) 
      continue 
     m = p2.match(line) 
     if not m: 
      print>>sys.stderr, "No match for line: %r" % line 
      continue 
     listofgroups = m.groups() 
     listofgroups.insert(2, 'NaN') 
     data.append(listofgroups) 

with open('result.csv', 'w') as f: 
    w = csv.writer(f) 
    w.writerow('UserName/ID Previous Status New Status Date Time'.split()) 
    w.writerows(data) 

如果我所描述的两个模式都无法满足需求,他们可能当然需要调整,但我认为这种一般方法会很有用。尽管Stack Overflow上的许多Python用户非常不喜欢RE,但我发现它们对于这种实用的特殊文本处理非常有用。

也许厌恶被别人想用RE上荒谬的用途,如CSV,HTML,XML,...即席分析解释 - 结构文本格式等多种的为这完美的解析器存在!此外,其他任务远远超出了RE的“舒适区”,并且需要改为固定的一般解析器系统,如pyparsing。或者在其他极端简单的任务中完成简单的任务(例如,我记得最近的一个SO问题,它使用if re.search('something', s):而不是if 'something' in s:!)。

但是,对于RE适合的任务(不包括一端最简单的任务,以及解析结构化或稍微复杂的语法的其他任务),使用它们确实没什么问题,我建议所有程序员至少学习REs的基础知识。

4

亚历克斯提到pyparsing所以这里是一个pyparsing方法你同样的问题:

from pyparsing import Word, Suppress, Regex, oneOf, SkipTo 
import datetime 

DASHES = Word('-').suppress() 
LPAR,RPAR,AT = map(Suppress,"()@") 
date = Regex(r'\d{2}/\d{2}/\d{4}') 
time = Regex(r'\d{2}:\d{2}:\d{2}') 
status = oneOf("Busy Available Idle Offline Unavailable") 

statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate') 
statechange2 = 'became' + status('tostate') 
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR + 
      (statechange1 | statechange2) + 
      AT + date('date') + time('time') + DASHES) 

def convertFields(tokens): 
    if 'fromstate' not in tokens: 
     tokens['fromstate'] = 'NULL' 
    tokens['name'] = tokens.name.strip() 
    tokens['email'] = tokens.email.strip() 
    d,mon,yr = map(int, tokens.date.split('/')) 
    h,m,s = map(int, tokens.time.split(':')) 
    tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s) 
linefmt.setParseAction(convertFields) 

for line in text.splitlines(): 
    fields = linefmt.parseString(line) 
    print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields 

打印:

Mark Grey/[email protected] Busy  Available 2010-07-14 16:32:36 
Silvia Pablo/[email protected] NULL  Available 2010-07-14 16:32:39 

pyparsing可以让你的名字附加到结果字段(就像命名在Tom Pietzcker的RE-styled答案中),加上解析时间操作来操作或操作已解析的操作 - 注意将单独的日期和时间字段转换为真正的日期时间对象,已经转换并准备处理,解析后没有额外的麻烦或大惊小怪。

这里是一个修饰的环,只是转储出的解析令牌和每行的命名字段:

for line in text.splitlines(): 
    fields = linefmt.parseString(line) 
    print fields.dump() 

打印:

['Mark Grey ', '[email protected]', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36'] 
- date: 14/07/2010 
- datetime: 2010-07-14 16:32:36 
- email: [email protected] 
- fromstate: Busy 
- name: Mark Grey 
- time: 16:32:36 
- tostate: Available 
['Silvia Pablo ', '[email protected]', 'became', 'Available', '14/07/2010', '16:32:39'] 
- date: 14/07/2010 
- datetime: 2010-07-14 16:32:39 
- email: [email protected] 
- fromstate: NULL 
- name: Silvia Pablo 
- time: 16:32:39 
- tostate: Available 

我怀疑,当你继续在这方面努力问题,您会发现输入文本格式的其他变体指定用户状态如何变化。在这种情况下,您只需添加另一个定义,如statechange1statechange2,并将其与其他的插入到linefmt中。我觉得pyparsing的解析器定义结构可以帮助开发人员在事情发生变化之后回到解析器,并轻松扩展他们的解析程序。

1

非常感谢您的所有意见。他们非常有用。我使用目录功能编写了我的代码。它所做的是读取文件并为每个用户创建一个输出文件,并进行所有状态更新。以下是粘贴在下面的代码。 ?

#Script to extract info from individual data files and print out a data file combining info from these files 

import os 
import commands 

dataFileDir="data/"; 

#Dictionary linking names to email ids 
#For the time being, assume no 2 people have the same name 
usrName2Id={}; 

#User id to user name mapping to check for duplicate names 
usrId2Name={}; 

#Store info: key: user ids and values a dictionary with time stamp keys and status messages values 
infoDict={}; 

#Given an array of space tokenized inputs, extract user name 
def getUserName(info,mailInd): 

    userName=""; 
    for i in range(mailInd-1,0,-1): 

     if info[i].endswith("-") or info[i].endswith("+"): 
      break; 

     userName=info[i]+" "+userName; 

    userName=userName.strip(); 
    userName=userName.replace(" "," "); 
    userName=userName.replace(" ","_"); 

    return userName; 

#Given an array of space tokenized inputs, extract time stamp 
def getTimeStamp(info,timeStartInd): 
    timeStamp=""; 
    for i in range(timeStartInd+1,len(info)): 
     timeStamp=timeStamp+" "+info[i]; 

    timeStamp=timeStamp.replace("-",""); 
    timeStamp=timeStamp.strip(); 
    return timeStamp; 

#Given an array of space tokenized inputs, extract status message 
def getStatusMsg(info,startInd,endInd): 
    msg=""; 
    for i in range(startInd,endInd): 
     msg=msg+" "+info[i]; 
    msg=msg.strip(); 
    msg=msg.replace(" ","_"); 
    return msg; 

#Extract and store info from each line in the datafile 
def extractLineInfo(line): 

    print line; 
    info=line.split(" "); 

    mailInd=-1;userId="-NONE-"; 
    timeStartInd=-1;timeStamp="-NONE-"; 
    becameInd="-1"; 
    statusMsg="-NONE-"; 

    #Find indices of email id and "@" char indicating start of timestamp 
    for i in range(0,len(info)): 
     #print (str(i)+" "+info[i]); 
     if(info[i].startswith("(") and info[i].endswith("@in.ibm.com)")): 
      mailInd=i; 
     if(info[i]=="@"): 
      timeStartInd=i; 

     if(info[i]=="became"): 
      becameInd=i; 

    #Debug print of mail and time stamp start inds 
    """print "\n"; 
    print "Index of mail id: "+str(mailInd); 
    print "Index of time start index: "+str(timeStartInd); 
    print "\n";""" 

    #Extract IBM user id and name for lines with ibm id 
    if(mailInd>=0): 
     userId=info[mailInd].replace("(",""); 
     userId=userId.replace(")",""); 
     userName=getUserName(info,mailInd); 
    #Lines with no ibm id are of the form "Suraj Godar Mr became idle @ 15/07/2010 16:30:18" 
    elif(becameInd>0): 
     userName=getUserName(info,becameInd); 

    #Time stamp info 
    if(timeStartInd>=0): 
     timeStamp=getTimeStamp(info,timeStartInd); 
     if(mailInd>=0): 
      statusMsg=getStatusMsg(info,mailInd+1,timeStartInd); 
     elif(becameInd>0): 
      statusMsg=getStatusMsg(info,becameInd,timeStartInd); 

    print userId; 
    print userName; 
    print timeStamp 
    print statusMsg+"\n"; 

    if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"): 
     usrName2Id[userName]=userId; 

    #Store status messages keyed by user email ids 
    timeDict={}; 

    #Retrieve user id corresponding to user name 
    if userName in usrName2Id: 
     userId=usrName2Id[userName]; 

    #For valid user ids, store status message in the dict within dict data str arrangement 
    if not(userId=="-NONE-"): 

     if not(userId in infoDict.keys()): 
      infoDict[userId]={}; 

     timeDict=infoDict[userId]; 
     if not(timeStamp in timeDict.keys()): 
      timeDict[timeStamp]=statusMsg; 
     else: 
      timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg; 


#Print for each user a file containing status 
def printStatusFiles(dataFileDir): 


    volNum=0; 

    for userName in usrName2Id: 
     volNum=volNum+1; 

     filename=dataFileDir+"/"+"status-"+str(volNum)+".txt"; 
     file = open(filename,"w"); 

     print "Printing output file name: "+filename; 
     print volNum,userName,usrName2Id[userName]+"\n"; 
     file.write(userName+" "+usrName2Id[userName]+"\n"); 

     timeDict=infoDict[usrName2Id[userName]]; 
     for time in sorted(timeDict.keys()): 
      file.write(time+" "+timeDict[time]+"\n"); 


#Read and store data from individual data files 
def readDataFiles(dataFileDir): 

    #Process each datafile 
    files=os.listdir(dataFileDir) 
    files.sort(); 
    for i in range(0,len(files)): 
    #for i in range(0,1): 

     file=files[i]; 

     #Do not process other non-data files lying around in that dir 
     if not file.endswith(".txt"): 
      continue 

     print "Processing data file: "+file 
     dataFile=dataFileDir+str(file); 
     inpFile=open(dataFile,"r"); 
     lines=inpFile.readlines(); 

     #Process lines 
     for line in lines: 

      #Clean lines 
      line=line.strip(); 
      line=line.replace("/India/Contr/IBM",""); 
      line=line.strip(); 

      #Skip header line of the file and L's sign in sign out times 
      if(line.startswith("System log for account") or line.find("signed")>-1): 
       continue; 


      extractLineInfo(line); 


print "\n"; 
readDataFiles(dataFileDir); 
print "\n"; 
printStatusFiles("out/"); 
+0

@ yhw42您在这篇古文中编辑了什么?我在想 。顺便说一句,海报自2010年8月以来一直没有见过 – eyquem 2011-04-16 16:33:55

+0

@eyquem:这是[大喊我](http://stackoverflow.com/suggested-edits/32961),所以我固定格式。 ':)' – yhw42 2011-04-16 17:09:59

+0

@ yhw42确实很可怕。谢谢你的解释 – eyquem 2011-04-16 17:16:53