Python - 如何将文本输入拆分为单独的元素

输入将与换行符不一致，因此我不能使用换行符作为某种分隔符。未来在该文本将在以下格式：Python - 如何将文本输入拆分为单独的元素

的IDNumber名姓得分函位置

的IDNumber：9号

分数：0-100

字母：A或B

位置：可以是任何从缩写州名到城市和州的完整拼写。这是可选的。

例：

123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

元素是：

123456789 John Doe 90 A New York City 
987654321 Jane Doe 70 B CAL 
432167895 John Cena 60 B FL 
473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR

我需要为每个人单独访问每个元素。因此，对于John Cena对象，我需要能够访问ID：432167895，名字：John，姓氏：Cena，B或A：B。我并不真的需要位置，但它将成为输入的一部分。

编辑：应该值得一提的是我不允许导入任何模块，如正则表达式。

来源

2017-04-19 Jackson Blankenship

如果输入是一个字符串，我会通过[分裂上的空白字符字符串]启动（http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python）。 –

有可能是一个更优雅的方式来做到这一点，但基于一个例子字符串输入下面是一个想法。

input = "123456789 John Doe 90 A New York City 987654321 Jane Doe 70 B CAL 473829105 Donald Trump 70 E 098743215 Bernie Sanders 92 A AR" 

#split by whitespaces 
output = input.split() 

#create output to store as dictionary this could then be dumped to a json file 
data = {'output':[]} 
end = len(output) 

i=0 

while i< end: 
    tmp = {} 
    tmp['id'] = output[i] 
    i=i+1 
    tmp['fname']=output[i] 
    i=i+1 
    tmp['lname']=output[i] 
    i=i+1 
    tmp['score']=output[i] 
    i=i+1 
    tmp['letter']=output[i] 
    i=i+1 
    location = "" 
    #Catch index out of bounds errors 
    try: 
     bool = output[i].isdigit() 
     while not bool: 
      location = location + " " + output[i] 
      i=i+1 
      bool = output[i].isdigit() 
    except IndexError: 
     print('Completed Array') 

    tmp['location'] = location 
    data['output'].append(tmp) 

print(data)

来源

2017-04-19 21:54:41

除非未指定位置，否则此作品完美无缺！你知道如何解决它吗？位置元素是可选的。 –

我做了一个更新，只是在没有任何东西的情况下将空字符串放在位置中。 –

你可以使用正则表达式，这需要每个记录开始一个9位数的号码，以言联在必要时，并跳过位置：

res = re.findall(r"(\d{9})\s+(\S*)\s+(\S*(?:\s+\D\S*)*)\s+(\d+)\s+(\S*)", data)

Result是：

[('123456789', 'John', 'Doe', '90', 'A'), 
('987654321', 'Jane', 'Doe', '70', 'B'), 
('432167895', 'John', 'Cena', '60', 'B'), 
('473829105', 'Donald', 'Trump', '70', 'E'), 
('098743215', 'Bernie', 'Sanders', '92', 'A')]

来源

2017-04-19 21:17:14 trincot

由于在空白分裂不是为位置的识别有帮助，我会直接去一个正则表达式：

import re 

input_string = """123456789 John Doe 90 A New York City 987654321 
Jane Doe 70 B CAL 432167895 John 

Cena 60 B FL 473829105 Donald Trump 70 E 
098743215 Bernie Sanders 92 A AR""" 

search_string=re.compile(r"([0-9]{9})\W+([a-zA-Z ]+)\W+([a-zA-Z ]+)\W+([0-9]{1,3})\W+([AB])\W+([a-zA-Z ]+)\W+") 
person_list = re.findall(search_string, input_string)

只

这产生了：

ID：9个位数（后面至少一个空白）

姓和名：2个独立

[('123456789', 'John', 'Doe', '90', 'A', 'New York City'), 
('987654321', 'Jane', 'Doe', '70', 'B', 'CAL'), 
('432167895', 'John', 'Cena', '60', 'B', 'FL')]

在正则表达式的基团的说明

得分：一个，两个或三个数字（后面至少有一个空格）
字母：A或B（随后通过至少一个空白）
位置：一组字符（接着通过至少一个空白）

来源

2017-04-19 21:20:40

自从你知道的ID号将是在启动每个“记录”的，是9位数字，由9位数的ID号试图分裂：

# Assuming your file is read in as a string s: 
import re 
records = re.split(r'[ ](?=[0-9]{9}\b)', s) 

# record locator will end up holding your records as: {'<full name>' -> {'ID'-><ID value>, 'FirstName'-><FirstName value>, 'LastName'-><LastName value>, 'Letter'-><LetterValue>}, 'full name 2'->{...} ...} 
record_locator = {} 

field_names = ['ID', 'FirstName', 'LastName', 'Letter'] 

# Get the individual records and store their values: 
for record in records: 

    # You could filter the record string before doing this if it contains newlines etc 
    values = record.split(' ')[:5] 

    # Discard the int after the name eg. 90 in the first record 
    del values[3] 

    # Create a new entry for the full name. This will overwrite entries with the same name so you might want to use a unique id instead 
    record_locator[values[1]+values[2]] = dict(zip(field_names, values))

然后访问信息：

print record_locator['John Doe']['ID'] # 987654321

来源

2017-04-19 21:26:00 sgrg

我认为试图按9位数字拆分可能是最好的选择。

import re 

with open('data.txt') as f: 
    data = f.read() 
    results = re.split(r'(\d{9}[\s\S]*?(?=[0-9]{9}))', data) 
    results = list(filter(None, results)) 
    print(results)

给我这些结果

['123456789 John Doe 90 A New York City ', '987654321\nJane Doe 70 B CAL ', '432167895 John\n\nCena 60 B FL ', '473829105 Donald Trump 70 E\n', '098743215 Bernie Sanders 92 A AR']

来源

2017-04-19 21:31:32 davidejones

Python - 如何将文本输入拆分为单独的元素

回答

相关问题