2017-08-14 64 views
1

我有一个txt文件,看起来像这样:阅读文本文件作为所需数据帧格式

Alabama[edit] 
    Auburn (Auburn University, Edward Via College of Osteopathic Medicine) 
    Birmingham (University of Alabama at Birmingham, Birmingham School of 
    Alaska[edit] 
    Anchorage[21] (University of Alaska Anchorage) 
    Fairbanks (University of Alaska Fairbanks)[16] 

我想看书txt文件作为一个数据帧,看起来像这样:

state  county 
Alabama Auburn 
Alabama Birmingham 
Alaska Anchorage 
Alaska Faibanks 

我至今是:

university_towns = open('university_towns.txt','r') 
df_university_towns = pd.DataFrame(columns={'State','RegionName'}) 
# loop over each line of the file object 
# determine if each line is state or county. 
# if the line has [edit], it's state 
for line in university_towns: 
    state_pattern = re.compile('\[edit\]') 
    state_pattern_m = state_pattern.search(line) 
    county_pattern = re.compile('(') 
    county_pattern_m = county_pattern.search(line) 
    if state_pattern_m: 
     #extract everything before \[edit] 
     print(state_pattern_m.start()) 
     end_position = state_pattern_m.start() 
     print(line[0:end_position]) 
     state_name = line[0:end_position] 
    if county_pattern_m: 
     #extract everything before (

这个代码将只给我这样的:

State County 
Alabama Auburn 
     Birminham 
. 
. 
. 

回答

0

这应做到:

key = None 

for line in t: 
    if '[edit]' in line: 
     key = line.replace('[edit]', '') 
     continue 
    if key: 
     # Use regex to extrac what you need 
     print(key, line.split(' ')[0]) 

我不知道你的数据看起来像这样改变正则​​表达式从标题中删除[](猜测这是一个标题),并有可能在使用正则表达式'[edit]]的位置在