2017-06-15 61 views
0

我正在清除“PERCENTAGE_AFFECTED”熊猫数据框的列。它包含整数范围(例如:“70-80”,“70和80”,“65至70”)。检查熊猫数据库中的字符串是否包含子字符串并删除

我想创建一个函数来清理所有这些以创建整数平均值。

这个作品>>>

def clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 

#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 



#Split initial_perc into two elements if "-" is found 
split_range = initial_perc.split('-') 
# If a "-" is found, split_date will contain a list with two items 
if len(split_range) > 1:   
    try: 
     final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
    except ValueError: 
     split_range = split_range[0].split('+') 
     final_perc = split_range[0]    
    finally: 
     if str(final_perc).isalpha(): 
      final_perc = 0 

elif initial_perc.find('and') != -1: 
    split_other = initial_perc.split('and') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 

elif initial_perc.find('to') != -1: 
    split_other = initial_perc.split('to') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 



elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc 

但是: 我试图简化这一因此,如果条目包含“ - ”,“和”,“到”串。我创建了我希望通过拆分和删除子(split_list)的列表:

def new_clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 
split_list = ['-','and'] 



# Split initial_perc into two elements if "-" is found  
if any(a in initial_perc for a in split_list): 
    for a in split_list: 
     split_range = initial_perc.split(a) 
     # If a "-" is found in split_list, initial_perc will contain a list with two items 
     if len(split_range) > 1:   
      try: 
       final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
      except ValueError: 
       split_range = split_range[0].split('+') 
       final_perc = split_range[0]    
      finally: 
       if str(final_perc).isalpha(): 
        final_perc = 0 
     else: 
      final_perc = initial_perc 



#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 
    split_range = ''  



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 









elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc 

任何帮助将是巨大的:)

+0

请提供的“initial_perc”和所有的输入和预期输出(你mantioned只是符合) – DexJ

+0

不知道如何为你连接,但它包含整数,范围如: “70-80”, “70和80“, ”65到70“,例如: ”<1“, ”12.2 + -5.2“, “超过95”, “大约50” 预期的输出仅仅是适合的整数的估计值。 “12.2±5.2”可以是12.2; “超过95”可以简单地是95 –

+0

那么我会建议另一种解决方案,然后你的?因为它有点复杂和毛病 – DexJ

回答

0

我会建议使用正则表达式。

检查了这一点。

import re 
results = re.findall(r"(\d{2,3}\.?\d*).*?(\d{2,3}\.?\d*)", x).pop() #x is input 
print results 
#results will be tuple and you can handle it easily. 

与follwoing输入和输出,

输入
'70 .5894-80.9894'
'70和85' ,
'65到70' 选中,
'72 <> 75'

输出
('70 0.5894' ,'80 0.9894 ')
(' 70' , '85')
( '65', '70')
( '72', '75')

+0

那么如何避免类型错误? 我可以做一个列表理解/ for循环迭代这个正则表达式方法通过数据框列? –

+0

你的意思是我的类型错误,我没有得到它?是的,你可以使用for循环这个正则表达式方法 – DexJ

相关问题