0
我一直在尝试清理csv文件中的字段。该字段填充了数字和字符,我读入熊猫数据框并转换为字符串。Python提取字段和符号位置的子字符串
目标是提取以下变量:StopId,StopCode(可能有多个为每个记录),RTE,路由ID从长字符串。这是我到目前为止的尝试。
在提取上面列出的变量后,我需要将变量/代码与每个stop/route/rte的位置数据合并到另一个文件中。
用于FIELD记录样本:
- “Web日志:页面生成的查询[CID = SM & RTE = 50183 & DIR = S &天= 5761 &大卫·= 5761 & FST = 0%2C & TST = 0%2C]”
- 'Web日志:页面生成查询:[_ = 1407744540393 & agencyId = SM & stopCode = 361096 & RTE = 7878%7eBus%7e251 & DIR = W]'
- Web日志:页面生成查询:[_ = 1407744956001 & agencyId = AC & stopCode = 55451 & stopCode = 55452stopCode = 55489 & & RTE = 43783%7eBus%7e88 & DIR = S]
解我试图下面,但我卡住了!意见和建议表示赞赏
# Idea 1: Splits field above in a loop by '&' into a list. This is useful but I'll
# have to write additional code to pull out relevant variables
i = 0
for t in data['EVENT_DESCRIPTION']:
s = list(t.split('&'))
data['STOPS'][i] = [ x for x in s if "Web Log" not in x ]
i+=1
# Idea 1 next step help - how to pull out necessary variables from the list in data['STOPS']
# Idea2: Loop through field with string to find the start and end of variable names. The output for stopcode_pl (et. al. variables) is tuple or list of tuples (if there are more than one in the string)
for i in data['EVENT_DESCRIPTION']:
stopcode_pl = [(a.start(), a.end()) for a in list(re.finditer('stopCode=', i))]
stopid_pl = i[(a.start(), a.end()) for a in list(re.finditer('stopId=', i))]
rte_pl = [(a.start(), a.end()) for a in list(re.finditer('rte=', i))]
routeid_pl = [(a.start(), a.end()) for a in list(re.finditer('routeId=', i))]
#Idea2: Next Step Help - how to use the string location for variable names to pull the number of the relevant variable. Is there a trick to grab the characters in between the variable name last place (i.e. after the '=' of the variable name) and the next '&'?