我想导入以下文件与两个重复的数据部分来提取。第一组开始于未使用的标题(第5行)和以“ES”第5行开头的实际标题)。数据的下一部分以未使用的标题(第13行)和以“LU”(第14行)开始的实际标题开始,并且具有更多变量名称。这些文件中有很多,每个文件中都有不同数量的长度各不相同的EU和LS部分。我需要提取LS和EU数据以分离数据帧。不幸的是,这些文件与传感器阵列“保持一致”,我无法改变这种情况,并且宁愿不在Excel中做所有这些事情,但可能不得不这样做。在真实文件中,每个EU和LS集可能有数百个这样的行。导入CSV多个范围和标题
我试图修改下面的代码来索引EU部分,然后将其提取出来并清理干净,然后在LS部分完成相同的操作,但我甚至没有得到这个工作。部分原因是欧盟在两个标题行中。我确实看到使用perl脚本的代码,但从未使用过该语言。
lns = readLines("lake1.txt")
idx = grepl("EU", lns)
df = read.table(text=lns[!idx])
wd = diff(c(which(idx), length(idx) + 1)) - 1
df$label = rep(lns[idx], wd)
我是不是一定要加一个CSV文件示例中的最好方法,但在这里它是...
Garbage Text 1,,,,,,,,
Garbage Text 2,,,,,,,,
Garbage Text 3,,,,,,,,
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),ES_DARK ('uW/cm^2/nm'),ES_DARK ('uW/cm^2/nm'),ES_DARK ('uW/cm^2/nm'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,0.35441789,-0.00060208,0.10290995,87,2017015,10:42:39,1
0.032,0,-0.36023974,-0.22242269,-0.09639,109,2017015,10:42:40,10
0.032,0,0.07552711,0.01524224,-0.16756855,91,2017015,10:42:48,41
,,,,,,,,11304
,,,,,,,,11312
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,0.02288441,0.02891912,0.03595322,53,2017015,10:42:38,2
0.032,0,-0.00014323,0.00024047,0.00001585,212,2017015,10:42:38,6
0.032,0,0.00114258,0.00091736,-0.0000495,16,2017015,10:42:39,9
0.032,0,0.00020744,0.0004186,0.00027721,118,2017015,10:42:40,16
,,,,,,,,11310
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),ES ('uW/cm^2/nm'),ES ('uW/cm^2/nm'),ES ('uW/cm^2/nm'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,56.7600789,59.43147464,62.83968564,186,2017015,10:42:38,3
0.032,0,56.27202003,59.52654061,62.86815706,29,2017015,10:42:38,4
,,,,,,,,11309
,,,,,,,,11311
,,,,,,,,
INTTIME ('sec'),SAMPLE ('sec'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),LU ('uW/cm^2/nm/sr'),CHECK (''),DATETAG (NONE),TIMETAG2 (NONE),POSFRAME (NONE)
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,-0.00011611,-0.00039544,-0.00014584,3,2017015,10:42:42,20
0.032,0,-0.00032394,-0.00020563,-0.00020383,229,2017015,10:42:46,39
这是两个dataframes应该是什么样子到底:
数据帧1
ES,DELAY,344.83,348.23,351.62,SUM,NONE,NONE,COUNTS
0.032,0,0.35441789,-0.00060208,0.10290995,87,2017015,10:42:39,1
0.032,0,-0.36023974,-0.22242269,-0.09639,109,2017015,10:42:40,10
0.032,0,0.07552711,0.01524224,-0.16756855,91,2017015,10:42:48,41
0.032,0,56.7600789,59.43147464,62.83968564,186,2017015,10:42:38,3
0.032,0,56.27202003,59.52654061,62.86815706,29,2017015,10:42:38,4
数据帧2
LU,DELAY,344.37,347.75,351.13,SUM,NONE,NONE,COUNTS
0.032,0,0.02288441,0.02891912,0.03595322,53,2017015,10:42:38,2
0.032,0,-0.00014323,0.00024047,0.00001585,212,2017015,10:42:38,6
0.032,0,0.00114258,0.00091736,-0.0000495,16,2017015,10:42:39,9
0.032,0,0.00020744,0.0004186,0.00027721,118,2017015,10:42:40,16
0.032,0,-0.00011611,-0.00039544,-0.00014584,3,2017015,10:42:42,20
0.032,0,-0.00032394,-0.00020563,-0.00020383,229,2017015,10:42:46,39
我不明白你是怎么来的例子输出。为什么输出“ES”文件中不包含第9-11行,以及“0.512”值和最后一行来自输出“LU”文件? – austensen
我缩短了输出,所以不会太长。对不起,我可以添加它,但想限制帖子的使用时间。 –
这很好,只是为了确保我明白。此外,你是否打算排除行10-11(',,,,,,,, 11309') – austensen