2017-02-28 53 views
1

python或R可以很好地用于此目的,但有人会告诉我如何选择“Basic stats”像下面的那样。我想被把这个信息和ROI的名字在大熊猫的数据帧或数据表中R.在文本文件中选择特定的行和单元格,并将其放入数据框中:python或R

ROI: mrc_ranch_house [Red] 195 points 

Basic Stats  Min  Max   Mean  Stdev 
    Band 1 -20.208261 6.025762 -8.866403 5.289712 

Histogram   DN  Npts Total Percent  Acc Pct 
Band 1  -20.208261  1  1 0.5128  0.5128 
Bin=0.10287 -20.105383  0  1 0.0000  0.5128 
      -20.002504  1  2 0.5128  1.0256 
      -19.899626  0  2 0.0000  1.0256 
      -19.796747  0  2 0.0000  1.0256 
      -19.693869  0  2 0.0000  1.0256 
      -19.590990  0  2 0.0000  1.0256 
      -19.488112  0  2 0.0000  1.0256 

Stats for ROI: river_1 [Blue] 90 points      
Basic Stats  Min  Max   Mean  Stdev   
    Band 1 -20.187374 -6.694543 -12.227586 2.66464  

Histogram   DN  Npts Total Percent  Acc Pct  
Band 1  -20.187374 1 1 1.1111 1.1111 
Bin=0.05291 -20.134461 0 1 0 1.1111 
     -20.081548 0 1 0 1.1111 
     -20.028635 0 1 0 1.1111 
     -19.975722 0 1 0 1.1111 


Stats for ROI: river_2 [Blue] 96 points     
Basic Stats  Min  Max   Mean  Stdev  
    Band 1 -18.365091 -5.820825 -13.164463 2.851231  

Histogram    DN  Npts Total Percent  Acc Pct 
Band 1   -18.365091 1 1 1.0417 1.0417 
Bin=0.04919 -18.315898 0 1 0 1.0417 
     -18.266705 0 1 0 1.0417 
     -18.217512 0 1 0 1.0417 

最终的输出应该是这个样子:

ROI    Min  Max   Mean  Stdev 
mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712 
river_1   -20.187374 -6.694543 -12.227586 2.66464 
river_2   -18.365091 -5.820825 -13.164463 2.851231 

。 ..等

谢谢!

+0

您可以使用'readLines'来逐行读入数据。读取这个不整洁的输出也需要一些正则表达式。 –

+0

在Python中是'readLines'吗?感谢您的方式编辑。 – JAG2024

+0

这是R的一个基函数。另请参阅'gsub'。 –

回答

4

随着R,使用:

# read the text file 
txt <- readLines('https://dl.dropboxusercontent.com/u/45095175/rois_all.txt') 

# create an index for the lines that are needed 
ti <- rep(which(grepl('ROI:', txt)), each = 3) + 1:3 
# create a grouping vector of the same length 
grp <- rep(1:33, each = 3) 

# filter the text with the index 'ti' 
# and split into a list with grouping variable 'grp' 
lst <- split(txt[ti], grp) 
# loop over the list a read the text parts in as dataframes 
lst <- lapply(lst, function(x) read.table(text = x, sep = '\t', header = TRUE, 
              blank.lines.skip = TRUE)) 

# bind the dataframes in the list together in one data.frame 
DF <- do.call(rbind, lst) 
# change the name of the first column 
names(DF)[1] <- 'ROI' 

# get the correct ROI's for the ROI-column 
DF$ROI <- sub('.*: (\\w+).*$', '\\1', txt[grepl('ROI: ', txt)]) 

给出:

> DF 
       ROI  Min  Max  Mean Stdev 
1 mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712 
2   river_1 -20.187374 -6.694543 -12.227586 2.664640 
3   river_2 -18.365091 -5.820825 -13.164463 2.851231 
4   river_3 -18.291010 -4.583666 -12.092995 3.479293 
5   river_4 -17.074295 -4.926921 -9.970926 2.897855 
6   river_5 -16.849176 -8.622208 -12.387085 2.168462 
7 adjacent_river_2 -18.987597 -7.957749 -13.392523 1.962263 
8 adjacent_river_3 -19.426531 -8.640042 -13.467425 1.888105 
9 adjacent_river_4 -20.452566 -6.830183 -12.833450 2.124761 
10   bcs_1_ -23.612043 -8.221417 -16.032305 2.080695 
11   bcs_2_ -24.018219 -10.648975 -16.814048 1.948863 
12   bcs_3_ -23.011086 -9.106754 -15.404174 1.867498 
13   red_1_ -22.313442 -7.839107 -14.768196 2.134152 
14   red_2_ -22.551537 -7.236300 -14.613618 2.204253 
15   red_3_ -22.057703 -7.746992 -14.483161 2.123497 
16   bcs_4 -22.705107 -8.972753 -15.201623 1.817122 
17   bcs_5 -24.109459 -10.113716 -15.776537 1.849163 
18   glade_1_ -19.913187 -6.189866 -12.695884 3.303929 
19   glade_2_ -19.812855 -4.672865 -11.995191 4.840168 
20   glade_3_ -10.078033 -2.828722 -5.877417 1.941401 
21   mwea_b -13.979379 -4.977155 -11.392434 2.019037 
22    kaga -13.114172 -8.889531 -10.649324 1.290551 
23    huku -14.206743 -7.853305 -10.608210 1.441250 
24    ruai -18.643108 -12.645180 -14.54.224183 
25   tumaini -19.543234 -13.164941 -15.899968 1.812876 
26   nkando -19.973492 -7.040238 -11.716987 2.617544 
27   jikaze -16.408030 -9.001065 -12.323898 1.942196 
28  miarage_b -15.126486 -6.661448 -10.391111 1.764279 
29   batian -15.269146 -9.603316 -11.962470 1.168859 
30   gitaraga -17.037708 -7.495215 -10.886802 2.561877 
31  wiumiririe -9.578024 -6.225223 -7.688715 1.059796 
32   chumvi -14.883148 -10.327570 -12.819469 1.231636 
33 next_to_airstrip -17.242777 -5.207252 -10.601750 1.987712 

最后部分(从绑定列表放在一个数据帧及以后),也可以从data.table -package的rbindlist -function完成:

# load the 'data.table' package for the 'rbindlist' function 
library(data.table) 
# bind the dataframes in the list together to a data.table (enhanced version of a data.frame) 
DT <- rbindlist(lst) 
# change the name of the first column 
setnames(DT, 1, 'ROI') 

# get the correct ROI's for the ROI-column 
DT[, ROI := sub('.*: (\\w+).*$', '\\1', txt[grepl('ROI: ', txt)])] 
+0

嘎,打我,但由于https://twitter.com/romunov/status/836622674944266240 –

+0

@RomanLuštrik对不起(并进一步改进) – Jaap

+0

嘿@夏侯!我想知道你是否可以帮助我解决这个简单的问题:我现在有一个带有多个“Band”的文本文件,但一直没能让你的代码正常工作。你可以检查出来这里,请:http://stackoverflow.com/questions/42614688/broken-r-code-to-select-specific-rows-and-cells-in-text-file-and-put-into -data-F?noredirect = 1#comment72359303_42614688 – JAG2024

1

我还没有找到一个单一的导入解决方案,因为data中的每一行都被称为Band 1,但这是一个好的开始。

import pandas as pd 

data = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1]) 
data = data.dropna() 
data = data.ix[data.ix[:, 'Basic Stats']!='Basic Stats', :] 
data 

输出

Basic Stats Min   Max   Mean  Stdev 
0 Band 1 -20.208261 6.025762 -8.866403 5.289712 
3 Band 1 -20.187374 -6.694543 -12.227586 2.664640 
6 Band 1 -18.365091 -5.820825 -13.164463 2.851231 

我现在已经提取的所有基本统计名称如下的例子,

names = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1]) 

names = names.ix[names.ix[:, 'Basic Stats'] != '  Band 1'] 
names = names.ix[names.ix[:, 'Basic Stats'] != 'Basic Stats'] 
names = names.ix[:, 'Basic Stats'].str.extract('Stats for ROI: (.*) \[.*\] [0-9]*') 
names.loc[0] = 'mrc_ranch_house' 
names = names.sort_index() 
names = names.reset_index(drop=True) 

这看起来如下,

0  mrc_ranch_house 
1    river_1 
2    river_2 

加盟datanames像这样,

data.ix[:, 'Basic Stats'] = names 

给出了这样的结果作为需要,

Basic Stats  Min   Max   Mean  Stdev 
0 mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712 
1 river_1   -20.187374 -6.694543 -12.227586 2.664640 
2 river_2   -18.365091 -5.820825 -13.164463 2.851231 
+0

您可以添加如何选择ROI名称并将此信息添加到列中的部分? – JAG2024

+0

这不是完整的解决方案。 –

+0

@RomanLuštrik我在答复中说了很多。我有这个解决方案的一部分,虽然它比JAG2024更有用而不是发布任何东西。 – josh

2

这里是另外一个丑陋的解决方案。结果是一个很好的老规则data.frame

rois_all <- file("https://dl.dropboxusercontent.com/u/45095175/rois_all.txt") 

xy <- readLines(rois_all) 

# find lines where ROI starts 
roin <- grep(pattern = "ROI: ", x = xy) 
roi <- xy[roin] 
roi <- gsub(".*ROI: (\\w+).*$", "\\1", roi) 

# find lines with stats 
stats <- grep(pattern = "Basic Stats", x = xy) 

# trim whitespace and collect Col 
cn <- trimws(sapply(strsplit(xy[stats][1], "\t"), "[", 2:5, simplify = FALSE)[[1]]) 

# split the stat line by \t and extract only elements 2 to 5. merge row-wise 
out <- do.call(rbind, sapply(strsplit(xy[stats + 1], "\t"), "[", 2:5, simplify = FALSE)) 
out <- as.data.frame(apply(out, MARGIN = 2, as.numeric)) 

# add ROI column extracted earlier 
out <- cbind(roi, out) 

colnames(out) <- c("ROI", cn) 

out 

       ROI  Min  Max  Mean Stdev 
1 mrc_ranch_house -20.208261 6.025762 -8.866403 5.289712 
2   river_1 -20.187374 -6.694543 -12.227586 2.664640 
3   river_2 -18.365091 -5.820825 -13.164463 2.851231 
4   river_3 -18.291010 -4.583666 -12.092995 3.479293 
5   river_4 -17.074295 -4.926921 -9.970926 2.897855 
6   river_5 -16.849176 -8.622208 -12.387085 2.168462 
7 adjacent_river_2 -18.987597 -7.957749 -13.392523 1.962263 
8 adjacent_river_3 -19.426531 -8.640042 -13.467425 1.888105 
9 adjacent_river_4 -20.452566 -6.830183 -12.833450 2.124761 
10   bcs_1_ -23.612043 -8.221417 -16.032305 2.080695 
11   bcs_2_ -24.018219 -10.648975 -16.814048 1.948863 
12   bcs_3_ -23.011086 -9.106754 -15.404174 1.867498 
13   red_1_ -22.313442 -7.839107 -14.768196 2.134152 
14   red_2_ -22.551537 -7.236300 -14.613618 2.204253 
15   red_3_ -22.057703 -7.746992 -14.483161 2.123497 
16   bcs_4 -22.705107 -8.972753 -15.201623 1.817122 
17   bcs_5 -24.109459 -10.113716 -15.776537 1.849163 
18   glade_1_ -19.913187 -6.189866 -12.695884 3.303929 
19   glade_2_ -19.812855 -4.672865 -11.995191 4.840168 
20   glade_3_ -10.078033 -2.828722 -5.877417 1.941401 
21   mwea_b -13.979379 -4.977155 -11.392434 2.019037 
22    kaga -13.114172 -8.889531 -10.649324 1.290551 
23    huku -14.206743 -7.853305 -10.608210 1.441250 
24    ruai -18.643108 -12.645180 -14.54.224183 
25   tumaini -19.543234 -13.164941 -15.899968 1.812876 
26   nkando -19.973492 -7.040238 -11.716987 2.617544 
27   jikaze -16.408030 -9.001065 -12.323898 1.942196 
28  miarage_b -15.126486 -6.661448 -10.391111 1.764279 
29   batian -15.269146 -9.603316 -11.962470 1.168859 
30   gitaraga -17.037708 -7.495215 -10.886802 2.561877 
31  wiumiririe -9.578024 -6.225223 -7.688715 1.059796 
32   chumvi -14.883148 -10.327570 -12.819469 1.231636 
33 next_to_airstrip -17.242777 -5.207252 -10.601750 1.987712 
+0

由于@RomanLuštrik。这段代码非常清楚。我现在有一个带有多个乐队的文本文件,并试图使其工作,但目前失败。你可以给它看看好吗:http://stackoverflow.com/questions/42614688/broken-r-code-to-select-specific-rows-and-cells-in-text-file-and-put-into-数据-F?noredirect = 1#comment72359303_42614688 – JAG2024

相关问题