2017-06-13 119 views
0

考虑以下tree计算每个文件夹在复杂文件夹结构中有多少个文件夹?

library(data.tree) 

acme <- Node$new("Acme Inc.") 
    accounting <- acme$AddChild("Accounting") 
     software <- accounting$AddChild("New Software") 
     standards <- accounting$AddChild("New Accounting Standards") 
    research <- acme$AddChild("Research") 
     newProductLine <- research$AddChild("New Product Line") 
     newLabs <- research$AddChild("New Labs") 
    it <- acme$AddChild("IT") 
     outsource <- it$AddChild("Outsource") 
     agile <- it$AddChild("Go agile") 
     goToR <- it$AddChild("Switch to R") 

我再要计算averageBranchingFactor

averageBranchingFactor(acme) 

这就产生2.5

但是,由于种种原因,我希望能够得到所有分枝因子,不仅是平均分枝因子。例如,我需要这样做来统计比较两个文件结构,以考虑平均分支因素的显着差异。

根据manual对于data.treeAverageBranchingFactor()函数执行以下操作:“计算每个非叶具有的分支的平均数量”。因此,我第一次尝试以下操作:

acme.df <- ToDataFrameTree(acme, "averageBranchingFactor") 
mean(acme.df$averageBranchingFactor[acme.df$averageBranchingFactor>0]) 

这就产生2.375,然后引导我去尝试一个简单的版本:

mean(acme.df$averageBranchingFactor) 

这就产生0.8636364

如何在所有到达个别分支因素的平均值为2.5

理想情况下,我想创建一个data.frame,列出每个文件夹,其中包含为每个文件夹列出分支因子的变量。例如,我有这个非常简单的文件夹结构:

top_level_folder 
    sub_folder_1 
    sub_folder_2 
     sub_folder_3 

回答这个问题会涉及创建输出看起来像这样:

Folders    Subfolders (BranchingFactor) 
top_level_folder 2 
sub_folder_1  0 
sub_folder_2  1 
sub_folder_3  0 

能够容易地生成第一列通过调用list.dirs("/Users/username/Downloads/top_level/"),但我不知道如何生成第二列。请注意,第二列是非递归的,这意味着子文件夹内的文件夹不计算在内(即top_level_folder仅包含2个子文件夹,即使sub_folder_2包含另一个文件夹sub_folder_2)。

如果您想了解您的解决方案是否可缩放,请下载Rails代码库:https://github.com/rails/rails/archive/master.zip并尝试使用Rails更复杂的文件结构。

回答

1

你可以在每个级别沿着文件夹结构简单循环和计数文件夹的nunber(不含递归性):

dir.create("top_level_folder/sub_folder_2/sub_folder_3", recursive = TRUE) 
dir.create("top_level_folder/sub_folder_1") 


dirs <- list.dirs() 
branching_factor <- vector(length = length(dirs)) 
for (i in 1:length(dirs)) { 
    branching_factor[i] <- length(list.dirs(path = dirs[i], 
              full.names = FALSE, recursive = FALSE)) 
} 

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor) 
result[-1,] 

你也可以使用此代码的短,更idomatic和矢量化版本:

dirs <- list.dirs() 
branching_factor <- sapply(dirs, function(x) length(list.dirs(x, FALSE, FALSE))) 
result2 <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor, 
         row.names = NULL)[-1,] 

结果看起来像这样:

> head(result2[rev(order(result2[,2])),]) 
      Folders BranchingFactor 
208  fixtures    24 
122  fixtures    23 
42  fixtures    18 
440  core_ext    17 
340 active_record    17 
562   rails    16 
+0

将您的代码应用于[https://github.com/rails/rails/archive/master.zip](https://github.com/rails/rails/archive/master.zip),'result'是不正确的 – parth

+0

原因是:'长度(dir(path = dirs [i]))'也计数'.yml'和'.md'文件 – parth

+0

你是对的谢谢你!查看编辑后的版本。看起来,前面的代码(在循环中使用'dir'而不是'list.dirs'来计算所有文件和目录。 – Gilles

0

我塔基ng递归地列出所有文件夹,然后制作一个文件夹子文件夹对的表格,从这些我可以按文件夹计算子文件夹的数量。

虽然我错过了空文件夹,所以我用左连接重新初始化这个文件夹,然后用零填充NA。

path <- getwd() 
all_folders <- path %>% list.dirs(full.names=TRUE,recursive=TRUE) %>% 

data.frame(stringsAsFactors=FALSE) %>% setNames("Folders") 
all_sub_folders <- all_folders$Folders %>% 
    strsplit("/") %>% 
    lapply(function(x){c(x[length(x)-1],x[length(x)])}) %>% 
    do.call(rbind,.) %>% 
    as.data.frame(stringsAsFactors=FALSE) %>% 
    setNames(c("ParentFolders","Folders")) 
output <- all_sub_folders$ParentFolders %>% table %>% as.data.frame(stringsAsFactors=FALSE) %>% setNames(c("Folders","SubFolders"))) 
output <- merge(all_sub_folders,output,all.x = TRUE)[,c("Folders","SubFolders")] 
output$SubFolders[is.na(output$SubFolders)] <- 0 
output <- output[match(all_sub_folders$Folders,output$Folders),] 

head(output) 
#  Folders SubFolders 
# 2160 Rhome  126 
# 17 acepack   5 
# 856  help   1 
# 992  html   9 
# 1486 libs  124 
# 1130 i386   0 
1

只是修正@Gilles解决方案,

path <- "SO/rails-master/" 
dirs <- list.dirs(path) 
branching_factor <- vector(length = length(dirs)) 
for (i in 1:length(dirs)) { 
    branching_factor[i] <- length(list.dirs(path = dirs[i], recursive = FALSE)) 
} 

result <- data.frame(Folders = basename(dirs), BranchingFactor = branching_factor) 

> head(result) 
     Folders BranchingFactor 
1 rails-master    14 
2  .github    0 
3 actioncable    4 
4   app    1 
5  assets    1 
6 javascripts    1 

希望这有助于。

+0

你正在纠正解决方案? – histelheim

+0

@histelheim,他现在正确地更新了他的解决方案 – parth

0

您可以在your other question适应my answer,与recursive = FALSElist.dirslist.files

library(purrr) 

files <- .libPaths()[1] %>% # omit for current directory or supply alternate path 
    list.dirs() %>% 
    map_df(~list(path = .x, 
       dirs = length(list.dirs(.x, recursive = FALSE)))) 

files 
#> # A tibble: 4,457 x 2 
#>                   path dirs 
#>                   <chr> <int> 
#> 1    /Library/Frameworks/R.framework/Versions/3.4/Resources/library 314 
#> 2  /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind  4 
#> 3 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help  0 
#> 4 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html  0 
#> 5 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta  0 
#> 6  /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R  0 
#> 7  /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack  5 
#> 8 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/help  0 
#> 9 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/html  0 
#> 10 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/acepack/libs  1 
#> # ... with 4,447 more rows 

mean(files$dirs[files$dirs != 0]) 
#> [1] 2.952949 

或基础R,

files <- do.call(rbind, lapply(list.dirs(.libPaths()[1]), function(path){ 
    data.frame(path = path, 
       dirs = length(list.dirs(path, recursive = FALSE)), 
       stringsAsFactors = FALSE) 
})) 

head(files) 
#>                  path dirs 
#> 1   /Library/Frameworks/R.framework/Versions/3.4/Resources/library 314 
#> 2  /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind 4 
#> 3 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/help 0 
#> 4 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/html 0 
#> 5 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/Meta 0 
#> 6 /Library/Frameworks/R.framework/Versions/3.4/Resources/library/abind/R 0 

mean(files$dirs[files$dirs != 0]) 
#> [1] 2.952949 
0

的averageBranchingFactor排除叶子。 注意事项:你可以直接使用data(acme)

library(data.tree) 
data(acme) 
acme$averageBranchingFactor 
acme$count 
print(acme, abf = "averageBranchingFactor", "count") 

这将表明这样的:

      levelName abf count 
1 Acme Inc.      2.5  3 
2 ¦--Accounting     2.0  2 
3 ¦ ¦--New Software    0.0  0 
4 ¦ °--New Accounting Standards 0.0  0 
5 ¦--Research      2.0  2 
6 ¦ ¦--New Product Line   0.0  0 
7 ¦ °--New Labs     0.0  0 
8 °--IT       3.0  3 
9  ¦--Outsource    0.0  0 
10  ¦--Go agile     0.0  0 
11  °--Switch to R    0.0  0 

?averageBranchingFactor实现不承担任何秘密,所以你可以把它调整到您的需要。只需输入averageBranchingFactor到您的控制台(不含括号):

function (node) 
{ 
    t <- Traverse(node, filterFun = isNotLeaf) 
    if (length(t) == 0) 
     return(0) 
    cnt <- Get(t, "count") 
    if (!is.numeric(cnt)) 
     browser() 
    return(mean(cnt)) 
} 

总之,我们遍历树(除叶),并得到每个节点的count值。最后,我们计算平均值。

希望有所帮助。

相关问题