今天我得到了一个奇怪的结果。移调相同的物体

要复制它，请考虑以下的数据帧：

x <- data.frame(x=1:3, y=11:13) 
y <- x[1:3, 1:2]

他们都应该是，实际上是相同的：

identical(x,y) 
# [1] TRUE

应用t()到张玉峰对象应产生相同的结果，但：

identical(t(x),t(y)) 
# [1] FALSE

区别在于列名称：

colnames(t(x)) 
# NULL 
colnames(t(y)) 
# [1] "1" "2" "3"

鉴于此，如果你想按列堆栈y，你得到你所期望的：

stack(as.data.frame(t(y))) 
# values ind 
# 1  1 1 
# 2  11 1 
# 3  2 2 
# 4  12 2 
# 5  3 3 
# 6  13 3

同时：

stack(as.data.frame(t(x))) 
#  values ind 
# 1  1 V1 
# 2  11 V1 
# 3  2 V2 
# 4  12 V2 
# 5  3 V3 
# 6  13 V3

在后一种情况下， as.data.frame()找不到原始列名称并自动生成它们。

罪魁祸首是as.matrix()，由t()叫：

rownames(as.matrix(x)) 
# NULL 
rownames(as.matrix(y)) 
# [1] "1" "2" "3"

一种解决方法是设置rownames.force：（并相应地重写stack(...)调用）

rownames(as.matrix(x, rownames.force=TRUE)) 
# [1] "1" "2" "3" 
rownames(as.matrix(y, rownames.force=TRUE)) 
# [1] "1" "2" "3" 
identical(t(as.matrix(x, rownames.force=TRUE)), 
      t(as.matrix(y, rownames.force=TRUE))) 
# [1] TRUE

我的问题是：

为什么as.matrix()对待不同x和y和
你怎么能告诉他们有什么区别？

注意，其他信息功能不x, y之间发现差异性：

identical(attributes(x), attributes(y)) 
# [1] TRUE 
identical(str(x), str(y)) 
# ... 
#[1] TRUE

评论到解决方案

Konrad Rudolph给出了一个简洁而有效的解释，上述行为（见mt1022 更多细节）。

总之康拉德表明：

一个）x和y是内部不同;
b）“identical太简直太默认了”来捕捉这个内部差异。

现在，如果你把一组S，其中有的S，然后S和T所有元素的一个子集T是完全一样的对象。所以，如果你把一个数据帧y，其中有所有行和x，然后x和y列应完全相同的对象。不幸的是x \neq y！
这种行为不仅是违反直觉，而且是混淆的，也就是说差异不是不言自明，而只有内部甚至默认identical函数看不到它。

另一个自然原理是转置两个相同的（类矩阵）对象产生相同的对象。再次，这是因为在转位之前，identical是“过于宽松”的事实打破了;转置后，默认identical足以看出差异。

恕我直言，这种行为（即使它是不是一个错误）是一个科学的语言如R.
希望这篇文章将推动一些关注和将R团队将考虑修改其错误行为。

来源

2017-04-04 antonio

似乎是如何定义'row.names'，因为它们在'dput（x）'和'dput（y'）中是不同的。在使用''[.data.frame'' – user20650

时可能会明确添加它们您可以使用dput（x）和dput（y），您将看到row.names以不同的方式存储。我认为它与自动row.names处理有关（查看https://stat.ethz.ch/R-manual/R-devel/library/base/html/row.names.html详细信息部分获取更多信息），不知道为什么子集返回不同的row.names尽管...说实话，它闻起来像一个意想不到的行为 – digEmAll

'相同（x，y，attrib.as.set = FALSE）'似乎在差异（注意到'*注意，相同的（x，y，FALSE，FALSE，FALSE，FALSE）会精确测试其是否相等。“* – user20650

identical简直是在默认情况下过于宽松，但你可以改变：

> identical(x, y, attrib.as.set = FALSE) 
[1] FALSE

原因可以通过详细检查的对象中找到：

> dput(x) 
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA, 
-3L), class = "data.frame") 
> dput(y) 
structure(list(x = 1:3, y = 11:13), .Names = c("x", "y"), row.names = c(NA, 
3L), class = "data.frame")

注意不同row.names属性：

> .row_names_info(x) 
[1] -3 
> .row_names_info(y) 
[1] 3

从文档中我们可以搜集负数表示自动排名（对于x），而y的排名不是自动的。而as.matrix对待它们的方式不同。

来源

2017-04-04 16:07:51

没有分歧。 'row.names'的帮助页面上写着：“对于n> 2，形式1：n的行名称以紧凑形式存储在内部，..”，as.matrix和“其他函数”将“处理[这种名字]不同。“运行轨迹（'row.names'）表明它对于提问者的例子被调用了3次（至少有一次调用了'print（y'））。它还说：“'row.names'将始终返回一个字符向量（如果需要检索一组整数值的行名，则使用attr（x，”row.names“））。 –

'row.names = c（NA，3L）'仍然自动生成row.names以及'row.names = c（NA，-3L）'。问题是，为什么对数据进行子集化会改变符号（从而导致差异）？ – digEmAll

@digEmAll：'c（NA，-3L）'似乎将对象标记为没有明确的“row.names”（即未设置或设置为NULL），这意味着函数适用于data.frame “row.names”应该忽略这个属性。 'c（NA，3L）'似乎将该对象标记为具有显式的“row.names”，但是形式为'1：nrow（x）'，可以不用创建。 ''[.data.frame“'返回数据的一个子集以及它的”row.names“的一个子集（例如'x [2：3，]的'row.names'不能被紧凑地存储），并且似乎最一致的行为方式总是返回带有明确“row.names”的对象。 –

正如在评论中，x和y不完全相同。当我们调用t到data.frame，t.data.frame将被执行：

function (x) 
{ 
    x <- as.matrix(x) 
    NextMethod("t") 
}

我们可以看到，它调用as.matrix，即as.matrix.data.frame：

function (x, rownames.force = NA, ...) 
{ 
    dm <- dim(x) 
    rn <- if (rownames.force %in% FALSE) 
     NULL 
    else if (rownames.force %in% TRUE) 
     row.names(x) 
    else if (.row_names_info(x) <= 0L) 
     NULL 
    else row.names(x) 
...

正如评论说@oropendola，.row_names_infox的回归和y是不同的，上述功能是差异生效的地方。

那么为什么y有不同rownames？让我们来看看[.data.frame，我在关键线路添加评论：

{ 
    ... # many lines of code 
    xx <- x #!! this is where xx is defined 
    cols <- names(xx) 
    x <- vector("list", length(x)) 
    x <- .Internal(copyDFattr(xx, x)) # This is where I am not sure about 
    oldClass(x) <- attr(x, "row.names") <- NULL 
    if (has.j) { 
     nm <- names(x) 
     if (is.null(nm)) 
      nm <- character() 
     if (!is.character(j) && anyNA(nm)) 
      names(nm) <- names(x) <- seq_along(x) 
     x <- x[j] 
     cols <- names(x) 
     if (drop && length(x) == 1L) { 
      if (is.character(i)) { 
       rows <- attr(xx, "row.names") 
       i <- pmatch(i, rows, duplicates.ok = TRUE) 
      } 
      xj <- .subset2(.subset(xx, j), 1L) 
      return(if (length(dim(xj)) != 2L) xj[i] else xj[i, 
                  , drop = FALSE]) 
     } 
     if (anyNA(cols)) 
      stop("undefined columns selected") 
     if (!is.null(names(nm))) 
      cols <- names(x) <- nm[cols] 
     nxx <- structure(seq_along(xx), names = names(xx)) 
     sxx <- match(nxx[j], seq_along(xx)) 
    } 
    else sxx <- seq_along(x) 
    rows <- NULL ## this is where rows is defined, as we give numeric i, the following 
    ## if block will not be executed 
    if (is.character(i)) { 
     rows <- attr(xx, "row.names") 
     i <- pmatch(i, rows, duplicates.ok = TRUE) 
    } 
    for (j in seq_along(x)) { 
     xj <- xx[[sxx[j]]] 
     x[[j]] <- if (length(dim(xj)) != 2L) 
      xj[i] 
     else xj[i, , drop = FALSE] 
    } 
    if (drop) { 
     n <- length(x) 
     if (n == 1L) 
      return(x[[1L]]) 
     if (n > 1L) { 
      xj <- x[[1L]] 
      nrow <- if (length(dim(xj)) == 2L) 
       dim(xj)[1L] 
      else length(xj) 
      drop <- !mdrop && nrow == 1L 
     } 
     else drop <- FALSE 
    } 
    if (!drop) { ## drop is False for our case 
     if (is.null(rows)) 
      rows <- attr(xx, "row.names") ## rows changed from NULL to 1,2,3 here 
     rows <- rows[i] 
     if ((ina <- anyNA(rows)) | (dup <- anyDuplicated(rows))) { 
      if (!dup && is.character(rows)) 
       dup <- "NA" %in% rows 
      if (ina) 
       rows[is.na(rows)] <- "NA" 
      if (dup) 
       rows <- make.unique(as.character(rows)) 
     } 
     if (has.j && anyDuplicated(nm <- names(x))) 
      names(x) <- make.unique(nm) 
     if (is.null(rows)) 
      rows <- attr(xx, "row.names")[i] 
     attr(x, "row.names") <- rows ## this is where the rownames of x changed 
     oldClass(x) <- oldClass(xx) 
    } 
    x 
}

我们可以看到，y通过类似attr(x, 'row.names')得到它的名字：

> attr(x, 'row.names') 
[1] 1 2 3

所以，当我们用[.data.frame创建y，它接收row.names属性与x不同，其中row.names是自动的，并且在dput结果中显示负号。

注

行：

编辑

事实上，这已经在row.names手册说明。名称与数组的rownames相似，并且它有一个方法为数组参数调用rownames。

形式1的行的名称：N对于n> 2在内部存储在一个紧凑的形式，这可能会从C代码或由deparsing但从来没有通过 row.names或ATTR（X，“行中看到。名称“）。此外，此排序的一些名称被标记为“自动”，并通过as.matrix 和data.matrix（以及潜在的其他函数）进行不同处理。

所以attr不自动row.names（像的x）和明确的整数row.names（像的y）之间，同时，这是通过as.matrix通过内部表示.row_names_info判别区分。

来源

2017-04-04 15:49:43 mt1022

值得注意的是，attr（x，“row.names”）和attr（x，“row.names”）= value并不显示R在内部如何处理“row.names ”。 '.row_names_info'更准确。例如。 'attr（x，“row.names”）= 1：3'不将'1：3'存储为“row.names”，但是如'.row_names_info（x，0）'所示。尽管如此，除了'NULL'之外的任何其他标签都将该对象标记为具有用户定义的“row.names”，因此函数（如'as.matrix'）需要/应该考虑到这一点。 –

当然。 'attr（x，'row。名称'）'和'attr（y，'row.names'）'给出了相同的结果！ – mt1022

移调相同的物体

评论到解决方案

回答

编辑

相关问题