用重复测量对数据进行应急测试

我希望有人能够给我提供一些指导或帮助。我有一个数据集，其中包含一个在三年内已经过感染测试的人群。一些人（不是全部）在一年多以前被抽样（因此它们代表重复测量）。我想确定感染的流行是否随着时间的推移而变化，但是我正在麻烦决定适当的测试。一个简单的应急测试违反了独立性的假设，因为跨越多年重复的个人。我不认为Cochran-Mantel-Haenszel测试或McNemar Chi-square测试是合适的，但如果我错了，请随时纠正我。这里是我正在使用的数据集，“AnID”变量是代表单个个体的因素（因此如果多年抽样一个人，您会看到该数字重复2或3次）。用重复测量对数据进行应急测试

我认为，一个可行的办法是随机重新采样数据多次（无需更换），每次只包括一个单独的一次，整个年执行应急测试。如果没有差异的零假设至少在95％的时间内被拒绝，那么我可以可靠地声称存在差异。我还不够好，还没有写出我自己的代码。预先感谢您提供的任何帮助。

dput（实施例）结构（列表（ANID =结构（C（37L，37L，45L，45L，45L，55L， 55L，62L，62L，68L，68L，1L，1L，2L， 3L，3L，4L，9L，9L，18L， 18L，18L，19L，19L，19L，20L，20L，21L，22L，22L，23L，24L，24L， 24L，25L，25L，25L，26L， 27L，28L，28L，28L，29L，29L，29L，30L， 31L，32L，32L，33L，34L，35L，36L，38L，38L，39L，39L，40L，41L， 41L，42L，42L， 42L，43L，43L，43L，44L，46L，46L，46L，47L，47L， 47L，48L，48L，48L，49L，49L，49L，50L，51L，52L，52L，53L，53L， 54L， 54L，56L，56L，57L，57L，57L，58L，59L，60L，61L，63L，64L， 65L ，66L，67L，69L，70L，71L，72L，73L，74L，74L，5L，6L，7L， 8L，10L，11L，12L，13L，14L，15L，16L，17L） “10”， “11”，“12”，“13”，“136”，“137”，“138”，“139”，“14”，“140”，“141”， “142” “143”“144”“145”“146”“147”“26”“27”28“29” “30”“31”“37” 38，39，40，41，42，43，44，45，，46，47，48，49，5 50“，51”，52“，”53“，”57“，”58“， ”59“，”6“，”60“，”61“，”62“，”63“ “64”“65”“66”“67”“69” “7”，“70”，“71”，“72”，“75”，“76”，“77” “8”“82”“83”“84” “85”“86”“9”“90”“94”“95”“96”“97”结构（c）（1L，2L，1L，2L，3L，1L，2L，2L，3L，2L， 3L，2L，3L，2L，2L，3L），2L，2L，3 L，1L，2L，3L，1L，2L，3L， 2L，3L，2L，1L，2L，2L，1L，2L，3L，1L，2L，3L，2L，2L，1L， 2L，3L， 1L，2L，3L，2L，2L，2L，3L，2L，2L，2L，2L，2L，3L， 2L，3L，2L，2L，3L，1L，2L，3L，1L，2L，3L，2L ，1L，2L，3L， 1L，2L，3L，1L，2L，3L，1L，2L，3L，2L，2L，1L，2L，1L，2L， 1L，2L，1L，2L，1L，2L 3L，3L，3L，3L，3L，3L，1L，1L，1L，1L，1L，1L，1L，1L， 3L，3L，3L，3L，3L），...。标签= c（“2012”，“2013”，“2014”），class =“factor”）， value = c（“Pos”，“Pos”，“Pos”，“Pos” Neg“，”Neg“， ”Pos“，”Pos“，”Pos“，”Pos“，”Pos“，”Pos“，”Neg“，”Neg“，”Pos“， ”Neg“ Pos“，”Neg“，”Pos“，”Pos“，”Neg“，”Neg“，”Neg“， ”Neg“，”Neg“，”Neg“，”Pos“，”Pos “Pos”，“Pos”，“Pos”，“Pos”， “Neg”，“Pos”，“Pos”，“Neg”，“Neg”，“Neg”，“Neg” ，“Pos”， “Pos”，“Pos”，“Neg”，“Neg”，“Pos”，“Pos”，“Neg”，“Pos”，“Neg”， “Pos” ，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Pos”， “Pos”，“Pos” “Neg”，“Neg”，“Pos”，“Neg”， “Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Pos” Pos“， ”Neg“，”Neg“，”Neg“，”Pos“，”Pos“，”Pos“，”Pos“，”Pos “Neg”， “Neg”，“Neg”，“Pos”，“Pos”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”， “Neg”，“Pos “Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”，“Neg”， “Pos”，“Pos” ，“Pos”，“Pos”，“Pos”，“Neg”， “Neg”，“Pos”，“Neg”，“Pos”，“Neg”）），.Names = c（“AnID”年”， “值”），row.names = 187：306中，class = “data.frame”）

来源

2017-02-17 giderk

记住，实验/测试设计需要预先的有效样本大小计算，以便如果存在统计显着性差异，则最大化可能性。（欲了解更多信息，请看这里：https://en.wikipedia.org/wiki/Sample_size_determination和https://en.wikipedia.org/wiki/Statistical_power）。

如果您的所有用户都在科目之前/之后（例如test/contol），您可以执行McNemar的比例比较测试（请参阅：https://en.wikipedia.org/wiki/McNemar's_test）。

然而，并非所有的用户都有重复的测量，所以我选择为每个用户随机选择一年，这样我就可以有3个独立的样本值。

假设dt是你的数据集...

library(dplyr) 

set.seed(1) # this will help you having a specific random sampling 

dt %>%      
    mutate(Pos = ifelse(value == "Pos", 1, 0)) %>% # create a binary variable to flag positives 
    group_by(AnID) %>%        # for each user 
    sample_n(1) %>%         # get one row/value randomly 
    group_by(year) %>%        # for each year 
    summarise(N = n(),        # get number of users 
      N_Pos = sum(Pos),      # get number of positive users 
      Prc_Pos = mean(Pos)) %>%    # get percentage of positives 
    print() -> tbl1         # print and save it 

# # A tibble: 3 × 4 
#  year  N N_Pos Prc_Pos 
# <fctr> <int> <dbl>  <dbl> 
# 1 2012 23  6 0.2608696 
# 2 2013 27  9 0.3333333 
# 3 2014 24 13 0.5416667

观察上述百分比每年之后，你可以

# run the statistical comparison of proportions 
prop.test(tbl1$N_Pos, tbl1$N) 

# 3-sample test for equality of proportions without continuity correction 
# 
# data: tbl1$N_Pos out of tbl1$N 
# X-squared = 4.3038, df = 2, p-value = 0.1163 
# alternative hypothesis: two.sided 
# sample estimates: 
# prop 1 prop 2 prop 3 
# 0.2608696 0.3333333 0.5416667

P值为跑这里来了一个比例比较（0.1163）表明，我们在积极的可能性方面，没有任何证据表明这些年份存在差异。

如果您发现有所不同，您可以在年份之间进行配对比较。

# run pairwise comparisons 
pairwise.prop.test(tbl1$N_Pos, tbl1$N) 

# Pairwise comparisons using Pairwise comparison of proportions 
# 
# data: tbl1$N_Pos out of tbl1$N 
# 
# 1 2 
# 2 0.80 - 
# 3 0.29 0.45 
# 
# P value adjustment method: holm

这里的输出是3个p值（3对比较）。正如所料，他们都表示没有证据显示这些年份之间存在差异。

您可以在一个函数内使用上述过程并创建N个模拟。检查这些模拟中有多少可以找到具有统计意义的结果。

来源

2017-02-17 14:39:18 AntoniosK

谢谢！这工作得很好。我已经把你的代码放在一个循环中来重复这个过程1000次。 – giderk

确保你删除了'set.seed'，以便每次都可以得到不同的随机数。 – AntoniosK

用重复测量对数据进行应急测试

回答

相关问题