2017-04-10 62 views
0

我正在写一个简单的程序,它应该将一个.tsv文件解析为多个.csv文件。问题在于它耗时如此之久(我认为〜5万行9分钟是可怕的表现)。请有人看看我的代码,并告诉我我做错了什么?R迭代通过50k数据帧花了很长时间

我有一个表,其中包含name of participant,name of media,timestamp,和一些坐标数据。在我的数据中可以有一个或多个参与者,每个参与者使用两个媒体文件。并且我想为每个media files创建csv文件与具体的参与者一起工作。

比如我有2名人参加P1P2和每个工作中的媒体文件M1M2。所以我想创建P1_M1.csv,P1_M2.csv,P2_M1.csv,P2_M2.csv

的数据是这样的:

P1 | M1 | data... 
P1 | M1 | data... 
... 
P1 | M2 | data... 
... 
P2 | m1 | data... 
... 
... 

这里是我的代码:

data = read.table("./data.tsv", header = T, sep = "\t", stringsAsFactors = F) # load data from tsv 

# function for creating csv file 
writeData = function(filename, d){ 
    filename = paste("./", filename, ".csv", sep = "") 
    write.csv(d, file = filename, row.names = F) 
} 

# initialize auxiliary variables 
participantName = "" 
mediaName = "" 
# initialize empty dataframe 
subdata <- data.frame(TimeStamp = numeric(), GazeLeftX = integer(), GazeLeftY = integer(), GazeRightX = integer(), GazeRightY = integer()) 

# for each row in original data... 
for(r in 1:nrow(data)) 
{ 
    # check if last participant is same as participant on actual row 
    if(participantName != data[r, 'ParticipantName']){ 
    # check if last participant is not empty (like no participant was processed yet) 
    if(participantName != ""){ 
     # if it is not than participant and also his work on media file ended so write data to csv 
     writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
     # empty auxiliary dataframe and also mediaName 
     subdata = subdata[0,] 
     mediaName = "" 
    } 
    # we detected new participant so record it into last participant variable 
    participantName = data[r, 'ParticipantName'] 
    } 
    # do same checks for media file because there can also change only mediafile and participant can be the same 
    if(mediaName != data[r, 'MediaName']){ 
    if(mediaName != ""){ 
     writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
     subdata = subdata[0,] 
    } 
    mediaName = data[r, 'MediaName'] 
    } 
    # in every iteration append actual row into auxilliary dataframe 
    subdata = rbind(subdata, 
        TimeStamp = data.frame(data[r, 'EyeTrackerTimestamp'], 
        GazeLeftX = data[r, 'GazeLeftX'], 
        GazeLeftY = data[r, 'GazeLeftY'], 
        GazeRightX = data[r, 'GazeRightX'], 
        GazeRightY = data[r, 'GazeRightY'])) 
} 
# if there are any data left in auxiliary dataframe, save it to csv 
if(nrow(subdata) != 0){ 
    writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
} 
+3

请参阅'?split'。尝试实例'split(data,data [,c(“ParticipantName”,“MediaName”)])'。 – nicola

+0

@nicola非常感谢你。太棒了。如果你愿意,你可以发表一个答案,我会将其标记为解决方案。现在我只有一个问题,我的代码只创建一个csv文件,但在我的代码中可能只是一些愚蠢的错误:) – Gondil

回答

1

您正在寻找?split。尝试例如:

split(data,data[,c("ParticipantName","MediaName")],drop=TRUE) 

,将创建一个list包含data.frame每个ParticipantName - MediaName对。如果你想要写在不同的文件中的每个数据帧,你可以尝试这样的:

res<-split(data,data[,c("ParticipantName","MediaName")],drop=TRUE) 
Map(writeData,names(res),res) 

其中writeData是你定义的功能。