我有一个以3列开头的CSV。累积百分比成本列,成本列和关键字列。 R脚本适用于小文件,但当我向它提供实际文件(有一百万行)时完全死亡(永远不会结束)。你能帮我让这个脚本更高效吗? Token.Count是我无法创建的人。谢谢!计数令牌字的最佳和最有效的方法
# Token Histogram
# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)
# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)
# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)
# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]
# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize
# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
您可以链接到一块样本数据的?随意使它合成,只是具有代表性,所以人们可以测试他们的方法,以确保他们更快。 – 2010-12-10 21:12:56
CumuCost \t \t成本Keyword.text 0.004394288 \t \t 678.5北+脸+出口 0.006698245 \t \t 80.05超高动力学传感器 0.008738991 \t \t 79.51 X盒360 250 – datayoda 2010-12-10 22:47:12
'data.frame':74231个OBS。 5个变量: $ CumuCost:num 0.00439 0.0067 0.00874 0.01067 0.01258 ... $ Cost:num 1678 880 780 736 731 ... $ Keyword.text:chr“north + face + outlet”“kinect sensor”“x box 360 250“... $ HTT:因子w/1级别”HEAD“:1 1 1 1 1 1 1 1 1 1 ... $ Token.Count:int 3 2 4 1 4 2 2 2 2 1 ... – datayoda 2010-12-10 22:51:07