2017-06-16 97 views
2

我有一个文章标题(测试$标题)和他们的社会总份额(测试$ total_shares)的测试文件。我可以用比如找到最常用的卦:计算顶部卦

library(tau) 
trigrams = textcnt(test$title, n = 3, method = "string") 
trigrams = trigrams[order(trigrams, decreasing = TRUE)] 
head(trigrams, 20) 

不过,我希望能够做的是不是由出现次数来计算平均股价顶部卦。

我可以用grep如

HowTo <- filter(test, grepl('how to create', ignore.case = TRUE, title)) 

发现任何具体的卦的平均股价,然后使用:

summary(HowTo) 

查看平均股与卦头条。

但这是一个耗时的过程。我想要做的是按平均份额从数据集中计算最高卦。谢谢你的帮助。

下面是一个示例数据集。 https://d380wq8lfryn3c.cloudfront.net/wp-content/uploads/2017/06/16175029/test4.csv

我倾向于从标题中删除非ASCII字符使用

test$title <- sapply(test$title,function(row) iconv(row, from = "UTF-8", to = "ASCII", sub="")) 

回答

0

权,这是一个有点棘手。我把它分解成可管理的块,然后将它们串起来,这意味着我可能错过了一些捷径,但至少它似乎工作。

哦,忘了说。如果您像使用textcnt()一样使用,则将制作三元组,其中包括一个标题的结尾和下一个标题的开头。我认为这是不可取的,并找到一种方法来规避它。

library(tau) 
library(magrittr) 

test0 <- read.csv(paste0("https://d380wq8lfryn3c.cloudfront.net/", 
        "wp-content/uploads/2017/06/16175029/test4.csv"), 
        header=TRUE, stringsAsFactors=FALSE) 

test0[7467,] #problematic line 

test <- test0 
# test <- head(test0, 20) 
test$title <- iconv(test$title, from="UTF-8", to="ASCII", sub=" ") 
test$title <- test$title %>% 
    tolower %>% 
    gsub("[,/]", " ", .) %>% #replace , and/with space 
    gsub("[^a-z ]", "", .) %>% #keep only letters and spaces 
    gsub(" +", " ", .) %>%  #shrink multiple spaces to one 
    gsub("^ ", "", .) %>%  #remove leading spaces 
    gsub(" $", "", .)   #remove trailing spaces 

test[7467,] #problematic line resolved 

trigrams <- sapply(test$title, 
    function(s) names(textcnt(s, n=3, method="string"))) 
names(trigrams) <- test$total_shares 

trigrams <- do.call(c, trigrams) 
trigrams.df <- data.frame(trigrams, shares=as.numeric(names(trigrams))) 

# aggregate shares by trigram. The number of shares of identical trigrams 
# are summarized using some function (sum, mean, median etc.) 
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df, sum) 

# more than one statistic can be calculated 
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df, 
    FUN=function(x) c(mean=mean(x), sum=sum(x), nhead=length(x))) 
trigrams_share <- do.call(data.frame, trigrams_share) 
trigrams_share[[1]] <- as.character(trigrams_share[[1]]) 

# top five trigrams by average number of shares, 
# of those that was found in three or more hedlines 
trigrams_share <- trigrams_share[order(
    trigrams_share[2], decreasing=TRUE), ] 
head(trigrams_share[trigrams_share[["shares.nhead"]] >= 3, ], 5) 
#       trigrams shares.mean shares.sum shares.nhead 
# 37588    the secret to 42852.75  171411   4 
# 43607     will be a 24779.00  123895   5 
# 44945  your career elearning 23012.00  92048   4 
# 31454   raises million to 21378.67  64136   3 
# 6419 classroom elearning industry 18812.38  150499   8 

如果连接应该打破

# dput(head(test0, 20)): 

test <- structure(list(
title = c("Top 3 Myths About BYOD In The Classroom - eLearning Industry", 
"The Emotional Weight of Being Graded, for Better or Worse", 
"Online learning startup Coursera raises $64M at an $800M valuation", 
"LinkedIn doubles down on education with LinkedIn Learning, updates desktop site", 
"Create Your eLearning Resume - eLearning Industry", 
"The Disruption of Digital Learning: Ten Things We Have Learned", 
"'Top universities to offer full degrees online in five years' - BBC News", 
"Schools will teach 'soft skills' from 2017, but assessing them presents a challenge", 
"Top 5 Lead-Generating Ideas for Your Content Marketing", 
"'Top universities to offer full degrees online in five years' - BBC News", 
"The long-distance learners of Aleppo - BBC News", 
"eLearning Solutions for Business", 
"6 Top eLearning Course Reviewer Tools And Selection Criteria - eLearning Industry", 
"eLearning Elevated", 
"When Teachers and Technology Let Students Be Masters of Their Own Learning", 
"Aviation Technical English online elearning course", 
"How the Pioneers of the MOOC Got It Wrong", 
"Study challenges cost and price myths of online education", 
"10 Easy Ways to Integrate Technology in Your Classroom", 
"7 e-learning trends for educational institutions in 2017" 
), total_shares = c(13646L, 12120L, 8328L, 5945L, 5853L, 5108L, 
4944L, 3570L, 3104L, 2841L, 2463L, 2227L, 2218L, 2210L, 2200L, 
2117L, 2039L, 1876L, 1861L, 1779L)), .Names = c("title", "total_shares" 
), row.names = c(NA, 20L), class = "data.frame") 
+0

这是伟大的,非常感谢。 –

+0

酷,这是一个有趣的问题弄清楚。如果您认为它值得,可以将答案标记为[接受](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)。 – AkselA

+0

谢谢,这个工程很好,并给我每卦总份额。理想情况下,我会喜欢平均数和中位数,因为总的份额可能会因使用次数而偏斜。是否有一种简单的方法可以查看每个三元组的出现次数以及平均值和中位数? –