Spark request max count

我是一名初学者，我尝试提出请求让我检索访问量最大的网页。Spark request max count

我的要求是以下

mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()

有了这个请求，我只检索与最大计数一个数据帧，但我想检索与此分数和网页保存一个数据帧这个分数

类似的东西：

webPage   max(cntWebPage) 
google.com   2

我该如何解决我的问题？

非常感谢。

来源

2016-11-26 JackR

在pyspark + SQL：

logDF.registerTempTable("logDF") 

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
              select webPage, count(*) as cntWebPage, max(count(*)) over() as maxcnt 
              from logDF 
              group by webPage) as tmp 
              where tmp.cntWebPage = tmp.maxcnt""")

也许我可以使它更清洁，但它的作品。我会尽力优化它。

我的结果：

webPage  cntWebPage 
google.com 2

的数据集：

webPage usersid 
google.com 1 
google.com 3 
bing.com 10

说明：正常计数是通过分组+ COUNT（*）函数来完成。所有这些计数的最大通过窗函数计算，所以以上数据集，即时数据帧/不失MAXCOUNT列/是：

webPage count maxCount 
google.com 2  2 
bing.com 1  2

然后我们选择具有计数等于MAXCOUNT

编辑行：我有删除DSL版本 - 它不支持window over（）和排序正在改变结果。对不起，这个错误。 SQL版本是正确的

来源

2016-11-26 12:34:30

非常感谢您的帮助:) – JackR

@JackR如果它对您有帮助，请将uptove +标记为接受:) –

我对此投票，因为OP似乎对如何处理事情毫无头绪。 :) – eliasah

Spark request max count

回答

相关问题