如何查询注释的stackoverflow样式？

我在meta上看到了这个问题：https://meta.stackexchange.com/questions/33101/how-does-so-query-comments 如何查询注释的stackoverflow样式？

我想直接设置记录并以适当的技术方式提出问题。

说我有2个表：

 
Posts 
id 
content 
parent_id   (null for questions, question_id for answer) 

Comments 
id 
body 
is_deleted 
post_id 
upvotes 
date

注意：我认为这是这样的架构是如何设置，答案有PARENT_ID这是问题，问题都空在那里。问题和答案存储在同一个表中。

如何以最简单的往返方式以非常有效的方式提取注释stackoverflow样式？

规则：

单个查询应该拔出来呈现
仅需要拔出每个答案5个评论，与PREF为upvotes
需要提供足够的信息来通知用户有更多的评论超出了5。（和实际的数量 - 例如2条评论）
排序对于评论来说真的很有趣，正如你可以在这个问题的评论中看到的那样。规则是，按日期显示评论，但是如果评论有积极的评价，那么它将获得优惠待遇并显示在列表底部。（这在sql中很难表达）

如果有任何非规范化使它更好，它们是什么？哪些指数非常重要？

来源

2009-12-16 Sam Saffron

@Mark：SO被设置为在相同的表中存在问题和答案。 – 2009-12-16 23:03:10

SO有问题，答案和评论。什么是“帖子”？他们有问题吗？答案？都？我如何知道哪些帖子属于哪个问题？ – 2009-12-16 23:03:56

@OMG小马，好的我不知道。 – 2009-12-16 23:04:33

用途：

WITH post_hierarchy AS (
    SELECT p.id, 
     p.content, 
     p.parent_id, 
     1 AS post_level 
    FROM POSTS p 
    WHERE p.parent_id IS NULL 
    UNION ALL 
    SELECT p.id, 
     p.content, 
     p.parent_id, 
     ph.post_level + 1 AS post_level 
    FROM POSTS p 
    JOIN post_hierarchy ph ON ph.id = p.parent_id) 
SELECT ph.id, 
     ph.post_level, 
     c.upvotes, 
     c.body 
    FROM COMMENTS c 
    JOIN post_hierarchy ph ON ph.id = c.post_id 
ORDER BY ph.post_level, c.date

几件事情需要注意的：

StackOverflow上显示前5点意见，如果他们upvoted与否并不重要。立即显示后续注释，并立即显示
如果不对每个帖子使用SELECT，则无法容纳每个帖子5条评论的限制。添加TOP 5什么我张贴只会返回基于ORDER前五排BY语句

来源

2009-12-16 22:59:15

我不会理会使用SQL（因为我是一个SQL倡导者这可能会让你大吃一惊）过滤的意见。只需将它们按CommentId排序，然后在应用程序代码中进行筛选即可。

实际上很少有一个给定的帖子有超过五条评论，所以需要对它们进行过滤。在StackOverflow的10月份数据转储中，78％的帖子有0个或1个评论，97％的评论有5个或更少的评论。只有20个帖子有> = 50条评论，并且只有两个帖子有超过100条评论。

因此，编写复杂的SQL来做这种过滤会增加查询所有帖子时的复杂度。在适当的时候，我都会使用聪明的SQL，但这将是一分钱一分货和笨蛋。

你可以这样来做：

SELECT q.PostId, a.PostId, c.CommentId 
FROM Posts q 
LEFT OUTER JOIN Posts a 
    ON (a.ParentId = q.PostId) 
LEFT OUTER JOIN Comments c 
    ON (c.PostId IN (q.PostId, a.PostId)) 
WHERE q.PostId = 1234 
ORDER BY q.PostId, a.PostId, c.CommentId;

但是这给你的q和a列，因为这些列包括文字斑点是显著的冗余副本。将冗余文本从RDBMS复制到应用程序的额外成本变得很大。

所以它可能更好而不是在两个查询中做到这一点。相反，考虑到客户端浏览一个问题与帖子ID = 1234，请执行下列操作：

SELECT c.PostId, c.Text 
FROM Comments c 
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p 
    ON (c.PostId = p.PostId);

通过他们

然后排序在应用程序代码，被引用后加以收集并过滤掉超过五个最多余的评论每个帖子有趣的。

我测试了这两个查询针对从10月份起使用StackOverflow的数据转储加载的MySQL 5.1数据库。第一个查询大约需要50秒。第二个查询几乎是瞬间的（在我为Posts和Comments表预先缓存索引之后）。

底线是坚持使用单个SQL查询获取所需的所有数据是人为需求（可能基于一种错误观念，即对RDBMS发出查询的往返行程必须尽量减少开销不惜一切代价）。通常单个查询是较少的高效解决方案。您是否尝试将所有应用程序代码写入单一功能？ :-)

来源

2009-12-16 23:26:09

我同意你的观点，我的实现实际上是一个轻微的优化，我会在posts表中存储comment_count。在客户端拉出所有帖子进行渲染，通过他们，然后做一个选择*从其中post_id（id1，id2，id3） - 所有帖子超过0评论）的评论）这使得东西超简单，非常高效的一般情况 – 2009-12-16 23:40:50

真正的问题不在于查询，而在于模式，特别是聚簇索引。评论顺序要求在你定义的时候是非常有用的（每个答案只有5个？）。我将这些要求解释为“每个帖子提取5条评论（回答或问题），优先考虑优先考虑的问题，然后考虑更新的问题。我知道这不是如何评论，但你必须更加谨慎地定义你的需求。

这里是我的查询：

declare @postId int; 
set @postId = ?; 

with cteQuestionAndReponses as (
    select post_id 
    from Posts 
    where post_id = @postId 
    union all 
    select post_id 
    from Posts 
    where parent_id = @postId) 
select * from 
cteQuestionAndReponses p 
outer apply (
    select count(*) as CommentsCount 
    from Comments c 
    where is_deleted = 0 
    and c.post_id = p.post_id) as cc 
outer apply (
    select top(5) * 
    from Comments c 
    where is_deleted = 0 
    and p.post_id = c.post_id 
    order by upvotes desc, date desc 
) as c

我有一些14K职位和我的测试表67K意见，查询得到的职位在7毫秒：

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times: 
    CPU time = 0 ms, elapsed time = 7 ms.

这里是我测试模式搭配：

create table Posts (
post_id int identity (1,1) not null 
, content varchar(max) not null 
, parent_id int null -- (null for questions, question_id for answer) 
, constraint fkPostsParent_id 
    foreign key (parent_id) 
    references Posts(post_id) 
, constraint pkPostsId primary key nonclustered (post_id) 
); 
create clustered index cdxPosts on 
    Posts(parent_id, post_id); 
go 

create table Comments (
comment_id int identity(1,1) not null 
, body varchar(max) not null 
, is_deleted bit not null default 0 
, post_id int not null 
, upvotes int not null default 0 
, date datetime not null default getutcdate() 
, constraint pkComments primary key nonclustered (comment_id) 
, constraint fkCommentsPostId 
    foreign key (post_id) 
    references Posts(post_id) 
); 
create clustered index cdxComments on 
    Comments (is_deleted, post_id, upvotes, date, comment_id); 
go

，这里是我的测试数据生成：

insert into Posts (content) 
select 'Lorem Ipsum' 
from master..spt_values; 

insert into Posts (content, parent_id) 
select 'Ipsum Lorem', post_id 
from Posts p 
cross apply (
    select top(checksum(newid(), p.post_id) % 10) Number 
    from master..spt_values) as r 
where parent_id is NULL 

insert into Comments (body, is_deleted, post_id, upvotes, date) 
select 'Sit Amet' 
    -- 5% deleted comments 
    , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end 
    , p.post_id 
    -- up to 10 upvotes 
    , abs(checksum(newid(), p.post_id, r.Number)) % 10 
    -- up to 1 year old posts 
    , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p 
cross apply (
    select top(abs(checksum(newid(), p.post_id)) % 10) Number 
    from master..spt_values) as r

来源

2009-12-17 01:00:57

如何查询注释的stackoverflow样式？

回答

相关问题