T-SQL中的随机加权选择

15

丹恩的答案包括一个引入平方律的自我连接。 (n*n/2)行之后的表中有n行。

更理想的是能够解析表格一次。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END, 
    @weight_point = @weight_point - [table].weight 
FROM 
    @table [table] 
ORDER BY 
    [table].Weight DESC

这将通过表格，设置@id每个记录的id值，而在同一时间递减@weight点。最终，@weight_point将消极。这意味着前面所有权重的SUM大于随机选择的目标值。这是我们想要的记录，因此从那时起，我们将@id设置为自己（忽略表中的任何ID）。

这只会贯穿表格一次，但即使所选值是第一条记录，也必须贯穿整个表格。因为平均仓位是通过表的一半（和更少，如果下令升重量）编写循环也可能会被更快......（特别是如果该权重是在公共组）：

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT @next_weight = MAX(weight) FROM @table 
SELECT @row_count = COUNT(*) FROM @table 
SET @weight_point = @weight_point - (@next_weight * @row_count) 

WHILE (@weight_point > 0) 
BEGIN 
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight 
    SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight 
    SET @weight_point = @weight_point - (@next_weight * @row_count) 
END 

-- # Once the @weight_point is less than 0, we know that the randomly chosen record 
-- # is in the group of records WHERE [table].weight = @next_weight 

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END, 
    @row_count = @row_count - 1 
FROM 
    @table [table] 
WHERE 
    [table].weight = @next_weight 
ORDER BY 
    [table].Weight DESC

来源

2009-01-18 01:26:24 MatBailie

+0

我做了一些经验性测试，发现您的解决方案对输入数据非常敏感。我的测试数据 - 权重：2，998，迭代次数：1M。重量2应该拿起约2k次。如果表中权重的顺序是升序（2,998），那么它的权重2就是大约500次。如果您反转订单，则约为2500次。如果您将`ROUND`更改为`FLOOR`，升序约为1000次，重量降低约2000倍。这是适当的概率。我已经更新了你的答案。 – 2016-09-01 19:01:46

7

您只需要对所有准备行的权重求和，然后在该总和中选择一个随机点，然后选择与该选定点协调的记录（每个记录递增地携带累加权重和）。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT TOP 1 @id = t1.id 
FROM @table t1, @table t2 
WHERE t1.id >= t2.id 
GROUP BY t1.id 
HAVING SUM(t2.weight) >= @weight_point 
ORDER BY t1.id 

SELECT @id

来源

2008-09-12 07:45:06 Dane

+1

我做了一个小您和MatBailie解决方案的基准，看起来您的时间大约是Mat的两倍。在2行和1千万次迭代的桌子上，Mat的解决方案耗时约45秒，解决方案约需85秒。 – 2016-09-01 18:40:50

3

的“逐步背着一个accumlating [原文]加权求和”部分是昂贵的，如果你有很多的记录。如果你已经有了很大的分数/重量范围（例如：范围足够宽，大多数记录重量是独一无二的，1-5颗星可能不会削减它），你可以做这样的事情来选择一个重量值。我使用VB.Net在这里展示，但是这可以很容易地在纯SQL来完成，以及：

Function PickScore() 
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already 
    'Get count of scores in database 
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]") 
    ' You could also approximate this with just the number of records in the table, which might be faster. 

    'Random number between 0 and 1 with ScoreCount possible values 
    Dim rand As Double = Random.GetNext(ScoreCount)/ScoreCount 

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores 
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1 
    rand = 1 - (rand * rand * rand) 

    'Now we need to map the (0,1] vector to [1,Maxscore]. 
    'Just find MaxScore and mutliply by rand 
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs") 
    Return MaxScore * rand 
End Function

运行它，并选择具有最大记录得分比返回的重量更轻。如果分数超过一个记录，请随机挑选。这里的优点是你不必保留任何总和，并且你可以调整适合你的口味的概率公式。但是，再次，分数越大，它的效果就越好。

来源

2008-09-12 13:41:07

2

使用随机数生成器完成此操作的方法是集成概率密度函数。使用一组离散值可以计算前缀和（所有值的总和），并将其存储起来。使用此选项，您可以选择大于随机数的最小前缀总和（聚合到日期）值。

在数据库上插入后的后续值必须更新。如果数据集的更新和大小的相对频率没有使得这样做成本过高，这意味着可以从单个s-argable（可以通过索引查找来解析的谓词）查询中获取适当的值。

来源

2008-09-18 14:57:55 ConcernedOfTunbridgeWells