2008-09-12 119 views
21

如何根据所有候选行的应用权重在T-SQL中随机选择一个表行?T-SQL中的随机加权选择

例如,我有一个表格中的一组行,其权重分别为50,25和25(合计为100,但不需要),我想随机选择其中的一个统计结果相当于相应的重量。

回答

15

丹恩的答案包括一个引入平方律的自我连接。 (n*n/2)行之后的表中有n行。

更理想的是能够解析表格一次。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END, 
    @weight_point = @weight_point - [table].weight 
FROM 
    @table [table] 
ORDER BY 
    [table].Weight DESC 

这将通过表格,设置@id每个记录的id值,而在同一时间递减@weight点。最终,@weight_point将消极。这意味着前面所有权重的SUM大于随机选择的目标值。这是我们想要的记录,因此从那时起,我们将@id设置为自己(忽略表中的任何ID)。

这只会贯穿表格一次,但即使所选值是第一条记录,也必须贯穿整个表格。因为平均仓位是通过表的一半(和更少,如果下令升重量)编写循环也可能会被更快......(特别是如果该权重是在公共组):

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT @next_weight = MAX(weight) FROM @table 
SELECT @row_count = COUNT(*) FROM @table 
SET @weight_point = @weight_point - (@next_weight * @row_count) 

WHILE (@weight_point > 0) 
BEGIN 
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight 
    SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight 
    SET @weight_point = @weight_point - (@next_weight * @row_count) 
END 

-- # Once the @weight_point is less than 0, we know that the randomly chosen record 
-- # is in the group of records WHERE [table].weight = @next_weight 

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0) 

SELECT 
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END, 
    @row_count = @row_count - 1 
FROM 
    @table [table] 
WHERE 
    [table].weight = @next_weight 
ORDER BY 
    [table].Weight DESC 
+0

我做了一些经验性测试,发现您的解决方案对输入数据非常敏感。我的测试数据 - 权重:2,998,迭代次数:1M。重量2应该拿起约2k次。如果表中权重的顺序是升序(2,998),那么它的权重2就是大约500次。如果您反转订单,则约为2500次。如果您将`ROUND`更改为`FLOOR`,升序约为1000次,重量降低约2000倍。这是适当的概率。我已经更新了你的答案。 – 2016-09-01 19:01:46

7

您只需要对所有准备行的权重求和,然后在该总和中选择一个随机点,然后选择与该选定点协调的记录(每个记录递增地携带累加权重和)。

DECLARE @id int, @weight_sum int, @weight_point int 
DECLARE @table TABLE (id int, weight int) 

INSERT INTO @table(id, weight) VALUES(1, 50) 
INSERT INTO @table(id, weight) VALUES(2, 25) 
INSERT INTO @table(id, weight) VALUES(3, 25) 

SELECT @weight_sum = SUM(weight) 
FROM @table 

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0) 

SELECT TOP 1 @id = t1.id 
FROM @table t1, @table t2 
WHERE t1.id >= t2.id 
GROUP BY t1.id 
HAVING SUM(t2.weight) >= @weight_point 
ORDER BY t1.id 

SELECT @id 
+1

我做了一个小您和MatBailie解决方案的基准,看起来您的时间大约是Mat的两倍。在2行和1千万次迭代的桌子上,Mat的解决方案耗时约45秒,解决方案约需85秒。 – 2016-09-01 18:40:50

3

“逐步背着一个accumlating [原文]加权求和”部分是昂贵的,如果你有很多的记录。如果你已经有了很大的分数/重量范围(例如:范围足够宽,大多数记录重量是独一无二的,1-5颗星可能不会削减它),你可以做这样的事情来选择一个重量值。我使用VB.Net在这里展示,但是这可以很容易地在纯SQL来完成,以及:

Function PickScore() 
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already 
    'Get count of scores in database 
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]") 
    ' You could also approximate this with just the number of records in the table, which might be faster. 

    'Random number between 0 and 1 with ScoreCount possible values 
    Dim rand As Double = Random.GetNext(ScoreCount)/ScoreCount 

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores 
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1 
    rand = 1 - (rand * rand * rand) 

    'Now we need to map the (0,1] vector to [1,Maxscore]. 
    'Just find MaxScore and mutliply by rand 
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs") 
    Return MaxScore * rand 
End Function 

运行它,并选择具有最大记录得分比返回的重量更轻。如果分数超过一个记录,请随机挑选。这里的优点是你不必保留任何总和,并且你可以调整适合你的口味的概率公式。但是,再次,分数越大,它的效果就越好。

2

使用随机数生成器完成此操作的方法是集成概率密度函数。使用一组离散值可以计算前缀和(所有值的总和),并将其存储起来。使用此选项,您可以选择大于随机数的最小前缀总和(聚合到日期)值。

在数据库上插入后的后续值必须更新。如果数据集的更新和大小的相对频率没有使得这样做成本过高,这意味着可以从单个s-argable(可以通过索引查找来解析的谓词)查询中获取适当的值。