2010-11-08 50 views
6

我试图用广度优先算法来获取连接的整个集群选定用户所属的方式如下:分析大图 - 检索集群和计算最强路径

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL) 
    AS 
    BEGIN 
     -- Automatically rollback the transaction if something goes wrong. 
     SET XACT_ABORT ON 
     BEGIN TRAN 

     -- Increase performance and do not intefere with the results. 
     SET NOCOUNT ON; 

     -- Create a temporary table for storing the discovered nodes as the algorithm runs 
     CREATE TABLE #Discovered 
     (
      DiscoveredUser varchar(50) NOT NULL, -- The Node Id 
      Predecessor varchar(50) NULL, -- The node we came from to get to this node. 
      LinkStrength decimal(10,7) NULL, -- positive - from predecessor to DiscoveredUser, negative - from DiscoveredUser to predecessor 
      OrderDiscovered int -- The order in which the nodes were discovered. 
     ) 

     -- Initially, only the start node is discovered. 
     INSERT INTO #Discovered (DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered) 
     VALUES (@StartNode, NULL, NULL, 0) 

     -- Add all nodes that we can get to from the current set of nodes, 
     -- that are not already discovered. Run until no more nodes are discovered. 
     WHILE @@ROWCOUNT > 0 
     BEGIN 
      -- If we have found the node we were looking for, abort now. 
      IF @EndNode IS NOT NULL 
       IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE DiscoveredUser = @EndNode) 
        BREAK 

      -- We need to group by ToNode and select one FromNode since multiple 
      -- edges could lead us to new same node, and we only want to insert it once. 
      INSERT INTO #Discovered (DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered) 
      (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1 
      FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party 
      WHERE mc.called_party NOT IN (SELECT DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength 
      UNION 
      SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1 
      FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party 
      WHERE mc.calling_party NOT IN (SELECT DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength 
      ) 
     END; 

     -- Select the results. We use a recursive common table expression to 
     -- get the full path from the start node to the current node. 
     WITH BacktraceCTE(Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path) 
     AS 
     (
      -- Anchor/base member of the recursion, this selects the start node. 
      SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
       CAST(n. DiscoveredUser AS varchar(MAX)) 
      FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser 
      WHERE d. DiscoveredUser = @StartNode 

      UNION ALL 

      -- Recursive member, select all the nodes which have the previous 
      -- one as their predecessor. Concat the paths together. 
      SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
       CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX)) 
      FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser 
      JOIN users n ON d. DiscoveredUser = n. DiscoveredUser 
     ) 

     SELECT Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE 
     WHERE DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce 
     ORDER BY OrderDiscovered    -- a bad execution plan, but I use it for simplicity here. 

     DROP TABLE #Discovered 
     COMMIT TRAN 
     RETURN 0 
    END 

曲线图(社交网络),我目前正在分析有28M连接,并且不忽略弱连接(使用@LinkStrength设置阈值)执行运行很长时间(到目前为止,我没有得到任何结果,并且会尝试让它在一夜之间运行)。

无论如何,下一步是计算两个用户之间的最短(最强)的链接(大约有3M用户)。我正在考虑使用Djikstra算法,但我不确定是否可以在我目前使用的PC(四核CPU 2.66 GHz,4 GB RAM)上分析此类网络,并将数据存储在MS SQL Server 2008数据库中。

总之我希望得到一些答案/以下问题的建议:

  1. 是否有可能复杂如一个以上的 描述图(28M连接,3M执行 查询 用户)在描述的机器 (2.66 GHz,4GB RAM)?
  2. 如果不可能,有没有 执行时间可以减少 (例如创建具有部分 结果的表)?
  3. 您是否推荐任何其他 算法用于检测群集 并计算所描述的图的最短路径为 ?

谢谢!

+0

如果你能想到的一些好的启发由猜测在哪个环节比其他人更容易,那么你可能想尝试A * ... – fmark 2010-11-08 13:55:45

+0

你几乎可以肯定I /比CPU的O或内存绑定,添加主轴并分发临时文件和数据文件将提高SQL性能。即使进行了优化,从根本上说,您也正在执行SQL难以优化的任务,并且随着数据集的增长,无疑将继续打击I/O瓶颈。大多数时间敏感的民众在这一点上达到了地图缩小的效果。 – stephbu 2011-02-22 16:52:30

回答

0

无论您是想要一个确切的答案,无论您首先选择宽度还是深度优先都无关紧要。准确的答案将需要详尽的搜索,这将是缓慢的。

就像fmark所暗示的那样,启发式可以帮助您找到具有合理确定度的潜在最大解决方案。它会为你节省很多时间,但并不准确。

你必须选择速度或精确度,你不能真正拥有两者。它就像照片(或声音或视频)的图像压缩:对于大多数自然场景的照片,png是无损的,但不会压缩太多,jpeg压缩比较好,但有一些损失。

编辑1:我能想到的唯一的东西可以帮助你的确切搜索是稀疏矩阵的数学理论。你的问题类似于将社会关系强度矩阵提升到一系列不同的权力(权力n = n之间的人A和B之间的步骤)并找出哪些单元格对于每个(A,B)对具有最高值。这就是你在查询中所做的事情,只有一个数据库查询可能不是达到这个目的的最快方法。

虽然我无法帮助你。您可能想查看Wikipedia for Sparse Matrix

编辑2:我只是想了一件事。我看不出如何筛选出你知道在SQL查询中肯定很弱的分支,而使用量身定制的算法处理稀疏矩阵时,应该很容易找出你知道可以消除的分支,在你的力量模型上。

1

首先,使用指数

其次,您需要减少您的内存需求。这意味着首先为你的VARCHAR(50)提供一个简短的别名,比如int是4个字节而不是50个。这会让你的速度提高10倍。

declare @tmpPeople table(
    ixPerson int identity primary key, 
    UserNodeID varchar(50), 
    unique(UserNodeID, ix) -- this creates an index 
) 
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable 
declare @relationships table(
    ixParent int, 
    ixChild int, 
    unique(ixParent, ixChild), 
    unique(ixChild, ixParent) 
) 
insert @relationships(ixParent, ixChild) 
select distinct p.ixPerson, c.ixPerson 
from NormalRelationshipsTable nr 
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID 
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID 

-- OK now got a copy of all relationships, but it is a fraction of the size 
-- because we are using integers for the keys. 
-- if we need to we can insert the reverse relationships too. 

你需要编写一个你想做的事情,而不是“通用”的查询。

如果要查找两个节点之间的最短距离,可以通过从两端搜索一次来缩短搜索时间。

declare @p1 table(
ix int identity primary key, 
ixParent int, 
ixChild int, 
nDeep int, 
-- Need indexes 
unique(ixParent, ixChild, nDeep, ix), 
unique(ixChild, ixParent, nDeep, ix) 
) 
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out. 
-- define @p2 the same 

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0 
insert @p2 ..... @ixUserTo, @ixUserTo, 0 

-- Breadth first goes like this. 
-- Just keep repeating it till you get as far as you want. 
insert @p1 (ixParent, ixChild, nDeep) 
select 
p1.ixChild, r.ixChild, p1.nDeep+1 
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild 
-- may want to exclude those already in the table 
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild 
) 

对于“从Alice到Bob距离”你做两个并联的搜索,当Alice的搜索包含任何人也包含在Bob的搜索停止。这也将您的时间减少n^2倍,其中n是平均连接数。

如果深度过大,不要忘记停止。

0

在进行分析之前,也许这将有助于首先迁移到Graph DB。我没有亲自使用他们,但建议我尝试neo4j

HTH