我试图用广度优先算法来获取连接的整个集群选定用户所属的方式如下:分析大图 - 检索集群和计算最强路径
CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
AS
BEGIN
-- Automatically rollback the transaction if something goes wrong.
SET XACT_ABORT ON
BEGIN TRAN
-- Increase performance and do not intefere with the results.
SET NOCOUNT ON;
-- Create a temporary table for storing the discovered nodes as the algorithm runs
CREATE TABLE #Discovered
(
DiscoveredUser varchar(50) NOT NULL, -- The Node Id
Predecessor varchar(50) NULL, -- The node we came from to get to this node.
LinkStrength decimal(10,7) NULL, -- positive - from predecessor to DiscoveredUser, negative - from DiscoveredUser to predecessor
OrderDiscovered int -- The order in which the nodes were discovered.
)
-- Initially, only the start node is discovered.
INSERT INTO #Discovered (DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
VALUES (@StartNode, NULL, NULL, 0)
-- Add all nodes that we can get to from the current set of nodes,
-- that are not already discovered. Run until no more nodes are discovered.
WHILE @@ROWCOUNT > 0
BEGIN
-- If we have found the node we were looking for, abort now.
IF @EndNode IS NOT NULL
IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE DiscoveredUser = @EndNode)
BREAK
-- We need to group by ToNode and select one FromNode since multiple
-- edges could lead us to new same node, and we only want to insert it once.
INSERT INTO #Discovered (DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
(SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
WHERE mc.called_party NOT IN (SELECT DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
UNION
SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
WHERE mc.calling_party NOT IN (SELECT DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
)
END;
-- Select the results. We use a recursive common table expression to
-- get the full path from the start node to the current node.
WITH BacktraceCTE(Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path)
AS
(
-- Anchor/base member of the recursion, this selects the start node.
SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
CAST(n. DiscoveredUser AS varchar(MAX))
FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
WHERE d. DiscoveredUser = @StartNode
UNION ALL
-- Recursive member, select all the nodes which have the previous
-- one as their predecessor. Concat the paths together.
SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
)
SELECT Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
WHERE DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
ORDER BY OrderDiscovered -- a bad execution plan, but I use it for simplicity here.
DROP TABLE #Discovered
COMMIT TRAN
RETURN 0
END
曲线图(社交网络),我目前正在分析有28M连接,并且不忽略弱连接(使用@LinkStrength设置阈值)执行运行很长时间(到目前为止,我没有得到任何结果,并且会尝试让它在一夜之间运行)。
无论如何,下一步是计算两个用户之间的最短(最强)的链接(大约有3M用户)。我正在考虑使用Djikstra算法,但我不确定是否可以在我目前使用的PC(四核CPU 2.66 GHz,4 GB RAM)上分析此类网络,并将数据存储在MS SQL Server 2008数据库中。
总之我希望得到一些答案/以下问题的建议:
- 是否有可能复杂如一个以上的 描述图(28M连接,3M执行 查询 用户)在描述的机器 (2.66 GHz,4GB RAM)?
- 如果不可能,有没有 执行时间可以减少 (例如创建具有部分 结果的表)?
- 您是否推荐任何其他 算法用于检测群集 并计算所描述的图的最短路径为 ?
谢谢!
如果你能想到的一些好的启发由猜测在哪个环节比其他人更容易,那么你可能想尝试A * ... – fmark 2010-11-08 13:55:45
你几乎可以肯定I /比CPU的O或内存绑定,添加主轴并分发临时文件和数据文件将提高SQL性能。即使进行了优化,从根本上说,您也正在执行SQL难以优化的任务,并且随着数据集的增长,无疑将继续打击I/O瓶颈。大多数时间敏感的民众在这一点上达到了地图缩小的效果。 – stephbu 2011-02-22 16:52:30