我最近编写了一个程序,找到两个维基百科文章之间的最短路径。问题是获取页面上的所有链接并将其放入图表需要很长时间。寻找路径是一件容易的事情。 Basicly我在做什么是这样的:如何加快程序,找到两个维基百科文章之间的最短路径
startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)
while found != True:
for node in list(graph):
if graph.out_degree(node) == 0:
found = pages.import_page(graph, node)
if found == True:
break;
而且我import_page功能是这一个:
def import_page(graph, starting, target):
general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
data_str = general_str + starting
encoded_data_str = data_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_data_str)
data = json.loads(response.read())
pageId = data['query']['pages'].keys()
print starting
if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
return False
elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
return False
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
while data.keys()[0] != 'batchcomplete':
continueId = data['continue']['plcontinue']
continue_str = data_str + '&plcontinue=' + continueId
encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
response = url.urlopen(encoded_continue_str)
data = json.loads(response.read())
for jsonObject in data['query']['pages'][pageId[0]]['links']:
graph.add_node(jsonObject['title'])
graph.add_edge(starting, jsonObject['title'])
if jsonObject['title'] == target:
return True
return False
问题对于任何距离大于2/3的链接是一个采取一个巨大的多少时间。关于我如何加快速度的任何想法?
为什么你关心的边缘?你不是只想保留一个活动节点列表和访问节点列表吗?另一个问题(可能只是由于我不知道图形库而导致):在遍历其节点时修改图形是否有效? –
另外,它写在这里的方式,你的'import_page'确实是'import_Adolf_Hitler_page' :) –
是的,我已经更新了,忘了在这里更新! –