2010-06-01 61 views
4
不同坐标

假设我有坐标列表:越来越最小的,由N以上的Python

data = [ 
    [(10, 20), (100, 120), (0, 5), (50, 60)], 
    [(13, 20), (300, 400), (100, 120), (51, 62)] 
] 

,我想采取的所有元组,要么出现在数据中的每个列表,或任何元组除以外的列表中的所有元组与相差3或更少。我怎样才能在Python中有效地做到这一点?

对于上面的例子,结果应该是:

[[(100, 120), # since it occurs in both lists 
    (10, 20), (13, 20), # since they differ by only 3 
    (50, 60), (51, 60)]] 

(0,5)和(300,400)将不被包括在内,因为它们没有出现在两个列表,并且不不同从列表以外的列表中的元素减去3或更少。

这怎么能算出来?谢谢。

+1

你有什么样的功能,两个元组并返回一个int? – 2010-06-01 22:46:39

+0

假设它是开始坐标和结束坐标之间的差异之和 – user248237dfsf 2010-06-01 23:04:29

+1

在我看来,“与其他列表中的所有元组不同于3或更少”意味着(10,20)不能出现在结果中,因为它比(300,400)多3。你的意思是说你应该包含一个项目,在三个内的匹配可以在另一个列表中找到?在所有其他清单中? – Pierce 2010-06-01 23:05:39

回答

0

我希望这会让你开始。任何改进将不胜感激。

出现在所有列表中很简单 - 只需取出列表中的所有元素的交集。

>>> data = [ 
...  [(10, 20), (100, 120), (0, 5), (50, 60)], 
...  [(13, 20), (300, 400), (100, 120), (51, 62)] 
... ] 
>>> dataset = [set(d) for d in data] 
>>> dataset[0].intersection(*dataset[1:]) 
set([(100, 120)]) 

具有比在同一列表中的那些元组其他出现我的曲线图/ 2D空间问题“通过在3个或更少的不同”。没有简单的算法,没有多项式算法,如果你的数据集不是很大,你可以迭代它们并将不在同一列表中的关闭点进行分组。

1

一个天真的执行会很慢:O(n^2),测试每个节点对每个节点。用树来加速。

该实现使用简单的四叉树使搜索更有效。这并没有试图平衡这棵树,所以一个非常有序的点列表可能会使它效率很低。对于很多用途来说,简单地对这个列表进行洗牌可能会使它足够好。只要确保不要传递很多按坐标排序的项目,因为这会将其减少到链接列表。

这里的优化很简单:如果我们要查找某个点的3个单位的欧几里得距离内的项目,并且我们知道子树中的所有项目都至少有3个单位,那么就没有办法该地区的积分可能少于3个单位。

此代码是公有领域。尽量不要把它作为家庭作业。

#!/usr/bin/python 
import math 

def euclidean_distance(pos1, pos2): 
    x = math.pow(pos1[0] - pos2[0], 2) 
    y = math.pow(pos1[1] - pos2[1], 2) 
    return math.sqrt(x + y) 

class QuadTreeNode(object): 
    def __init__(self, pos): 
     """ 
     Create a QuadTreeNode at the specified position. pos must be an (x, y) tuple. 
     Children are classified by quadrant. 
     """ 
     # Children of this node are ordered TL, TR, BL, BL (origin top-left). 
     self.children = [None, None, None, None] 
     self.pos = pos 

    def classify_node(self, pos): 
     """ 
     Return which entry in children can contain pos. If pos is equal to this 
     node, return None. 

     >>> node = QuadTreeNode((10, 20)) 
     >>> node.classify_node((10, 20)) == None 
     True 
     >>> node.classify_node((2, 2)) 
     0 
     >>> node.classify_node((50, 2)) 
     1 
     >>> node.classify_node((2, 50)) 
     2 
     >>> node.classify_node((50, 50)) 
     3 

     X boundary condition: 
     >>> node.classify_node((10, 2)) 
     0 
     >>> node.classify_node((10, 50)) 
     2 

     Y boundary conditoin: 
     >>> node.classify_node((2, 20)) 
     0 
     >>> node.classify_node((50, 20)) 
     1 
     """ 
     if pos == self.pos: 
      return None 
     if pos[0] <= self.pos[0]:  # Left 
      if pos[1] <= self.pos[1]: # Top-left 
       return 0 
      else:      # Bottom-left 
       return 2 
     else:       # Right 
      if pos[1] <= self.pos[1]: # Top-right 
       return 1 
      else:      # Bottom-right 
       return 3 
     assert False, "not reached" 

    def add_node(self, node): 
     """ 
     Add a specified point under this node. 
     """ 
     type = self.classify_node(node.pos) 
     if type is None: 
      # node is equal to self, so this is a duplicate node. Ignore it. 
      return 

     if self.children[type] is None: 
      self.children[type] = node 
     else: 
      # We already have a node there; recurse and add it to the child. 
      self.children[type].add_node(node) 

    @staticmethod 
    def CreateQuadTree(data): 
     """ 
     Create a quad tree from the specified list of points. 
     """ 
     root = QuadTreeNode(data[0]) 
     for val in data[1:]: 
      node = QuadTreeNode(val) 
      root.add_node(node) 

     return root 

    def distance_from_pos(self, pos): 
     return euclidean_distance(self.pos, pos) 

    def __str__(self): return str(self.pos) 

    def find_point_within_range(self, pos, distance): 
     """ 
     If a point exists within the specified Euclidean distance of the specified 
     point, return it. Otherwise, return None. 
     """ 
     if self.distance_from_pos(pos) <= distance: 
      return self 

     for axis in range(0, 4): 
      if self.children[axis] is None: 
       # We don't have a node on this axis. 
       continue 

      # If moving forward on this axis would permanently put us out of range of 
      # the point, short circuit the search on that axis. 
      if axis in (0, 2): # axis moves left on X 
       if self.pos[0] < pos[0] - distance: 
        continue 
      if axis in (1, 3): # axis moves right on X 
       if self.pos[0] > pos[0] + distance: 
        continue 
      if axis in (0, 1): # axis moves up on Y 
       if self.pos[1] < pos[1] - distance: 
        continue 
      if axis in (2, 3): # axis moves down on Y 
       if self.pos[1] > pos[1] + distance: 
        continue 
      node = self.children[axis].find_point_within_range(pos, distance) 
      if node is not None: 
       return node 
     return None 

    @staticmethod 
    def find_point_in_range_for_all_trees(point, trees, distance): 
     """ 
     If all QuadTreeNodes in trees contain a a point within the specified distance 
     of point, return True, Otherwise, return False. 
     """ 
     for tree in trees: 
      if tree.find_point_within_range(point, distance) is None: 
       return False 
     return True 

def test_naive(data, distance): 
    def find_point_in_list(iter, point): 
     for i in iter: 
      if euclidean_distance(i, point) <= distance: 
       return True 
     return False 

    def find_point_in_all_lists(point): 
     for d in data: 
      if not find_point_in_list(d, point): 
       return False 
     return True 

    results = [] 
    for d in data: 
     for point in d: 
      if find_point_in_all_lists(point): 
       results.append(point) 
    return set(results) 

def test_tree(data, distance): 
    trees = [QuadTreeNode.CreateQuadTree(d) for d in data] 
    results = [] 
    for d in data: 
     for point in d: 
      if QuadTreeNode.find_point_in_range_for_all_trees(point, trees, 3): 
       results.append(point) 
    return set(results) 

def test(): 
    sample_data = [ 
      [(10, 20), (100, 120), (0, 5), (50, 60)], 
      [(13, 20), (300, 400), (100, 120), (51, 62)] 
    ] 
    result1 = test_naive(sample_data, 3) 
    result2 = test_tree(sample_data, 3) 
    print result1 
    assert result1 == result2 

    # Loosely validate the tree algorithm against a lot of sample data, and compare 
    # performance while we're at it: 
    def random_data(): 
     import random 
     return [(random.randint(0,1000), random.randint(0,1000)) for d in range(0,500)] 
    data = [random_data() for x in range(0,10)] 

    print "Searching (naive)..." 
    result1 = test_naive(data, 3) 

    print "Searching (tree)..." 
    result2 = test_tree(data, 3) 
    assert result1 == result2 


if __name__ == "__main__": 
    test() 

    import doctest 
    doctest.testmod() 
0

@ barrycarter的直觉很有趣:减少比较的次数(其中由“比较”两点我们的意思是检查,如果他们的距离是<= 3),“虚拟切片” 2D平面到合适的“细胞”,因此每个点只需要与相邻“单元格”中的点进行比较。如果你的数据集很大,这可能真的有帮助(如果你需要一个强力解决方案,每个点都需要与基本上所有其他点进行比较)。

下面是这个想法的Python实现(因为Barry的代码草图似乎是Perl或其他),目的是为了清晰而不是速度...:

import collections 
import math 


def cellof(point): 
    x, y = point 
    return x//3, y//3 

def distance(p1, p2): 
    return math.hypot(p1[0]-p2[0], p1[1]-p2[1]) 

def process(data): 
    cells = collections.defaultdict(list) 
    for i, points in enumerate(data): 
     for p in points: 
      cx, cy = cellof(p) 
      cells[cx, cy].append((i, p)) 
    res = set() 
    for c, alist in cells.items(): 
     for i, p in alist: 
       for cx in range(c[0]-1, c[0]+2): 
        for cy in range(c[1]-1, c[1]+2): 
         otherc = cells[cx, cy] 
         for otheri, otherp in otherc: 
          if i == otheri: continue 
          dst = distance(p, otherp) 
          if dst <= 3: res.add(p) 
    return sorted(res) 

if __name__ == '__main__': # just an example 

    data = [ 
     [(10, 20), (100, 120), (0, 5), (50, 60)], 
     [(13, 20), (300, 400), (100, 120), (51, 62)] 
    ] 
    print process(data) 

当作为一个脚本来运行,这产生输出

[(10, 20), (13, 20), (50, 60), (51, 62), (100, 120)] 

。当然,以确定这是否是值得的,或者更简单的蛮力方法确实比较好,唯一可行的方法是为了让你在现实数据上运行这两种解决方案的基准测试数据 - 你的程序实际需要在现实生活中处理的数据集种类。取决于你有多少个列表,多少个点,间隔多么紧密,性能可能会有很大的不同,而且最好是测量而不是猜测! - )