2016-03-28 372 views
0

我有一个数据集,其中一列的标题是“什么是您的位置和时区?”使用python从文本中提取城市名称

这意味着,我们有像

  1. 丹麦项,CET
  2. 地点是英国德文郡,GMT时区
  3. 澳大利亚。澳洲东部标准时间。 + 10h UTC。

甚至

  • 我的位置是俄勒冈州尤金市全年大部分时间还是在首尔, 韩国因学校放假。我的主要时区是太平洋时区的 。
  • 对于整个五月我会在英国伦敦(GMT + 1)。在整个六月,我将在挪威(GMT + 2)或以色列 (格林威治标准时间+3)与有限的互联网接入。对于整个七月和八月 我将在英国伦敦(格林威治标准时间+ 1)。然后从 月,2015年,我公司将在美国波士顿(EDT)
  • 有没有办法从这个提取城市,国家和时区?

    我正在考虑创建一个包含所有国家/地区名称(包括简短形式)以及城市名称/时区的数组(包含开放源数据集),然后如果数据集中的任何单词与城市/国家/时区或简短形式将其填充到同一数据集中的新列并对其进行计数。

    这是否实用?

    =========== REPLT基于NLTK ANSWER ============

    运行相同的代码,Alecxe我得到

    Traceback (most recent call last): 
        File "E:\SBTF\ntlk_test.py", line 19, in <module> 
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag 
        tagger = PerceptronTagger() 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__ 
        self.load(AP_MODEL_LOC) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load 
        self.model.weights, self.tagdict, self.classes = load(loc) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load 
        opened_resource = _open(resource_url) 
        File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open 
        return urlopen(resource_url) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen 
        return opener.open(url, data, timeout) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open 
        response = self._open(req, data) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open 
        'unknown_open', req) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain 
        result = func(*args) 
        File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open 
        raise URLError('unknown url type: %s' % type) 
    URLError: <urlopen error unknown url type: c> 
    

    回答

    4

    我会使用自然语言处理和nltk必须提供以提取实体

    示例(很大程度上基于this gist)对文件中的每一行进行标记,将其拆分为块并以递归方式查找每个块的NE(命名实体)标签。更多解释here

    import nltk 
    
    def extract_entity_names(t): 
        entity_names = [] 
    
        if hasattr(t, 'label') and t.label: 
         if t.label() == 'NE': 
          entity_names.append(' '.join([child[0] for child in t])) 
         else: 
          for child in t: 
           entity_names.extend(extract_entity_names(child)) 
    
        return entity_names 
    
    with open('sample.txt', 'r') as f: 
        for line in f: 
         sentences = nltk.sent_tokenize(line) 
         tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] 
         tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] 
         chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True) 
    
         entities = [] 
         for tree in chunked_sentences: 
          entities.extend(extract_entity_names(tree)) 
    
         print(entities) 
    

    对于含有sample.txt

    Denmark, CET 
    Location is Devon, England, GMT time zone 
    Australia. Australian Eastern Standard Time. +10h UTC. 
    My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone. 
    For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT) 
    

    它打印:

    ['Denmark', 'CET'] 
    ['Location', 'Devon', 'England', 'GMT'] 
    ['Australia', 'Australian Eastern Standard Time'] 
    ['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific'] 
    ['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT'] 
    

    输出是不理想,但可能是一个良好的开端为您服务。

    +2

    这是如何工作的?好像是巫术 – Keatinge

    +2

    @Racialz'nltk'经常令人惊讶!我远不是NLP的专家,但试图增加更多的解释和链接进一步阅读。感谢您询问详细信息! – alecxe

    +0

    辉煌。我不知道NTLK - 我会试验这个,然后(希望)接受答案:-) – GeorgeC

    相关问题