python推文解析

我试图解析推文数据。python推文解析

我的数据形状如下：

59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s>&lt;a href="http://help.twitter.com/index.php?pg=kb.page&amp;id=75" rel="nofollow"&gt;txt&lt;/a&gt;</s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us &amp; canada)</t> <l>ga</l> 
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s>&lt;a href="http://alexking.org/projects/wordpress" rel="nofollow"&gt;twitter tools&lt;/a&gt;</s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us &amp; canada)</t> <l>madison, wi</l>

我想微博的总数和关键字相关的tweet的数量。我在文本文件中准备了关键字。此外，我想获取tweet文本内容，包含提及（@），转推（RT）和URL（我想将每个URL保存在其他文件中）的推文总数。

所以，我这样编码。

import time 
import os 

total_tweet_count = 0 
related_tweet_count = 0 
rt_count = 0 
mention_count = 0 
URLs = {} 

def get_keywords(filepath): 
    with open(filepath) as f: 
     for line in f: 
      yield line.split() 

for line in open('/nas/minsu/2009_06.txt'): 
    tweet = line.strip() 

    total_tweet_count += 1 

    with open('./related_tweets.txt', 'a') as save_file_1: 
     keywords = get_keywords('./related_keywords.txt', 'r') 

     if keywords in line: 
      text = line.split('<t>')[1].split('</t>')[0] 

      if 'http://' in text: 
       try: 
        url = text.split('http://')[1].split()[0] 
        url = 'http://' + url 

        if url not in URLs: 
         URLs[url] = [] 
        URLs[url].append('\t' + text) 

        save_file_3 = open('./URLs_in_related_tweets.txt', 'a') 
        print >> save_file_3, URLs 

       except: 
        pass 

      if '@' in text: 
       mention_count +=1 

      if 'RT' in text: 
       rt_count += 1 

      related_tweet_count += 1 

      print >> save_file_1, text 

    save_file_2 = open('./info_related_tweets.txt', 'w') 

print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count) 

save_file_1.close() 
save_file_2.close() 
save_file_3.close()

关键词集合喜欢

Happy 
Hello 
Together

我觉得我的代码有很多问题，但第一个错误是follws：

Traceback (most recent call last): 
    File "health_related_tweets.py", line 21, in <module> 
    keywords = get_keywords('./public_health_related_words.txt', 'r') 
TypeError: get_keywords() takes exactly 1 argument (2 given)

请帮我！

来源

2011-10-02 ooozooo

该问题在错误中不言自明，您在调用get_keywords（）时指定了两个参数，但实现只有一个参数。你应该改变你的get_keywords实施类似：

def get_keywords(filepath, mode): 
    with open(filepath, mode) as f: 
     for line in f: 
      yield line.split()

然后你可以使用下面的行没有这种特定的错误：

keywords = get_keywords('./related_keywords.txt', 'r')

来源

2011-10-02 13:40:15

回溯（最近通话最后一个）：文件 “health_related_tweets.py” 23行，在如果关键字在行： TypeError：'在'需要字符串作为左操作数，而不是生成器///现在我得到了这个错误。 PLZ帮助我！ – ooozooo

@MINSUPARK'get_keywords（）'返回一个生成器，而不是一个字符串，所以当你调用'如果关键字在行：'你得到的错误，因为关键字'不是一个字符串。 –

@CodyHess那我该如何解决呢？其实，我是一个初学者......。所以我需要你的帮助！ – ooozooo

现在你得到这个错误：

回溯（最近调用最后一个）：文件“health_related_tweets.py”，第23行，在if关键字行中：TypeError：'in'需要字符串作为左操作数，而不是生成器

原因是keywords = get_keywords(...)返回一个生成器。按逻辑思考，关键字应该是所有关键字的列表。并且对于此列表中的每个关键字，您想要检查它是否在推文/行中。

示例代码：

keywords = get_keywords('./related_keywords.txt', 'r') 
has_keyword = False 
for keyword in keywords: 
    if keyword in line: 
    has_keyword = True 
    break 
if has_keyword: 
    # Your code here (for the case when the line has at least one keyword)

（上面的代码会被替换if keywords in line:）

来源

2011-10-02 16:55:41 varunl

我得到了另一个错误。（Traceback（最近一次调用的最后一个）：文件“health_related_tweets.py”，第25行，在中关键字的关键字：文件“health_related_tweets.py”，第13行，在get_keywords中产生line.split（）。lower（）AttributeError： 'list'对象没有属性'lower'）我认为我需要将关键字和推文转换为小写形式进行解析。所以我把“.lower”放在我的代码中。但它使错误....。我应该如何解决它？ – ooozooo

python推文解析

回答

相关问题