2017-09-26 102 views
0

我有以下代码从文本文件的目录中提取语句。如何将字符串附加到熊猫数据框?

# -*- coding: utf-8 -*- 
from nltk.tokenize import sent_tokenize 
import pandas as pd 

directory_in_str = "E:\\Extracted\\" 
directory = os.fsencode(directory_in_str) 

for file in os.listdir(directory): 
    filename = os.fsdecode(file) 
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in: 
     for line in f_in: 
      sentences = sent_tokenize(line) 

我想建立一个大熊猫数据框并追加句子到数据帧,这样我可以构建的n-gram的句子的频率计数为每How to find ngram frequency of a column in a pandas dataframe?

也就是说我需要的句子追加到df = pd.DataFrame([], columns=['description']),这样我可以再做:

from sklearn.feature_extraction.text import CountVectorizer 
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word') 
sparse_matrix = word_vectorizer.fit_transform(df['description']) 
frequencies = sum(sparse_matrix).toarray()[0] 
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency']) 

什么是对句子的df数据框添加的代码?

回答

1

您的提取代码需要稍作更改。外部声明sentences,并根据需要保留extend

sentences = [] 
for file in os.listdir(directory): 
    filename = os.fsdecode(file) 
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in: 
     for line in f_in: 
      sentences.extend(sent_tokenize(line)) 

一旦这样做,只是初始化你df这样的:

df = pd.DataFrame({'Description' : sentences}) 
+0

如果我做'ngram_freq = pd.DataFrame(频率,指数= word_vectorizer.get_feature_names(),列= [ '频率'] )'和'df.index.name ='ngram''和'ngram_freq [ngram_freq.ngram =='youtube']'我无法获得youtube的频率计数。任何想法如何做到这一点? – Superdooperhero

+0

@Superdooperhero你的意思是:'ngram_freq [ngram_freq.index =='youtube']'? –

+0

对不起,应该是'ngram_freq.index.name ='ngram'' – Superdooperhero

相关问题