快速的方法来创建对

我有救喜欢这个词/标签对一个大文件的熊猫数据框：快速的方法来创建对

This/DT gene/NN called/VBN gametocide/NN

现在我希望把这些对与他们的计数这样的数据帧：

 DT | NN -- 
This| 1 0 
Gene| 0 1 
:

我尝试与计数对，然后把它在数据帧的字典这样做：

file = open("data.txt", "r") 

train = file.read() 
words = train.split() 

data = defaultdict(int) 
for i in words: 
    data[i] += 1 

matrixB = pd.DataFrame() 

for elem, count in data.items(): 
    word, tag = elem.split('/') 
    matrixB.loc[tag, word] = count

但这需要很长时间（文件有300000个）。有没有更快的方法来做到这一点？

来源

2016-03-01 maxmijn

从your other question得到的答案有什么问题？

from collections import Counter 

with open('data.txt') as f: 
    train = f.read() 
c = Counter(tuple(x.split('/')) for x in train.split()) 
s = pd.Series(c) 
df = s.unstack().fillna(0) 

print(df)

产生

  DT NN VBN 
This   1 0 0 
called  0 0 1 
gametocide 0 1 0 
gene   0 1 0

来源

2016-03-01 17:53:49 Alex

什么都没有，只是仍在测试这一切之前，我看到你的答案。这帮了我很多，非常感谢！ – maxmijn

太棒了 - 很高兴它有帮助！ – Alex

我以为这个问题非常相似......你为什么发布两次？

from collection import Counter 

text = "This/DT gene/NN called/VBN gametocide/NN" 

>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0) 

      DT NN VBN 
This   1 0 0 
called  0 0 1 
gametocide 0 1 0 
gene   0 1 0

来源

2016-03-01 17:53:39 Alexander

快速的方法来创建对

回答

相关问题