-2
我是scikit学习和numpy的新手。我怎么能代表我的数据集由列表/字符串数组组成,例如列表/数组字符串到numpy浮点数组
[["aa bb","a","bbb","à"], [bb cc","c","ddd","à"], ["kkk","a","","a"]]
给一个numpy数组的dtype float?
我是scikit学习和numpy的新手。我怎么能代表我的数据集由列表/字符串数组组成,例如列表/数组字符串到numpy浮点数组
[["aa bb","a","bbb","à"], [bb cc","c","ddd","à"], ["kkk","a","","a"]]
给一个numpy数组的dtype float?
我认为你所寻找的是你的单词的数字表示。您可以使用gensim并将每个单词映射到令牌id,然后从中创建您的numpy阵列,如下所示:
import numpy as np
from gensim import corpora
toconvert = [["aa bb","a","bbb","à"], ["bb", "cc","c","ddd","à"], ["kkk","a","","a"]]
# convert your list of lists into token id's. For example, 'aa bb' could be represented as a 2, a as a 1, etc.
tdict = corpora.Dictionary(toconvert)
# given nested structure, you can append nested numpy arrays
newlist = []
for l in toconvert:
tmplist = []
for word in l:
# append to intermediate list the id for the given word under observation
tmplist.append(tdict.token2id[word])
# convert to numpy array and append to main list
newlist.append(np.array(tmplist).astype(float)) # type float
print(newlist) # desired output: [array([ 2., 0., 1., 0.]), array([ 5., 3., 4., 6., 0.]), array([ 7., 0., 8., 0.])]
# and to see what id's represent which strings:
tdict[0] # 'a'
感谢@datawrestler为您提供的答案。这非常有用。 –
whaat ???将字符串转换为浮点数?顺便说一下,它与sklearn无关 – MMF
好吧,也许我没有使用正确的术语,但@datawrestler了解我的问题,并给出了一个非常有用的建议。不管怎么说,还是要谢谢你。 –