2017-10-05 214 views
0

我想使用此代码标准化数值数据为特征向量:数字数据转换为特征向量

import numpy as np 
import pandas as pd 
import csv 

def clearRegister(): 
    clear_register = [] 
    zero = 0 
    for i in range(21): 
     clear_register.append(0) 
    return clear_register 

def header(): 
    clear_register = [] 
    name = 'c' 
    entry = 1 
    for i in range(21): 
     clear_register.append(name+str(entry)) 
     entry += 1 
    return clear_register 

def convert(filename): 
    clear_dataset = [] 
    clear_dataset.append(header()) 
    with open(filename) as csvfile: 
     reader = csv.DictReader(csvfile) 
     for row in reader: 
      clear_register = clearRegister() 
      clear_register[(int(row["blue1"])-1)] = 1 
      clear_register[(int(row["blue2"])-1)] = 1 
      clear_register[(int(row["blue3"])-1)] = 1 
      clear_register[(int(row["red1"])+9)] = 1 
      clear_register[(int(row["red2"])+9)] = 1 
      clear_register[(int(row["red3"])+9)] = 1 

这里是我的csvfile输入:

row blue1 blue2 blue3 red1 red2 red3 lable 
0 1 5 4 6 2 8 0 
1 2 3 1 9 4 5 1 
. . . . . . . . 
3000 5 7 4 3 8 10 1 

我期待这样的输出(C1- C10为蓝色,C11 - C20为红色):

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 
1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 
. . . . . . . . . . . . . . . . . . . . . 
0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 

C11 - C20是 '红色' 代表它们都是独一无二的。如果c1,c5,c10的值为1,那么c11,c15,c20就不能有这个值。

我试图把它称为:

df = convert("dataset.csv") 
df1 = pd.DataFrame(df) 
print(df1) 

我得到了这样的结果:

Empty DataFrame 
Columns: [] 
Index: [] 

有什么问题或与代码欠缺?

+0

有蓝天航空公司的posibility = blue2 = blue3,对于红色也是一样,你实际需要的是计数?或者答案总是二进制 – DJK

+0

总是二进制。我忘了提及它们对于两者都不重复(唯一),所以如果c1的值为1,则作为红色c1的代表的c11将不具有相同的值。 –

回答

1

考虑一个熊猫的解决方案,而不是csv操作,使用loc来反复创建新的c1-c20列。用随机数据如下演示:

数据(仅适用于问题的读者,其中OP使用实际CSV代替)

import numpy as np 
import pandas as pd 

pd.set_option('display.width', 1000) 
pd.set_option('display.max_columns', 25) 

np.random.seed(5005) 
df = pd.DataFrame({'row': range(3000), 
        'blue1': [np.random.randint(11) for _ in range(3000)], 
        'blue2': [np.random.randint(11) for _ in range(3000)], 
        'blue3': [np.random.randint(11) for _ in range(3000)], 
        'red1': [np.random.randint(11) for _ in range(3000)], 
        'red2': [np.random.randint(11) for _ in range(3000)], 
        'red3': [np.random.randint(11) for _ in range(3000)], 
        'lable': [0,1]*1500}) 

print(df.head()) 
# blue1 blue2 blue3 lable red1 red2 red3 row 
# 0  4  5  5  0 10  0  8 0 
# 1  7  2  2  1  3  8  8 1 
# 2  2  4  0  0  8  1  7 2 
# 3  4  5  8  1  9  8  1 3 
# 4  0  1  5  0  5  6  9 4 

过程

for i in range(1,11):  
    df.loc[(df['blue1'] == i) | (df['blue2'] == i) | (df['blue3'] == i), 'c'+str(i)] = 1 
    df.loc[(df['red1'] == i) | (df['red2'] == i) | (df['red3'] == i), 'c'+str(i+10)] = 1 

# SELECT AND RE-ORDER COLUMNS, FILL IN NANs, CONVERT TO INT TYPE 
df = df[['c'+str(i) for i in range(1,21)]+['lable']].fillna(0).astype(int) 

print(df.head())  
# c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 lable 
# 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1  0 
# 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0  1 
# 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0  0 
# 3 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0  1 
# 4 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0  0 
+0

虽然给出的例子是21x3000,我真正的数据集转换包含277列和39500行,这使得执行运行非常缓慢......无论如何,我真的很感谢你的帮助。非常感谢你 ! –