2011-05-08 46 views
0

我正在定义一个函数,它将返回列表,其中元素零是2D阵列,元素之一是标题信息,元素2是rowname。我怎样才能从文件中读取这其中创建一个函数以允许标题行和列名称列

文件看起来像这样:

基因S1,S2,S3 S4 S5

100 -0.243 -0.021 -0.205 -1.283 0.411

10000 - 1.178 -0.79 0.063 -0.878 0.011

def input2DarrayData(fn): 
    # define twoDarray, headerLine and rowLabels 
    twoDarray = [] 
    # open filehandle 
    fh = open(fileName) 
    # collect header information 


    # read in the rest of the data and organize it into a list of lists 
    for line in fh: 
     # split line into columns and append to array 
     arrayCols = line.strip().split('\t') 
     # collect rowname information 

     **what goes here?** 


     # convenient float conversion for each element in the list using the 
     # map function. note that this assumes each element is a number and can 
     # be cast as a float. see floatizeData(), which gives the explicit 
     # example of how the map function works conceptually. 
     twoDarray.append(map(float, arrayCols)) 
    # return data 
    return twoDarray 

我不断收到一个错误说,它不能转换为一个浮动的第一个字的文件(基因),因为它是一个字符串。所以,我的问题是搞清楚如何在阅读只是一线

+0

我们可以看到源文件前3行的例子吗?我_think_建议第一行包含列标题,后续行包含数字数据。它是否正确? – 2011-05-08 22:42:31

+0

请说明arrayCols的内容,即,将其打印 – joaquin 2011-05-08 22:43:29

+0

基因\t \t S1 S2 S3 \t \t \t S4 S5 \t -0.243 -0.021 -0.205 \t \t \t -1.283 0.411 \t -1.178 -0.79 0.063 \t \t -0.878 \t 0.011 – Eugene 2011-05-08 22:48:28

回答

1
def input2DarrayData(fn): 
    # define twoDarray, headerLine and rowLabels 
    twoDarray = [] 
    headerLine = None 
    rowLabels = [] 
    # open filehandle 
    fh = open(fn) 

    headerLine = fh.readline() 
    headerLine = headerLine.strip().split('\t') 

    for line in fh: 
     arrayCols = line.strip().split('\t') 
     rowLabels.append(arrayCols[0]) 

     twoDarray.append(map(float, arrayCols[1:])) 
    # return data 
    return [twoDarray, headerLine, rowLabels] 

如果这对你的工作,请阅读PEP-8和重构变量和函数名。另外不要忘记关闭文件。即关闭它最好使用with

def input2DarrayData(fn): 
    "" 
    twoDarray = [] 
    rowLabels = [] 
    # 
    with open(fn) as fh: 
     headerLine = fh.readline() 
     headerLine = headerLine.strip().split('\t') 
     for line in fh: 
      arrayCols = line.strip().split('\t') 
      rowLabels.append(arrayCols[0]) 
      twoDarray.append(map(float, arrayCols[1:])) 
    # 
    return [twoDarray, headerLine, rowLabels] 
+0

太棒了!那完美的工作!感谢程序员! – Eugene 2011-05-09 03:50:03

1

要处理的标题行(第一行的文件中)与.readline()消耗它明确地遍历其余的行前:

fh = open(fileName) 
headers = fh.readline().strip().split('\t') 
for line in fh: 
    arrayCols = line.strip().split('\t') 
    ## etc... 

我不确定你想从文件中获得什么数据结构;您似乎暗示了您需要包含标题的每行的列表。重复这样的标题并没有多大意义。

假设一个相当微不足道的文件结构具有标题行,并且每行柱的侧向承载力固定数,下面是一个发生器,产生用头作为键,和列的值作为值每行的字典:

def process_file(filepath): 
    ## open the file 
    with open('my_file') as src: 
     ## read the first line as headers 
     headers = src.readline().strip().split('\t') 
     for line in src: 
      ## Split the line 
      line = line.strip().split('\t') 
      ## Coerce each value to a float 
      line = [float(col) for col in line] 
      ## Create a dictionary using headers and cols 
      line_dict = dict(zip(headers, line)) 
      ## Yield it 
      yield line_dict 

>>> for row in process_file('path/to/myfile'): 
...  print row 
>>> 
>>> {'genes':100.00, 'S1':-0.243, 'S2':-0.021, 'S3':-0.205, 'S4': -1.283, 'S5': 0.411} 
>>> {'genes':10000.00, 'S1':-1.178, 'S2':-0.79, 'S3':0.063, 'S4': -0.878, 'S5': 0.011}