在打开的文件上使用熊猫read_csv（）两次

正如我在尝试使用熊猫时，我注意到了pandas.read_csv的一些奇怪行为，并想知道是否有更多经验的人可以解释可能导致它的原因。在打开的文件上使用熊猫read_csv（）两次

要启动，这是我从.csv文件创建一个新的pandas.dataframe基本的类定义：

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath # File path to the target .csv file. 
     self.csvfile = open(filepath) # Open file. 
     self.csvdataframe = pd.read_csv(self.csvfile)

现在，这个工作得很好，并调用类的我__主要__.py成功地创建了一个数据帧大熊猫：

From dataMatrix.py import dataMatrix 

testObject = dataMatrix('/path/to/csv/file')

但我注意到，这个过程是自动设置的.csv作为pandas.dataframe.columns指数的第一行。相反，我决定编号列。由于我不想假设我已经知道列的数量，所以我采取了打开文件，将其加载到数据框，计算列数，然后使用范围重新加载数据框的方法（）。

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 
     self.csvfile = open(filepath) 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(self.csvfile) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 
     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(self.csvfile, 
             names=range(self.numcolumns))

保持我的处理__主要__.py一样的，我回来用适当的名称（0 ... 499）的正确的列数（500在这种情况下）一个数据帧，但它是否则为空（无行数据）。

抓我的头，我决定关闭self.csvfile并重新加载它，像这样：

import pandas as pd 

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 
     self.csvfile = open(filepath) 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(self.csvfile) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 

     # Close the .csv file.   #<---- +++++++ 
     self.csvfile.close()   #<---- Added 
     # Re-open file.    #<---- Block 
     self.csvfile = open(filepath) #<---- +++++++ 

     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(self.csvfile, 
             names=range(self.numcolumns))

关闭文件并重新打开它用pandas.dataframe返回正确的列编号为0 ... 499和随后的所有255行数据。

我的问题是为什么关闭文件并重新打开它有所作为？

来源

2014-09-19 Grant Hulegaard

当您打开与

open(filepath)

文件句柄迭代文件返回。一个迭代器适用于一次遍历其内容。所以

self.csvdataframe = pd.read_csv(self.csvfile)

读取内容并用尽迭代器。后续调用pd.read_csv认为迭代器为空。

请注意，您可以通过刚好路过的文件路径pd.read_csv避免这个问题：

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(filepath) 
     # Count the columns. 
     self.numcolumns = len(self.csvdataframe.columns) 


     # Re-load the .csv file, manually setting the column names to their 
     # number. 
     self.csvdataframe = pd.read_csv(filepath, 
             names=range(self.numcolumns))

pd.read_csv会再开（闭）为您的文件。

PS。另一个选项是通过调用self.csvfile.seek(0)将文件句柄重置为文件的开头，但使用pd.read_csv(filepath, ...)仍然更容易。有关文件迭代器信息

class dataMatrix: 
    def __init__(self, filepath): 
     self.path = filepath 

     # Load the .csv file to count the columns. 
     self.csvdataframe = pd.read_csv(filepath) 
     self.numcolumns = len(self.csvdataframe.columns) 
     self.csvdataframe.columns = range(self.numcolumns)

来源

2014-09-19 22:27:43 unutbu

感谢：

更妙的是，不是调用pd.read_csv两倍（这是低效的），你可以重命名列这样的。这就说得通了。我将进行更改以传递“文件路径”而不是打开的文件。但是，按照您在最后建议的方式重命名列将替换列名称，这意味着我丢失了第一行数据。 – 2014-09-21 22:25:25

然后添加'header = None'，这样第一行数据将成为数据的一部分，而不是解释为列名。 – unutbu 2014-09-21 23:08:08

啊是的，我忘了标题=无...我有问题得到这个工作，但这是一个单独的问题。感谢您回答我原来的问题！我只是对导致“开放文件”行为的较低级别的“幕后”交互感到好奇。谢谢！ – 2014-09-22 03:31:56

在打开的文件上使用熊猫read_csv（）两次

回答

相关问题