提取包含某个名称的列

我正在尝试使用它来处理大型txt文件中的数据。提取包含某个名称的列

我有一个超过2000列的txt文件，其中约三分之一的标题包含“Net”字样。我只想提取这些列并将它们写入一个新的txt文件。任何关于我如何做到这一点的建议？

我已经搜索了一下，但一直未能找到可以帮助我的东西。如果以前有类似的问题被问及解决，我们表示歉意。

编辑1：谢谢大家！在写这篇文章的时候，有3位用户提出了解决方案，他们都工作得很好。我真的不认为人们会回答，所以我没有检查一两天，并且很高兴为此感到惊讶。我非常感动。

编辑2：我添加的图片，显示了原来的txt文件的一部分，可以是什么样子，在情况下，它会帮助任何人在未来：

Sample from original txt-file

来源

2015-05-04 Rickyboy

你能请附上您的文件的一个小样本有问题，使问题陈述更清楚一点？ – ZdaR

当然！我已经获得了帮助，但是我现在包含了一段代码样本的小图片，以防将来帮助任何人 – Rickyboy

这样做的一种方式，没有安装像numpy/pandas这样的第三方模块，如下所示。给定一个输入文件，名为 “input.csv” 是这样的：

A，B，c_net，d，e_net

0,0,1,0,1

0,0,1， 0,1

（去除之间的空行，它们只是格式化这个职位的内容）

下面的代码你想要做什么。

import csv 


input_filename = 'input.csv' 
output_filename = 'output.csv' 

# Instantiate a CSV reader, check if you have the appropriate delimiter 
reader = csv.reader(open(input_filename), delimiter=',') 

# Get the first row (assuming this row contains the header) 
input_header = reader.next() 

# Filter out the columns that you want to keep by storing the column 
# index 
columns_to_keep = [] 
for i, name in enumerate(input_header): 
    if 'net' in name: 
     columns_to_keep.append(i) 

# Create a CSV writer to store the columns you want to keep 
writer = csv.writer(open(output_filename, 'w'), delimiter=',') 

# Construct the header of the output file 
output_header = [] 
for column_index in columns_to_keep: 
    output_header.append(input_header[column_index]) 

# Write the header to the output file 
writer.writerow(output_header) 

# Iterate of the remainder of the input file, construct a row 
# with columns you want to keep and write this row to the output file 
for row in reader: 
    new_row = [] 
    for column_index in columns_to_keep: 
     new_row.append(row[column_index]) 
    writer.writerow(new_row)

请注意，没有错误处理。至少应该处理两个。第一个是检查输入文件是否存在（提示：检查os和os.path模块提供的功能）。第二个是处理空白行或列数不一致的行。

来源

2015-05-04 12:08:43

哇，非常感谢，很有魅力！非常感动:) – Rickyboy

这可能是做了实例与熊猫，

import pandas as pd 

df = pd.read_csv('path_to_file.txt', sep='\s+') 
print(df.columns) # check that the columns are parsed correctly 
selected_columns = [col for col in df.columns if "net" in col] 
df_filtered = df[selected_columns] 
df_filtered.to_csv('new_file.txt')

当然，因为我们没有你的文本文件的结构，你必须适应这种变化的read_csv的参数，使你的情况下，这个工作（参见相应的documentation）。

这将加载内存中的所有文件，然后过滤出不必要的列。如果您的文件太大以至于无法立即将其加载到RAM中，则只能使用usecols参数加载特定列。

来源

2015-05-04 12:05:29 rth

整洁！完美的作品 – Rickyboy

可以使用熊猫过滤功能来选择基于正则表达式的几列

data_filtered = data.filter(regex='net')

来源

2015-05-04 16:48:44

不错！一旦文件被读取，这条简单的线就可以很好地提取列。谢谢！ – Rickyboy

提取包含某个名称的列

回答

相关问题