根据特定内容将大文本文件拆分为小文件

我得到了一个很大的基因组序列，我需要将它分解成小的.txt文件。根据特定内容将大文本文件拆分为小文件

顺序是这样的

>supercont1.1 of Geomyces destructans 20631-21 
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA 
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG 
>supercont1.2 of Geomyces destructans 20631-21 
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA 
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG 
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG 
>supercont1.3 of Geomyces destructans 20631-21 
AGATTTT (...)

，它应该是分裂成小文件与名称： “1.1 Geomyces-destructans - 20631-21”， “1.2 Geomyces ......” 与基因组完成数据。

@JimMischel帮助后，我的代码如下所示：

using System; 
using System.Collections.Generic; 
using System.ComponentModel; 
using System.Data; 
using System.Drawing; 
using System.Linq; 
using System.Text; 
using System.Windows.Forms; 
using System.IO; 

namespace genom1 
{ 
    public partial class Form1 : Form 
    { 
     public Form1() 
     { 
      InitializeComponent(); 
     } 

     string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*"; 

     private void doit_Click(object sender, EventArgs e) 
     { 
      bar.Value = 0; 

      OpenFileDialog opf = new OpenFileDialog(); 

      // filter for choosing file types 
      opf.Filter = filter; 

      string lineo = "error"; // test 

      if (opf.ShowDialog() == DialogResult.OK) 
      { 
       var lineCount = 0; 
       using (var reader = File.OpenText(opf.FileName)) 
       { 
        while (reader.ReadLine() != null) 
        { 
         lineCount++; 
        } 
       } 

       bar.Maximum = lineCount; 
       bar.Step = 1; 

       FolderBrowserDialog fbd = new FolderBrowserDialog(); 

       fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: \n\n" + opf.FileName; // dialog desc 
       if (fbd.ShowDialog() == DialogResult.OK) 
       { 
        List<string> lines = new List<string>(); 
        foreach (var line in File.ReadLines(opf.FileName)) 
        { 
         bar.PerformStep(); 

         if (line[0] == '>') 
         { 
          if (lines.Count >= 0) 
          { 
           // write contents of lines list to file 

           //quicker replace for better file name 
           StringBuilder prep = new StringBuilder(line); 
           prep.Replace(">supercont", ""); 
           prep.Replace("of", ""); 
           prep.Replace(" ", "-"); 
           lineo = prep.ToString(); 

           // append or writeall? how to writeall lines without append? 
           //System.IO.File.WriteAllText(fbd.SelectedPath + "\\" + lineo + ".txt", lineo); 
           StreamWriter SW; 
           SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt"); 

           foreach (string s in lines) 
            { 
             SW.WriteLine(s); 
            } 

           SW.Close(); 

           // and clear the list. 
           lines.Clear(); 
          } 
         } 
         lines.Add(line); 
        } 
        // here, do the last part 
        if (lines.Count >= 0) 
        { 
         // write contents of lines list to file. 

         /* starts being little buggy here... 

         StreamWriter SW; 
         SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt"); 
         foreach (string s in lines) 
         { 
          SW.WriteLine(s); 
         } 
         SW.Close(); 

         */ 
        } 
       } 

      } 
     } 

    } 
}

来源

2012-04-16 user1337432

如果文件足够大以适合内存，则可以拨打File.ReadAllText将其放入字符串中。然后你通过并提取>之间的文字。例如：

string s = File.ReadAllText("filename"); 
int pos = s.IndexOf('>'); 
while (pos != -1) 
{ 
    int newpos = s.IndexOf('>', pos+1); 
    string text = s.Substring(pos+1, newpos - pos); 
    // now write text to a file 

    // update current position 
    pos = newpos; 
} 
// here you'll have to handle the last part of the file specially.

我假设你可以弄清楚如何正确命名文件。

如果你不能将整个文件放到内存中，那么你可以逐个字符地读取文件或者做一些缓冲。如果您知道>总是在一行的开头，问题会更容易。然后你可以写：

List<string> lines = new List<string>(); 
foreach (var line in File.ReadLines("filename")) 
{ 
    if (line[0] == '>') 
    { 
     if (lines.Count > 0) 
     { 
      // write contents of lines list to file. 
      // and clear the list. 
      lines.Clear(); 
     } 
    } 
    lines.Add(line); 
} 
// here, do the last part 
if (lines.Count > 0) 
{ 
    // write contents of lines list to file. 
}

来源

2012-04-16 23:44:35

这是惊人的反应！您的评论真的帮助了我！但我仍然有一个问题（我很抱歉） - 我对这两个IFs有点困惑 - 为什么应该是这里的最后一部分？我对我的代码做了一些更改。你能用你有经验的眼睛来看看吗？有与生成txt文件的问题，其中“> supercont1.1”具有内容“> supercont1.2”等 PS：是它更好地使用WriteAllText或AppendText通过？哪一个更快？我要求，因为这个程序将阅读真正的大文件 – user1337432 2012-04-17 22:43:56

你不希望'lines.Count> = 0'，而是'lines.Count> 0'。如果没有行，则不需要创建文件。 “最后一部分”的原因是该文件可能不会以行上的“>”结尾（或者它可能）。如果没有，那么你将在'lines'列表中缓存文件的最后部分，并且你需要输出它。 'File.AppendText'很好。如果这个程序使用非常大的文件，你将受到磁盘速度的限制，所以你在逻辑中做的任何小的优化都不会有太大的改变。 – 2012-04-18 00:01:39

@ user1337432：您可能不想使用'line'来提取文件名。相反，使用'lines [0]'，这是开始的标记。这就是为什么我在那里有'lines.Count> 0'的原因。还有为什么我有“最后一部分”。如果你使用'line'，你的标签将被关闭。 – 2012-04-18 00:03:57

我想说的最简单的方法是，先阅读使用File.ReadAllText()整个文件。然后，只需使用String.Split(">")这将返回一个我认为将是您的新文件内容的数组。

来源

2012-04-16 23:44:23

根据特定内容将大文本文件拆分为小文件

回答

相关问题