解析Haskell中的大日志文件

假设我有几个200mb +的文件需要grep。我如何在Haskell中做到这一点？解析Haskell中的大日志文件

这里是我的初步方案：

import Data.List 
import Control.Monad 
import System.IO 
import System.Environment 

main = do 
    filename <- liftM head getArgs 
    contents <- liftM lines $ readFile filename 
    putStrLn . unlines . filter (isPrefixOf "import") $ contents

这读取整个文件到内存中，通过分析它之前。然后我带着这样的：

import Data.List 
import Control.Monad 
import System.IO 
import System.Environment 

main = do 
    filename <- liftM head getArgs 
    file <- (openFile filename ReadMode) 
    contents <- liftM lines $ hGetContents file 
    putStrLn . unlines . filter (isPrefixOf "import") $ contents

我想既然hGetContents很懒，it will avoid reading the whole file into memory。但是，在valgrind下运行这两个脚本都显示出类似的内存使用情况。所以无论我的脚本是错误的，还是valgrind都是错误的。我编译脚本使用

ghc --make test.hs -prof

我错过了什么？奖金问题：我看到很多关于如何在Haskell中使用惰性IO实际上是一件坏事的提及。如何/为什么我会使用严格的IO？

更新：

所以看起来我错了，我的valgrind的读数。使用+RTS -s，这里就是我得到：

7,807,461,968 bytes allocated in the heap 
1,563,351,416 bytes copied during GC 
     101,888 bytes maximum residency (1150 sample(s)) 
     45,576 bytes maximum slop 
      2 MB total memory in use (0 MB lost due to fragmentation) 

Generation 0: 13739 collections,  0 parallel, 2.91s, 2.95s elapsed 
Generation 1: 1150 collections,  0 parallel, 0.18s, 0.18s elapsed 

INIT time 0.00s ( 0.00s elapsed) 
MUT time 2.07s ( 2.28s elapsed) 
GC time 3.09s ( 3.13s elapsed) 
EXIT time 0.00s ( 0.00s elapsed) 
Total time 5.16s ( 5.41s elapsed)

的重要行是101,888 bytes maximum residency，它说，在任何给定的点我的脚本使用的内存101 KB最多。我掠过的文件是44 MB。所以我认为判决是：readFile和hGetContents都是懒惰的。

后续问题：

为什么我看到的内存7GB在堆上分配？对于在44 MB文件中读取的脚本，这看起来非常高。

更新后续问题

貌似的堆上分配的内存数GB的不是非典型哈斯克尔，关注所以没有原因。使用替代String小号ByteString S可内存使用率下降了不少：

81,617,024 bytes allocated in the heap 
     35,072 bytes copied during GC 
     78,832 bytes maximum residency (1 sample(s)) 
     26,960 bytes maximum slop 
      2 MB total memory in use (0 MB lost due to fragmentation)

来源

2012-03-17 Vlad the Impala

哼，你确定在用'putStrLn'实际编写之前不需要建立整个'unlines'字符串吗？我会尝试像'Control.Monad.forM_（过滤器（isPrefixOf“导入”）内容）$ putStrLn'。然而，这只是一个猜测。 – 2012-03-17 01:19:32

@Riccardo：不，可以懒惰评估'unlines'。在'ghci'中试试'putStr $ unlines $ map show [1 ..]'。 – ephemient 2012-03-17 01:28:20

-O2神奇地解决了这个问题？ – gspr 2012-03-17 07:55:13

两个readFile和hGetContents应该是懒惰。尝试使用+RTS -s运行程序并查看实际使用的内存量。是什么让你认为整个文件被读入内存？

至于你的问题的第二部分，惰性IO有时是意外的space leaks或resource leaks的根源。不是真正的惰性IO本身的错误，而是确定它是否泄漏需要分析它的使用方式。

来源

2012-03-17 01:17:09 ephemient

是的，你是正确的:)我的后续问题的任何想法？ – 2012-03-17 05:16:24

@VladtheImpala：不要担心总分配数字;它是在程序生命周期内分配的*总量*内存量。即使内存被垃圾收集释放，它也不会减少，就像Haskell经常发生的那样;每秒数千兆字节的数字并不少见。 – ehird 2012-03-17 05:57:13

@ehird啊好吧，谢谢。我只是不确定这是否是典型的。 – 2012-03-17 20:15:03

请不要使用普通的String's（尤其是当您处理大于100m的文件时）。只是ByteString的（或Data.Text）替换它们：

{-# LANGUAGE OverloadedStrings #-} 

import Control.Monad 
import System.Environment 
import qualified Data.ByteString.Lazy.Char8 as B 

main = do 
    filename <- liftM getArgs 
    contents <- liftM B.lines $ B.readFile filename 
    B.putStrLn . B.unlines . filter (B.isPrefixOf "import") $ contents

我敢打赌，这将是快好几倍。

UPD：关于您的后续问题。
当切换到字节串时，分配的内存量与魔术加速强烈连接。
由于String只是一个通用列表，它需要每个Char的额外内存：指向下一个元素，对象头等的指针。所有这些内存都需要分配并收集回来。这需要很大的计算能力。
另一方面，ByteString是块的列表，即连续的存储器块（我认为每个存储块不少于64字节）。这大大减少了分配和集合的数量，并且还改善了缓存局部性。

来源

2012-03-17 08:23:46

绝对同意使用ByteStrings ...我不想通过添加到我的示例进一步复杂化。但是，是的，它们在时间和内存方面都是巨大的节约：'在堆中分配了81,617,024字节'，最大驻留时间为78,832字节'，MUT时间为0.08秒（已过时0.22秒）。 – 2012-03-17 20:14:13

解析Haskell中的大日志文件

回答

相关问题