我试图通过haskell读取一个大的csv文件,并生成每列的字数。如何阅读大的CSV文件?
这超过了4M行的文件。
所以我选择读取一个块并获得每次字数(5k行一块)。 而不是总结在一起。
当我用12000行和120000行测试函数时,时间增加几乎是线性的。 但是,当读取180000行时,运行时间超过四次以上。
我认为这是因为内存不够,与磁盘交换使功能慢得多。
我把我的代码写成map/reduce样式,但是如何让haskell不把所有的数据保存在内存中呢?
这次打击是我的代码和分析结果。
import Data.Ord
import Text.CSV.Lazy.String
import Data.List
import System.IO
import Data.Function (on)
import System.Environment
splitLength = 5000
mySplit' [] = []
mySplit' xs = [x] ++ mySplit' t
where
x = take splitLength xs
t = drop splitLength xs
getBlockCount::Ord a => [[a]] -> [[(a,Int)]]
getBlockCount t = map
(map (\x -> ((head x),length x))) $
map group $ map sort $ transpose t
foldData::Ord a=> [(a,Int)]->[(a,Int)]->[(a,Int)]
foldData lxs rxs = map combind wlist
where
wlist = groupBy ((==) `on` fst) $ sortBy (comparing fst) $ lxs ++ rxs
combind xs
| 1==(length xs) = head xs
| 2 ==(length xs) = (((fst . head) xs), ((snd . head) xs)+((snd . last) xs))
loadTestData datalen = do
testFile <- readFile "data/test_csv"
let cfile = fromCSVTable $ csvTable $ parseCSV testFile
let column = head cfile
let body = take datalen $ tail cfile
let countData = foldl1' (zipWith foldData) $ map getBlockCount $ mySplit' body
let output = zip column $ map (reverse . sortBy (comparing snd)) countData
appendFile "testdata" $ foldl1 (\x y -> x ++"\n"++y)$ map show $tail output
main = do
s<-getArgs
loadTestData $ read $ last s
剖析结果
loadData +RTS -p -RTS 12000
total time = 1.02 secs (1025 ticks @ 1000 us, 1 processor)
total alloc = 991,266,560 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 120000
total time = 17.28 secs (17284 ticks @ 1000 us, 1 processor)
total alloc = 9,202,259,064 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 180000
total time = 85.06 secs (85059 ticks @ 1000 us, 1 processor)
total alloc = 13,760,818,848 bytes (excludes profiling overheads)
您需要使用流式库,例如'csv-conduit'或'pipes-csv' – ErikR 2014-11-03 02:43:54