2012-03-30 105 views
4

我决定自学自己如何使用Parsec,我用自己指定的玩具项目打了一个路障。Haskell:为什么我的解析器不能正确回溯?

我试图解析HTML,具体是:

<html> 
    <head> 
    <title>Insert Clever Title</title> 
    </head> 
    <body> 
    What don't you like? 
    <select id="some stuff"> 
     <option name="first" font="green">boilerplate</option> 
     <option selected name="second" font="blue">parsing HTML with regexes</option> 
     <option name="third" font="red">closing tags for option elements 
    </select> 
    That was short. 
    </body> 
</html> 

我的代码是:

{-# LANGUAGE FlexibleContexts, RankNTypes #-} 
module Main where 

import System.Environment (getArgs) 
import Data.Map hiding (null) 
import Text.Parsec hiding ((<|>), label, many, optional) 
import Text.Parsec.Token 
import Control.Applicative 

data HTML = Element { tag :: String, attributes :: Map String (Maybe String), children :: [HTML] } 
      | Text { contents :: String } 
    deriving (Show, Eq) 

type HTMLParser a = forall s u m. Stream s m Char => ParsecT s u m a 

htmlDoc :: HTMLParser HTML 
htmlDoc = do 
    spaces 
    doc <- html 
    spaces >> eof 
    return doc 

html :: HTMLParser HTML 
html = text <|> element 

text :: HTMLParser HTML 
text = Text <$> (many1 $ noneOf "<") 

label :: HTMLParser String 
label = many1 . oneOf $ ['a' .. 'z'] ++ ['A' .. 'Z'] 

value :: HTMLParser String 
value = between (char '"') (char '"') (many anyChar) <|> label 

attribute :: HTMLParser (String, Maybe String) 
attribute = (,) <$> label <*> (optionMaybe $ spaces >> char '=' >> spaces >> value) 

element :: HTMLParser HTML 
element = do 
    char '<' >> spaces 
    tag <- label 
    -- at least one space between each attribute and what was before 
    attributes <- fromList <$> many (space >> spaces >> attribute) 
    spaces >> char '>' 
    -- nested html 
    children <- many html 
    optional $ string "</" >> spaces >> string tag >> spaces >> char '>' 
    return $ Element tag attributes children 

main = do 
    source : _ <- getArgs 
    result <- parse htmlDoc source <$> readFile source 
    print result 

这个问题似乎是我的解析器不喜欢关闭的标签 - 它似乎被贪婪地假设<总是意味着一个开始标签(据我可以告诉):

% HTMLParser temp.html 
Left "temp.html" (line 3, column 32): 
unexpected "/" 
expecting white space 

我我一直在玩这个游戏,我不确定为什么它不会回溯过去的比赛。

+6

秒差距只有回溯失败。 – ehird 2012-03-30 18:08:01

+1

有时甚至不会 - - 。 Attoparsec在这方面更糟糕。 – 2012-03-31 04:29:26

回答

2

像ehird说,我需要使用try:如果你使用`try`

attribute = (,) <$> label <*> (optionMaybe . try $ spaces >> char '=' >> spaces >> value) 
--... 
attributes <- fromList <$> many (try $ space >> spaces >> attribute) 
--... 
children <- many $ try html 
optional . try $ string "</" >> spaces >> string tag >> spaces >> char '>' 
相关问题