2012-04-16 56 views
1

我有一些字符串我想解析成“块”列表。我的琴弦是这样的解析haskell中的字符串

"some text [[anchor]] some more text, [[another anchor]]. An isolated [" 

而且我希望能找回这样的事情

[ 
    TextChunk "some text ", 
    Anchor "anchor", 
    TextChunk " some more text, " 
    Anchor "another anchor", 
    TextChunk ". An isolated [" 
] 

我已经成功地编写一个函数,那做什么,我需要的类型,但他们似乎过于难看。 有没有更好的方法来做到这一点?

data Token = TextChunk String | Anchor String deriving (Show) 
data TokenizerMode = EatString | EatAnchor deriving (Show) 

tokenize::[String] -> [Token] 
tokenize xs = 
    let (_,_,tokens) = tokenize' (EatString, unlines xs, [TextChunk ""]) 
    in reverse tokens 

tokenize' :: (TokenizerMode, String, [Token]) -> (TokenizerMode, String,[Token]) 
-- If we're starting an anchor, add a new anchor and switch modes 
tokenize' (EatString, '[':'[':xs, tokens) = tokenize' (EatIdentifier, xs, (Identifier ""):tokens) 
-- If we're ending an anchor ass a new text chunk and switch modes 
tokenize' (EatAnchor, ']':']':xs, tokens) = tokenize' (EatString, xs, (TextChunk ""):tokens) 
-- Otherwise if we've got stuff to consume append it 
tokenize' (EatString, x:xs, (TextChunk t):tokens) = tokenize'(EatString, xs, (TextChunk (t++[x])):tokens) 
tokenize' (EatAnchor, x:xs, (Identifier t):tokens) = tokenize'(EatAnchor, xs, (Identifier (t++[x])):tokens) 
--If we've got nothing more to consume we're done. 
tokenize' (EatString, [], tokens) = (EatString, [], tokens) 
--We'll only get here if we're given an invalid string 
tokenize' xx = error ("Error parsing .. so far " ++ (show xx)) 
+2

这不是真正的标记,它是解析。对于所有的解析需求,Parsec。 – 2012-04-16 04:38:32

+0

@CatPlusPlus同意解析..更新文本和标题匹配。 – 2012-04-16 04:47:08

+0

@CatPlusPlus你能告诉我如何使用parsec看起来如何?我发现我的喜欢文档/ tutes有点模糊。 – 2012-04-16 05:02:08

回答

11

这应该工作,其中包括唯一的括号:

import Control.Applicative ((<$>), (<*), (*>)) 
import Text.Parsec 

data Text = TextChunk String 
      | Anchor String 
      deriving Show 

chunkChar = noneOf "[" <|> try (char '[' <* notFollowedBy (char '[')) 
chunk  = TextChunk <$> many1 chunkChar 
anchor = Anchor <$> (string "[[" *> many (noneOf "]") <* string "]]") 
content = many (chunk <|> anchor) 

parseS :: String -> Either ParseError [Text] 
parseS input = parse content "" input 

注意使用try允许当chunkChar解析器匹配两个开放括号回溯。在没有try的情况下,第一个支架将在该点消耗。

4

这是一个使用两个相互递归函数的简单版本。

module Tokens where 

data Token = TextChunk String | Anchor String deriving (Show) 

tokenize :: String -> [Token] 
tokenize = textChunk emptyAcc 


textChunk :: Acc -> String -> [Token] 
textChunk acc []   = [TextChunk $ getAcc acc] 
textChunk acc ('[':'[':ss) = TextChunk (getAcc acc) : anchor emptyAcc ss 
textChunk acc (s:ss)  = textChunk (snocAcc acc s) ss 

anchor :: Acc -> String -> [Token] 
anchor acc []    = error $ "Anchor not terminated" 
anchor acc (']':']':ss) = Anchor (getAcc acc) : textChunk emptyAcc ss 
anchor acc (s:ss)   = anchor (snocAcc acc s) ss 


-- This is a Hughes list (also called DList) which allows 
-- efficient 'Snoc' (adding to the right end). 
-- 
type Acc = String -> String 

emptyAcc :: Acc 
emptyAcc = id 

snocAcc :: Acc -> Char -> Acc 
snocAcc acc c = acc . (c:) 

getAcc :: Acc -> String 
getAcc acc = acc [] 

这个版本如果输入开头或锚定结束,或者如果在文本两个连续锚,它会产生空TextChunks问题。

它是直着增加检查不生成TextChunk如果累加器是空的,但它使代码约两倍长 - 也许我会达到秒差距毕竟......用一元

+0

如果我关心空TextChunks,我可以很容易地将空TextChunks过滤为后处理。 – 2012-04-17 02:51:59

+0

感谢关于追加列表的性能指针,并且DList解决了这个问题。 – 2012-04-17 02:53:56

1

解决方案秒差距。

import Text.ParserCombinators.Parsec 

data Text = TextChunk String 
      | Anchor String 
      deriving Show 

inputString = "some text [[anchor]] some more text, [[another anchor]]." 

content :: GenParser Char st [Text] 
content = do 
    s1 <- many (noneOf "[") 
    string "[[" 
    s2 <- many (noneOf "]") 
    string "]]" 
    s3 <- many (noneOf "[") 
    string "[[" 
    s4 <- many (noneOf "]") 
    string "]]." 
    return $ [TextChunk s1, Anchor s2, TextChunk s3, Anchor s4] 


parseS :: String -> Either ParseError [Text] 
parseS input = parse content "" input 

它是如何工作的:

> parseS inputString 
Right [TextChunk "some text ",Anchor "anchor",TextChunk " some more text, ",Anchor "another anchor"] 
it :: Either ParseError [Text] 
+2

更一般地,你可以用'chunk = TextChunk <$> many1(noneOf“[”)'和'anchor = Anchor <$>(string“[['*> many(noneOf”]]来编写'content = many(chunk <|> anchor) “)<* string”]]“)'(使用Control.Applicative中的一些快捷键)。这应该适用于文本块和锚点的任何组合 – hammar 2012-04-16 08:32:29

+0

@hammar,这几乎可以,但我猜测它不允许在文本中使用'[''。我将这个添加到我的示例字符串中以使其更清晰,我只希望将“[[stuff]]”作为锚点,并将其他任何内容粘贴到Text Chunk中。 – 2012-04-16 08:42:40