parsing with haskell parser combinators下载 - parsing with haskell parser combinators来源代码下载

与Haskell解析器组合分析

需要解析东西吗？从未听说过“解析器组合”吗？想学习一些haskell吗？惊人的！以下是您需要与Haskell Parser组合者一起起床和解析的一切。从这里，您可以尝试处理深奥的数据序列化格式，编译器前端，特定域的语言 - 您命名它！

构建演示
运行演示
解析器组合
版本号
SRT
练习

构建演示

本指南中包括两个演示程序。

version-number-parser解析版本号的文件。 srt-file-parser解析了SRT字幕的文件。随时尝试使用test-input/中找到的文件。

堆

下载Haskell工具堆栈，然后运行以下内容。

git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
cd parsing-with-haskell-parser-combinators
stack build

阴谋

如果使用Cabal，则可以运行以下内容。

git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
cd parsing-with-haskell-parser-combinators
cabal sandbox init
cabal --require-sandbox build
cabal --require-sandbox install

运行演示

构建了两个演示程序后，您可以这样运行。

堆

要尝试版本号码解析器，请运行以下内容。

 cd parsing-with-haskell-parser-combinators
stack exec -- version-number-parser
What is the version output file path ?
test-input/gifcurry-version-output.txt

要尝试SRT文件解析器，请运行以下内容。

 cd parsing-with-haskell-parser-combinators
stack exec -- srt-file-parser
What is the SRT file path ?
test-input/subtitles.srt

阴谋

要尝试版本号码解析器，请运行以下内容。

 cd parsing-with-haskell-parser-combinators
.cabal-sandbox/bin/version-number-parser
What is the version output file path ?
test-input/gifcurry-version-output.txt

要尝试SRT文件解析器，请运行以下内容。

 cd parsing-with-haskell-parser-combinators
.cabal-sandbox/bin/srt-file-parser
What is the SRT file path ?
test-input/subtitles.srt

解析器组合

了解解析策略的更好方法之一，解析器组合者是研究一个实施。

使用组合剂构建的解析器很容易构建，可读，模块化，结构良好且易于维护。

^{- 比较比较组合者 - 维基百科}

读取

让我们看一下在基地中发现的解析器组合库ReadP的引擎盖下。由于它位于基础上，因此您应该已经拥有它了。

请注意，熟悉ReadP后，您可能需要尝试PARSEC。它也是其他人更喜欢读取的解析器组合库。作为额外的奖励，从GHC版本8.4.1开始，它包含在GHC的引导库中。

P数据类型

 -- (c) The University of Glasgow 2002

data P a
  = Get ( Char -> P a )
  | Look ( String -> P a )
  | Fail
  | Result a ( P a )
  | Final [( a , String )]
  deriving Functor

我们将从P数据类型开始。 P a中的a取决于您（库用户），可以是您想要的。编译器会自动创建函数实例，并且有用于应用，单一， MonadFail和替代方案的手写实例。

请注意，有关函数，应用程序和单子的更多信息，请查看您的单调，应用程序和函子的轻松指南。

P是五种情况的总和类型。

从输入字符串中Get单个字符，并返回新的P 。
Look接受输入字符串的副本，并返回新的P 。
Fail指示解析器完成没有结果。
Result可能会解析和另一种P情况。
Final是两个小组的列表。第一个元组元素是输入的可能解析，第二个元组元素是Get未消耗的输入字符串的其余部分。

跑步

 -- (c) The University of Glasgow 2002

run :: P a -> ReadS a
run ( Get f)      (c : s) = run (f c) s
run ( Look f)     s     = run (f s) s
run ( Result x p) s     = (x,s) : run p s
run ( Final r)    _     = r
run _            _     = []

run是ReadP解析器的核心。当它递归贯穿我们上面看到的所有解析器状态时，它都会完成所有繁重的举动。您可以看到它需要P并返回A ReadS 。

 -- (c) The University of Glasgow 2002

type ReadS a = String -> [( a , String )]

ReadS a是String -> [(a,String)]的类型别名。因此，每当您看到ReadS a时，请思考String -> [(a,String)] 。

 -- (c) The University of Glasgow 2002

run :: P a -> String -> [( a , String )]
run ( Get f)      (c : s) = run (f c) s
run ( Look f)     s     = run (f s) s
run ( Result x p) s     = (x,s) : run p s
run ( Final r)    _     = r
run _            _     = []

run模式与P的不同情况匹配。

如果是Get ，它将用新的P呼唤自己（通过传递函数f ，在Get f ，输入字符串中的下一个字符c返回）和输入s的其余部分。
如果是Look ，它将用新的P调用自身（通过传递函数f ，在Look f中返回，输入字符串s ）和输入字符串。请注意， Look不会像Get那样从输入字符串中消耗任何字符。
如果是Result ，它将组装一个两核算 - 包含解析的结果和输入字符串的剩下的内容 - 并将其预先为递归调用的结果，该递归调用与另一个P情况和输入字符串一起运行。
如果是Final ， run将返回包含解析结果和输入字符串剩余的两个tuplace的列表。
对于其他任何内容， run返回一个空列表。例如，如果情况Fail ， run将返回一个空列表。

 > run ( Get (  a -> Get (  b -> Result [a,b] Fail ))) " 12345 "
[( " 12 " , " 345 " )]

ReadP不会公开run ，但是如果这样做，您可以这样称呼它。两者Get了'1'和'2' ，将"345"留在后面。

 > run ( Get (  a -> Get (  b -> Result [a,b] Fail ))) " 12345 "
> run ( Get (  b -> Result [ ' 1 ' ,b] Fail )) " 2345 "
> run ( Result [ ' 1 ' , ' 2 ' ] Fail ) " 345 "
> ([ ' 1 ' , ' 2 ' ], " 345 " ) : run ( Fail ) " 345 "
> ([ ' 1 ' , ' 2 ' ], " 345 " ) : []
[( " 12 " , " 345 " )]

通过每个递归电话，您可以看到我们如何达到最终结果。

 > run ( Get (  a -> Get (  b -> Result [a,b] ( Final [([ ' a ' , ' b ' ], " c " )])))) " 12345 "
[( " 12 " , " 345 " ),( " ab " , " c " )]

使用Final ，您可以在两个tuples的最终列表中包含一个解析结果。

readp_to_s

 -- (c) The University of Glasgow 2002

   readP_to_S :: ReadP a -> ReadS a
-- readP_to_S :: ReadP a -> String -> [(a,String)]
   readP_to_S ( R f) = run (f return )

虽然readP并未直接曝光run ，但它确实通过readP_to_S暴露了它。 readP_to_S引入了一个名为ReadP的newtype 。 readP_to_S接受ReadP a ，一个字符串，并返回两个tuplace的列表。

读取newtype

 -- (c) The University of Glasgow 2002

newtype ReadP a = R ( forall b . ( a -> P b ) -> P b )

这是ReadP a的定义。有一些实例，适用于函子，应用，monad， MonadFail ，替代方案和MonadPlus 。 R构造器采用一个函数，该函数采用另一个函数并返回P 。接受的功能采用您选择的任何东西a返回P 。

 -- (c) The University of Glasgow 2002

readP_to_S ( R f) = run (f return )

回想一下P是单子， return的类型是a -> ma 。因此， f是(a -> P b) -> Pb函数， return是(a -> P b)函数。最终， run获得了预期的P b

 -- (c) The University of Glasgow 2002

readP_to_S ( R f) inputString = run (f return ) inputString
--               ^ ^^^^^^^^^^                  ^^^^^^^^^^^

它在源代码中留下，但请记住， readP_to_S并run期望输入字符串。

 -- (c) The University of Glasgow 2002

instance Functor ReadP where
  fmap h ( R f) = R (  k -> f (k . h))

这是ReadP的函数实例定义。

 > readP_to_S ( fmap toLower get) " ABC "
[( ' a ' , " BC " )]

> readP_to_S (toLower <$> get) " ABC "
[( ' a ' , " BC " )]

这使我们能够做这样的事情。 fmap函数在函子上映射toLower get函数等于R Get 。回想一下， Get的类型为(Char -> P a) -> P a ReadP构造函数（ R ）接受。

 -- (c) The University of Glasgow 2002

fmap h       ( R f  ) = R (  k -> f   (k . h      ))
fmap toLower ( R Get ) = R (  k -> Get (k . toLower))

在这里，您可以看到为fmap toLower get示例。

应用P实例

查看上面，当我们只使用未终止run的Get时， readP_to_S返回[('a',"BC")] ？答案在于P的应用定义。

 -- (c) The University of Glasgow 2002

instance Applicative P where
  pure x = Result x Fail
  (<*>) = ap

return等于pure ，因此我们可以重写readP_to_S (R f) = run (f return)为readP_to_S (R f) = run (f pure) 。通过使用return或相当pure readP_to_S集Result x Fail因为最终的案例run将会遇到。如果到达， run将终止，我们将获得分析列表。

 > readP_to_S ( fmap toLower get) " ABC "

-- Use the functor instance to transform fmap toLower get.
> readP_to_S ( R (  k -> Get (k . toLower))) " ABC "

-- Call run which removes R.
> run ((  k -> Get (k . toLower)) pure ) " ABC "

-- Call function with pure to get rid of k.
> run ( Get ( pure . toLower)) " ABC "

-- Call run for Get case to get rid of Get.
> run (( pure . toLower) ' A ' ) " BC "

-- Call toLower with 'A' to get rid of toLower.
> run ( pure ' a ' ) " BC "

-- Use the applicative instance to transform pure 'a'.
> run ( Result ' a ' Fail ) " BC "

-- Call run for the Result case to get rid of Result.
> ( ' a ' , " BC " ) : run ( Fail ) " BC "

-- Call run for the Fail case to get rid of Fail.
> ( ' a ' , " BC " ) : []

-- Prepend.
[( ' a ' , " BC " )]

在这里，您可以看到从readP_to_S到解析结果的流程。

替代P实例

 -- (c) The University of Glasgow 2002

instance Alternative P where
  -- ...

  -- most common case: two gets are combined
  Get f1     <|> Get f2     = Get (  c -> f1 c <|> f2 c)

  -- results are delivered as soon as possible
  Result x p <|> q          = Result x (p <|> q)
  p          <|> Result x q = Result x (p <|> q)

  -- ...

P的Alternative实例使我们能够将解析器的流动分为左右路径。当输入不进行，一种或（很少）两种方式中的两种方式时，这很方便。

 > readP_to_S ((get >>=  a -> return a) <|> (get >> get >>=  b -> return b)) " ABC "
[( ' A ' , " BC " ),( ' B ' , " C " )]

<|>运算符或功能在解析器的流中引入叉子。解析器将穿过左右路径。最终结果将包含所有可能的解析，以及所有可能的解析。如果这两条路径都失败，则整个解析器将失败。

请注意，在其他解析器组合器实现中，使用<|>运算符时，解析器将向左或向右，但不会两者兼而有之。如果左派成功，右侧将被忽略。右侧仅在左侧失败时进行处理。

 > readP_to_S ((get >>=  a -> return [a]) <|> look <|> (get >> get >>=  a -> return [a])) " ABC "
[( " ABC " , " ABC " ),( " A " , " BC " ),( " B " , " C " )]

您可以将<|>运算符链接用于许多选项或替代方案。解析器将返回可能的分析。

读取失败

 -- (c) The University of Glasgow 2002

instance Monad ReadP where
  fail _    = R (  _ -> Fail )
  R m >>= f = R (  k -> m (  a -> let R m' = f a in m' k))

这是ReadP monad实例。注意fail的定义。

 > readP_to_S ((  a b c -> [a,b,c]) <$> get <*> get <*> get) " ABC "
[( " ABC " , " " )]

> readP_to_S ((  a b c -> [a,b,c]) <$> get <*> fail " " <*> get) " ABC "
[]

> readP_to_S (get >>=  a -> get >>=  b -> get >>=  c -> return [a,b,c]) " ABC "
[( " ABC " , " " )]

> readP_to_S (get >>=  a -> get >>=  b -> fail " " >>=  c -> return [a,b,c]) " ABC "
[]

您可以通过打电话fail导致整个解析器路径流产。由于readP不能提供生成Result或Final情况的直接方法，因此返回值将是一个空列表。如果故障路径是唯一的路径，则整个结果将是一个空列表。回想一下，当run匹配Fail时，它将返回一个空列表。

 -- (c) The University of Glasgow 2002

instance Alternative P where
  -- ...

  -- fail disappears
  Fail       <|> p          = p
  p          <|> Fail       = p

  -- ...

回到替代P实例，您可以看到两侧的故障（但并非两者都不会失败）不会使整个解析器失败。

 > readP_to_S (get >>=  a -> get >>=  b -> pfail >>=  c -> return [a,b,c]) " ABC "
[]

ReadP不使用fail ，而是提供pfail ，它允许您直接生成Fail情况。

版本号

Gifcurry是GIF制造商的Haskell构建的视频编辑，它将用于各种不同的程序。为了确保兼容性，它需要为其外壳提供的每个程序的版本编号。这些程序之一是ImageMagick。

Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org
Copyright: © 1999-2018 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenCL OpenMP

在这里，您可以看到convert --version的输出。您如何解析此捕获6、9、10和14？

查看输出，我们知道版本编号是一个数字的集合，该数字由段或破折号分隔。该定义也涵盖了日期，因此我们将确保前两个数字分为一段时间。这样，如果他们在版本编号之前放一个日期，我们就不会得到错误的结果。

1. Consume zero or more characters that are not 0 through 9 and go to 2.
2. Consume zero or more characters that are 0 through 9, save this number, and go to 3.
3. Look at the rest of the input and go to 4.
4. If the input
    - is empty, go to 6.
    - starts with a period, go to 1.
    - starts with a dash
        - and you have exactly one number, go to 5.
        - and you have more than one number, go to 1.
    - doesn't start with a period or dash
        - and you have exactly one number, go to 5.
        - you have more than one number, go to 6.
5. Delete any saved numbers and go to 1.
6. Return the numbers found.

在我们深入了解代码之前，这是我们将要关注的算法。

构建版本号解析器

parseVersionNumber
  ::  [ String ]
  ->  ReadP [ String ]
parseVersionNumber
  nums
  = do
  _         <- parseNotNumber
  num       <- parseNumber
  let nums' = nums ++ [num]
  parseSeparator nums' parseVersionNumber

parseVersionNumber是解析版本号的输入字符串的主要解析器组合。它接受字符串列表，并在ReadP数据类型的上下文中返回字符串列表。接收的字符串列表不是被解析的输入，而是到目前为止发现的数字列表。对于第一个函数调用，列表为空，因为列表尚未解析任何内容。

parseVersionNumber
  nums

从顶部开始， parseVersionNumber列出了一系列字符串，这是迄今为止发现的当前数字列表。

  _         <- parseNotNumber

parseNotNumber消耗了输入字符串中的所有数字。由于我们对结果不感兴趣，因此我们将其丢弃（ _ <- ）。

  num       <- parseNumber
  let nums' = nums ++ [num]

接下来，我们消耗所有数字的所有内容，然后将其添加到到目前为止发现的数字列表中。

  parseSeparator nums' parseVersionNumber

在处理下一个parseVersionNumber之后，它将发现的数字列表和本身传递给了parseSeparator 。

解析分离器

parseSeparator
  ::  [ String ]
  ->  ([ String ] -> ReadP [ String ])
  ->  ReadP [ String ]
parseSeparator
  nums
  f
  = do
  next <- look
  case next of
    " "    -> return nums
    (c : _) ->
      case c of
        ' . ' -> f nums
        ' - ' -> if length nums == 1 then f [] else f nums
        _   -> if length nums == 1 then f [] else return nums

在这里，您可以看到parseSeparator 。

  next <- look
  case next of
    " "    -> return nums
    (c : _) ->

look使我们可以在不消耗输入字符串的情况下获得输入字符串的剩余内容。如果什么都没有，它将返回找到的数字。但是，如果剩下一些东西，它将分析第一个字符。

      case c of
        ' . ' -> f nums
        ' - ' -> if length nums == 1 then f [] else f nums
        _   -> if length nums == 1 then f [] else return nums

如果下一个字符是一个时期，请再次致电当前的数字列表，再次致电parseVersionNumber名单。如果这是一个破折号，并且我们有一个数字，请致电parseVersionNumber列表，因为它是日期，该数字列表空白列表。如果这是破折号，我们没有一个数字，请parseVersionNumber到目前为止发现的数字列表。否则，如果我们完全没有一个数字，请致电带有空列表的parseVersionNumber ，如果我们没有一个数字，请返回数字。

解析非数字

parseNotNumber
  ::  ReadP String
parseNotNumber
  =
  munch ( not . isNumber)

parseNotNumber使用ReadP提供的munch 。给予munch ，为任何不是0到9的角色返回的谓词(not . isNumber) 。

 munch :: ( Char -> Bool ) -> ReadP String

如果输入字符串中的下一个字符满足谓词， munch连续打电话get 。如果不是这样， munch会返回所做的字符（如果有的话）。由于它只使用get ，因此Munch总是成功。

请注意， parseNumber类似于parseNotNumber 。而not . isNumber ，谓词只是isNumber 。

蒙克与许多

parseNotNumber'
  ::  ReadP String
parseNotNumber'
  =
  many (satisfy ( not . isNumber))

您可以不使用munch ，而可以使用many和satisfy方式编写parseNotNumber ，这是readp提供的。查看many的类型签名，它接受单个解析器组合器（ ReadP a ）。在这种情况下，它得到了解析器组合的satisfy 。

 > readP_to_S (satisfy ( not . isNumber)) " a "
[( ' a ' , " " )]

> readP_to_S (satisfy ( not . isNumber)) " 1 "
[]

satisfy需要一个谓词，并使用get消耗下一个字符。如果接受的谓词返回真实， satisfy将返回角色。否则， satisfy呼叫pfail并失败。

 > readP_to_S (munch ( not . isNumber)) " abc123 "
[( " abc " , " 123 " )]

> readP_to_S (many (satisfy ( not . isNumber))) " abc123 "
[( " " , " abc123 " ),( " a " , " bc123 " ),( " ab " , " c123 " ),( " abc " , " 123 " )]

使用many可以给您带来不必要的结果。最终， many引入了一个或多个Result案例。因此， many总是成功。

 > readP_to_S (many look) " abc123 "
-- Runs forever.

many会运行您的解析器，直到失败或输入输入为止。如果您的解析器永远不会失败或从未用完输入，那么many将永远不会返回。

 > readP_to_S (many (get >>=  a -> return ( read (a : " " ) :: Int ))) " 12345 "
[( [] , " 12345 " ),([ 1 ], " 2345 " ),([ 1 , 2 ], " 345 " ),([ 1 , 2 , 3 ], " 45 " ),([ 1 , 2 , 3 , 4 ], " 5 " ),([ 1 , 2 , 3 , 4 , 5 ], " " )]

对于结果中的每个索引，分析结果将是在整个输入中运行解析器索引时间的结果。

 > let parser        = get >>=  a -> return ( read (a : " " ) :: Int )
> let many' results = return results <|> (parser >>=  result -> many' (results ++ [result]))
> readP_to_S (many' [] ) " 12345 "
[( [] , " 12345 " ),([ 1 ], " 2345 " ),([ 1 , 2 ], " 345 " ),([ 1 , 2 , 3 ], " 45 " ),([ 1 , 2 , 3 , 4 ], " 5 " ),([ 1 , 2 , 3 , 4 , 5 ], " " )]

这是many的替代定义。在<|>的左侧，它返回当前解析器结果。在<|>的右侧，它运行解析器，将结果添加到当前解析器的结果中，并以更新的结果自称。这具有累积总和类型效果，其中索引i是分析器结果，将其附加到i - 1 ， i - 2 ，...和1解析器结果上。

运行版本号解析器

现在我们建造了解析器，让我们运行它。

 > let inputString =
>     " Some Program (C) 1234-56-78 All rights reserved. n 
>     Version: 12.345.6-7 n 
>     License: Some open source license. "
> readP_to_S (parseVersionNumber [] ) inputString
[([ " 12 " , " 345 " , " 6 " , " 7 " ], " n License: Some open source license. " )]

您可以看到它即使日期之前也正确提取了版本编号。

SRT

现在，让我们解析一些更复杂的东西-SRT文件。

为了发布gifcurry六，我需要解析SRT（子栏文本）文件。 SRT文件包含视频处理程序用来在视频顶部显示文本的字幕。通常，此文本是电影的对话框，该电影翻译成各种不同的语言。通过将文本与视频分开，只需要一个视频，可以节省时间，存储空间和带宽。视频软件可以交换文本而无需交换视频。将其与刻录或硬编码的字幕进行对比，其中文本成为构成视频的图像数据的一部分。在这种情况下，您需要每个字幕集合的视频。

^{内部视频©Blender Foundation | www.sintel.org}

Gifcurry可以获取SRT文件，并为您选择的视频切片刻录字幕。

7
00:02:09,400 --> 00:02:13,800
What brings you to
the land of the gatekeepers?

8
00:02:15,000 --> 00:02:17,500
I'm searching for someone.

9
00:02:18,000 --> 00:02:22,200
Someone very dear?
A kindred spirit?

在这里，您可以看到Sintel的英文字幕（©Blender Foundation | www.sintel.org）。

SRT格式

SRT也许是所有字幕格式中最基本的。

^{- SRT字幕| Matrosk}

SRT文件格式由块组成，每个字幕一个由空线组成。

块的顶部是索引。这决定了字幕的顺序。希望字幕已经井井有条，所有这些都有独特的索引，但情况并非如此。

01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63

在索引是开始时间，结束时间之后，并且可以选择矩形的可选点，字幕文本应输入。

01:04:13,000

时间戳格式为hours:minutes:seconds,milliseconds 。

注意逗号，而不是将秒数与毫秒分开的时期。

This is the actual subtitle
text. It can span multiple lines.
It may include formating
like <b>bold</b>, <i>italic</i>,
<u>underline</u>,
and <font color="#010101">font color</font>.

块的第三部分也是最后一部分是字幕文本。它可以跨越多行，并在有空线时结束。文本可以包括格式化标签，让人联想到HTML。

建造SRT解析器

parseSrt
  ::  ReadP [ SrtSubtitle ]
parseSrt
  =
  manyTill parseBlock (skipSpaces >> eof)

parseSrt是处理所有内容的主要解析器组合。它会解析每个块，直到到达文件末尾（ eof ）或输入。为了安全起见，最后一个块和文件末端之间可能会有落后的空格。为了解决这个问题，它在解析文件末尾之前（ skipSpaces >> eof ）在解析Whitespace（ skipSpaces ）的零或更多字符。如果到达eof时仍留下输入， eof将失败，这将一无所获。因此，重要的是， parseBlock只会留下任何东西，而是落后。

建造SRT块解析器

parseBlock
  ::  ReadP SrtSubtitle
parseBlock
  = do
  i      <- parseIndex
  (s, e) <- parseTimestamps
  c      <- parseCoordinates
  t      <- parseTextLines
  return
    SrtSubtitle
      { index       = i
      , start       = s
      , end         = e
      , coordinates = c
      , taggedText  = t
      }

正如我们之前所经过的那样，一个块由索引，时间戳，可能是一些坐标和一些文本组成。在此版本的parseBlock中，您会看到记录语法更急切的DO符号样式。

parseBlock'
  ::  ReadP SrtSubtitle
parseBlock'
  =
      SrtSubtitle
  <$> parseIndex
  <*> parseStartTimestamp
  <*> parseEndTimestamp
  <*> parseCoordinates
  <*> parseTextLines

这是您可以编写parseBlock另一种方式。这是应用风格。只要确保正确订单即可。例如，我本可以不小心混合了开始和结束时间戳。

构建SRT索引解析器

parseIndex
  ::  ReadP Int
parseIndex
  =
      skipSpaces
  >>  readInt <$> parseNumber

块的顶部是索引。在这里，您再次看到skipSpaces 。跳过空格之后，它解析了数字的输入并将其转换为实际整数。

readInt
  ::  String
  ->  Int
readInt
  =
  read

readInt看起来像这样。

 > read " 123 " :: Int
123
> read " 1abc " :: Int
*** Exception : Prelude. read : no parse

通常，直接使用read可能是危险的。 read可能无法将输入转换为指定类型。但是， parseNumber只会返回10个数字数字字符（ ['0'..'9'] ），因此直接使用read变得安全。

构建SRT时间戳解析器

解析时间戳比解析索引要多一点。

parseTimestamps
  ::  ReadP ( Timestamp , Timestamp )
parseTimestamps
  = do
  _   <- char ' n '
  s   <- parseTimestamp
  _   <- skipSpaces
  _   <- string " --> "
  _   <- skipSpaces
  e   <- parseTimestamp
  return (s, e)

这是解析时间戳的主要组合。

char解析您给它的角色或失败。如果失败，则parseTimestamps失败，最终导致parseSrt失败，因此索引后必须有一个newline字符。

string就像char一样，除了仅仅是一个字符，它可以解析您给它的字符串或失败。

parseStartTimestamp
  ::  ReadP Timestamp
parseStartTimestamp
  =
      char ' n '
  >>  parseTimestamp

parseTimestamps均解析了两个时间戳，但是对于应用样式（ parseSrt' ），我们仅需要一个解析器才能开始时间戳。

parseEndTimestamp
  ::  ReadP Timestamp
parseEndTimestamp
  =
      skipSpaces
  >>  string " --> "
  >>  skipSpaces
  >>  parseTimestamp

这解析了时间戳之间的所有内容，并返回末端时间戳。

parseTimestamp
  ::  ReadP Timestamp
parseTimestamp
  = do
  h  <- parseNumber
  _  <- char ' : '
  m  <- parseNumber
  _  <- char ' : '
  s  <- parseNumber
  _  <- char ' , ' <|> char ' . '
  m' <- parseNumber
  return
    Timestamp
      { hours        = readInt h
      , minutes      = readInt m
      , seconds      = readInt s
      , milliseconds = readInt m'
      }

这解析了构成时间戳的四个数字。前三个数字被结肠隔开，最后一个数字被逗号隔开。但是，要更加宽容，我们允许有一个时期而不是逗号。

 > readP_to_S (char ' . ' <|> char ' , ' ) " ... "
[( ' . ' , " .. " )]

> readP_to_S (char ' . ' <|> char ' , ' ) " ,.. "
[( ' , ' , " .. " )]

请注意，当使用<|>的char时，只有一侧可以成功（两个char Enter，一个char休假），因为char会消耗一个字符，两个字符不能占据相同的空间。

构建SRT坐标解析器

坐标是块的可选部分，但如果包括，则将与时间戳在同一行。

parseCoordinates
  ::  ReadP ( Maybe SrtSubtitleCoordinates )
parseCoordinates
  =
  option Nothing $ do
    _  <- skipSpaces1
    x1 <- parseCoordinate ' x ' 1
    _  <- skipSpaces1
    x2 <- parseCoordinate ' x ' 2
    _  <- skipSpaces1
    y1 <- parseCoordinate ' y ' 1
    _  <- skipSpaces1
    y2 <- parseCoordinate ' y ' 2
    return
      $ Just
        SrtSubtitleCoordinates
          { x1 = readInt x1
          , x2 = readInt x2
          , y1 = readInt y1
          , y2 = readInt y2
          }

option需要两个参数。如果第二个参数（解析器）失败，则返回第一个参数。因此，如果坐标解析器失败，那么parseCoordinates将Nothing 。换句话说，坐标解析器失败并不会导致整个解析器失败。该块将对其coordinates “字段” Nothing 。

parseCoordinate
  ::  Char
  ->  Int
  ->  ReadP String
parseCoordinate
  c
  n
  = do
  _  <- char ( Data.Char. toUpper c) <|> char ( Data.Char. toLower c)
  _  <- string $ show n ++ " : "
  parseNumber

该解析器允许坐标标签在大写或小写字母中。例如， x1:1 X2:2 Y1:3 y2:4将成功。

构建SRT文本解析器

由于类似于HTML的标签格式，解析文本是最相关的部分。

标签解析可能具有挑战性 - 只是问任何以正则表达方式解析它们的人。为了使我们更轻松，对于用户，我们将使用标签汤的方法。解析器将允许未串联和/或错误的嵌套标签。它还将允许任何标签，而不仅允许b ， u ， i和font 。

parseTextLines
  ::  ReadP [ TaggedText ]
parseTextLines
  =
      char ' n '
  >>  (getTaggedText <$> manyTill parseAny parseEndOfTextLines)

我们首先要匹配Newline字符。之后，我们在字幕文本字符上函数映射或fmap（ <$> ） getTaggedText直到我们到达文本行的末尾。

parseEndOfTextLines
  ::  ReadP ()
parseEndOfTextLines
  =
  void (string " nn " ) <|> eof

当我们到达两个Newline字符或文件末尾时，我们停止收集字符（ parseAny ）。这标志着块的末端。

getTaggedText
  ::  String
  ->  [ TaggedText ]
getTaggedText
  s
  =
  fst
    $ foldl
      folder
      ( [] , [] )
      parsed
  where

getTaggedText通过从左到右折叠的文本折叠，返回累积的标记文本。

    parsed
      ::  [ String ]
    parsed
      =
      case readP_to_S (parseTaggedText [] ) s of
        []      -> [s]
        r @ (_ : _) -> ( fst . last ) r

parsed返回一个或多个字符串的列表。它试图解析标签的输入文本。如果失败， parsed将返回列表中的输入字符串。否则，如果parseTaggedText成功， parse将返回最后可能的解析（ (fst . last) r ）。

    folder
      ::  ([ TaggedText ], [ Tag ])
      ->  String
      ->  ([ TaggedText ], [ Tag ])
    folder
      (tt, t)
      x
      | isTag x   = (tt, updateTags t x)
      | otherwise = (tt ++ [ TaggedText { text = x, tags = t}], t)

当folder从从左向右移动，在解析的字符串上移动时，它会检查当前字符串是否为标签。如果是标签，它将更新当前的活动标签集（ t ）。否则，它将附加另一个与活动标签集有关的标记文本。

updateTags
  ::  [ Tag ]
  ->  String
  ->  [ Tag ]
updateTags
  tags
  x
  | isClosingTag x = remove compare' tags (makeTag x)
  | isOpeningTag x = add    compare' tags (makeTag x)
  | otherwise      = tags
  where
    compare'
      ::  Tag
      ->  Tag
      ->  Bool
    compare'
      a
      b
      =
      name a /= name b

updateTags根据删除或添加给定标签（ x ）的tags更新标签，具体取决于它是关闭或打开标签。如果这都不是，它只是返回传递的标签集。如果tags已经具有相同名称的标签， add将覆盖现有标签。您可以在给出的compare'函数中看到这一点。

为了使解析器保持简单，如果找到了打开标签T ，则将T添加到标签列表中或覆盖退出T如果已经存在）。如果找到相应的关闭/T ，则将T从标签列表中删除（如果存在）。连续有两个或多个T s，一个或多个T s，一个或多个没有关闭/T ，和 /或没有打开T关闭/T 。

makeTag
  ::  String
  ->  Tag
makeTag
  s
  =
  Tag
    { name       = getTagName       s
    , attributes = getTagAttributes s
    }

makeTag从给定的s组装一个标签。每个Tag都有一个名称和零或更多属性。

parseTaggedText
  ::  [ String ]
  ->  ReadP [ String ]
parseTaggedText
  strings
  = do
  s <- look
  case s of
    " " -> return strings
    _  -> do
      r <- munch1 ( /= ' < ' ) <++ parseClosingTag <++ parseOpeningTag
      parseTaggedText $ strings ++ [r]

parseTaggedText返回输入字符串分为零件。每个作品都是标签，关闭标签或开放标签所包含的文本。将其分开后，它将其添加到其他碎片中，然后再次自动起诉。如果其余的输入字符串为空，则返回找到的字符串列表。

 > readP_to_S (string " ab " <++ string " abc " ) " abcd "
[( " ab " , " cd " )]

> readP_to_S (string " ab " +++ string " abc " ) " abcd "
[( " ab " , " cd " ),( " abc " , " d " )]

> readP_to_S (string " ab " <|> string " abc " ) " abcd "
[( " ab " , " cd " ),( " abc " , " d " )]

<++运算符的偏见意味着，如果左侧成功，它甚至不会打扰右侧。回想一下，当我们运行解析器时，我们将获得所有可能的分析的列表。所有这些可能的解析都是解析器已经穿过所有可能的路径的结果。通过使用<++ ，当且仅当左侧失败时，我们就会从左路和右路路径接收可能的解析。如果您希望通过左侧和右侧的所有可能的解析，则可以使用ReadP提供的+++操作员。 +++只是<|> ，我们在上面看到了。

parseOpeningTag
  ::  ReadP String
parseOpeningTag
  = do
  _ <- char ' < '
  t <- munch1 (  c -> c /= ' / ' && c /= ' > ' )
  _ <- char ' > '
  return $ " < " ++ t ++ " > "

开头标签是一个开头的角度支架，一些不包括前斜线的文本，以及下一个即时的闭合角括号。

parseClosingTag
  ::  ReadP String
parseClosingTag
  = do
  _ <- char ' < '
  _ <- char ' / '
  t <- munch1 ( /= ' > ' )
  _ <- char ' > '
  return $ " </ " ++ t ++ " > "

闭合标签是一个开头的角括号，前向斜线，一些文本和下一个即时的闭合角括号。

getTagAttributes
  ::  String
  ->  [ TagAttribute ]
getTagAttributes
  s
  =
  if isOpeningTag s
    then
      case readP_to_S (parseTagAttributes [] ) s of
        []    -> []
        (x : _) -> fst x
    else
      []

开放标签可以具有属性。例如， <font color="#101010"> 。每个属性都是两个键盘，钥匙值对。在上面的示例中， color将是钥匙， #101010是值。

getTagName
  ::  String
  ->  String
getTagName
  s
  =
  case readP_to_S parseTagName s of
    []    -> " "
    (x : _) -> toLower' $ fst x

这将返回小写的标签名称。

parseTagName
  ::  ReadP String
parseTagName
  = do
  _ <- char ' < '
  _ <- munch ( == ' / ' )
  _ <- skipSpaces
  n <- munch1 (  c -> c /= ' ' && c /= ' > ' )
  _ <- munch  ( /= ' > ' )
  _ <- char ' > '
  return n

标签名称是开头角括号之后的第一个非Whitespace字符字符串，可能的前向斜线以及一些可能的空格以及在其他空间和/或闭合角括号之前。

parseTagAttributes
  ::  [ TagAttribute ]
  ->  ReadP [ TagAttribute ]
parseTagAttributes
  tagAttributes
  = do
  s <- look
  case s of
    " " -> return tagAttributes
    _  -> do
      let h = head s
      case h of
        ' > ' -> return tagAttributes
        ' < ' -> trimTagname >> parseTagAttributes'
        _   -> parseTagAttributes'
  where
    parseTagAttributes'
      ::  ReadP [ TagAttribute ]
    parseTagAttributes'
      = do
      tagAttribute <- parseTagAttribute
      parseTagAttributes
        ( add
            (  a b -> fst a /= fst b)
            tagAttributes
            tagAttribute
        )

parseTagAttributes递归通过输入字符串，收集键值对。在标签（ < ）的开头，它首先在处理属性之前先修剪标签名称。当它到达闭合角括号（ > ）时，它会停止解析属性。如果标签恰好具有重复的属性（基于密钥）， add将确保仅保留最新的属性。

trimTagname
  :: ReadP ()
trimTagname
  =
      char ' < '
  >> skipSpaces
  >> munch1 (  c -> c /= ' ' && c /= ' > ' )
  >> return ()

这修剪或放弃标签名称。

parseTagAttribute
  ::  ReadP TagAttribute
parseTagAttribute
  = do
  _ <- skipSpaces
  k <- munch1 ( /= ' = ' )
  _ <- string " = " "
  v <- munch1 ( /= ' " ' )
  _ <- char ' " '
  _ <- skipSpaces
  return (toLower' k, v)

属性键是在均等符号之前的任何非Whitespace字符字符串。属性值是在均等符号和双引号之后以及下一个即时双引号之前的任何字符。

isTag
  ::  String
  ->  Bool
isTag
  s
  =
  isOpeningTag s || isClosingTag s

字符串是标签，如果它是打开标签或关闭标签。

isOpeningTag
  ::  String
  ->  Bool
isOpeningTag
  s
  =
  isPresent $ readP_to_S parseOpeningTag s

如果开放标签解析器成功，则字符串是开头标签。

isClosingTag
  ::  String
  ->  Bool
isClosingTag
  s
  =
  isPresent $ readP_to_S parseClosingTag s

如果关闭标签解析器成功，则字符串是关闭标签。

运行SRT解析器

现在我们已经组装了解析器，让我们尝试一下。

 > let srt =
>       " 1 n 
>       :0:0,1 --> 0:1:0.2  x1:1 X2:3  y1:4 y2:10 n 
>       <font color= " red " color= " blue " >This is some <b><u><i> n 
>       subtitle n 
>       </u>text.</b>  "
> readP_to_S parseSrt srt
[([ SrtSubtitle
      { index = 1
      , start = Timestamp {hours = 0 , minutes = 0 , seconds = 0 , milliseconds = 1 }
      , end   = Timestamp {hours = 0 , minutes = 1 , seconds = 0 , milliseconds = 2 }
      , coordinates = Just ( SrtSubtitleCoordinates {x1 = 1 , x2 = 3 , y1 = 4 , y2 = 10 })
      , taggedText =  [ TaggedText
                        { text = " This is some "
                        , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                 ]
                        }
                      , TaggedText
                          { text = " n subtitle n "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " b " ,    attributes = [] }
                                   , Tag {name = " u " ,    attributes = [] }
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      , TaggedText
                          { text = " text. "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " b " ,    attributes = [] }
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      , TaggedText
                          { text = "  "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      ]
      }
  ]
, " "
)]

在这里，您可以看到解析测试字符串的结果。注意测试字符串中的错误，例如使用周期而不是逗号或重复的标签属性。

练习

编写一个可以将SRT文件转换为JSON文件的程序。
使用PARSEC而不是readP重写版本号解析器。
使用PARSEC而不是readP重写SRT解析器。

版权

（c）2019 David Lettier
lettier.com

展开