parsing with haskell parser combinators下載 - parsing with haskell parser combinators來源代碼下載

與Haskell解析器組合分析

需要解析東西嗎？從未聽說過“解析器組合”嗎？想學習一些haskell嗎？驚人的！以下是您需要與Haskell Parser組合者一起起床和解析的一切。從這裡，您可以嘗試處理深奧的數據序列化格式，編譯器前端，特定域的語言 - 您命名它！

構建演示
運行演示
解析器組合
版本號
SRT
練習

構建演示

本指南中包括兩個演示程序。

version-number-parser解析版本號的文件。 srt-file-parser解析了SRT字幕的文件。隨時嘗試使用test-input/中找到的文件。

堆

下載Haskell工具堆棧，然後運行以下內容。

git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
cd parsing-with-haskell-parser-combinators
stack build

陰謀

如果使用Cabal，則可以運行以下內容。

git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
cd parsing-with-haskell-parser-combinators
cabal sandbox init
cabal --require-sandbox build
cabal --require-sandbox install

運行演示

構建了兩個演示程序後，您可以這樣運行。

堆

要嘗試版本號碼解析器，請運行以下內容。

 cd parsing-with-haskell-parser-combinators
stack exec -- version-number-parser
What is the version output file path ?
test-input/gifcurry-version-output.txt

要嘗試SRT文件解析器，請運行以下內容。

 cd parsing-with-haskell-parser-combinators
stack exec -- srt-file-parser
What is the SRT file path ?
test-input/subtitles.srt

陰謀

要嘗試版本號碼解析器，請運行以下內容。

 cd parsing-with-haskell-parser-combinators
.cabal-sandbox/bin/version-number-parser
What is the version output file path ?
test-input/gifcurry-version-output.txt

要嘗試SRT文件解析器，請運行以下內容。

 cd parsing-with-haskell-parser-combinators
.cabal-sandbox/bin/srt-file-parser
What is the SRT file path ?
test-input/subtitles.srt

解析器組合

了解解析策略的更好方法之一，解析器組合者是研究一個實施。

使用組合劑構建的解析器很容易構建，可讀，模塊化，結構良好且易於維護。

^{- 比較比較組合者 - 維基百科}

讀取

讓我們看一下在基地中發現的解析器組合庫ReadP的引擎蓋下。由於它位於基礎上，因此您應該已經擁有它了。

請注意，熟悉ReadP後，您可能需要嘗試PARSEC。它也是其他人更喜歡讀取的解析器組合庫。作為額外的獎勵，從GHC版本8.4.1開始，它包含在GHC的引導庫中。

P數據類型

 -- (c) The University of Glasgow 2002

data P a
  = Get ( Char -> P a )
  | Look ( String -> P a )
  | Fail
  | Result a ( P a )
  | Final [( a , String )]
  deriving Functor

我們將從P數據類型開始。 P a中的a取決於您（庫用戶），可以是您想要的。編譯器會自動創建函數實例，並且有用於應用，單一， MonadFail和替代方案的手寫實例。

請注意，有關函數，應用程序和單子的更多信息，請查看您的單調，應用程序和函子的輕鬆指南。

P是五種情況的總和類型。

從輸入字符串中Get單個字符，並返回新的P 。
Look接受輸入字符串的副本，並返回新的P 。
Fail指示解析器完成沒有結果。
Result可能會解析和另一種P情況。
Final是兩個小組的列表。第一個元組元素是輸入的可能解析，第二個元組元素是Get未消耗的輸入字符串的其餘部分。

跑步

 -- (c) The University of Glasgow 2002

run :: P a -> ReadS a
run ( Get f)      (c : s) = run (f c) s
run ( Look f)     s     = run (f s) s
run ( Result x p) s     = (x,s) : run p s
run ( Final r)    _     = r
run _            _     = []

run是ReadP解析器的核心。當它遞歸貫穿我們上面看到的所有解析器狀態時，它都會完成所有繁重的舉動。您可以看到它需要P並返回A ReadS 。

 -- (c) The University of Glasgow 2002

type ReadS a = String -> [( a , String )]

ReadS a是String -> [(a,String)]的類型別名。因此，每當您看到ReadS a時，請思考String -> [(a,String)] 。

 -- (c) The University of Glasgow 2002

run :: P a -> String -> [( a , String )]
run ( Get f)      (c : s) = run (f c) s
run ( Look f)     s     = run (f s) s
run ( Result x p) s     = (x,s) : run p s
run ( Final r)    _     = r
run _            _     = []

run模式與P的不同情況匹配。

如果是Get ，它將用新的P呼喚自己（通過傳遞函數f ，在Get f ，輸入字符串中的下一個字符c返回）和輸入s的其餘部分。
如果是Look ，它將用新的P調用自身（通過傳遞函數f ，在Look f中返回，輸入字符串s ）和輸入字符串。請注意， Look不會像Get那樣從輸入字符串中消耗任何字符。
如果是Result ，它將組裝一個兩核算 - 包含解析的結果和輸入字符串的剩下的內容 - 並將其預先為遞歸調用的結果，該遞歸調用與另一個P情況和輸入字符串一起運行。
如果是Final ， run將返回包含解析結果和輸入字符串剩餘的兩個tuplace的列表。
對於其他任何內容， run返回一個空列表。例如，如果情況Fail ， run將返回一個空列表。

 > run ( Get (  a -> Get (  b -> Result [a,b] Fail ))) " 12345 "
[( " 12 " , " 345 " )]

ReadP不會公開run ，但是如果這樣做，您可以這樣稱呼它。兩者Get了'1'和'2' ，將"345"留在後面。

 > run ( Get (  a -> Get (  b -> Result [a,b] Fail ))) " 12345 "
> run ( Get (  b -> Result [ ' 1 ' ,b] Fail )) " 2345 "
> run ( Result [ ' 1 ' , ' 2 ' ] Fail ) " 345 "
> ([ ' 1 ' , ' 2 ' ], " 345 " ) : run ( Fail ) " 345 "
> ([ ' 1 ' , ' 2 ' ], " 345 " ) : []
[( " 12 " , " 345 " )]

通過每個遞歸電話，您可以看到我們如何達到最終結果。

 > run ( Get (  a -> Get (  b -> Result [a,b] ( Final [([ ' a ' , ' b ' ], " c " )])))) " 12345 "
[( " 12 " , " 345 " ),( " ab " , " c " )]

使用Final ，您可以在兩個tuples的最終列表中包含一個解析結果。

readp_to_s

 -- (c) The University of Glasgow 2002

   readP_to_S :: ReadP a -> ReadS a
-- readP_to_S :: ReadP a -> String -> [(a,String)]
   readP_to_S ( R f) = run (f return )

雖然readP並未直接曝光run ，但它確實通過readP_to_S暴露了它。 readP_to_S引入了一個名為ReadP的newtype 。 readP_to_S接受ReadP a ，一個字符串，並返回兩個tuplace的列表。

讀取newtype

 -- (c) The University of Glasgow 2002

newtype ReadP a = R ( forall b . ( a -> P b ) -> P b )

這是ReadP a的定義。有一些實例，適用於函子，應用，monad， MonadFail ，替代方案和MonadPlus 。 R構造器採用一個函數，該函數採用另一個函數並返回P 。接受的功能採用您選擇的任何東西a返回P 。

 -- (c) The University of Glasgow 2002

readP_to_S ( R f) = run (f return )

回想一下P是單子， return的類型是a -> ma 。因此， f是(a -> P b) -> Pb函數， return是(a -> P b)函數。最終， run獲得了預期的P b

 -- (c) The University of Glasgow 2002

readP_to_S ( R f) inputString = run (f return ) inputString
--               ^ ^^^^^^^^^^                  ^^^^^^^^^^^

它在源代碼中留下，但請記住， readP_to_S並run期望輸入字符串。

 -- (c) The University of Glasgow 2002

instance Functor ReadP where
  fmap h ( R f) = R (  k -> f (k . h))

這是ReadP的函數實例定義。

 > readP_to_S ( fmap toLower get) " ABC "
[( ' a ' , " BC " )]

> readP_to_S (toLower <$> get) " ABC "
[( ' a ' , " BC " )]

這使我們能夠做這樣的事情。 fmap函數在函子上映射toLower get函數等於R Get 。回想一下， Get的類型為(Char -> P a) -> P a ReadP構造函數（ R ）接受。

 -- (c) The University of Glasgow 2002

fmap h       ( R f  ) = R (  k -> f   (k . h      ))
fmap toLower ( R Get ) = R (  k -> Get (k . toLower))

在這裡，您可以看到為fmap toLower get示例。

應用P實例

查看上面，當我們只使用未終止run的Get時， readP_to_S返回[('a',"BC")] ？答案在於P的應用定義。

 -- (c) The University of Glasgow 2002

instance Applicative P where
  pure x = Result x Fail
  (<*>) = ap

return等於純的純度，因此我們可以重寫readP_to_S (R f) = run (f return)為readP_to_S (R f) = run (f pure) pure ）。通過使用return或相當pure readP_to_S集Result x Fail因為最終的案例run將會遇到。如果到達， run將終止，我們將獲得分析列表。

 > readP_to_S ( fmap toLower get) " ABC "

-- Use the functor instance to transform fmap toLower get.
> readP_to_S ( R (  k -> Get (k . toLower))) " ABC "

-- Call run which removes R.
> run ((  k -> Get (k . toLower)) pure ) " ABC "

-- Call function with pure to get rid of k.
> run ( Get ( pure . toLower)) " ABC "

-- Call run for Get case to get rid of Get.
> run (( pure . toLower) ' A ' ) " BC "

-- Call toLower with 'A' to get rid of toLower.
> run ( pure ' a ' ) " BC "

-- Use the applicative instance to transform pure 'a'.
> run ( Result ' a ' Fail ) " BC "

-- Call run for the Result case to get rid of Result.
> ( ' a ' , " BC " ) : run ( Fail ) " BC "

-- Call run for the Fail case to get rid of Fail.
> ( ' a ' , " BC " ) : []

-- Prepend.
[( ' a ' , " BC " )]

在這裡，您可以看到從readP_to_S到解析結果的流程。

替代P實例

 -- (c) The University of Glasgow 2002

instance Alternative P where
  -- ...

  -- most common case: two gets are combined
  Get f1     <|> Get f2     = Get (  c -> f1 c <|> f2 c)

  -- results are delivered as soon as possible
  Result x p <|> q          = Result x (p <|> q)
  p          <|> Result x q = Result x (p <|> q)

  -- ...

P的Alternative實例使我們能夠將解析器的流動分為左右路徑。當輸入不進行，一種或（很少）兩種方式中的兩種方式時，這很方便。

 > readP_to_S ((get >>=  a -> return a) <|> (get >> get >>=  b -> return b)) " ABC "
[( ' A ' , " BC " ),( ' B ' , " C " )]

<|>運算符或功能在解析器的流中引入叉子。解析器將穿過左右路徑。最終結果將包含所有可能的解析，以及所有可能的解析。如果這兩條路徑都失敗，則整個解析器將失敗。

請注意，在其他解析器組合器實現中，使用<|>運算符時，解析器將向左或向右，但不會兩者兼而有之。如果左派成功，右側將被忽略。右側僅在左側失敗時進行處理。

 > readP_to_S ((get >>=  a -> return [a]) <|> look <|> (get >> get >>=  a -> return [a])) " ABC "
[( " ABC " , " ABC " ),( " A " , " BC " ),( " B " , " C " )]

您可以將<|>運算符鏈接用於許多選項或替代方案。解析器將返回可能的分析。

讀取失敗

 -- (c) The University of Glasgow 2002

instance Monad ReadP where
  fail _    = R (  _ -> Fail )
  R m >>= f = R (  k -> m (  a -> let R m' = f a in m' k))

這是ReadP monad實例。注意fail的定義。

 > readP_to_S ((  a b c -> [a,b,c]) <$> get <*> get <*> get) " ABC "
[( " ABC " , " " )]

> readP_to_S ((  a b c -> [a,b,c]) <$> get <*> fail " " <*> get) " ABC "
[]

> readP_to_S (get >>=  a -> get >>=  b -> get >>=  c -> return [a,b,c]) " ABC "
[( " ABC " , " " )]

> readP_to_S (get >>=  a -> get >>=  b -> fail " " >>=  c -> return [a,b,c]) " ABC "
[]

您可以通過打電話fail導致整個解析器路徑流產。由於readP不能提供生成Result或Final情況的直接方法，因此返回值將是一個空列表。如果故障路徑是唯一的路徑，則整個結果將是一個空列表。回想一下，當run匹配Fail時，它將返回一個空列表。

 -- (c) The University of Glasgow 2002

instance Alternative P where
  -- ...

  -- fail disappears
  Fail       <|> p          = p
  p          <|> Fail       = p

  -- ...

回到替代P實例，您可以看到兩側的故障（但並非兩者都不會失敗）不會使整個解析器失敗。

 > readP_to_S (get >>=  a -> get >>=  b -> pfail >>=  c -> return [a,b,c]) " ABC "
[]

ReadP不使用fail ，而是提供pfail ，它允許您直接生成Fail情況。

版本號

Gifcurry是GIF製造商的Haskell構建的視頻編輯，它將用於各種不同的程序。為了確保兼容性，它需要為其外殼提供的每個程序的版本編號。這些程序之一是ImageMagick。

Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org
Copyright: © 1999-2018 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenCL OpenMP

在這裡，您可以看到convert --version的輸出。您如何解析此捕獲6、9、10和14？

查看輸出，我們知道版本編號是一個數字的集合，該數字由段或破折號分隔。該定義也涵蓋了日期，因此我們將確保前兩個數字分為一段時間。這樣，如果他們在版本編號之前放一個日期，我們就不會得到錯誤的結果。

1. Consume zero or more characters that are not 0 through 9 and go to 2.
2. Consume zero or more characters that are 0 through 9, save this number, and go to 3.
3. Look at the rest of the input and go to 4.
4. If the input
    - is empty, go to 6.
    - starts with a period, go to 1.
    - starts with a dash
        - and you have exactly one number, go to 5.
        - and you have more than one number, go to 1.
    - doesn't start with a period or dash
        - and you have exactly one number, go to 5.
        - you have more than one number, go to 6.
5. Delete any saved numbers and go to 1.
6. Return the numbers found.

在我們深入了解代碼之前，這是我們將要關注的算法。

構建版本號解析器

parseVersionNumber
  ::  [ String ]
  ->  ReadP [ String ]
parseVersionNumber
  nums
  = do
  _         <- parseNotNumber
  num       <- parseNumber
  let nums' = nums ++ [num]
  parseSeparator nums' parseVersionNumber

parseVersionNumber是解析版本號的輸入字符串的主要解析器組合。它接受字符串列表，並在ReadP數據類型的上下文中返回字符串列表。接收的字符串列表不是被解析的輸入，而是到目前為止發現的數字列表。對於第一個函數調用，列表為空，因為列表尚未解析任何內容。

parseVersionNumber
  nums

從頂部開始， parseVersionNumber列出了一系列字符串，這是迄今為止發現的當前數字列表。

  _         <- parseNotNumber

parseNotNumber消耗了輸入字符串中的所有數字。由於我們對結果不感興趣，因此我們將其丟棄（ _ <- ）。

  num       <- parseNumber
  let nums' = nums ++ [num]

接下來，我們消耗所有數字的所有內容，然後將其添加到到目前為止發現的數字列表中。

  parseSeparator nums' parseVersionNumber

在處理下一個parseVersionNumber之後，它將發現的數字列表和本身傳遞給了parseSeparator 。

解析分離器

parseSeparator
  ::  [ String ]
  ->  ([ String ] -> ReadP [ String ])
  ->  ReadP [ String ]
parseSeparator
  nums
  f
  = do
  next <- look
  case next of
    " "    -> return nums
    (c : _) ->
      case c of
        ' . ' -> f nums
        ' - ' -> if length nums == 1 then f [] else f nums
        _   -> if length nums == 1 then f [] else return nums

在這裡，您可以看到parseSeparator 。

  next <- look
  case next of
    " "    -> return nums
    (c : _) ->

look使我們可以在不消耗輸入字符串的情況下獲得輸入字符串的剩餘內容。如果什麼都沒有，它將返回找到的數字。但是，如果剩下一些東西，它將分析第一個字符。

      case c of
        ' . ' -> f nums
        ' - ' -> if length nums == 1 then f [] else f nums
        _   -> if length nums == 1 then f [] else return nums

如果下一個字符是一個時期，請再次致電當前的數字列表，再次致電parseVersionNumber名單。如果這是一個破折號，並且我們有一個數字，請致電parseVersionNumber列表，因為它是日期，該數字列表空白列表。如果這是破折號，我們沒有一個數字，請parseVersionNumber到目前為止發現的數字列表。否則，如果我們完全沒有一個數字，請致電帶有空列表的parseVersionNumber ，如果我們沒有一個數字，請返回數字。

解析非數字

parseNotNumber
  ::  ReadP String
parseNotNumber
  =
  munch ( not . isNumber)

parseNotNumber使用ReadP提供的munch 。給予munch ，為任何不是0到9的角色返回的謂詞(not . isNumber) 。

 munch :: ( Char -> Bool ) -> ReadP String

如果輸入字符串中的下一個字符滿足謂詞， munch連續打電話get 。如果不是這樣， munch會返回所做的字符（如果有的話）。由於它只使用get ，因此Munch總是成功。

請注意， parseNumber類似於parseNotNumber 。而not . isNumber ，謂詞只是isNumber 。

蒙克與許多

parseNotNumber'
  ::  ReadP String
parseNotNumber'
  =
  many (satisfy ( not . isNumber))

您可以不使用munch ，而可以使用many和satisfy方式編寫parseNotNumber ，這是readp提供的。查看many的類型簽名，它接受單個解析器組合器（ ReadP a ）。在這種情況下，它得到了解析器組合的satisfy 。

 > readP_to_S (satisfy ( not . isNumber)) " a "
[( ' a ' , " " )]

> readP_to_S (satisfy ( not . isNumber)) " 1 "
[]

satisfy需要一個謂詞，並使用get消耗下一個字符。如果接受的謂詞返回真實， satisfy將返回角色。否則， satisfy呼叫pfail並失敗。

 > readP_to_S (munch ( not . isNumber)) " abc123 "
[( " abc " , " 123 " )]

> readP_to_S (many (satisfy ( not . isNumber))) " abc123 "
[( " " , " abc123 " ),( " a " , " bc123 " ),( " ab " , " c123 " ),( " abc " , " 123 " )]

使用many可以給您帶來不必要的結果。最終， many引入了一個或多個Result案例。因此， many總是成功。

 > readP_to_S (many look) " abc123 "
-- Runs forever.

many會運行您的解析器，直到失敗或輸入輸入為止。如果您的解析器永遠不會失敗或從未用完輸入，那麼many將永遠不會返回。

 > readP_to_S (many (get >>=  a -> return ( read (a : " " ) :: Int ))) " 12345 "
[( [] , " 12345 " ),([ 1 ], " 2345 " ),([ 1 , 2 ], " 345 " ),([ 1 , 2 , 3 ], " 45 " ),([ 1 , 2 , 3 , 4 ], " 5 " ),([ 1 , 2 , 3 , 4 , 5 ], " " )]

對於結果中的每個索引，分析結果將是在整個輸入中運行解析器索引時間的結果。

 > let parser        = get >>=  a -> return ( read (a : " " ) :: Int )
> let many' results = return results <|> (parser >>=  result -> many' (results ++ [result]))
> readP_to_S (many' [] ) " 12345 "
[( [] , " 12345 " ),([ 1 ], " 2345 " ),([ 1 , 2 ], " 345 " ),([ 1 , 2 , 3 ], " 45 " ),([ 1 , 2 , 3 , 4 ], " 5 " ),([ 1 , 2 , 3 , 4 , 5 ], " " )]

這是many的替代定義。在<|>的左側，它返回當前解析器結果。在<|>的右側，它運行解析器，將結果添加到當前解析器的結果中，並以更新的結果自稱。這具有累積總和類型效果，其中索引i是分析器結果，將其附加到i - 1 ， i - 2 ，...和1解析器結果上。

運行版本號解析器

現在我們建造了解析器，讓我們運行它。

 > let inputString =
>     " Some Program (C) 1234-56-78 All rights reserved. n 
>     Version: 12.345.6-7 n 
>     License: Some open source license. "
> readP_to_S (parseVersionNumber [] ) inputString
[([ " 12 " , " 345 " , " 6 " , " 7 " ], " n License: Some open source license. " )]

您可以看到它即使日期之前也正確提取了版本編號。

SRT

現在，讓我們解析一些更複雜的東西-SRT文件。

為了發布gifcurry六，我需要解析SRT（子欄文本）文件。 SRT文件包含視頻處理程序用來在視頻頂部顯示文本的字幕。通常，此文本是電影的對話框，該電影翻譯成各種不同的語言。通過將文本與視頻分開，只需要一個視頻，可以節省時間，存儲空間和帶寬。視頻軟件可以交換文本而無需交換視頻。將其與刻錄或硬編碼的字幕進行對比，其中文本成為構成視頻的圖像數據的一部分。在這種情況下，您需要每個字幕集合的視頻。

^{內部視頻©Blender Foundation | www.sintel.org}

Gifcurry可以獲取SRT文件，並為您選擇的視頻切片刻錄字幕。

7
00:02:09,400 --> 00:02:13,800
What brings you to
the land of the gatekeepers?

8
00:02:15,000 --> 00:02:17,500
I'm searching for someone.

9
00:02:18,000 --> 00:02:22,200
Someone very dear?
A kindred spirit?

在這裡，您可以看到Sintel的英文字幕（©Blender Foundation | www.sintel.org）。

SRT格式

SRT也許是所有字幕格式中最基本的。

^{- SRT字幕| Matrosk}

SRT文件格式由塊組成，每個字幕一個由空線組成。

塊的頂部是索引。這決定了字幕的順序。希望字幕已經井井有條，所有這些都有獨特的索引，但情況並非如此。

01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63

在索引是開始時間，結束時間之後，並且可以選擇矩形的可選點，字幕文本應輸入。

01:04:13,000

時間戳格式為hours:minutes:seconds,milliseconds 。

注意逗號，而不是將秒數與毫秒分開的時期。

This is the actual subtitle
text. It can span multiple lines.
It may include formating
like <b>bold</b>, <i>italic</i>,
<u>underline</u>,
and <font color="#010101">font color</font>.

塊的第三部分也是最後一部分是字幕文本。它可以跨越多行，並在有空線時結束。文本可以包括格式化標籤，讓人聯想到HTML。

建造SRT解析器

parseSrt
  ::  ReadP [ SrtSubtitle ]
parseSrt
  =
  manyTill parseBlock (skipSpaces >> eof)

parseSrt是處理所有內容的主要解析器組合。它會解析每個塊，直到到達文件末尾（ eof ）或輸入。為了安全起見，最後一個塊和文件末端之間可能會有落後的空格。為了解決這個問題，它在解析文件末尾之前（ skipSpaces >> eof ）在解析Whitespace（ skipSpaces ）的零或更多字符。如果到達eof時仍留下輸入， eof將失敗，這將一無所獲。因此，重要的是， parseBlock只會留下任何東西，而是落後。

建造SRT塊解析器

parseBlock
  ::  ReadP SrtSubtitle
parseBlock
  = do
  i      <- parseIndex
  (s, e) <- parseTimestamps
  c      <- parseCoordinates
  t      <- parseTextLines
  return
    SrtSubtitle
      { index       = i
      , start       = s
      , end         = e
      , coordinates = c
      , taggedText  = t
      }

正如我們之前所經過的那樣，一個塊由索引，時間戳，可能是一些坐標和一些文本組成。在此版本的parseBlock中，您會看到記錄語法更急切的DO符號樣式。

parseBlock'
  ::  ReadP SrtSubtitle
parseBlock'
  =
      SrtSubtitle
  <$> parseIndex
  <*> parseStartTimestamp
  <*> parseEndTimestamp
  <*> parseCoordinates
  <*> parseTextLines

這是您可以編寫parseBlock另一種方式。這是應用風格。只要確保正確訂單即可。例如，我本可以不小心混合了開始和結束時間戳。

構建SRT索引解析器

parseIndex
  ::  ReadP Int
parseIndex
  =
      skipSpaces
  >>  readInt <$> parseNumber

塊的頂部是索引。在這裡，您再次看到skipSpaces 。跳過空格之後，它解析了數字的輸入並將其轉換為實際整數。

readInt
  ::  String
  ->  Int
readInt
  =
  read

readInt看起來像這樣。

 > read " 123 " :: Int
123
> read " 1abc " :: Int
*** Exception : Prelude. read : no parse

通常，直接使用read可能是危險的。 read可能無法將輸入轉換為指定類型。但是， parseNumber只會返回10個數字數字字符（ ['0'..'9'] ），因此直接使用read變得安全。

構建SRT時間戳解析器

解析時間戳比解析索引要多一點。

parseTimestamps
  ::  ReadP ( Timestamp , Timestamp )
parseTimestamps
  = do
  _   <- char ' n '
  s   <- parseTimestamp
  _   <- skipSpaces
  _   <- string " --> "
  _   <- skipSpaces
  e   <- parseTimestamp
  return (s, e)

這是解析時間戳的主要組合。

char解析您給它的角色或失敗。如果失敗，則parseTimestamps失敗，最終導致parseSrt失敗，因此索引後必須有一個newline字符。

string就像char一樣，除了僅僅是一個字符，它可以解析您給它的字符串或失敗。

parseStartTimestamp
  ::  ReadP Timestamp
parseStartTimestamp
  =
      char ' n '
  >>  parseTimestamp

parseTimestamps均解析了兩個時間戳，但是對於應用樣式（ parseSrt' ），我們僅需要一個解析器才能開始時間戳。

parseEndTimestamp
  ::  ReadP Timestamp
parseEndTimestamp
  =
      skipSpaces
  >>  string " --> "
  >>  skipSpaces
  >>  parseTimestamp

這解析了時間戳之間的所有內容，並返回末端時間戳。

parseTimestamp
  ::  ReadP Timestamp
parseTimestamp
  = do
  h  <- parseNumber
  _  <- char ' : '
  m  <- parseNumber
  _  <- char ' : '
  s  <- parseNumber
  _  <- char ' , ' <|> char ' . '
  m' <- parseNumber
  return
    Timestamp
      { hours        = readInt h
      , minutes      = readInt m
      , seconds      = readInt s
      , milliseconds = readInt m'
      }

這解析了構成時間戳的四個數字。前三個數字被結腸隔開，最後一個數字被逗號隔開。但是，要更加寬容，我們允許有一個時期而不是逗號。

 > readP_to_S (char ' . ' <|> char ' , ' ) " ... "
[( ' . ' , " .. " )]

> readP_to_S (char ' . ' <|> char ' , ' ) " ,.. "
[( ' , ' , " .. " )]

請注意，當使用<|>的char時，只有一側可以成功（兩個char Enter，一個char休假），因為char會消耗一個字符，兩個字符不能佔據相同的空間。

構建SRT坐標解析器

坐標是塊的可選部分，但如果包括，則將與時間戳在同一行。

parseCoordinates
  ::  ReadP ( Maybe SrtSubtitleCoordinates )
parseCoordinates
  =
  option Nothing $ do
    _  <- skipSpaces1
    x1 <- parseCoordinate ' x ' 1
    _  <- skipSpaces1
    x2 <- parseCoordinate ' x ' 2
    _  <- skipSpaces1
    y1 <- parseCoordinate ' y ' 1
    _  <- skipSpaces1
    y2 <- parseCoordinate ' y ' 2
    return
      $ Just
        SrtSubtitleCoordinates
          { x1 = readInt x1
          , x2 = readInt x2
          , y1 = readInt y1
          , y2 = readInt y2
          }

option需要兩個參數。如果第二個參數（解析器）失敗，則返回第一個參數。因此，如果坐標解析器失敗，那麼parseCoordinates將Nothing 。換句話說，坐標解析器失敗並不會導致整個解析器失敗。該塊將對其coordinates “字段” Nothing 。

parseCoordinate
  ::  Char
  ->  Int
  ->  ReadP String
parseCoordinate
  c
  n
  = do
  _  <- char ( Data.Char. toUpper c) <|> char ( Data.Char. toLower c)
  _  <- string $ show n ++ " : "
  parseNumber

該解析器允許坐標標籤在大寫或小寫字母中。例如， x1:1 X2:2 Y1:3 y2:4將成功。

構建SRT文本解析器

由於類似於HTML的標籤格式，解析文本是最相關的部分。

標籤解析可能具有挑戰性 - 只是問任何以正則表達方式解析它們的人。為了使我們更輕鬆，對於用戶，我們將使用標籤湯的方法。解析器將允許未串聯和/或錯誤的嵌套標籤。它還將允許任何標籤，而不僅允許b ， u ， i和font 。

parseTextLines
  ::  ReadP [ TaggedText ]
parseTextLines
  =
      char ' n '
  >>  (getTaggedText <$> manyTill parseAny parseEndOfTextLines)

我們首先要匹配Newline字符。之後，我們在字幕文本字符上函數映射或fmap（ <$> ） getTaggedText直到我們到達文本行的末尾。

parseEndOfTextLines
  ::  ReadP ()
parseEndOfTextLines
  =
  void (string " nn " ) <|> eof

當我們到達兩個Newline字符或文件末尾時，我們停止收集字符（ parseAny ）。這標誌著塊的末端。

getTaggedText
  ::  String
  ->  [ TaggedText ]
getTaggedText
  s
  =
  fst
    $ foldl
      folder
      ( [] , [] )
      parsed
  where

getTaggedText通過從左到右折疊的文本折疊，返回累積的標記文本。

    parsed
      ::  [ String ]
    parsed
      =
      case readP_to_S (parseTaggedText [] ) s of
        []      -> [s]
        r @ (_ : _) -> ( fst . last ) r

parsed返回一個或多個字符串的列表。它試圖解析標籤的輸入文本。如果失敗， parsed將返回列表中的輸入字符串。否則，如果parseTaggedText成功， parse將返回最後可能的解析（ (fst . last) r ）。

    folder
      ::  ([ TaggedText ], [ Tag ])
      ->  String
      ->  ([ TaggedText ], [ Tag ])
    folder
      (tt, t)
      x
      | isTag x   = (tt, updateTags t x)
      | otherwise = (tt ++ [ TaggedText { text = x, tags = t}], t)

當folder從從左向右移動，在解析的字符串上移動時，它會檢查當前字符串是否為標籤。如果是標籤，它將更新當前的活動標籤集（ t ）。否則，它將附加另一個與活動標籤集有關的標記文本。

updateTags
  ::  [ Tag ]
  ->  String
  ->  [ Tag ]
updateTags
  tags
  x
  | isClosingTag x = remove compare' tags (makeTag x)
  | isOpeningTag x = add    compare' tags (makeTag x)
  | otherwise      = tags
  where
    compare'
      ::  Tag
      ->  Tag
      ->  Bool
    compare'
      a
      b
      =
      name a /= name b

updateTags根據刪除或添加給定標籤（ x ）的tags更新標籤，具體取決於它是關閉或打開標籤。如果這都不是，它只是返回傳遞的標籤集。如果tags已經具有相同名稱的標籤， add將覆蓋現有標籤。您可以在給出的compare'函數中看到這一點。

為了使解析器保持簡單，如果找到了打開標籤T ，則將T添加到標籤列表中或覆蓋退出T如果已經存在）。如果找到相應的關閉/T ，則將T從標籤列表中刪除（如果存在）。連續有兩個或多個T s，一個或多個T s，一個或多個沒有關閉/T ，和 /或沒有打開T關閉/T 。

makeTag
  ::  String
  ->  Tag
makeTag
  s
  =
  Tag
    { name       = getTagName       s
    , attributes = getTagAttributes s
    }

makeTag從給定的s組裝一個標籤。每個Tag都有一個名稱和零或更多屬性。

parseTaggedText
  ::  [ String ]
  ->  ReadP [ String ]
parseTaggedText
  strings
  = do
  s <- look
  case s of
    " " -> return strings
    _  -> do
      r <- munch1 ( /= ' < ' ) <++ parseClosingTag <++ parseOpeningTag
      parseTaggedText $ strings ++ [r]

parseTaggedText返回輸入字符串分為零件。每個作品都是標籤，關閉標籤或開放標籤所包含的文本。將其分開後，它將其添加到其他碎片中，然後再次自動起訴。如果其餘的輸入字符串為空，則返回找到的字符串列表。

 > readP_to_S (string " ab " <++ string " abc " ) " abcd "
[( " ab " , " cd " )]

> readP_to_S (string " ab " +++ string " abc " ) " abcd "
[( " ab " , " cd " ),( " abc " , " d " )]

> readP_to_S (string " ab " <|> string " abc " ) " abcd "
[( " ab " , " cd " ),( " abc " , " d " )]

<++運算符的偏見意味著，如果左側成功，它甚至不會打擾右側。回想一下，當我們運行解析器時，我們將獲得所有可能的分析的列表。所有這些可能的解析都是解析器已經穿過所有可能的路徑的結果。通過使用<++ ，當且僅當左側失敗時，我們就會從左路和右路路徑接收可能的解析。如果您希望通過左側和右側的所有可能的解析，則可以使用ReadP提供的+++操作員。 +++只是<|> ，我們在上面看到了。

parseOpeningTag
  ::  ReadP String
parseOpeningTag
  = do
  _ <- char ' < '
  t <- munch1 (  c -> c /= ' / ' && c /= ' > ' )
  _ <- char ' > '
  return $ " < " ++ t ++ " > "

開頭標籤是一個開頭的角度支架，一些不包括前斜線的文本，以及下一個即時的閉合角括號。

parseClosingTag
  ::  ReadP String
parseClosingTag
  = do
  _ <- char ' < '
  _ <- char ' / '
  t <- munch1 ( /= ' > ' )
  _ <- char ' > '
  return $ " </ " ++ t ++ " > "

閉合標籤是一個開頭的角括號，前向斜線，一些文本和下一個即時的閉合角括號。

getTagAttributes
  ::  String
  ->  [ TagAttribute ]
getTagAttributes
  s
  =
  if isOpeningTag s
    then
      case readP_to_S (parseTagAttributes [] ) s of
        []    -> []
        (x : _) -> fst x
    else
      []

開放標籤可以具有屬性。例如， <font color="#101010"> 。每個屬性都是兩個鍵盤，鑰匙值對。在上面的示例中， color將是鑰匙， #101010是值。

getTagName
  ::  String
  ->  String
getTagName
  s
  =
  case readP_to_S parseTagName s of
    []    -> " "
    (x : _) -> toLower' $ fst x

這將返回小寫的標籤名稱。

parseTagName
  ::  ReadP String
parseTagName
  = do
  _ <- char ' < '
  _ <- munch ( == ' / ' )
  _ <- skipSpaces
  n <- munch1 (  c -> c /= ' ' && c /= ' > ' )
  _ <- munch  ( /= ' > ' )
  _ <- char ' > '
  return n

標籤名稱是開頭角括號之後的第一個非Whitespace字符字符串，可能的前向斜線以及一些可能的空格以及在其他空間和/或閉合角括號之前。

parseTagAttributes
  ::  [ TagAttribute ]
  ->  ReadP [ TagAttribute ]
parseTagAttributes
  tagAttributes
  = do
  s <- look
  case s of
    " " -> return tagAttributes
    _  -> do
      let h = head s
      case h of
        ' > ' -> return tagAttributes
        ' < ' -> trimTagname >> parseTagAttributes'
        _   -> parseTagAttributes'
  where
    parseTagAttributes'
      ::  ReadP [ TagAttribute ]
    parseTagAttributes'
      = do
      tagAttribute <- parseTagAttribute
      parseTagAttributes
        ( add
            (  a b -> fst a /= fst b)
            tagAttributes
            tagAttribute
        )

parseTagAttributes遞歸通過輸入字符串，收集鍵值對。在標籤（ < ）的開頭，它首先在處理屬性之前先修剪標籤名稱。當它到達閉合角括號（ > ）時，它會停止解析屬性。如果標籤恰好具有重複的屬性（基於密鑰）， add將確保僅保留最新的屬性。

trimTagname
  :: ReadP ()
trimTagname
  =
      char ' < '
  >> skipSpaces
  >> munch1 (  c -> c /= ' ' && c /= ' > ' )
  >> return ()

這修剪或放棄標籤名稱。

parseTagAttribute
  ::  ReadP TagAttribute
parseTagAttribute
  = do
  _ <- skipSpaces
  k <- munch1 ( /= ' = ' )
  _ <- string " = " "
  v <- munch1 ( /= ' " ' )
  _ <- char ' " '
  _ <- skipSpaces
  return (toLower' k, v)

屬性鍵是在均等符號之前的任何非Whitespace字符字符串。屬性值是在均等符號和雙引號之後以及下一個即時雙引號之前的任何字符。

isTag
  ::  String
  ->  Bool
isTag
  s
  =
  isOpeningTag s || isClosingTag s

字符串是標籤，如果它是打開標籤或關閉標籤。

isOpeningTag
  ::  String
  ->  Bool
isOpeningTag
  s
  =
  isPresent $ readP_to_S parseOpeningTag s

如果開放標籤解析器成功，則字符串是開頭標籤。

isClosingTag
  ::  String
  ->  Bool
isClosingTag
  s
  =
  isPresent $ readP_to_S parseClosingTag s

如果關閉標籤解析器成功，則字符串是關閉標籤。

運行SRT解析器

現在我們已經組裝了解析器，讓我們嘗試一下。

 > let srt =
>       " 1 n 
>       :0:0,1 --> 0:1:0.2  x1:1 X2:3  y1:4 y2:10 n 
>       <font color= " red " color= " blue " >This is some <b><u><i> n 
>       subtitle n 
>       </u>text.</b>  "
> readP_to_S parseSrt srt
[([ SrtSubtitle
      { index = 1
      , start = Timestamp {hours = 0 , minutes = 0 , seconds = 0 , milliseconds = 1 }
      , end   = Timestamp {hours = 0 , minutes = 1 , seconds = 0 , milliseconds = 2 }
      , coordinates = Just ( SrtSubtitleCoordinates {x1 = 1 , x2 = 3 , y1 = 4 , y2 = 10 })
      , taggedText =  [ TaggedText
                        { text = " This is some "
                        , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                 ]
                        }
                      , TaggedText
                          { text = " n subtitle n "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " b " ,    attributes = [] }
                                   , Tag {name = " u " ,    attributes = [] }
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      , TaggedText
                          { text = " text. "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " b " ,    attributes = [] }
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      , TaggedText
                          { text = "  "
                          , tags = [ Tag {name = " font " , attributes = [( " color " , " blue " )]}
                                   , Tag {name = " i " ,    attributes = [] }
                                   ]
                          }
                      ]
      }
  ]
, " "
)]

在這裡，您可以看到解析測試字符串的結果。注意測試字符串中的錯誤，例如使用周期而不是逗號或重複的標籤屬性。

練習

編寫一個可以將SRT文件轉換為JSON文件的程序。
使用PARSEC而不是readP重寫版本號解析器。
使用PARSEC而不是readP重寫SRT解析器。

版權

（c）2019 David Lettier
lettier.com

展開