Matching bytestrings in Parsec - parsec

I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString instead of String, but there is a string combinator which just works with String.
Is there a Parsec combinator similar to string that works with ByteString, without having to do conversions back and forth?
I've seen there is an alternative parser that handles ByteString: attoparsec, but I would prefer to stick with Parsec, since I'm just learning how to use it.

I'm assuming you're starting with something like
import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString
Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.
Something I noticed along the way:
If you import Text.Parsec.ByteString then this uses uncons from Data.ByteString.Char8, which in turn uses w2c from Data.ByteString.Internal, to convert all read bytes to Chars. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to use string and friends without problem.
Thus, the easy version of the CSV parser, which does exactly that:
import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)
import Text.Parsec
import Text.Parsec.ByteString
csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")
quotedCell :: Parser String
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return content
quotedChar :: Parser Char
quotedChar =
noneOf "\""
<|> try (string "\"\"" >> return '"')
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"
main :: IO ()
main =
do c <- ByteString.getContents
case parse csvFile "(stdin)" c of
Left e -> do Prelude.putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString or [Word8] or something similar all the way through? Hence my second attempt below. I am still importing Text.Parsec.ByteString, which may be a mistake, and the code is hopelessly riddled with conversions.
But, it compiles and has complete type annotations, and therefore should make a sound starting point.
import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)
import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)
import Data.Word (Word8)
import Data.ByteString.Internal (c2w)
import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)
csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))
quotedCell :: Parser ByteString
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return (ByteString.pack content)
quotedChar :: Parser Word8
quotedChar =
noneOf "\""
<|> try (string "\"\"" >> return (c2w '"'))
eol :: Parser ByteString
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"
main :: IO ()
main =
do c <- ByteString.getContents
case parse csvFile "(stdin)" c of
Left e -> do Prelude.putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
-- replacements for some of the functions in the Parsec library
noneOf :: String -> Parser Word8
noneOf cs = satisfy (\b -> b `notElem` [c2w c | c <- cs])
char :: Char -> Parser Word8
char c = byte (c2w c)
byte :: Word8 -> Parser Word8
byte c = satisfy (==c) <?> show [c]
satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f = tokenPrim (\c -> show [c])
(\pos c _cs -> updatePosChar pos c)
(\c -> if f (c2w c) then Just (c2w c) else Nothing)
string :: String -> Parser ByteString
string s = liftM Char8.pack (tokens show updatePosString s)
Probably your concern, efficiency-wise, should be those two ByteString.pack instructions, in the definitions of cell and quotedCell. You might try to replace the Text.Parsec.ByteString module so that instead of “making strict ByteStrings an instance of Stream with Char token type,” you make ByteStrings an instance of Stream with Word8 token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.
No, the way to make it more efficient would be to change the types of char, quotedChar and string to Parser [Word8] and the types of line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You could even change the type of eol to Parser (). The necessary changes would look something like this:
cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")
quotedCell :: Parser [Word8]
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return content
string :: String -> Parser [Word8]
string s = [c2w c | c <- (tokens show updatePosString s)]
You don't need to worry about all the calls to c2w as far as efficiency is concerned, because they cost nothing.
If this doesn't answer your question, please say what would.

I don't believe so. You will need to create one yourself using tokens. Although the documentation for it is a bit... nonexistent, the first two arguments are a function to use to show the expected tokens in an error message and a function to update the source position that will be printed in errors.

Related

Format a string in Elm

I have a list of string and generate it to HTML dynamically with li tag. I want to assign that value to id attribute as well. But the problem is the string item has some special characters like :, ', é, ... I just want the output to include the number(0-9) and the alphabet (a-z) only.
// Input:
listStr = ["Pop & Suki", "PINK N' PROPER", "L'Oréal Paris"]
// Output:
result = ["pop_suki", "pink_n_proper", "loreal_paris"] ("loral_paris" is also good)
Currently, I've just lowercased and replace " " to _, but don't know how to eliminate special character.
Many thanks!
Instead of thinking of it as eliminating special characters, consider the permitted characters – you want just lower-case alphanumeric characters.
Elm provides Char.isAlphaNum to test for alphanumeric characters, and Char.toLower to transform a character to lower case. It also provides the higher function String.foldl which you can use to process a String one Char at a time.
So for each character:
check if it's alphanumeric
if it is, transform it to lower case
if not and it is a space, transform it to an underscore
else drop the character
Putting this together, we create a function that processes a character and appends it to the string processed so far, then apply that to all characters in the input string:
transformNextCharacter : Char -> String -> String
transformNextCharacter nextCharacter partialString =
if Char.isAlphaNum nextCharacter then
partialString ++ String.fromChar (Char.toLower nextCharacter)
else if nextCharacter == ' ' then
partialString ++ "_"
else
partialString
transformString : String -> String
transformString inputString =
String.foldl transformNextCharacter "" inputString
Online demo here.
Note: This answer simply drops special characters and thus produces "loral_paris" which is acceptable as per the OP.
The answer that was ticked is a lot more efficient than the code I have below. Nonetheless, I just want to add my code as an optional method.
Nonetheless, if you want to change accents to normal characters, you can install and use the elm-community/string-extra package. That one has the remove accent method.
This code below is inefficient as you keep on calling library function on the same string of which all of them would go through your string one char at a time.
Also, take note that when you remove the & in the first index you would have a double underscore. You would have to replace the double underscore with a single underscore.
import Html exposing (text)
import String
import List
import String.Extra
import Char
listStr = ["Pop & Suki", "PINK N' PROPER", "L'Oréal Paris"]
-- True if alpha or digit or space, otherwise, False.
isDigitAlphaSpace : Char -> Bool
isDigitAlphaSpace c =
if Char.isAlpha c || Char.isDigit c || c == ' ' then
True
else
False
main =
List.map (\x -> String.Extra.removeAccents x --Remove Accents first
|> String.filter isDigitAlphaSpace --Remove anything that not digit alpha or space
|> String.replace " " "_" --Replace space with _
|> String.replace "__" "_" --Replace double __ with _
|> String.toLower) listStr --Turn the string to lower
|> Debug.toString
|> Html.text

Print elements in list in OCaml

I want to write a tail recursive function to print elements in a string list in separate lines like this:
# printlist ["a";"b";"c"];;
a
b
c
- : unit = ()
# printlist ["hello";"thanks"];;
hello
thanks
- : unit = ()
I was able to get it to work using print_endline with no problem:
let rec printlist strlist =
match strlist with
| [] -> print_endline ""
| hd::[] -> print_endline hd
| hd :: tl -> print_endline hd ; printlist tl;;
However, as soon as I switch to printf, it doesn't work anymore. What's wrong with my printf version?
let rec printlist strlist =
match strlist with
| [] -> printf ""
| hd::[] -> printf hd
| hd :: tl -> printf "%s\n" hd ; printlist tl;;
Error: This expression has type
(unit, out_channel, unit) format =
(unit, out_channel, unit, unit, unit, unit)
CamlinternalFormatBasics.format6
but an expression was expected of type string
In essence, you're trying to use printf without a format. The first argument of printf has to be a constant string. So you should have this:
printf "%s" hd
rather than this:
printf hd
To see why this is required, imagine what would happen if some of the strings in your input contained percent characters. Things would get out of control (type-wise) quite quickly.
In addition to Jeffrey's answer, I would suggest you use the standard library more in order to write more concise code.
List.iter, for example, calls a given function on all the elements of the list:
let print_list l = List.iter (fun e -> Printf.printf "%s\n" e) l
Using partial application smartly, you can make this line even shorter and more readable:
let print_list = List.iter (Printf.printf "%s\n")
The only difference with your function is the newline after the last element.
On the other hand, instead of printing elements one after another, a more functional and idiomatic approach would be to build the whole string first, and then print it.
Lucky you, the standard library got you covered. String.concat joins the elements in a string list together in one big string. You can also specify a string to use as a separator, and you don't have to worry about the newline after the last element.
let print_list l = print_string (String.concat "\n" l)

Elm: String.toFloat doesn't work with comma only with point - what to do?

I'm very new to elm and i want to do a simple mileage counter app.
If i get "1.2" (POINT) form input - String.toFloat returns in the OK branch with 1.2 as a number.
But if i get "1,2" (COMMA) form input, then String.toFloat returns in the Err branch with "You can't have words, only numbers!"
This pretty much works like a real time validator.
The code:
TypingInInput val ->
case String.toFloat val of
Ok success ->
{ model | inputValue = val, errorMessage = Nothing }
Err err ->
{ model | inputValue = val, errorMessage = Just "You can't have words, or spaces, only numbers!" }
.
Question: So how can i force String.toFloat of "1,2" to give me 1.2 the number?
Unfortunately the source for toFloat is hardcoded to only respect a dot as decimal separator. You can replace the comma with a dot in the string prior to passing it to toFloat as a workaround.
String.Extra.replace can be used for the simple string replacement.
The implementation of String.toFloat only supports a dot as a separator.
You should replace commas first before parsing the Float
Please see the example:
import Html exposing (text)
import String
import Regex
main =
"1,2"
|> Regex.replace Regex.All (Regex.regex ",") (\_ -> ".")
|> String.toFloat
|> toString
|> text -- 1.2
In JavaScript parseFloat doesn't support comma separator either.

functional programming : understand parser combinator

I was trying to solve a problem using a parser combinator. I tried the following:
Note: the below code uses the combinator library
styleParserItalic : Bool -> Parser ( List (List Char , Style))
styleParserItalic bolded =
let
style = if bolded then Italic else Unstyled
in
(end `andThen` always ( succeed ( [] )))
<|> (string "(!ITALIC!)" `andThen` \_ -> styleParserItalic ( not bolded ) )
<|> ( anyChar `andThen` \c -> styleParserItalic bolded `andThen` \cs -> succeed ((c :: [],style) :: cs) )
I am struggling to understand how this parser runs since the styleParserItalic parser is called before the parser succeeds.
Could someone explain how the parser works when it is given a string of characters?
If someone is interested in the purpose of the parser and the full code, here is my previous question.
Here is what I have understood thus far
The parser will check first if it is the end of a line , if not it will try to parser the string (!ITALIC!) if that the case then it will call the parser with with parameter True or false (if false then will makes it true ..)
If the parser does not find the the string (!ITALIC!) it will try to parse any character then it will call the parser again.
What confuses me is that the parser will keep calling itself as long as it succeeds with parsing any character!
edit :* NOTE THE BELOW IS NOT A PART OF THE QUESTION, JUST TO SHARE THE CODE IF SOMEONE IS INTERESTED
thanks for all responses, I have updated the parser to parse Bold italic underline..., as per the below screen shot
type Style = Bold| Unstyled | Italic | Coded | Lined | Titled | Marked | Underline
styleParser : Bool ->Bool ->Bool ->Bool-> Bool-> Bool->Bool
-> Parser ( List (List Char , (Style,Style,Style,Style,Style,Style,Style)))
--(bold,italic ,code,line ,Titled,mark)
styleParser bolded italiced coded lined titled marked underlined=
let
style = (
if bolded then Bold else Unstyled
,if italiced then Italic else Unstyled
,if coded then Coded else Unstyled
,if lined then Lined else Unstyled
,if titled then Titled else Unstyled
,if marked then Marked else Unstyled
,if underlined then Underline else Unstyled
)
in
(end `andThen` always ( succeed ( [] )))
<|> (string "//" `andThen` \_ -> styleParser bolded italiced coded lined titled marked (not underlined))
<|> (string "**" `andThen` \_ -> styleParser (not bolded) italiced coded lined titled marked underlined)
<|> (string "*" `andThen` \_ -> styleParser bolded (not italiced) coded lined titled marked underlined)
<|> (string "`" `andThen` \_ -> styleParser bolded italiced (not coded) lined titled marked underlined)
<|> (string "/br" `andThen` \_ -> styleParser bolded italiced coded (not lined) titled marked underlined)
<|> (string "/*" `andThen` \_ -> styleParser bolded italiced coded lined (not titled) marked underlined)
<|> (string "{-" `andThen` \_ -> styleParser bolded italiced coded lined titled (not marked) underlined)
<|> ( anyChar `andThen` \c -> styleParser bolded italiced coded lined titled marked underlined `andThen` \cs -> succeed ((c :: [],style) :: cs) )
foldStyleHtml : List ( List Char , ( Style,Style,Style,Style,Style,Style,Style) ) -> List (Html Msg)
foldStyleHtml lst =
List.map styleToHtml lst
styleToHtml : ( List Char, (Style ,Style,Style,Style,Style,Style,Style)) -> Html Msg
styleToHtml (a,b) =
case b of
(Bold,Italic,_,_,_,_,Unstyled) -> strong [] [em [][ text (String.fromList a)]]
(Bold,Italic,_,_,_,_,Underline) -> u[][ strong [] [em [][ text (String.fromList a)]]]
(Bold,Unstyled,_,_,_,_,Underline) -> u[][ strong [] [text (String.fromList a)]]
(Unstyled,Italic,_,_,_,_,Underline) -> u[][ em [] [text (String.fromList a)]]
(Unstyled,Italic,_,_,_,_,_) -> em[] [text (String.fromList a)]
(Bold,Unstyled,_,_,_,_,_) -> strong [][ text (String.fromList a)]
(_,_,Coded,_,_,_,_) -> code [codeStyle ][text (String.fromList a)]
(_,_,_,Lined,_,_,_) -> br [][text " "]
-- (_,_,_,_,Titled,_,_) -> div [][text (String.fromList a)]
(_,_,_,_,_,Marked,_) -> mark [][text (String.fromList a)]
(_,_,_,_,_,_,Underline) -> u [][text (String.fromList a)]
(_,_,_,_,_,_,_) -> text (String.fromList a)
htmlParser : Parser (List (Html Msg))
htmlParser =
styleParser False False False False False False False `andThen` (succeed << foldStyleHtml )
runParser : Parser (List (Html Msg)) -> String -> Html Msg
runParser parser str =
case parse parser str of
(Ok htmls,_)-> div [] htmls
(Err err, _) -> div [ style [("color", "red")] ] [ text <| toString <| err]
Parser combinators (generally) consume input as they succeed. In this library, if string "(!ITALIC!)" fails, it will not consume any input. Since the <|> combinator is used, it then tries to use the next part of the code that starts with anyChar.
When anyChar succeeds, it consumes that single character and captures it in c after andThen. Then the remaining string (everything but the character captured by anyChar) is then "crawled" when the recursive call to styleParserItalic bolded is made. That second andThen captures the output of the recursive combinator into cs and prepends the captured character onto the rest of the list of characters from the recursive call.
I think the important part to remember is that the combinators consume input as they succeed and (generally) don't consume input when they fail.
First a few simplications...
(1) Every element of the inner List in the signature:
styleParserItalic : Bool -> Parser ( List (List Char , Style))
^^^^^^^^^
is just a single character. By just removing :: [] from the last line,
<|> ... `andThen` \cs -> succeed ((c ,style) :: cs) )
^^^
removed `:: []`
you can make it have this signature.
(2) Note that the bolded argument only affects the Style - it has not effect on control flow. It really should be called italic since the style appearing in the output is Italic if the argument is True and Unstyled otherwise.
Also note that once this parameter is set the True, it will remain True in
all subsequent recursive calls.
So now the algorithm is:
If at the end of the line, return the empty list.
If at (!ITALIC!), use Italic for the style in the remainder of the parse.
Otherwise, parse a character, parse the remainder of the line and concatenate the results.
An approximate Python algorithm would be something like:
def parseLine(style, line):
result = []
while line:
if line.startsWith('(!ITALIC!)'):
line = line[8:]
style = Italic
# loop around
else:
result.append( (line[0], style) )
line = line[1:]
return result

Testing functions in Haskell that do IO

Working through Real World Haskell right now. Here's a solution to a very early exercise in the book:
-- | 4) Counts the number of characters in a file
numCharactersInFile :: FilePath -> IO Int
numCharactersInFile fileName = do
contents <- readFile fileName
return (length contents)
My question is: How would you test this function? Is there a way to make a "mock" input instead of actually needing to interact with the file system to test it out? Haskell places such an emphasis on pure functions that I have to imagine that this is easy to do.
You can make your code testable by using a type-class-constrained type variable instead of IO.
First, let's get the imports out of the way.
{-# LANGUAGE FlexibleInstances #-}
import qualified Prelude
import Prelude hiding(readFile)
import Control.Monad.State
The code we want to test:
class Monad m => FSMonad m where
readFile :: FilePath -> m String
-- | 4) Counts the number of characters in a file
numCharactersInFile :: FSMonad m => FilePath -> m Int
numCharactersInFile fileName = do
contents <- readFile fileName
return (length contents)
Later, we can run it:
instance FSMonad IO where
readFile = Prelude.readFile
And test it too:
data MockFS = SingleFile FilePath String
instance FSMonad (State MockFS) where
-- ^ Reader would be enough in this particular case though
readFile pathRequested = do
(SingleFile pathExisting contents) <- get
if pathExisting == pathRequested
then return contents
else fail "file not found"
testNumCharactersInFile :: Bool
testNumCharactersInFile =
evalState
(numCharactersInFile "test.txt")
(SingleFile "test.txt" "hello world")
== 11
This way your code under test needs very little modification.
As Alexander Poluektov already pointed out, the code you are trying to test can easily be separated into a pure and an impure part.
Nevertheless I think it is good to know how to test such impure functions in haskell.
The usual approach to testing in haskell is to use quickcheck and that's what I also tend to use for impure code.
Here is an example of how you might achieve what you are trying to do which gives you kind of a mock behavior * :
import Test.QuickCheck
import Test.QuickCheck.Monadic(monadicIO,run,assert)
import System.Directory(removeFile,getTemporaryDirectory)
import System.IO
import Control.Exception(finally,bracket)
numCharactersInFile :: FilePath -> IO Int
numCharactersInFile fileName = do
contents <- readFile fileName
return (length contents)
Now provide an alternative function (Testing against a model):
numAlternative :: FilePath -> IO Integer
numAlternative p = bracket (openFile p ReadMode) hClose hFileSize
Provide an Arbitrary instance for the test environment:
data TestFile = TestFile String deriving (Eq,Ord,Show)
instance Arbitrary TestFile where
arbitrary = do
n <- choose (0,2000)
testString <- vectorOf n $ elements ['a'..'z']
return $ TestFile testString
Property testing against the model (using quickcheck for monadic code):
prop_charsInFile (TestFile string) =
length string > 0 ==> monadicIO $ do
(res,alternative) <- run $ createTmpFile string $
\p h -> do
alternative <- numAlternative p
testRes <- numCharactersInFile p
return (testRes,alternative)
assert $ res == fromInteger alternative
And a little helper function:
createTmpFile :: String -> (FilePath -> Handle -> IO a) -> IO a
createTmpFile content func = do
tempdir <- catch getTemporaryDirectory (\_ -> return ".")
(tempfile, temph) <- openTempFile tempdir ""
hPutStr temph content
hFlush temph
hClose temph
finally (func tempfile temph)
(removeFile tempfile)
This will let quickCheck create some random files for you and test your implementation against a model function.
$ quickCheck prop_charsInFile
+++ OK, passed 100 tests.
Of course you could also test some other properties depending on your usecase.
* Note about the my usage of the term mock behavior:
The term mock in the object oriented sense is perhaps not the best here. But what is the intention behind a mock?
It let's you test code that needs access to a resource that usually is
either not available at testing time
or is not easily controllable and thus not easy to verify.
By shifting the responsibility of providing such a resource to quickcheck, it suddenly becomes feasible to provide an environment for the code under test that can be verified after a test run.
Martin Fowler describes this nicely in an article about mocks :
"Mocks are ... objects pre-programmed with expectations which form a specification of the calls they are expected to receive."
For the quickcheck setup I'd say that files generated as input are "pre-programmed" such that we know about their size (== expectation). And then they are verified against our specification (== property).
For that you will need to modify the function such that it becomes:
numCharactersInFile :: (FilePath -> IO String) -> FilePath -> IO Int
numCharactersInFile reader fileName = do
contents <- reader fileName
return (length contents)
Now you can pass any mock function that takes a file path and return IO string such as:
fakeFile :: FilePath -> IO String
fakeFile fileName = return "Fake content"
and pass this function to numCharactersInFile.
The function consists from two parts: impure (reading part content as String) and pure (calculating the length of String).
The impure part cannot be "unit"-tested by definition. The pure part is just call to the library function (and of course you can test it if you want :) ).
So there is nothing to mock and nothing to unit-test in this example.
Put it another way. Consider you have an equal C++ or Java implementation (*): reading content and then calculating its length. What would you really want to mock and what would remain for testing afterwards?
(*) which is of course not the way you will do in C++ or Java, but that's offtopic.
Based on my layman's understanding of Haskell, I've come to the following conclusions:
If a function makes use of the IO monad, mock testing is going to be impossible. Avoid hard-coding the IO monad in your function.
Make a helper version of your function that takes in other functions that may do IO. The result will look like this:
numCharactersInFile' :: Monad m => (FilePath -> m String) -> FilePath -> m Int
numCharactersInFile' f filePath = do
contents <- f filePath
return (length contents)
numCharactersInFile' is now testable with mocks!
mockFileSystem :: FilePath -> Identity String
mockFileSystem "fileName" = return "mock file contents"
Now you can verify that numCharactersInFile' returns the the expected results w/o IO:
18 == (runIdentity . numCharactersInFile' mockFileSystem $ "fileName")
Finally, export a version of your original function signature for use with IO
numCharactersInFile :: IO Int
numCharactersInFile = NumCharactersInFile' readFile
So, at the end of the day, numCharactersInFile' is testable with mocks. numCharactersInFile is just a variation of numCharactersInFile'.