Sunday, April 30, 2017

A small Tetris-like clone using J and ncurses. Part 1

For me, the J programming language it's a very intriguing. It is full of ideas and concepts that I'm not familiar with. Getting to know a programming language it's not only about learning the syntax. It is learning the techniques that people use to take advantage of it what gives you more insight . This is particularly true for J.

For me the best way to learn more about a programming language is to try to solve a small problem with it. In this post I'm going to describe an attempt to write a small and incomplete Tetris-like clone using J and the ncurses library. Here's a preview of how it looks:

Tetriminos

According to the Wikipedia page for Tetris, the pieces are named Tetriminos https://en.wikipedia.org/wiki/Tetris#Gameplay. Each piece is composed of blocks. In J we can represent this pieces as matrices.

To create this matrices we use the Shape verb ($) For example:

  • The "L" tetrimino:
   ] l_tetrimino =. 2 3 $ 1 1 1 1 0 0 
1 1 1
1 0 0
  • The "O" tetrimino:
   ] b_tetrimino =. 2 2 $ 4 4 4 4 
4 4
4 4
  • The "S" tetrimino:
   ] s_tetrimino =. 2 3 $ 0 5 5 5 5 0 
0 5 5
5 5 0

Tetrimino rotation

In Tetris pressing the 'up' arrow is going to rotate the current piece. We can use matrix Transpose (|:) and Reverse (|.) verbs and compose them together using the Atop conjunction (@). Here's the definition:

rotate =: |.@|:

Here we can see how this verb works:

   l_tetrimino
1 1 1
1 0 0
   rotate l_tetrimino
1 0
1 0
1 1
   rotate rotate l_tetrimino
0 0 1
1 1 1
   rotate rotate rotate l_tetrimino
1 1
0 1
0 1
   rotate rotate rotate rotate l_tetrimino
1 1 1
1 0 0

We can apply this transformation to the other tetriminos for example:

   ] s_tetrimino =. 2 3 $ 0 5 5 5 5 0 
0 5 5
5 5 0
   rotate s_tetrimino
5 0
5 5
0 5
   rotate rotate s_tetrimino
0 5 5
5 5 0

We use different numbers for each tetrimino so we can use different colors to paint them.

Tetrimino placement

A natural way to model the game is to use a matrix representing the playing field. We use a 10 columns by 20 rows matrix for this effect. We use the Shape verb ($) to do this:

   ] game =:  20 10 $ 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

A fundamental piece of functionality that we need is a way to put a tetrimino inside this matrix. This proved to be tricky (maybe because of lack of J knowledge). We're going to incrementally create this verb.

Reading the J documentation, it seems that we can use the Amend (m } _ _ _) verb to change just a set of cells of the game matrix. Here's an example on how to use this verb

   ] sample =. 5 5 $ 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0

   1 2 3 4 (1 1;2 1;1 2;2 2) } sample
0 0 0 0 0
0 1 3 0 0
0 2 4 0 0
0 0 0 0 0
0 0 0 0 0

What we are saying here is that we can to change the following cells in sample:

  • row 1 column 1 with value 1
  • row 2 column 1 with value 2
  • row 1 column 2 with value 3
  • row 2 column 2 with value 4

Now to take advantage of this verb we need to calculate the target coordinates to change the value of a tetrimino. First we start by generating coordinates for each of the cells of the tetrimino.

We're going to use the following predifined values:

   ] l_tetrimino =. 2 3 $ 1 1 1 1 0 0 
1 1 1
1 0 0
   sample
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0

We start by determining how many rows and columns. We use the Shape of verb ($) to do this:

   $ l_tetrimino
2 3

Now we want to generate (0, 0);(0, 1);...(1, 2);(2, 2). With the result of Shape of we generate a sequence of numbers for each of the axis. To do this we use the Integers (i.) verbwith the number of rows and the number of columns. For example:

   NB. Get the number of rows:
   (0&{@$) l_tetrimino
2
   NB. Get the number of columns:
   (1&{@$) l_tetrimino
3
   NB. Get an integer sequence from zero to number of rows or columns
   i.(1&{@$) l_tetrimino
0 1 2
   i.(0&{@$) l_tetrimino
0 1

Now this is very cool, we can use the Table verb (/)to pair this two arrays. From the documentation:

In general, each cell of x is applied to the entire of y . Thus x u/ y is equivalent to x u"(lu,_) y where lu is the left rank of u .

This is very important!. To taken advantage of this we need to use the Append verb (,) but changing the rack to operate on each of the elements from the right argument. See this example:

   0 (,"0) 0 1 2
0 0
0 1
0 2

Now we can take advantage of this and write:

   (  (i.@(0&{@$)) (<@,"0)/ (i.@(1&{@$))) l_tetrimino 
+---+---+---+
|0 0|0 1|0 2|
+---+---+---+
|1 0|1 1|1 2|
+---+---+---+

Now this is almost what we need. We can use the Ravel veb (,) to flatten this box:


, (  (i.@(0&{@$)) (<@,"0)/ (i.@(1&{@$))) l_tetrimino
+---+---+---+---+---+---+
|0 0|0 1|0 2|1 0|1 1|1 2|
+---+---+---+---+---+---+

With this positions we can use the Amend verb to change our game matrix:

   (,l_tetrimino) positions } sample
1 1 1 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0

We need something else since this technique only allows us to put the tetrimino at the top of the matrix. In order to do this we need to sum the target position to the coordinates that we generated. We can use the powerful Under verb (&.) which allows us to apply an operation to each of the cells of a box.

   (3 2)&+ &.> positions
+---+---+---+---+---+---+
|3 2|3 3|3 4|4 2|4 3|4 4|
+---+---+---+---+---+---+

We construct this operation by:

  1. using Bond conjuction (&) to tie together the position (3 2) with the plus operation (+) . That is (3 2)&+ .
  2. we apply this operation to each of the elements of the box and then assemble the box again. That is &.>

Now we can put the tetrimino in row 3, column 2.

   target_position =. 3 2
   target_position =. 2 1
   (,l_tetrimino) (target_position&+&.>positions) } sample
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 1 1 1 0
0 0 0 0 0

We cannot just pust the tetrimino in the target position. It may also "blend" with existing values. For example say the following game field and the following tetrminio:

   ] game
0 0 0 0 0
0 0 0 0 0
0 0 0 1 0
0 0 0 1 0
0 0 0 1 1

positions =., (  (i.@(0&{@$)) (<@,"0)/ (i.@(1&{@$))) tetrimino

   (,tetrimino) ((1 2)&+&.> positions) } game
0 0 0 0 0
0 0 1 1 0
0 0 1 0 0
0 0 1 0 0
0 0 0 1 1

Because of this we need to mix the tetrimino with the current values of the target region. We do this by extracting the values of the target position:

   ($tetrimino) $ ((1 2)&+&.> positions) { game
0 0
0 1
0 1

We can combine this array with the tetrimino and we get the desired target value:

   ] target_tetrimino =. +&tetrimino ($tetrimino) $ ((1 2)&+&.> positions) { game
1 1
1 1
1 1

   (,target_tetrimino) ((1 2)&+&.> positions) } game
0 0 0 0 0
0 0 1 1 0
0 0 1 1 0
0 0 1 1 0
0 0 0 1 1

The final verb definition looks like this:

put_in_matrix =: 3 : 0
 NB. unpack argumets
 tetrimino =. > 0 { y
 i =. > 1 { y
 j =. > 2 { y
 game =. > 3 { y

 NB. calculate target positions 
 positions =. , ((i.@(0&{)@$)(<@,"0)/(i.@(1&{)@$)) tetrimino

 NB. combine tetrimino with target section
 tetrimino =. +&tetrimino ($tetrimino) $ ((+&(i,j))&.> positions) { game
  
 NB. change matrix
 (,tetrimino) ((+&(i,j))&.> positions)} game
)

Checking if space is available

The other piece of functionality that we need is a way to verify if we can put the tetrimino in a target position. We need to verify two conditions: 1. We can put the tetrimino inside the game field. 2. There's space available in the target position.

To check the boundaries we use common comparison operators:

 NB. Verify field boundaries
 is_inside =. (xpos >: 0) *. (ypos >: 0) *. ( (xpos+tetrimino_width - 1) < game_width) *. ((ypos+tetrimino_height - 1) < game_height)

The second criteria it's more intersesting. To illustrate how we did the detection we're going to start with a predifined game field:

   game
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 1 0
0 0 0 0 0
   ] tetrimino =. 2 3 $ 2 0 0 2 2 2
2 0 0
2 2 2

The first step is to reset the values of the tetrimino to be either zero or one:

   ] tetrimino =. 0&< tetrimino
1 0 0
1 1 1

Now we extract the elements of the target position (in this example column 1, row 3)

   ypos =. 3
   xpos =. 1
   positions =. , ((i.@(0&{)@$)(<@,"0)/(i.@(1&{)@$)) tetrimino
   ($tetrimino) $ ((+&(ypos,xpos))&.> positions){ game
   
0 0 1
0 0 0
   xpos =. 1
   ypos =. 2
   positions =. , ((i.@(0&{)@$)(<@,"0)/(i.@(1&{)@$)) tetrimino
   ($tetrimino) $ ((+&(ypos,xpos))&.> positions){ game
0 0 0
0 0 1

Now we can multiply the tetrimino by the target value:

   target_segment =. ($tetrimino) $ ((+&(ypos,xpos))&.> positions){ game
   ] hits =. +/ , *&tetrimino target_segment
1

Now the variable hits contains the number of elements with a target cell value. The final predicate looks like this:

can_put_in_matrix =: 3 : 0
 NB. Unpack the arguments
 tetrimino =. 0&< > 0 { y
 tetrimino_width =. 1 { $ tetrimino
 tetrimino_height =. 0 { $ tetrimino
 ypos =. > 1 { y
 xpos =. > 2 { y
 game =. > 3 { y
 game_width  =. 1 { $ game
 game_height =. 0 { $ game

 NB. Verify field boundaries
 is_inside =. (xpos >: 0) *. (ypos >: 0) *. ( (xpos+tetrimino_width - 1) < game_width) *. ((ypos+tetrimino_height - 1) < game_height)

 NB. Check if we hit an occupied cell
 hits =. 0
 if. is_inside do.
   positions =. , ((i.@(0&{)@$)(<@,"0)/(i.@(1&{)@$)) tetrimino
   hits =. +/ , *&tetrimino ($tetrimino) $ ((+&(ypos,xpos))&.> positions){ game
 end.

 is_inside *. (hits = 0)
)

End words

As it was said in the beginning, J it's very interesting. For me there are many things to learn (you can tell that by looking at all those parenthesis in some expressions). Also there are many strategies in array languages that one needs to understand in other to write idiomatic code.

The ncurses interface will be discussed in part 2. For future posts it will be interesting to talk about concepts like the obverse (http://www.jsoftware.com/help/jforc/u_1_uv_uv_and_u_v.htm#_Toc191734413) and state machines (http://www.jsoftware.com/help/jforc/loopless_code_vii_sequential.htm#_Toc191734470) .

Code for this post can be found here: https://github.com/ldfallas/jcurtris

Monday, April 10, 2017

A simple language with indentation based blocks. Part 4: Improving error messages

Going back to our first post, we defined the type of a parser function as :

ReaderState ->  (('a * ReaderState) option )

Which means that a parser is a function from a reader state to an Option<'a> of a value and a new reader state.

Using Option<'a> is handy for writing code that may result on a value or a failure indicator. In our case it's possible that the parser fails to recognize its input. The most common example it's a program with syntax errors. For example the following program is incorrect:

if cond1:
   if x y z:
      print(x)

A parser failure may also be something we expected. This is the case when need to use failure to choose a parser from a list of possible parsers. For example the expression parser:

   pStatements := [pReturn; ifParser; pCallStatement; pAssignStatement; whileParser]

   let pStatement =
       fun state -> (List.reduce disjParser !pStatements) state

Using the bultin Option<'a> type comes handy, but it has the problem that it doesn't provide information about the parsing failure. When a parser cannot recognize some input string, the only response we get is None. That's why we introduced the ParsingResult<'a> type to replace the Option<'a> type. Here's the definition:

type ReaderState = { Data:        string;
                     Position:    int;
                     Line:        int;
                     Indentation: int list }

type ParserFailure =
    | Fatal of string * int
    | Fail

   
type ParsingResult<'a> =
     | Success of ('a * ReaderState)
     | Failure of ParserFailure

We still have two possible states: Success and Failure. However now we have a space to add more information about a syntax error . In this case Fatal has a couple of slots to specify an error message and a line number.

Failure information

The Failure has a parameter of type ParserFailure. This type is defined as follows:

   type ParserFailure =
       | Fatal of string * int
       | Fail

We use this two alternatives to represent the scenarios described at the beginning of this post:

  • Fatal : a syntax error in the input string
  • Fail : a failure to completely recognize something

We use Fatal for scenarios such as the following program which has a syntax error in the while condition ( x y ):

while x y:
   print(1)

This is the definition of the whileParser:

   let whileParser =
        whileKeyword    >>
        pExpression     >>= (fun condition ->
        colon           >>
        pBlock          >>= (fun block ->
        preturn (PWhile(condition, block))))

We can say that any failure after the while keyword is a fatal failure. However a failure detecting the while keyword is a simple failure . In order to represent this situation we're going to introduce the +>> which is going to sequence two parsers and produce a fatal failure in case that the first parser fails. The definition of this operation looks like this:

   let inline (+>>)  (parser1 : (ReaderState ->  ParsingResult<'a>))
                     (parser2 : (ReaderState ->  ParsingResult<'b>)) =
       fun input -> match (parser1 input) with
                    | Success (_, restState) -> parser2 restState
                    | Failure(f & Fatal(_, _)) -> Failure(f)
                    | Failure(_) -> Failure(Fatal("Parsing error ", input.Line))

   let inline (+>>=) (parser1 : (ReaderState ->  ParsingResult<'a>))
                     (parser2N : ('a ->  (ReaderState ->  ParsingResult<'b>))) =
       fun input -> match (parser1 input) with
                    | Success (matchedTxt, restState) -> (parser2N matchedTxt) restState
                    | Failure(f & Fatal(_)) -> Failure(f)
                    | Failure(_) -> Failure(Fatal("Parse problem ", input.Line))                    

Now with this operator we can modify the definition of whileParser to be as follows:

   let whileParser =
        whileKeyword    >>
        pExpression     +>>= (fun condition ->
        colon           +>>
        pBlock          +>>= (fun block ->
        preturn (PWhile(condition, block))))

Example

Now we can see the result of parsing a file:

> parse "if x:
- 
-       while x y:
-           print(1)
- " pStatement;;
val it : ParsingResult<PStat> = Failure (Fatal ("Parsing error ",3))

Expected failure

Now we also we parser failures for choosing from a set of possible parsers. We do that in the disjParser which chooses between two parsers. The old definition of the disjParser looks like this:

let disjParser (parser1 : (ReaderState ->  (('a * ReaderState) option )))
               (parser2 : (ReaderState ->  (('a * ReaderState) option ))) =
    fun input -> match (parser1 input) with
                 | result & Some _ -> result
                 | _ -> parser2 input

Notice that this definition uses the Option<'a> cases (Some and None) to determine if it succeeds or if it needs to give a chance to parser2. With our new ParserResult<'a> type we need to make a distinction between fatal and a "controlled" failure. We can now change the definition of this parser to be:

let disjParser (parser1 : (ReaderState ->  ParsingResult<'a>))
               (parser2 : (ReaderState ->  ParsingResult<'a>)) =
    fun input -> match (parser1 input) with
                 | success & Success(_) -> success
                 | Failure(fatal & Fatal(_, _)) -> Failure(fatal)
                 | _ -> parser2 input

Next steps

In the next post we're going to deal with code comments.

This is part #4 of an ongoing series of posts on building a parser for a small language. The first post can be found here: http://langexplr.blogspot.com/2017/01/a-simple-language-with-indentation.html and a GitHub repository with the code can be found there: https://github.com/ldfallas/fsharpparsing.

Saturday, March 25, 2017

A simple language with indentation based blocks. Part 3: Blocks

In this post we're going to add support for statements containing blocks which is the main goal of these series of posts.

Identifying blocks by using indentation

The technique we're going to use to identify blocks is based the following description from the (excellent) Python documentation:

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

This section was taken from: https://docs.python.org/3/reference/lexical_analysis.html#indentation

The IDENT and DEDENT tokens act as begin/end block indicators. They have the same function as '{' and '}' in C-based languages. The following Python grammar fragment shows how this tokens are used.

...
suite: simple_stmt | NEWLINE IDENT stmt+ DEDENT
...
if_stmt: 'if' test ':' suite ...
...

Since our little parser uses parsing combinators without a tokenizer, we need to adjust this strategy a little bit. We're going to jump right ahead to show how we define the parser for the if statement.

   let pBlock =
       newlines   >>
       indent     >>
       pStatement >>= (fun firstStat ->
       (zeroOrMore
           (newlines  >>
            indented  >>
            pStatement) [])  >>= (fun restStats ->
       dedent      >> preturn (firstStat::restStats)))

   let ifParser =
       ifKeyword   >>
       pExpression >>= (fun expr ->
       colon       >>
       pBlock      >>= (fun block ->
       (optionalP pElse []) >>= (fun optElseBlockStats ->
                     preturn (PIf(expr,
                                  block,
                                  (match optElseBlockStats with
                                   | [] -> None
                                   | elements -> Some elements))))))

Here's an example of the generated AST for a simple if statement:

> parse "if x:
-           print(x)
-           return x*x
- " ifParser;;
val it : (PStat * ReaderState) option =
  Some
    (PIf
       (PSymbol "x",
        [PCallStat (PCall ("print",[PSymbol "x"]));
         PReturn (PBinaryOperation (Times,PSymbol "x",PSymbol "x"))],null),
     {Data = "if x:
          print(x)
          return x*x
";
      Position = 45;
      Indentation = [0];})

The key element here is the definition of the pBlock parser. In the definition of this parser we try to emulate the same strategy as the Python grammar definition:

   let pBlock =
       newlines   >>
       indent     >>
       pStatement >>= (fun firstStat ->
       (zeroOrMore
           (newlines  >>
            indented  >>
            pStatement) [])  >>= (fun restStats ->
       dedent      >> preturn (firstStat::restStats)))

We define indent, dedent and indented as follows:

   let indentation =
          pGetIndentation >>=(fun indentation ->                          
                (readZeroOrMoreChars (fun c -> c = ' '))
                >>=
                (fun spaces ->
                   match (spaces.Length,
                          indentation) with
                   | (lessThan, top::_) when lessThan > top ->
                       (pSetIndentation lessThan) >> (preturn "INDENT")
                   | (lessThan, top::_) when lessThan = top ->
preturn "INDENTED"
                   | (identifiedIndentation, top::rest) ->
                          (pSetFullIndentation rest) >> (preturn "DEDENT")
                   | _ -> pfail))

   let indent = indentation >>= (fun result -> if result = "INDENT" then preturn result else pfail)
   let dedent = indentation >>= (fun result -> if result = "DEDENT" then preturn result else pfail)
   let indented = indentation >>= (fun result -> if result = "INDENTED" then preturn result else pfail)

We define each of these indentation elements as:

  • INDENTED verifies that the expected indentation level is present. It consumes this indentation. For example, if expecting 3 space indentation, it will consume and verify that there are three spaces from the current position.
  • INDENT Verifies that there's a new indentation level can be found from the current position. It changes the 'Indentation' value in the current reader state.
  • DEDENT decreases the expected indentation level from the reader.

We can test these combinators to see how they affect the state of the parser.

We start with the following fragment:

> let code = "if x:
  if y:
     return 1
  else:
     return 2
"

We can test parts of our complex parsers by specifying just pieces of others. For example let's get to the then part of the parser.

> parse code (ifKeyword >> pExpression >> colon >> newlines >> indent);;
val it : (string * ReaderState) option =
  Some
    ("INDENT", {Data = "if x:
  if y:
     return 1
  else:
     return 2
";
                Position = 8;
                Indentation = [2; 0];})

As you can see, after executing these parsers we get to position 8. For example:

> code.Substring(8);;
val it : string = "if y:
     return 1
  else:
     return 2
"

Also the indentation stack now has 2 and 0. If we execute these series of parsers again to get to the then section of the first nested if we get:

> parse code (ifKeyword >> pExpression >> colon >> newlines >> indent >>
-             ifKeyword >> pExpression >> colon >> newlines >> indent);;
val it : (string * ReaderState) option =
  Some
    ("INDENT", {Data = "if x:
  if y:
     return 1
  else:
     return 2
";
                Position = 19;
                Indentation = [5; 2; 0];})

Now we have three indentation levels on our stack and we got to position 19.

Blocks

Just as the Python grammar reuses suite production, we can reuse the pBlock parser to define other kinds of statements. For example we can define the while statement as follows:

   let whileParser =
        whileKeyword    >>
        pExpression     >>= (fun condition ->
        colon           >>
        pBlock          >>= (fun block ->
        preturn (PWhile(condition, block))))

Now we can define more interesting statements such as:

val code : string = "if x > 0:
  while x < 10:
    print(x)
    x := x + 1
"

> parse code pStatement;;
val it : (PStat * ReaderState) option =
  Some
    (PIf
       (PBinaryOperation (Gt,PSymbol "x",PNumber "0"),
        [PWhile
           (PBinaryOperation (Lt,PSymbol "x",PNumber "10"),
            [PCallStat (PCall ("print",[PSymbol "x"]));
             PAssignStat
               (PSymbol "x",PBinaryOperation (Plus,PSymbol "x",PNumber "1"))])],
        null),
     {Data = "if x > 0:
  while x < 10:
    print(x)
    x := x + 1
";
      Position = 53;
      Indentation = [0];})

Next steps

Now that we can define blocks we can focus in other parts of our little language parser. One thing we can definetly improve is error messages. For example, let's try to parse the following fragment:

> parse "notanif x y z" ifParser;;
val it : (PStat * ReaderState) option = None

Here's where the Option<'a> type is not that useful since it doesn't allow you to specify a failure value such as a error message or location.

Another thing that we need to implement is comments. Our parser combinators work directly with the text. Because of this it will be interesting to see to introduce comments support without changing all our parsers.

This is part #3 of an ongoing series of posts on building a parser for a small language. The first post can be found here: http://langexplr.blogspot.com/2017/01/a-simple-language-with-indentation.html.

Sunday, February 26, 2017

A simple language with indentation based blocks. Part 2: Expressions

The focus of the second part will be expression parsing. The desired grammar for the expression small language experiment is the following (here in pseudo BNF):

NESTED_EXPRESSION = '(' EXPRESSION ')'

PRIMARY_EXPRESSION = SYMBOL | NUMBER | STRING | NESTED

UNARY_EXPRESSION =  NOT_EXPRESSION | CALL_EXPRESSION | ARRAY_ACCESS_EXPRESSION |
                    PRIMARY_EXPRESSION

MULTIPLICATIVE_EXPRESSION = UNARY_EXPRESSION ( ('*' | '/') UNARY_EXPRESSION)*

ADDITIVE_EXPRESSION = MULTIPLICATIVE_EXPRESSION ( ('*' | '/') MULTIPLICATIVE_EXPRESSION)*

RELATIONAL_EXPRESSION = ADDITIVE_EXPRESSION ( ('<' | '>') ADDITIVE_EXPRESSION)*

EQUALITY_EXPRESSION = RELATIONAL_EXPRESSION ( ('=' | '<>') RELATIONAL_EXPRESSION)*

LOGICAL_OR_EXPRESSION = EQUALITY_EXPRESSION  ('||'  EQUALITY_EXPRESSION)* 

LOGICAL_AND_EXPRESSION = LOGICAL_OR_EXPRESSION ('&&'  LOGICAL_OR_EXPRESSION)* 

EXPRESSION = LOGICAL_AND_EXPRESSION

In this post we're going to build some key techniques for creating the code for parsing a subset of this grammar. Not every element of expression grammar will be presented since it gets repetitive at some point.

For this small experiment we're going to use the following F# types as the AST of the program:

type Operator =
    | Plus
    | Minus
    | Times
    | Div
    | And
    | Or
    | Equal
    | NotEqual
    | Assign
    | Lt
    | Gt
    

type PExpr =
    | PSymbol  of string
    | PString  of string
    | PNumber  of string
    | PBoolean of bool
    | PNot     of PExpr
    | PCall    of string * (PExpr list)
    | PNested  of PExpr
    | PArrayAccess of PExpr * PExpr
    | PBinaryOperation of Operator * PExpr * PExpr

Simple atomic expressions

We're going to start with the atomic expressions:

  • Symbols or variables (ex. x,y,z)
  • Number literals (ex 1, 1.2)
  • String literals (ex. "Hello")

We already got these parsers from the previous post:

   let pSymbol =
       concatParsers2
          (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))
          (fun initialChar ->
               concatParsers2
                  (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || System.Char.IsDigit(c)))
                  (fun suffixString -> (preturn (PSymbol (initialChar + suffixString))))
           )


   let pNumber =
              ( (optionalP (readSpecificChar '-') "") >>= (fun neg -> 
                digitP  >>= (fun firstChar -> 
                (readZeroOrMoreChars (fun c ->  System.Char.IsDigit(c))) >>= (fun chars ->
                (optionalP decimalPartP "") >>= (fun dec ->                                                                               
                preturn (PNumber (neg + firstChar + chars + dec)))))))


   let pString =
       whitespace >>
       readSpecificChar '"' >>
       readZeroOrMoreStringChars (fun previous current ->
                 (previous = '\\' && current = '"') || current <> '"')
       >>= (fun stringContents ->
            readSpecificChar '"' >> (preturn (PString stringContents)))

We can call these expressions "primary expressions", which are used in conjunction with operators to create more complex elements. We need a parser that recognizes any one these elements. We can use the disjParser from the previous post :

> let primaryExpression = (disjParser (disjParser pSymbol pNumber) pString);;

val primaryExpression : (ReaderState -> (PExpr * ReaderState) option)

> parse "10" myPrimaryExpression;;
val it : (PExpr * ReaderState) option =
  Some (PNumber "10", {Data = "10";
                       Position = 2;
                       Indentation = [0];})
> parse "x" myPrimaryExpression;; 
val it : (PExpr * ReaderState) option =
  Some (PSymbol "x", {Data = "x";
                      Position = 1;
                      Indentation = [0];})
> parse "\"hello\"" myPrimaryExpression;;
val it : (PExpr * ReaderState) option =
  Some (PString "hello", {Data = ""hello"";
                          Position = 7;
                          Indentation = [0];})

We can use List.reduce function to improve this code since we could have several parsers as the primary expression.

> let myPrimaryExpression = List.reduce disjParser [pSymbol; pNumber; pString]  ;;

val myPrimaryExpression : (ReaderState -> (PExpr * ReaderState) option)

> parse "10" myPrimaryExpression
- ;;
val it : (PExpr * ReaderState) option =
  Some (PNumber "10", {Data = "10";
                       Position = 2;
                       Indentation = [0];})
> parse "x1" myPrimaryExpression 
- ;; 
val it : (PExpr * ReaderState) option =
  Some (PSymbol "x1", {Data = "x1";
                       Position = 2;
                       Indentation = [0];})

Binary operations

Binary expressions are composed of two expressions and an operator which connects them. For example: a + b . For these operations we can define the following utility parser creation function:

   let buildExpressions (leftExpr:PExpr) (rightExprs:(Operator * PExpr) list) =
       (List.fold (fun left (op,right) -> PBinaryOperation(op, left, right)) leftExpr rightExprs)

   let pBinaryExpression operators lowerLevelElementParser  =
       lowerLevelElementParser
         >>= (fun leftTerm ->
                 (zeroOrMore
                     (operators
                        >>= (fun op ->
                               lowerLevelElementParser >>= (fun expr -> preturn (op, expr))))
                     [])
                   >>= (fun acc -> preturn (buildExpressions leftTerm acc) ))

This function allows us to build parsers for binary expressions . The same expressed in pseudo BNF may look like this:

   pBinaryExpression = lowerLevelElementParser ( operators  lowerLevelElementParser )*

Which means that we'll like to recognize an element with lowerLevelElementParser and zero or more pairs of an element recognized by operators followed by another lowerLevelElementParser. What looks strange at first is the use of zero or more which implies that this parser can recognize just one element with lowerLevelElementParser. This will be useful in the following section when we try to implement operator precedence.

As an example we can define a parser for plus as follows:

> let identifyOperator operatorChar operatorResult =
-        concatParsers
-           whitespaceNoNl
-           ((readSpecificChar operatorChar) >> (preturn operatorResult))
- ;;

val identifyOperator :
  operatorChar:char ->
    operatorResult:'a -> (ReaderState -> ('a * ReaderState) option)
          
> let plusOperator = identifyOperator '+' Plus;;

val plusOperator : (ReaderState -> (Operator * ReaderState) option)

> let myPlus = pBinaryExpression plusOperator myPrimaryExpression;;

val myPlus : (ReaderState -> (PExpr * ReaderState) option)

> parse "1+2" myPlus;;
val it : (PExpr * ReaderState) option =
  Some (PBinaryOperation (Plus,PNumber "1",PNumber "2"), {Data = "1+2";
                                                          Position = 3;
                                                          Indentation = [0];})

Arithmetic operations

Now we're going to add support for basic arithmetic operations: +, -, /, *. By using the pBinaryExpression utility function defined in the previous section we can preserve operator precedence. To do this we define the operations as follows:

let pMultiplicativeExpression = pBinaryExpression (disjParser pDivOperator pTimesOperator)  myPrimaryExpression
         
let pAdditiveExpression = pBinaryExpression (disjParser plusOperator minusOperator)  pMultiplicativeExpression

What we say here is that an additive expression is composed of addition or subtraction (disjParser plusOperator minusOperator) of multiplicative expressions. Also we said that a multiplicative expression is composed of multiplication or division (disjParser pDivOperator pTimesOperator) of primary expressions.

An example of using these operators looks like this:

> parse "1 * 2 + 3" pAdditiveExpression;;
val it : (PExpr * ReaderState) option =
  Some
    (PBinaryOperation
       (Plus,PBinaryOperation (Times,PNumber "1",PNumber "2"),PNumber "3"),
     {Data = "1 * 2 + 3";
      Position = 9;
      Indentation = [0];})
> parse "1 + 2 * 3" pAdditiveExpression;;
val it : (PExpr * ReaderState) option =
  Some
    (PBinaryOperation
       (Plus,PNumber "1",PBinaryOperation (Times,PNumber "2",PNumber "3")),
     {Data = "1 + 2 * 3";
      Position = 9;
      Indentation = [0];})

Recursion

One thing that was tricky to implement at first was recursion. At some point in the grammar we need to refer back to the top parser . For example a parentheses expression can have any other expression and its recognized as a primary expression. For example:

1 * (2 + 3)

Needs to be parsed as:

     *
    / \
   /   \
  1  (2+3)
       |
       +
      / \
     2   3

What we need to do is to define a top level pExpression parser used for recognizing all our expressions . This parser will be used to define the parenthesis or nested expression. At this point our grammar written using pseudo BNF looks like this:

pPrimaryExpression = pNumber | pSymbol | pString

pAdditiveExpression = pPrimaryExpression ( (plusOperator |  minusOperator) pPrimaryExpression )*

pMultiplicativeExpression = pAdditiveExpression ( (divOperator |  timesOperator) pPrimaryExpression )*

pTopExpression = pMultiplicativeExpression

In F# this using our set of combinators it looks like:

let myPrimaryExpression = List.reduce disjParser [pSymbol; pNumber; pString]

let pMultiplicativeExpression = pBinaryExpression (disjParser pDivOperator pTimesOperator)  myPrimaryExpression
         
let pAdditiveExpression = pBinaryExpression (disjParser plusOperator minusOperator)  pMultiplicativeExpression

let pTopExpression = pAdditiveExpression

We can use pTopExpression to parse any expression:

> parse "3+10*x-1/32" pTopExpression;;
val it : (PExpr * ReaderState) option =
  Some
    (PBinaryOperation
       (Minus,
        PBinaryOperation
          (Plus,PNumber "3",PBinaryOperation (Times,PNumber "10",PSymbol "x")),
        PBinaryOperation (Div,PNumber "1",PNumber "32")),
     {Data = "3+10*x-1/32";
      Position = 11;
      Indentation = [0];})
> parse "x" pTopExpression;;          
val it : (PExpr * ReaderState) option =
  Some (PSymbol "x", {Data = "x";
                      Position = 1;
                      Indentation = [0];})

But now we want to use pTopExpression to define one of our primary expressions:

//WARNING THE FOLLOWING CODE WILL NOT COMPILE
let readLPar =
       concatParsers whitespace (readSpecificChar '(')
let readRPar = readSpecificChar ')'

let pNested = readLPar    >>
              pTopExpression >>= (fun expr ->
              readRPar    >>  (preturn (PNested expr)))

let myPrimaryExpression = List.reduce disjParser [pSymbol; pNumber; pString; pNested]

let pMultiplicativeExpression = pBinaryExpression (disjParser pDivOperator pTimesOperator)  myPrimaryExpression
         
let pAdditiveExpression = pBinaryExpression (disjParser plusOperator minusOperator)  pMultiplicativeExpression

let pTopExpression = pAdditiveExpression

When we try to compile this code we get the following error:

                pTopExpression >>= (fun expr ->
  --------------^^^^^^^^^^^^^^

/dir/stdin(6,15): error FS0039: The value or constructor 'pTopExpression' is not defined

Which is totally correct, pTopExpression is defined after pNested. It has to because pTopExpression is defined using pAdditiveExpression has a dependency on pPrimaryExpression. Until now we only used function composition to create our parsers.

What we want to do is to define:

   number symbol string nested
     ^      ^      ^     ^  |
     |      |      |     |  |
     |      |      |     |  |
     +------+------+-----+  |
            |               |
            |               |
         primary            |
            |               |
            |               |
        additive            |
            |               |
            |               |
      multiplicative        |
            |               |
            |               |
           top  <-----------+

To solve this problem we're going to take advantage of reference variables in F#. And do the following trick:

let pTopExpressions = ref []

let pTopExpression =
       fun state -> (List.reduce disjParser !pTopExpressions) state


let pNested = readLPar    >>
              pTopExpression >>= (fun expr ->
              readRPar    >>  (preturn (PNested expr)))

let myPrimaryExpression = List.reduce disjParser [pSymbol; pNumber; pString; pNested]

let pMultiplicativeExpression = pBinaryExpression (disjParser pDivOperator pTimesOperator)  myPrimaryExpression
         
let pAdditiveExpression = pBinaryExpression (disjParser plusOperator minusOperator)  pMultiplicativeExpression


pTopExpressions := [pAdditiveExpression]

Using a reference (ref) allows us to create the recursive parser that we need . Here's an example of using it:

> parse "1*(2+3)" pTopExpression;; 
val it : (PExpr * ReaderState) option =
  Some
    (PBinaryOperation
       (Times,PNumber "1",
        PNested (PBinaryOperation (Plus,PNumber "2",PNumber "3"))),
     {Data = "1*(2+3)";
      Position = 7;
      Indentation = [0];})

Next steps

Using the same techniques presented so far we can finish our basic grammar. This is part #2 of an ongoing series of posts on building a parser for a small language. The first post can be found here: http://langexplr.blogspot.com/2017/01/a-simple-language-with-indentation.html.

Sunday, January 29, 2017

A simple language with indentation based blocks. Part 1: Parsing combinators

In this post, I'm going to start with the implementation of the parser using the F# programming language. There are several tools for creating parsers for example FParsec http://www.quanttec.com/fparsec/ or FsYacc/FsLex https://en.wikibooks.org/wiki/F_Sharp_Programming/Lexing_and_,arsing. However for this experiment I wanted to create a small set of simple parser combinators, just to learn more about them.

The contents of this post is loosely based on the definition of the Parser Monad . There are many great articles on the net about them. One of this papers is the incredibly nice Monadic Parsing in Haskell http://www.cs.nott.ac.uk/~pszgmh/pearl.pdf .

Let's start by defining the type of the state of our parser:

   type ReaderState = { Data : string;
                        Position: int;
                        Indentation: int list}

This state contains the following:

  • Data : the input string containing the program to parse
  • Position : the current position of the parser inside the Data string
  • Indentation : an indentation stack (more on that in future posts)

And now we're going to define the type of a "parser" :

ReaderState ->  (('a * ReaderState) option )

This means that a "parser" is a function that takes the reader state and returns a tuple of some parsed value ('a) and the new parsing state. Notice that parsers could fail, so the result it's wrapped in an option type https://docs.microsoft.com/en-us/dotnet/articles/fsharp/language-reference/options.

A very simple example of a parser is the following:

   let readChar(state : ReaderState) : (string * ReaderState) option =
       if state.Data.Length > state.Position then
          Some (state.Data.[state.Position].ToString(),
                { state with Position = state.Position + 1 } )
       else
          None

We can test this parser using the following function:

   let parse (input : string) (parser : (ReaderState ->  (('a * ReaderState) option ))) =
       parser {  Data = input ; Position = 0; Indentation = [0] }

For example:

> parse "hello" readChar;;        
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
                                                     Position = 1;
                                                     Indentation = [0];})

Here we can see that the result is a successful optional Some with a string with the single recognized char . Also part of the result is the new reader state .

We can also specify operations for several characters for example:

   let readZeroOrMoreChars (pred:char -> bool) (state : ReaderState) : (string * ReaderState) option =
       let
         secondPosition = readingChar state.Position pred state.Data
         in
           Some ( state.Data.Substring(state.Position,
                                       secondPosition - state.Position),
                 { state with Position = secondPosition })

We can use this function as follows:

> parse "1231abc" (readZeroOrMoreChars System.Char.IsDigit);;
val it : (string * ReaderState) option = Some ("1231", {Data = "1231abc";
                                                        Position = 4;
                                                        Indentation = [0];})

Using this function we can define a parser for whitespace:

   let whitespace = readZeroOrMoreChars System.Char.IsWhiteSpace

Connecting parsers

The parsers I've been showing so far look very simple, and they are!. The goal is to have small parsers and combine them to create more interesting ones. The next operation is called concatParsers2 and it's goal is create a new parser that will execute two parsers in sequence. That is, apply the first one and with the resulting state, execute the second one.

For example, the following parser for recognizing symbols use the readWithConditionOnChar parser with the readZeroOrMoreChars in order to create a PSymbol with the recognized element:

   let symbol =
       concatParsers2
          (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))      
          (fun initialChar ->
               concatParsers2
                  (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || System.Char.IsDigit(c)))
                  (fun suffixString -> (preturn (PSymbol (initialChar + suffixString))))
           )

In our little language a symbol is represented by :

  • a letter
  • zero or more letters or numbers

We can use the readWithConditionOnChar and readZeroOrMoreChars to archive these goals, for example:

> parse "hello" (readWithConditionOnChar  (fun c -> System.Char.IsLetter(c, 0)))- ;; 
val it : (string * ReaderState) option = Some ("h", {Data = "hello";
                                                     Position = 1;
                                                     Indentation = [0];})
> parse "hi1and2and3" (readZeroOrMoreChars (fun c -> System.Char.IsLetter(c) || - System.Char.IsDigit(c)));;
val it : (string * ReaderState) option =
  Some ("hi1and2and3", {Data = "hi1and2and3";
                        Position = 11;
                        Indentation = [0];})

Notice that we cannot use just readZeroOrMoreChars since it will recognize symbols starting with numbers (like '13hello').

We can create a small parser for recognizing a letter (readWithConditionOnChar) and we can use readZeroOrMoreChars to read the rest. We could combine these two blocks to get another parser. The key for composing our parsers is the concatParsers2 function. This function has the following signature:

> concatParsers2;;
val it :
  ((ReaderState -> ('a * ReaderState) option) ->
     ('a -> ReaderState -> ('b * ReaderState) option) -> ReaderState ->
     ('b * ReaderState) option) = <fun:clo@21-1>
                     

This signature means that concatParsers2 receives a parser that produces a value of type 'a. The second argument is a function that receives a value of type 'a (the result of the first parser) and returns a parser of another type 'b. The result of concatParsers2 is itself another parser that produces a value of type 'b.

The implementation looks like this:

   let concatParsers2 (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                      (parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) =
       fun input -> match (parser1 input) with
                    | Some (matchedTxt, restState) -> (parser2N matchedTxt) restState
                    | _ -> None

This small function lets us create parsers using small blocks. But before we do that, we need to create another useful operation.

let preturn aValue (state : ReaderState ) = Some (aValue, state)

This operation is very simple, but really important!. It lets us prepare a result. By taking a look at the signature, we can see that it matches our parser signature. We can now used it in conjunction with concatParsers2 to produce results:

>  parse "AXXXC" 
-        (concatParsers2 recognizeABC                     
-                      (fun firstChar ->
-                           (concatParsers2
-                                recognizeXs              
-                                (fun xs -> 
-                                     (concatParsers2
-                                         recognizeABC
-                                         (fun lastChar ->
-                                              preturn (firstChar, xs, lastChar)- ))))));; 
val it : ((string * string * string) * ReaderState) option =
  Some (("A", "XXX", "C"), {Data = "AXXXC";
                            Position = 5;
                            Indentation = [0];})

Given that we define recognizeABC and recognizeXs as follows:

> let recognizeABC = readWithConditionOnChar (fun c -> c = "A" || c = "B" || c= "C");;

val recognizeABC : (ReaderState -> (string * ReaderState) option)

> let recognizeXs = readZeroOrMoreChars (fun c -> c = 'X' ) ;;    

val recognizeXs : (ReaderState -> (string * ReaderState) option)

Using the concatParsers2 function seems a little verbose. We can borrow inspiration from the Haskell's Monad type class https://hackage.haskell.org/package/base-4.9.1.0/docs/Control-Monad.html#t:Monad by defining versions of the >>= operator as follows:

   let inline (>>=) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                    (parser2N : ('a ->  (ReaderState ->  (('b * ReaderState) option )))) = concatParsers2 parser1 parser2N

Now using this operator we can write:

parse "AXXXC" ( recognizeABC >>= (fun firstChar -> 
                recognizeXs  >>= (fun xs -> 
                recognizeABC >>= (fun lastChar ->
                preturn (firstChar, xs, lastChar)))))

Ignoring output from previous parsers

Another useful operation allows us to connect two parsers but

   let concatParsers (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                     (parser2 : (ReaderState ->  (('b * ReaderState) option ))) =
       fun input -> match (parser1 input) with
                    | Some (_, restState) -> parser2 restState
                    | _ -> None

Notice that it's similar to the concatParsers2 but it ignores the result of parser1. This is useful for discarting whitespace:

let whitespaceNoNl =
    readZeroOrMoreChars
        (fun c -> System.Char.IsWhiteSpace(c) && c <> '\n')

let colon  =
    concatParsers whitespaceNoNl (readSpecificChar ':')
    

As with the concatParsers2 we can use another well known operator for this operation:

   let inline (>>) (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                   (parser2 : (ReaderState ->  (('b * ReaderState) option ))) = concatParsers parser1 parser2

Useful parsing operations

We need to create more operations that help us create complex parsers. For example, we can create a small operation for optional elements:

   let optionalP (parser : (ReaderState ->  (('a * ReaderState) option ))) (defaultValue:'a) =
       fun input -> match (parser input) with
                    | result & Some _ -> result
                    | _ -> (Some (defaultValue, input))

We can use this new operator to parse numbers as follows:

   let number =
              ( (optionalP (readSpecificChar '-') "") >>= (fun neg -> 
                digitP  >>= (fun firstChar -> 
                (readZeroOrMoreChars (fun c ->  System.Char.IsDigit(c))) >>= (fun chars ->
                (optionalP decimalPartP "") >>= (fun dec ->                                                                               
                preturn (PNumber (neg + firstChar + chars + dec))))

Choosing from two options

The following operation helps us to try to use one parser and fallback to another if it didn't work:

   let disjParser (parser1 : (ReaderState ->  (('a * ReaderState) option )))
                  (parser2 : (ReaderState ->  (('a * ReaderState) option ))) =
       fun input -> match (parser1 input) with
                    | result & Some _ -> result
                    | _ -> parser2 input

A simple scenario using this parser looks like this:

> parse "39" (disjParser symbol number);; 
val it : (PExpr * ReaderState) option =
  Some (PNumber "39", {Data = "39";
                       Position = 2;
                       Indentation = [0];})
> parse "hello" (disjParser symbol number);;
val it : (PExpr * ReaderState) option =
  Some (PSymbol "hello", {Data = "hello";
                          Position = 5;
                          Indentation = [0];})

Repetitions

Identifying sequences of elements identified by a parser is very useful. We can define the following operations

   let rec oneOrMore parser accumulated =
       parser >>=
           (fun lastResult ->
              let result = lastResult::accumulated
              in disjParser (oneOrMore parser result) (preturn (List.rev result)))

   let zeroOrMore parser accumulated =
       disjParser (oneOrMore parser accumulated) (preturn accumulated)

A simple example of using this operation looks like this:

> parse "1   2 3  4" (zeroOrMore (whitespace >> number) []);;
val it : (PExpr list * ReaderState) option =
  Some
    ([PNumber "1"; PNumber "2"; PNumber "3"; PNumber "4"],
     {Data = "1   2 3  4";
      Position = 10;
      Indentation = [0];})

The next post will show how to create more interesting things with the pieces we have so far.

A simple language with indentation based blocks (part 0)

One of the first things I noticed when reading about Python, is the use of indentation for defining code blocks. For example:

def foo(x):
   while x < 20:
      if x > 10:
         print "argument is greater than 10"
         print "value: " + x
      else:
         print "the argument is less than or equal to 10"
         print "its value is : " + x

If you're only familiar to C-based language this seems a bit strange. Not using characters like '{}' or words like BEGIN/END seems fragile at first. For example one could expect something like:

def foo(x) {
   while x < 20 {
      if x > 10 {
         print "argument is greater than 10"
         print "value: " + x
      } else {
         print "the argument is less than or equal to 10"
         print "its value is : " + x
   }
}

I've always wondered how do you create a parser for this kind of a language. In the following series of posts I'm going to record my experiences writing a parser for a simple little language. This language will use a style similar to Python. I'm going to write the code using F# .

Posts

  1. Parsing combinators
  2. Expressions
  3. Blocks
  4. Error messages