Phosphate — Parser Combinators with Error Recovery

Thy gardens and thy phosphate mines,
Florida, my Florida
— Chastain V. Waugh

Current version: 0.1

Phosphate is a parser-combinator library with error recovering inspired by algebraic effect handlers.

Parsers are procedures of three arguments: an SRFI 225 DTO, a dictionary of global state, and a dictionary of dynamic state. Parsers return the new global state and zero or more values.

Global state is passed along successful parses and contains the input state of the text to parse. The global state can also be used to store parse-wide settings, such as case folding.

The dynamic state is modified and passed around as parsers are composed like Scheme parameters and is used to store the handlers and information such as nesting depth. The dynamic state is not the same as the dynamic extent, although it acts in an analagous way. Unlike the parameterize form of R7RS, parameterize/p is guaranteed to tail-call its parser argument.

Parse failures and parse errors are handled using a system like algebraic effect handlers (except they use call/cc instead of delimited continuations). A system parse failure is usually handled by the library, while a user-defined parse error is handled by a user-defined handler. The error handlers get a continuation that allows them to execute code and return values in the raised-from context.

If you port this library to a system with real delimited continuations (either Racket or a future R^NRS), you should use the name orthophosphate.

Inspired by Medeiros and Mascarenhas, “Syntax Error Recovery in Parsing Expression Grammars”, 2018, and MegaParsack.

Phosphate is written in R⁷RS and is tested to work in Chibi Scheme 0.11 and CHICKEN 5. Phosphate only requires SRFI 225 to run (requires SRFI 64 for tests).

WARNING! The API is unlikely to change, but this is an unstable release, so it might.

Tutorial

This tutorial will walk through parsing an RFC 4180 comma separated values file using all of the major features of Phosphate.

If you are looking for an actual CSV parsing library, Wolfgang Corcoran-Mathe has one for R⁶RS.

A CSV file is a sequence of records separated by CRLF, optionally followed by CRLF followed by EOF. The many-until/p parser is perfect for this. It parses a parser repeatedly until it parses some end parser, and returns the results as a list.

(define csv/p
  (many-until/p record/p
                csv-eof/p
                `((sep/p . ,crlf/p))))

The many-until/p parser takes keyword arguments in the form of an alist, which will be modified later as we get into error handling.

The crlf/p parser just checks for \r\n in the feed. This can be accomplished using the ==seq/p parser, which checks for a string, list, or vector in the stream.

(define crlf/p
  (==seq/p "\r\n"))

The csv-eof/p procedure just checks for an optional CRLF followed by EOF. This can be handled using the previously defined crlf/p, eof/p (checks that the sequence is at EOF), or/p (disjunction over parsers) and and/p (composes parsers together, returning the results of the final parser).

(define csv-eof/p
  (or/p eof/p (and/p crlf/p eof/p)))

The record/p field parses text data separated by commas. It is terminated by CRLF, but it cannot consume it because csv/p consumes it. We can check for CRLF without consuming input using lookahead/p.

(define record/p
  (many-until/p field/p
                (lookahead/p (or/p crlf/p eof/p))
                `((sep/p . ,comma/p))))

The many-until/p and the related many/p parsers will insert the values returne by the separator parser into the list of parsed results. This was not a problem for crlf/p because it used ==seq/p because it returns no results. We could use that parser again, but for the sake of education I will use the parser ==/p, which does return a value. Since we don't want the commas in the returned list of fields, we can use discard/p to discard the parsed values.

(define comma/p
  (discard/p (==/p #\,)))

A field/p is either escaped or unescaped. We could use the or/p combinator here to try one, backtrack when it does not work, and then try another, but this is wasteful. The backtrack state will be active throughout the entire parse even when we have parsed ", which cannot be in an unescaped field. Instead we will use cond/p, which will tail-call a parser when one parser succeeds. In this case, we wish to look ahead one character to see if the next character is ", and if so parse the quoted character without pushing backtracking state. Otherwise we parse the unescaped field.

(define field/p
  (cond/p
    ((lookahead/p (==/p #\")) escaped/p)
    (else unescaped/p)))

Unescaped text does not allowed control characters, double quotes, commas, or DEL. As an extension to the spec, all non-ASCII Unicode characters are supported. To apply a Scheme predicate to the current character, use satisfy/p (which when successful will return the character it matched against), inside many/p to capture zero or more of the valid characters. Since we want to return a string, we will capture the returned value of many/p using lv/p (short for let-values/p) and then use return/p to convert the list of characters into a string.

(define unescaped-char/p
  (satisfy/p (lambda (x)
                (and (not (char<? x #\x20))
                     (not (char=? x #\x22))
                     (not (char=? x #\x2C))
                     (not (char=? x #\x7F))))))
(define unescaped/p
  (lv/p (((lst) (many/p unescaped-char/p)))
    (return/p (list->string lst))))

The use of return/p here wraps the returned values such that other parses can read the value along with the current parse state returned from many/p.

Escaped text starts with a double-quote and ends with a double quote. The escaped text can have two double-quotes in it, which turn into a single double quote.

We can accomplish this using advance/p (return any value at the current input point, and advance input) and not/p (attempt to parse the parser passed to it, and fail if it succeeds). The not/p parser is a lookahead parser, so if it succeeds the parse state is at the same point as the entry to not/p.

(define escaped-char/p
  (cond/p
    ((==/p #\") (==/p #\"))
    (else advance/p)))

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\"))))))
           (return/p (list->string lst)))))

Note that in escaped-char/p, the value returned by the first (==/p #\") is discarded. The parser then tail-calls the second (==/p #\"), which will return a value if the second character is ", and fail otherwise.

To actually parse input, we use parse. This requires the DTO that describes the operations on the input stream, the input stream as a dictionary, the dynamic state as a dictionary, the parser, and any error handlers. There are no named error handlers so we handle the catch-all error #t and just return #f on error.

(define (parse-csv str)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
         csv/p
    (#t (lambda (k dto dict dyn mark . rest)
          (values dict #f)))))

The parser will return two values: the input dictionary at whatever point the parser stopped at, and the parsed CSV file or #f on error.

This completes our simple CSV parser. Since the parser is made up of procedures, we need to rearrange the definitions to make it work. A complete script is:

(import (scheme base) (phosphate))

(define comma/p
  (discard/p (==/p #\,)))

(define crlf/p
  (==seq/p "\r\n"))

(define escaped-char/p
  (cond/p
    ((==/p #\") (==/p #\"))
    (else advance/p)))

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\"))))))
           (return/p (list->string lst)))))

(define unescaped-char/p
  (satisfy/p (lambda (x)
                (and (not (char<? x #\x20))
                     (not (char=? x #\x22))
                     (not (char=? x #\x2C))
                     (not (char=? x #\x7F))))))
(define unescaped/p
  (lv/p (((lst) (many/p unescaped-char/p)))
    (return/p (list->string lst))))

(define field/p
  (cond/p
    ((lookahead/p (==/p #\")) escaped/p)
    (else unescaped/p)))

(define record/p
  (many-until/p field/p
                (lookahead/p (or/p crlf/p eof/p))
                `((sep/p . ,comma/p))))

(define csv-eof/p
  (or/p eof/p (and/p crlf/p eof/p)))

(define csv/p
  (many-until/p record/p
                csv-eof/p
                `((sep/p . ,crlf/p))))

(define (parse-csv str)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
         csv/p
    (#t (lambda (k dto dict dyn mark . rest)
          (values dict #f)))))

You can try some examples:

(parse-csv "a,b,c") ; ⇒ (("a" "b" "c"))
(parse-csv "h,e,llo\r\nw,o,rld") ; ⇒ (("h" "e" "llo") ("w" "o" "rld"))
(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not") ; ⇒ (("This is\r\n\"escaped\" "this is not"))

This parser works, but when a parse error occurs, the parser just gives up and says it failed. There is no description of the failure, and no way to recover from the failure.

(parse-csv "\"truncated") ; ⇒ #f
(parse-csv "invalid \"text\"") ; → #f

Phosphate allows us to write a parser that raises named parse errors with continuations that allows the parser to recover and continue. Where to put error handlers and how many errors handlers to install will depend on what you need. With parser combinators, it is easy to write a parser, so there is no need to write the most general parser possible.

Phosphate differentiates between parse failures, which have the mark #f, and parse errors, which have any other mark. Parse errors are meant to be handled or signalled, while parse failures are meant to backtrack. Parse failures can be converted into parse errors using as-error/p. Our current parser only has parse failures and no parse errors.

Lets start adding errors at escaped/p. The many-until/p allows us to specify certain error behavior. The parser does not use min or max to specify size constraints, so too-little-error and expected-end-error are not useful. The escaped text does not have a separator, so expected-after-sep-error is not useful either. The error we are looking for is expected-sep-or-end-error, which is raised when the end parser and the char parser fail. If this error is raied, it means that the parser hits EOF while trying to parse the escaped text.

So as a first pass we can write

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\"))
                               '((expected-sep-or-end-error
                                  . EOF-in-escaped-text)))))
           (return/p (list->string lst)))))

This will raise 'EOF-in-escaped-text when the parser hits EOF while parsing escaped text. The arguments that the handler receives are documented in the many-until/p entry.

We also want to raise an error when garbage data is found after the closing quote mark. We can do this by adding another parser at the end that looks ahead for delimiters.

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\")))
                               '((expected-sep-or-end-error
                                  . EOF-in-escaped-text)))))
           (let ((str (if (list? lst) (list->string lst) lst)))
             (cond/p
              ((or/p eof/p
                     (lookahead/p comma/p)
                     (lookahead/p crlf/p))
               (return/p str))
              (else (error/p 'garbage-after-escaped str)))))))

Here error/p will raise an error with mark 'garbage-after-escaped with the parsed string as an argument. The continuation that is passed to the handler is the continuation of the parse of error/p. Note that the body of lv/p is just an expression, so we can put a regular let in there. The if expression is used to allow for non-list returns from error handlers.

Similar considerations apply to unescaped/p:

(define unescaped/p
  (lv/p (((lst) (many/p unescaped-char/p)))
    (let ((str (if (list? lst) (list->string lst) lst)))
      (cond/p
       ((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
        (return/p str))
       (else (error/p 'garbage-after-unescaped str))))))

There is nothing to be done for field/p, as it is just a wrapper for escaped/p and unescaped/p, which do the error handling.

At record/p there is not much error handling to be done, because errors regarding missing CRLF or commas are done in the parsers we just did. However, if parsing of a field fails, we may want to throw out the whole row instead of trying to recover an individual parse. We can do this by installing a handler whenever record/p is parsed using handle-errors/p.

(define record/p
  (handle-errors/p
      (many-until/p field/p
                    (lookahead/p (or/p crlf/p eof/p))
                    `((sep/p . ,comma/p)))
    ('return-from-record (lambda (k dto dict dyn mark value)
                           (values dict value)))))

During the parse of record/p, a handler named 'return-from-record is installed. When it is invoked (for instance, with (error/p 'return-from-record value), the passed value is returned from the parse of record/p. This is because the lambda is evaluated in the continuation of the invocation of handle-errors/p.

As an artifical example, we will write a parser entry procedure that will abort a record when a malformed unescaped field is found, but will recover for a malformed escaped field.

When there is garbage after a field, we can skip over the characters in that field until we get to a synchronization point, which is , CRLF, or EOF.

When we want to abort a record, we have to parse fields until the end of a record is found, but discarding all data accumulated.

(define abort-after-field/p
  (skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
                 advance/p)))

(define abort-field/p
  (cond/p
   ((==/p #\") (and/p (skip/p (cond/p
                               ((==/p #\") (==/p #\"))
                               (else advance/p)))
                      (or/p (==/p #\") (return/p))
                      abort-after-field/p))
   (else abort-after-field/p)))

(define abort-record/p
  (many-until/p abort-field/p
                (lookahead/p (or/p eof/p crlf/p))
                `((sep/p . ,comma/p))))

(define (parse-csv str)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
    csv/p
    ('garbage-after-escaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    (return/p str))
             dto dict dyn)))))
    ('garbage-after-unescaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    abort-record/p
                    (error/p 'return-from-record #f))
             dto dict dyn)))))
    ('EOF-in-escaped-text
     (lambda (k dto dict dyn mark . rest)
       (k (lambda ()
            (values dict #f)))))))

The parse form uses handle-errors/p to install handlers, so the same considerations apply to both. The previous use of handle-errors/p returned values in the continuation that the handler was invoked in, and directly returned values without further parsing. Here we see parsers that invoke the continuation k.

The continuation passed to an error handler accepts a single value, a thunk, and will evaluate that thunk in that continuation. Hence each handler will re-enter the continuation. In the case of the first two handlers, it will re-start parsing by passing values manually to a new set of combinators. After skipping tokens to get to a synchronization point, the parsers either return values to their continuation (in the cased of 'garbage-after-escaped).

With this we have written our parser with error recovery. The complete script is:

(import (scheme base) (phosphate))

(define comma/p
  (discard/p (==/p #\,)))

(define crlf/p
  (==seq/p "\r\n"))

(define escaped-char/p
  (cond/p
   ((==/p #\") (==/p #\"))
   (else advance/p)))

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\")))
                               '((expected-sep-or-end-error
                                  . EOF-in-escaped-text)))))
           (let ((str (if (list? lst) (list->string lst) lst)))
             (cond/p
              ((or/p eof/p
                     (lookahead/p comma/p)
                     (lookahead/p crlf/p))
               (return/p str))
              (else (error/p 'garbage-after-escaped str)))))))

(define unescaped-char/p
  (satisfy/p (lambda (x)
               (and (not (char<? x #\x20))
                    (not (char=? x #\x22))
                    (not (char=? x #\x2C))
                    (not (char=? x #\x7F))))))

(define unescaped/p
  (lv/p (((lst) (many/p unescaped-char/p)))
    (let ((str (if (list? lst) (list->string lst) lst)))
      (cond/p
       ((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
        (return/p str))
       (else (error/p 'garbage-after-unescaped str))))))

(define field/p
  (cond/p
   ((lookahead/p (==/p #\")) escaped/p)
   (else unescaped/p)))

(define record/p
  (handle-errors/p
      (many-until/p field/p
                    (lookahead/p (or/p crlf/p eof/p))
                    `((sep/p . ,comma/p)))
    ('return-from-record (lambda (k dto dict dyn mark value)
                           (values dict value)))))

(define csv-eof/p
  (or/p eof/p (and/p crlf/p eof/p)))

(define csv/p
  (many-until/p record/p
                csv-eof/p
                `((sep/p . ,crlf/p))))

(define abort-after-field/p
  (skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
                 advance/p)))

(define abort-field/p
  (cond/p
   ((==/p #\") (and/p (skip/p (cond/p
                               ((==/p #\") (==/p #\"))
                               (else advance/p)))
                      (or/p (==/p #\") (return/p))
                      abort-after-field/p))
   (else abort-after-field/p)))

(define abort-record/p
  (many-until/p abort-field/p
                (lookahead/p (or/p eof/p crlf/p))
                `((sep/p . ,comma/p))))

(define (parse-csv str)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
    csv/p
    ('garbage-after-escaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    (return/p str))
             dto dict dyn)))))
    ('garbage-after-unescaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    abort-record/p
                    (error/p 'return-from-record #f))
             dto dict dyn)))))
    ('EOF-in-escaped-text
     (lambda (k dto dict dyn mark . rest)
       (k (lambda ()
            (values dict #f)))))))

Normal code is fine:

(parse-csv "a,b,c") ; ⇒ (("a" "b" "c"))
(parse-csv "h,e,llo\r\nw,o,rld") ; ⇒ (("h" "e" "llo") ("w" "o" "rld"))
(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not") ; ⇒ (("This is\r\n\"escaped\" "this is not"))

CSV files with garbage after escaped work as expected:

(parse-csv "field,\"my \"\"escaped\"\" field\" with \"garbage,next field\r\nother, field\r\n")
; ⇒ (("field "my \"escaped\" field" "next field") ("other field"))

Garbage after unescaped will handle even very mangled files:

(parse-csv "multiple, garbage \n, and \"this has a \r\n\
            in it, causing another row\", extra stuff\r\n\
            this, is OK")
; ⇒ (#f #f ("this" "is OK"))

CSV files with unterminated escaped text will return #f:

(parse-csv "a,record,with,\"EOF") ; ⇒ (("a" "record" "with" #f))

This parser will work, but what if we are parsing untrusted data? Then we might read a very large file that will cause our program to run out of memory. This is more of an issue for file formats like JSON with arbitrary nesting, but we might still want to impose limits on the length of fields, records, and files.

We can use parameterize/p for this, unlike R⁷RS's parameterize, the parser that is being parameterized will always be tail-called. This is because parser parameters are carried around in the dyn dictionary.

We will add the parameters field-max, record-max, and file-max. These are either #f or exact positive integers. The many-until/p and many/p parsers accept a 'max keyword that will limit the text to a maximum length, and they also can emit errors when this maximum length is hit without a corresponding limiter.

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((max) (ref-parameter/p 'field-max))
                ((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\")))
                               `((expected-sep-or-end-error
                                  . EOF-in-escaped-text)
                                 (max . ,max)
                                 (expected-end-error . escaped-too-big)))))
           (let ((str (if (list? lst) (list->string lst) lst)))
             (cond/p
              ((or/p eof/p
                     (lookahead/p comma/p)
                     (lookahead/p crlf/p))
               (return/p str))
              (else (error/p 'garbage-after-escaped str)))))))

(define unescaped/p
  (lv/p (((max) (ref-parameter/p 'field-max))
         ((lst) (many/p unescaped-char/p
                        `((max . ,max)))))
    (let ((str (if (list? lst) (list->string lst) lst)))
      (cond/p
       ((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
        (return/p str))
       ((lookahead/p unescaped-char/p)
        (error/p 'unescaped-too-big))
       (else (error/p 'garbage-after-unescaped str))))))

(define record/p
  (lv/p (((max) (ref-parameter/p 'record-max)))
    (handle-errors/p
        (many-until/p field/p
                      (lookahead/p (or/p crlf/p eof/p))
                      `((sep/p . ,comma/p)
                         (max . ,max)
                         (expected-end-error . record-too-big)))
      ('return-from-record (lambda (k dto dict dyn mark value)
                             (values dict value))))))

(define csv/p
  (lv/p (((max) (ref-parameter/p 'file-max)))
    (many-until/p record/p
                  csv-eof/p
                  `((sep/p . ,crlf/p)
                     (max . ,max)
                     (expected-end-error . file-too-big)))))

Here we use the ref-parameter/p parser, which will return the value associated with the key given in its argument.

The parse-csv procedure now takes limits.

(define (parse-csv str file-max record-max field-max)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
         (parameterize/p csv/p
                         'file-max file-max
                         'record-max record-max
                         'field-max field-max)
    ('garbage-after-escaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    (return/p str))
             dto dict dyn)))))
    ('garbage-after-unescaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    abort-record/p
                    (error/p 'return-from-record #f))
             dto dict dyn)))))
    ('EOF-in-escaped-text
     (lambda (k dto dict dyn mark . rest)
       (k (lambda ()
            (values dict #f)))))
    (#t (lambda (k dto dict dyn mark . rest)
          (values dict mark)))))

The parser is set up to fail as soon as a limit as reached. Exercise for the reader: modify the parser to allow for sensible error recovery after a limit has been reached.

The complete script looks like:

(import (scheme base) (phosphate))

(define comma/p
  (discard/p (==/p #\,)))

(define crlf/p
  (==seq/p "\r\n"))

(define escaped-char/p
  (cond/p
   ((==/p #\") (==/p #\"))
   (else advance/p)))

(define escaped/p
  (and/p (==/p #\")
         (lv/p (((max) (ref-parameter/p 'field-max))
                ((lst)
                 (many-until/p escaped-char/p
                               (and/p (==/p #\")
                                      (not/p (==/p #\")))
                               `((expected-sep-or-end-error
                                  . EOF-in-escaped-text)
                                 (max . ,max)
                                 (expected-end-error . escaped-too-big)))))
           (let ((str (if (list? lst) (list->string lst) lst)))
             (cond/p
              ((or/p eof/p
                     (lookahead/p comma/p)
                     (lookahead/p crlf/p))
               (return/p str))
              (else (error/p 'garbage-after-escaped str)))))))

(define unescaped-char/p
  (satisfy/p (lambda (x)
               (and (not (char<? x #\x20))
                    (not (char=? x #\x22))
                    (not (char=? x #\x2C))
                    (not (char=? x #\x7F))))))

(define unescaped/p
  (lv/p (((max) (ref-parameter/p 'field-max))
         ((lst) (many/p unescaped-char/p
                        `((max . ,max)))))
    (let ((str (if (list? lst) (list->string lst) lst)))
      (cond/p
       ((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
        (return/p str))
       ((lookahead/p unescaped-char/p)
        (error/p 'unescaped-too-big))
       (else (error/p 'garbage-after-unescaped str))))))

(define field/p
  (cond/p
   ((lookahead/p (==/p #\")) escaped/p)
   (else unescaped/p)))

(define record/p
  (lv/p (((max) (ref-parameter/p 'record-max)))
    (handle-errors/p
        (many-until/p field/p
                      (lookahead/p (or/p crlf/p eof/p))
                      `((sep/p . ,comma/p)
                         (max . ,max)
                         (expected-end-error . record-too-big)))
      ('return-from-record (lambda (k dto dict dyn mark value)
                             (values dict value))))))

(define csv-eof/p
  (or/p eof/p (and/p crlf/p eof/p)))

(define csv/p
  (lv/p (((max) (ref-parameter/p 'file-max)))
    (many-until/p record/p
                  csv-eof/p
                  `((sep/p . ,crlf/p)
                     (max . ,max)
                     (expected-end-error . file-too-big)))))

(define abort-after-field/p
  (skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
                 advance/p)))

(define abort-field/p
  (cond/p
   ((==/p #\") (and/p (skip/p (cond/p
                               ((==/p #\") (==/p #\"))
                               (else advance/p)))
                      (or/p (==/p #\") (return/p))
                      abort-after-field/p))
   (else abort-after-field/p)))

(define abort-record/p
  (many-until/p abort-field/p
                (lookahead/p (or/p eof/p crlf/p))
                `((sep/p . ,comma/p))))

(define (parse-csv str file-max record-max field-max)
  (parse (phosphate-dto)
         (char-sequence->dict "<stdin>" str)
         (phosphate-empty-dict)
         (parameterize/p csv/p
                         'file-max file-max
                         'record-max record-max
                         'field-max field-max)
    ('garbage-after-escaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    (return/p str))
             dto dict dyn)))))
    ('garbage-after-unescaped
     (lambda (k dto dict dyn mark str)
       (k (lambda ()
            ((and/p abort-after-field/p
                    abort-record/p
                    (error/p 'return-from-record #f))
             dto dict dyn)))))
    ('EOF-in-escaped-text
     (lambda (k dto dict dyn mark . rest)
       (k (lambda ()
            (values dict #f)))))
    (#t (lambda (k dto dict dyn mark . rest)
          (values dict mark)))))

You can try some of the previous examples and see that they work.

(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not" #f #f #f) ; ⇒ (("This is\r\n\"escaped\" "this is not"))

We can also try some limited parses, and see them fail:

(parse-csv "one,two,three" #f #f 4 ) ; ⇒ 'unescaped-too-big
(parse-csv "this,record,is,too,big" #f 4 #f ) ; ⇒ 'record-too-big
(parse-csv "this,file\r\nis not too\r\nbig" 3 #f #f ) ; ⇒ (("this" "file") ("is not too") ("big"))
(parse-csv "this,file\r\nis too\r\nbig" 2 #f #f ) ; ⇒ 'file-too-big

This covers all the major features of Phosphate.

API

All procedure arguments that start with dict or are described as dictionaries must be objects that are (dictionary? dto dict) when passed to the procedure.

Dictonaries

phosphate-dto
1. dto?

(phosphate)

Default DTO.

phosphate-empty-dict
1. dict

(phosphate)

Default empty dictionary.

dict-invoke
1. dto
2. dict
3. key
4. 1. arguments
1. *

(phosphate)

Invoke the procedure associated with key in dict with arguments. Raises an error when there is no mapping for key.

Parser Entry

parsedto dict dyn parser (error handler)

(phosphate)

Start a parser with dict as the input dictionary, dyn as the initial dynamic state, dto as the DTO for both, parser as the parser, and handler as the handler for each error. No handlers are installed by default, even for #f, so make sure to install one or the parser will raise an exception when the parse fails.

Fundamental Parsers

All procedure arguments that start with parser, or are described as parsers must be procedures that take three arguments (a DTO, the global dictionary, and the dynamic dictionary). A procedure that ends in /p is either a parser, a procedure that makes a parser, or a macro that makes a parser.

The number of return values of a parser is the number of values returned by the procedure minus the return dictionary: so a parser that “returns one value” will return two values: the return dictionary and some Scheme value.

return/p
1. 1. values
1. parser

(phosphate)

Creates a parser that returns its input dictionary unchanged and the values passed to it.

delay/pexpr

(phosphate)

Returns a parser that returns two values: its input dictionary unchanged, and the expr evaluated at the time of calling the parser.

ref/p
1. key
1. parser

(phosphate)

Creates a parser that returns the global dictionary unchanged, with the second value being the value associated with key in the global dictionary.

expose/p
1. dto
2. dict
3. dyn
1. dict
2. dto
3. global-dict
4. dyn

(phosphate)

A parser that returns the global dictionary unchanged, with the dto, global dictionary, and dynamic dictionary as values after it.

test/p
1. obj
1. parser

(phosphate)

Returns a parser that returns no values if obj is true, and fails if obj is false.

inject/p
1. obj
1. parser

(phosphate)

Returns a parser that will inject obj as the next object that will be parsed. This object does not increase the line or character number. The next object returned by the parser is the object that would be returned if inject/p had not been parsed.

discard/p
1. parser
1. parser

(phosphate)

Returns a parser that parses parser and discards the return values (but keeps the dictionary).

eof/p
1. dto
2. dict
3. dyn
1. dict

(phosphate)

Returns no values when the parser is at EOF, and fails otherwise.

Errors and Parse Failures

The fail mark is #f, and is used to denote a parse failure that is not necessarily a reportable parse error. This is used for backtracking.

The fallback mark is #t. When an error is raised and there is no handler for it, but there is a handler for the fallback mark, then the handler for the fallback mark is invoked.

if/pconditional/p subsequent/p alternative/p

(phosphate)

Each expression must evaluate to a parser.

Returns a parser that, when parsed, will evaluate the passed parsers.

The parser then parses conditional/p with a dynamic state handling the fail mark.

If that fails, then parse alternative/p in the tail-context of the call to if/p and the input dictionary.

Otherwise, if conditional/p returns the dictionary dict₂ and values vals, then parse (apply subsequent/p vals) with the dictionary dict₂.

This is a macro and not a procedure in order to delay the evaluation of the subsequent and alternative so that it matches the behavior of Scheme's if.

cond/pclause (else parser)

(phosphate)

clause: (expr); (expr₁ expr₂); (expr₁ => formals expr₂)

Evaluates to a parser that will try the left hand side of each clause in order. If one matches, then it binds its results to formals (or just discards them), and then tail-parses the right hand side.

If no parser matches and there is a final else clause, then that parser is tail-parsed. Otherwise the match fails.

as-error/p
1. mark
2. parser
3. 1. irritants
1. parser

(phosphate)

Returns a parser that will parse parser with a dynamic state handling the fail mark. If the parser returns normally, then return the return dictionary and values. If the parser fails with the fail mark, then raise an error with mark, the continuation of the call to as-error/p, DTO, and dictionary, with the irritants passed after it.

error/p
1. mark
2. 1. irritants
1. parser

(phosphate)

Returns a parser that aborts to the first handler marked with mark, passing as arguments:

A continuation that takes a single argument, a thunk. The thunk is called with the continuation of error/p.
The DTO.
The global state.
The dynamic state.
The mark that the error was raised with.

along with irritants as rest arguments.

fail/p
1. dto
2. dict
3. dyn
1. dict

(phosphate)

Equivalent to ((error/p #f) dto dict).

handle-errors/pparser ((name formals) handler)

(phosphate)

Parse parser with each the evaluation of name assigned in the dynamic state to a handler with formals formals and body handler in order of inclusion. The handler is called with the continuation of handle-errors/p.

The arguments to a handler depend on how the handler was invoked. In all cases with the included procedures, the handler is invoked using error/p, and so the next arguments are those arguments.

Backtracking Parsers

The or/p combinator implements infinite lookahead. To use limited lookahead, use if/p or cond/p. Unlike MegaParsack or other PEG implementations, this does not track whether or not a branch of or/p has consumed input. (This was the behavior of early versions of Phosphate.) This is done to make parts of the parser that look ahead explicit to make it easier to understand where error recovery will occur. This also gets rid of dead stack frames and allows for idiomatic code to iterate using tail-recursion.

or/p
1. 1. parser
1. parser

(phosphate)

Returns a parser that will try each parser in order. If that succeeds, return its values. Otherwise try the next parser, failing if all parsers fail.

lookahead/p
1. parser
1. parser

(phosphate)

Returns a parser that parses parser. If the parser succeeds, then the returned dictionary is the input dictionary with no values. If it fails it fails as normal.

Failure will return a continuation that is inside the dynamic extent of lookahead/p.

not/p
1. parser
1. parser

(phosphate)

Returns a parser that parses parser. If that parse succeeds, raise a parse failure with the returned dictionary. Otherwise, return zero values and the input dictionary.

Failure will return the continuation of the invocation of not/p.

Predicates

advance/p
1. dto
2. dict
3. dyn
1. dict
2. value

(phosphate)

Returns a parser that returns the value at this point and a dictionary advanced this point if there is a value at this point. EOF is a parse failure.

satisfy/p
1. obj
2. predicate
1. parser

(phosphate)

Returns a parser that returns a dictionary that is advanced one element from this point if the value at this point if it satisfies predicate, and fails otherwise.

==/p
1. obj
2. 1. equal?
1. parser

(phosphate)

Returns a parser that returns a dictionary that is advanced one element from this point if the value at this point if it is equal? (defaults to R⁷RS's equal?), to obj, and fails otherwise.

!==/p
1. obj
2. 1. equal?
1. parser

(phosphate)

Returns a parser that returns a dictionary that is advanced one element from this point if the value at this point if it does not satisfies equal? (defaults to R⁷RS's equal?), and fails otherwise.

==seq/p
1. sequence
2. 1. equal?
1. parser

(phosphate)

sequence is either a list, or a string or a vector, which are converted to lists.

Returns a parser that matches each element in sequence with the input stream. If successful, it will return a dictionary increment as many positions as there are elements of sequence, and raises a parse failure otherwise.

one-of/p
1. sequence
2. 1. equal?
1. parser

(phosphate)

sequence is a list.

Returns a parser that will return the dictionary advanced one past the current element and the element, if it is equal? (defaults to R⁷RS equal?) to any one of the elements of sequence.

Sequencing

lv/p((formal parser)) parser_final

(phosphate)

Expands to a parser that parses each parser in order, binding the returned values (without the dictionary) to the formals interpeted as multiple-values bindings. The returned dictionary from the previous parsers is used for future parsers. Then the parser_final is parsed, and its continuation is the continuation of the invocation of lv/p.

values/p
1. 1. parser
1. parser

(phosphate)

Returns a parser that parses each parser in sequence like in lv/p. The values are returned, appended from this parser.

and/p
1. 1. parser
1. parser

(phosphate)

Returns a parser that parses each parser in order, discarding all return results except the final parser, which is tail-called.

Repetition

many/p
1. parser
2. 1. kw
1. parser

(phosphate)

The keywords argument is an association list that accepts keys 'min, 'max, 'sep/p, 'too-little-error, and 'expected-after-sep-error.

Returns a parser that parses parser between min (defaults to 0) to max (defaults to #f, which is equivalent to +inf.0).

If sep/p is #f (the default), then the parser will detect the end of the repetition when parser fails to parse.

If sep/p is not #f, then it must be a parser. If the parse of sep/p fails, then the parser will halt repetitions. After the first parse, the returned parser attempts to parse sep/p. If the parse of sep/p succeeds, then it will try to parse parser. If parsing parser fails, then an error with mark expected-after-sep-error (defaults to #f) is raised with arguments:

the number of elements matched,
the minimum number of elements that can be matched,
the maximum number of elements that can be matched,
a procedure of two arguments (an integer and a list), that returns a parser that will continue to parse, where the number of successful parses is the first argument, and the accumulated list of values in reverse is the second element,
the values returned by the parse of sep/p.

The continuation of this error returns from the parse of many/p.

If the parser fails due to matching too little elements, then it will raise an error with error too-little-error, which defaults to #f. The arguments passed to too-little-error are

the number of elements matched,
the minimum number of elements that can be matched,
the maximum number of elements that can be matched,
a procedure of two arguments (an integer and a list), that returns a parser that will continue to parse, where the number of successful parses is the first argument, and the accumulated list of values in reverse is the second element,

The continuation returns the accumulated list to whatever parsed the call to many/p.

The parser will return a list of values, appended, from each parse of sep/p and parser/p.

many-until/p
1. parser
2. end/p
3. 1. kw
1. parser

(phosphate)

Returns a parser that parses parser repeatedly until it parses end/p.

The keywords argument is an association list that accepts keys 'min, 'max, 'sep/p, 'too-little-error, 'expected-after-sep-error, 'expected-sep-or-end-error, and 'expected-end-error.

The returned parser will try to parse end/p. If that succeeds, then its values are discarded and the parse either ends with too-little-error as in many/p (i.e. if min elements are not matched), or the list of elements parsed is returned.

If the maximum number of elements has been parsed, then the parser will try to parse end/p, failing with expected-end-error (defaults to #f) if it could not.

If it does not parse end/p, then it will parse sep/p like in many/p.

If neither works, then the parser raises an error of mark expected-sep-or-end-error (defaults to #f), with the same continuation and values passed to it as too-little-error.

skip/p
1. parser
1. parser

Returns a parser that matches parser zero or more times, discarding the results and returning no values.

Parameterization

parameterize/p
1. parser
2. 1. key
  2. val
1. parser

(phosphate)

Returns a parser that tail-calls parser with each key mapped to its respective val in the dynamic state of the call.

parameterize-update/p
1. key
2. updater
3. parser
1. parser

(phosphate)

Returns a parser that tail-calls parser with key mapped to (dict-ref dto dyn key updater) in the dynamic extent of the call.

ref-parameter/p
1. key
1. parser

(phosphate)

Returns a parser that returns its global dictionary unchanged with a single extra value, the value that key is mapped to in the dynamic state.

Making Dictionaries

A dictionary has (generally) the following keys:

here: The current element to match against.
eof?: Is the parser at EOF?
next: A procedure that takes a DTO and a dictionary, and returns a dictionary that is at the next element in the sequence. This must memoize the result, so that streams are lazy.
line-number: The line number (when parsing chars).
offset: The character offset (when parsing chars).
filename: Filename.

parser-init
1. dto
2. dict
1. dict

(phosphate)

Advances dict if here is not a key.

char-sequence->dict
1. filename
2. input
1. dict

(phosphate)

Input must be either a list of chars, a string, a vector of chars, or a textual input port.

Returns a dictionary that parses each character in the sequence, with line numbers and offsets.

char-position-wrapper
1. dto
2. dict
3. filename

(phosphate)

Returns a dictionary with the filename, tracking line and offset numbers starting from the line-number and offset in the dictionary, which defaults to 1 and 0 respectively.

port-like->dict
1. dto
2. dict
3. accessor
4. port
1. dict

(phosphate)

Returns a dictionary that parses input by calling accessor on port.

char-port->dict
1. dto
2. dict
3. port
1. dict

(phosphate)

Returns a dictionary that parses input by calling read-char on port.

u8-port->dict
1. dto
2. dict
3. port
1. dict

(phosphate)

Returns a dictionary that parses input by calling read-u8 on port.

generator->dict
1. dto
2. dict
3. generator
1. dict

(phosphate)

Returns a dictionary that parses input by invoking generator with no arguments.

list->dict!
1. dto
2. dict
3. list
1. dict

(phosphate)

Returns a dictionary that parses input by reading from the list in sequence. Mutating the list will cause undefined behavior.

list->dict
1. dto
2. dict
3. list
1. dict

(phosphate)

Returns a dictionary that parses input by reading from the list in sequence. The list is copied beforehand.

string->dict
1. dto
2. dict
3. string
1. dict

(phosphate)

Returns a dictionary that parses input by reading from the string in sequence.

JSON Parser

This is a set of parser combinators for RFC 8259 JSON. Instead of offering a parser that parses a single JSON document, this library offers a set of JSON parsers that can validate the input structure during parsing. For instance:

(object*/p (string/p)
           "first_name" (string/p)
           "last_name" (string/p)
           "age" (number/p))

will create a parser that will only accept objects with exactly three keys, "first_name", "last_name", and "age" (in any order), that take a string, a string, and a number respectively.

Atoms

whitespace/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Skips internal whitespace and returns no values.

true/p
1. dto
2. dict
3. dyn
1. dict
2. #t

(phosphate RFC8259)

Reads true and returns #t if the parse is successful.

false/p
1. dto
2. dict
3. dyn
1. dict
2. #f

(phosphate RFC8259)

Reads false and returns #f if the parse is successful.

null/p
1. dto
2. dict
3. dyn
1. dict
2. 'null

(phosphate RFC8259)

Reads null and returns 'null if the parse is successful.

Numbers

number/p
1. 1. max-of-part
1. parser

(phosphate RFC8259)

Returns a parser that reads JSON numbers, and returns a number parsed using number->string. The maximum number of characters for each part of the number is max-of-part (defaults to infinite).

The parser will raise the following errors:

'no-digits-after-frac: Raised when . is parsed, but no digits follow it. The continuation of this error expects a list of ASCII digits as characters.
'no-digits-after-e: Raised when e is parsed, but no digits follow it. The continuation of this error expects a list of ASCII digits as characters.

start-number/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Looks ahead to see if the input stream is at a character that starts a number, which is an ASCII digit or -.

delimits-number/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Looks ahead to see if the input stream is at a character that can follow a well formed number. This is a whitespace character, the start/end of an object, or ,.

Strings

as-string/p
1. string
1. parser

(phosphate RFC8259)

Returns a parser that parses a JSON string that is equal to string. The parser will parse a string that has the exact length of the input string, and will return a newly allocated string if the string is equal? to the string.

UTF16/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Parses a UTF-16 escape, starting with \u.

If an escape is not a surrogate pair, the parser parses one \u escape and returns the character.
If the escape is a high surrogate, and the next character is an escape for a low surrogate, then the parser will consume both escapes and returns the character reprsented by the pair.
If the first \u is not a valid unicode sequence (it is not four hexadecimal characters), then the parser raises a 'invalid-unicode-escape error, whose continuation accepts one value, an integer representing the parsed codepoint. The recommended method for handling this error is to return #xFFFD. (This is an integer, not a character.)
If the first \u is a low surrogate, or it is a high surrogate that is not followed by a low surrogate sequence, then the parser raises a 'lone-surrogate error with the first codepoint as an integer parsed. The continuation is the continuation of the parse of UTF-16/p. The recommended method for handling this error is to return #\xFFFD. (This is the character, not an integer.)

unescaped/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Parses a character that is allowed to appear unescaped in a string.

string/p
1. 1. min
  2. 1. max
1. parser

(phosphate RFC8259)

Returns a parser that parses a string with minimum length min (default 0) and maximum length max.

The parser will raise the same errors as UTF16/p. In addition, it will return

'expected-end-of-string: The end double quote " was not found, either because of EOF or because an unescaped character was found. The string parsed up to that point is passed. The continuation of this error returns the string that is returned by the parse of string/p.
'unknown-escape: An unknown escape character was found. The parser is at the unknown escape character. The continuation of this error returns a character that is inserted into the string.

Objects

object/p
1. key/p
2. value/p
3. 1. min
  2. 1. max
1. parser

(phosphate RFC8259)

Returns a parser that reads a JSON object, where all keys match key/p and all values match value/p. The object matched has length between min (default 0) and max (default infinite).

The returned value is an alist.

The following errors are raised by returned parser:

'expected-key-separator: The character that separates the key and value in the object is missing. The continuatuion takes no values and tries to parse the value.
'expected-value: The value parser failed. The key is passed as an argument. The continuation takes one value, which is the value for the key.
'object-too-small: The object parsed was smaller than the length limit. The arguments are the minimum length, maximum length, and the accumulated alist (in reverse). The continuation of this error accepts the whole alist for the parser, and tries to match }.
'expected-end-of-object: An invalid char or EOF was found in the middle of the object. The alist is passed as an argument. The continuation of this error accepts the whole alist.

object-dict/p
1. string/p
2. key-dto
3. key-dict
1. parser

(phosphate RFC8259)

Parses a JSON object that has the same keys as the keys of key-dict (in the sense of equal?). The parsed values are parsed according to the parser associated with each key in key-dict.

The errors raised by the returned parser include the errors returned by object/p, and in addition

unexpected-key: The object matched a key that is either not in the dictionary, or was already matched. The handler is passed the key, and the accumulated alist (in reverse). The continuation of this call returns the object.
unexpected-value: A parse of the value for the key raised a parse failure. The handle is passed the key. The continuation of this error returns the value that was to be parsed.

object*/p
1. string/p
2. 1. string
  2. parser
1. parser

(phosphate RFC8259)

Equivalent to calling object-dict/p with a dictionary that maps each string to each parser.

Arrays

array/p
1. parser
2. 1. min
  2. 1. max
1. parser

(phosphate RFC8259)

Create a parser that parses a JSON array. Each element in the array is parser, and the array must be between min elements (default 0) and max elements (default infinity).

The errors raised by the returned parser are the errors raised by parser, and

'array-too-small: The array parsed was smaller than the length limit. The arguments are the minimum length, maximum length, and the accumulated array (as a list in reverse). The continuation of this error accepts the whole list (in regular order as a vector) for the parser, and tries to match ].
expected-end-of-array: An unexpected character or EOF was found. The arguments to this is the array (as a vector). The continuation of this error accepts the array as a vector.

array*/p
1. parser
2. 1. value/p
1. parser

(phosphate RFC8259)

Create a parser that parses a JSON array with the same number of elements as arguments to this procedure, and where the nth element is parsed by the nth parser.

The errors raised by the returned parse are the errors raised by the parsers, the errors raised by array/p, and in addition

'unexpected-array-value: The parser that parses the value in this part of the array failed. The continuation accepts one value, the value in place of the failed parse.

A JSON File

RFC8259/p
1. dto
2. dict
3. dyn
1. dict

(phosphate RFC8259)

Parses a JSON file. The errors raised by this parser is the union of the errors raised from number/p, object/p, array/p, and string/p.

Copyright

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Index

!==/p
- defined
/p
- convention
==/p
- defined
==seq/p
- defined
RFC8259/p
- defined
UTF16/p
- defined
advance/p
- defined
and/p
- defined
array*/p
- defined
array/p
- defined
as-error/p
- defined
as-string/p
- defined
char-port->dict
- defined
char-position-wrapper
- defined
char-sequence->dict
- defined
clause
- definition
cond/p
- defined
delay/p
- defined
delimits-number/p
- defined
dict
- argument parameter definition
dict-invoke
- defined
discard/p
- defined
eof/p
- defined
error/p
- defined
expose/p
- defined
fail mark
- definition
fail/p
- defined
fallback mark
- definition
false/p
- defined
generator->dict
- defined
handle-errors/p
- defined
if/p
- defined
inject/p
- defined
list->dict
- defined
list->dict!
- defined
lookahead/p
- defined
lv/p
- defined
many-until/p
- defined
many/p
- defined
not/p
- defined
null/p
- defined
number/p
- defined
object*/p
- defined
object-dict/p
- defined
object/p
- defined
one-of/p
- defined
or/p
- defined
parameterize-update/p
- defined
parameterize/p
- defined
parse
- defined
parser
- argument parameter definition
parser-init
- defined
phosphate-dto
- defined
phosphate-empty-dict
- defined
port-like->dict
- defined
ref-parameter/p
- defined
ref/p
- defined
return/p
- defined
satisfy/p
- defined
skip/p
- defined
start-number/p
- defined
string->dict
- defined
string/p
- defined
test/p
- defined
true/p
- defined
u8-port->dict
- defined
unescaped/p
- defined
values/p
- defined
whitespace/p
- defined