Thy gardens and thy phosphate mines,
Florida, my Florida— Chastain V. Waugh
Current version: 0.1
Phosphate is a parser-combinator library with error recovering inspired by algebraic effect handlers.
Parsers are procedures of three arguments: an SRFI 225 DTO, a dictionary of global state, and a dictionary of dynamic state. Parsers return the new global state and zero or more values.
Global state is passed along successful parses and contains the input state of the text to parse. The global state can also be used to store parse-wide settings, such as case folding.
The dynamic state is modified and passed around as parsers are composed
like Scheme parameters and is used to store the handlers and information
such as nesting depth. The dynamic state is not the same as the dynamic
extent, although it acts in an analagous way. Unlike the parameterize
form of R7RS, parameterize/p is guaranteed to tail-call its
parser argument.
Parse failures and parse errors are handled using a system like
algebraic effect handlers (except they use call/cc instead of
delimited continuations). A system parse failure is usually handled
by the library, while a user-defined parse error is handled by a
user-defined handler. The error handlers get a continuation that allows them
to execute code and return values in the raised-from context.
If you port this library to a system with real delimited continuations
(either Racket or a future RNRS), you should use the name
orthophosphate.
Inspired by Medeiros and Mascarenhas, “Syntax Error Recovery in Parsing Expression Grammars”, 2018, and MegaParsack.
Phosphate is written in R7RS and is tested to work in Chibi Scheme 0.11 and CHICKEN 5. Phosphate only requires SRFI 225 to run (requires SRFI 64 for tests).
WARNING! The API is unlikely to change, but this is an unstable release, so it might.
This tutorial will walk through parsing an RFC 4180 comma separated values file using all of the major features of Phosphate.
If you are looking for an actual CSV parsing library, Wolfgang Corcoran-Mathe has one for R6RS.
A CSV file is a sequence of records separated by CRLF, optionally followed by CRLF followed by EOF. The many-until/p parser is perfect for this. It parses a parser repeatedly until it parses some end parser, and returns the results as a list.
(define csv/p
(many-until/p record/p
csv-eof/p
`((sep/p . ,crlf/p))))The many-until/p parser takes keyword arguments in the form of an alist, which will be modified later as we get into error handling.
The crlf/p parser just checks for \r\n in the feed. This can be accomplished
using the ==seq/p parser, which checks for a string, list, or vector in
the stream.
(define crlf/p
(==seq/p "\r\n"))
The csv-eof/p procedure just checks for an optional CRLF followed by EOF. This can
be handled using the previously defined crlf/p, eof/p (checks that the sequence is at EOF),
or/p (disjunction over parsers) and
and/p (composes parsers together, returning the results of the final
parser).
(define csv-eof/p
(or/p eof/p (and/p crlf/p eof/p)))
The record/p field parses text data separated by commas. It is terminated
by CRLF, but it cannot consume it because csv/p consumes it. We can check
for CRLF without consuming input using lookahead/p.
(define record/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p))))The many-until/p and the related many/p parsers will
insert the values returne by the separator parser into the list of
parsed results. This was not a problem for crlf/p because it used
==seq/p because it returns no results. We could use that parser
again, but for the sake of education I will use the parser ==/p,
which does return a value. Since we don't want the commas in the returned
list of fields, we can use discard/p to discard the parsed values.
(define comma/p
(discard/p (==/p #\,)))
A field/p is either escaped or unescaped. We could use the or/p
combinator here to try one, backtrack when it does not work, and then
try another, but this is wasteful. The backtrack state will be active
throughout the entire parse even when we have parsed ", which cannot
be in an unescaped field. Instead we will use cond/p, which
will tail-call a parser when one parser succeeds. In this case, we wish
to look ahead one character to see if the next character is ", and if
so parse the quoted character without pushing backtracking state.
Otherwise we parse the unescaped field.
(define field/p
(cond/p
((lookahead/p (==/p #\")) escaped/p)
(else unescaped/p)))Unescaped text does not allowed control characters, double quotes, commas, or DEL. As an extension to the spec, all non-ASCII Unicode characters are supported. To apply a Scheme predicate to the current character, use satisfy/p (which when successful will return the character it matched against), inside many/p to capture zero or more of the valid characters. Since we want to return a string, we will capture the returned value of many/p using lv/p (short for let-values/p) and then use return/p to convert the list of characters into a string.
(define unescaped-char/p
(satisfy/p (lambda (x)
(and (not (char<? x #\x20))
(not (char=? x #\x22))
(not (char=? x #\x2C))
(not (char=? x #\x7F))))))
(define unescaped/p
(lv/p (((lst) (many/p unescaped-char/p)))
(return/p (list->string lst))))The use of return/p here wraps the returned values such that other parses can read the value along with the current parse state returned from many/p.
Escaped text starts with a double-quote and ends with a double quote. The escaped text can have two double-quotes in it, which turn into a single double quote.
We can accomplish this using advance/p (return any value at the current input point, and advance input) and not/p (attempt to parse the parser passed to it, and fail if it succeeds). The not/p parser is a lookahead parser, so if it succeeds the parse state is at the same point as the entry to not/p.
(define escaped-char/p
(cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(define escaped/p
(and/p (==/p #\")
(lv/p (((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\"))))))
(return/p (list->string lst)))))Note that in escaped-char/p, the value returned by the first (==/p #\")
is discarded. The parser then tail-calls the second (==/p #\"), which
will return a value if the second character is ", and fail otherwise.
To actually parse input, we use parse. This requires the
DTO that describes the operations on the input stream,
the input stream as a dictionary, the dynamic state as a dictionary,
the parser, and any error handlers. There are no named error handlers
so we handle the catch-all error #t and just return #f on error.
(define (parse-csv str)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
csv/p
(#t (lambda (k dto dict dyn mark . rest)
(values dict #f)))))The parser will return two values: the input dictionary at whatever point
the parser stopped at, and the parsed CSV file or #f on error.
This completes our simple CSV parser. Since the parser is made up of procedures, we need to rearrange the definitions to make it work. A complete script is:
(import (scheme base) (phosphate))
(define comma/p
(discard/p (==/p #\,)))
(define crlf/p
(==seq/p "\r\n"))
(define escaped-char/p
(cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(define escaped/p
(and/p (==/p #\")
(lv/p (((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\"))))))
(return/p (list->string lst)))))
(define unescaped-char/p
(satisfy/p (lambda (x)
(and (not (char<? x #\x20))
(not (char=? x #\x22))
(not (char=? x #\x2C))
(not (char=? x #\x7F))))))
(define unescaped/p
(lv/p (((lst) (many/p unescaped-char/p)))
(return/p (list->string lst))))
(define field/p
(cond/p
((lookahead/p (==/p #\")) escaped/p)
(else unescaped/p)))
(define record/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p))))
(define csv-eof/p
(or/p eof/p (and/p crlf/p eof/p)))
(define csv/p
(many-until/p record/p
csv-eof/p
`((sep/p . ,crlf/p))))
(define (parse-csv str)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
csv/p
(#t (lambda (k dto dict dyn mark . rest)
(values dict #f)))))You can try some examples:
(parse-csv "a,b,c") ; ⇒ (("a" "b" "c"))
(parse-csv "h,e,llo\r\nw,o,rld") ; ⇒ (("h" "e" "llo") ("w" "o" "rld"))
(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not") ; ⇒ (("This is\r\n\"escaped\" "this is not"))This parser works, but when a parse error occurs, the parser just gives up and says it failed. There is no description of the failure, and no way to recover from the failure.
(parse-csv "\"truncated") ; ⇒ #f
(parse-csv "invalid \"text\"") ; → #f
Phosphate allows us to write a parser that raises named parse errors with continuations that allows the parser to recover and continue. Where to put error handlers and how many errors handlers to install will depend on what you need. With parser combinators, it is easy to write a parser, so there is no need to write the most general parser possible.
Phosphate differentiates between parse failures, which have the mark
#f, and parse errors, which have any other mark. Parse errors
are meant to be handled or signalled, while parse failures are meant
to backtrack. Parse failures can be converted into parse errors using
as-error/p. Our current parser only has parse failures and no
parse errors.
Lets start adding errors at escaped/p. The many-until/p allows
us to specify certain error behavior. The parser does not use min
or max to specify size constraints, so too-little-error
and expected-end-error are not useful. The escaped text does
not have a separator, so expected-after-sep-error is not useful
either. The error we are looking for is expected-sep-or-end-error,
which is raised when the end parser and the char parser fail. If this
error is raied, it means that the parser hits EOF while trying to parse
the escaped text.
So as a first pass we can write
(define escaped/p
(and/p (==/p #\")
(lv/p (((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\"))
'((expected-sep-or-end-error
. EOF-in-escaped-text)))))
(return/p (list->string lst)))))This will raise 'EOF-in-escaped-text when the parser hits EOF while
parsing escaped text. The arguments that the handler receives are
documented in the many-until/p entry.
We also want to raise an error when garbage data is found after the closing quote mark. We can do this by adding another parser at the end that looks ahead for delimiters.
(define escaped/p
(and/p (==/p #\")
(lv/p (((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\")))
'((expected-sep-or-end-error
. EOF-in-escaped-text)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p
(lookahead/p comma/p)
(lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-escaped str)))))))Here error/p will raise an error with mark 'garbage-after-escaped
with the parsed string as an argument. The continuation that is passed to
the handler is the continuation of the parse of error/p. Note that
the body of lv/p is just an expression, so we can put a regular
let in there. The if expression is used to allow for non-list returns
from error handlers.
Similar considerations apply to unescaped/p:
(define unescaped/p
(lv/p (((lst) (many/p unescaped-char/p)))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-unescaped str))))))There is nothing to be done for field/p, as it is just a wrapper for
escaped/p and unescaped/p, which do the error handling.
At record/p there is not much error handling to be done, because errors
regarding missing CRLF or commas are done in the parsers we just did.
However, if parsing of a field fails, we may want to throw out the whole
row instead of trying to recover an individual parse. We can do this by
installing a handler whenever record/p is parsed using
handle-errors/p.
(define record/p
(handle-errors/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p)))
('return-from-record (lambda (k dto dict dyn mark value)
(values dict value)))))During the parse of record/p, a handler named 'return-from-record
is installed. When it is invoked (for instance, with (error/p
'return-from-record value), the passed value is returned from the parse
of record/p. This is because the lambda is evaluated in the continuation
of the invocation of handle-errors/p.
As an artifical example, we will write a parser entry procedure that will abort a record when a malformed unescaped field is found, but will recover for a malformed escaped field.
When there is garbage after a field, we can skip over the characters in
that field until we get to a synchronization point, which is , CRLF,
or EOF.
When we want to abort a record, we have to parse fields until the end of a record is found, but discarding all data accumulated.
(define abort-after-field/p
(skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
advance/p)))
(define abort-field/p
(cond/p
((==/p #\") (and/p (skip/p (cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(or/p (==/p #\") (return/p))
abort-after-field/p))
(else abort-after-field/p)))
(define abort-record/p
(many-until/p abort-field/p
(lookahead/p (or/p eof/p crlf/p))
`((sep/p . ,comma/p))))
(define (parse-csv str)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
csv/p
('garbage-after-escaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
(return/p str))
dto dict dyn)))))
('garbage-after-unescaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
abort-record/p
(error/p 'return-from-record #f))
dto dict dyn)))))
('EOF-in-escaped-text
(lambda (k dto dict dyn mark . rest)
(k (lambda ()
(values dict #f)))))))The parse form uses handle-errors/p to install handlers, so the same considerations apply to both. The previous use of handle-errors/p returned values in the continuation that the handler was invoked in, and directly returned values without further parsing. Here we see parsers that invoke the continuation k.
The continuation passed to an error handler accepts a single value, a
thunk, and will evaluate that thunk in that continuation. Hence each
handler will re-enter the continuation. In the case of the first two
handlers, it will re-start parsing by passing values manually to a
new set of combinators. After skipping tokens to get to a synchronization
point, the parsers either return values to their continuation (in the
cased of 'garbage-after-escaped).
With this we have written our parser with error recovery. The complete script is:
(import (scheme base) (phosphate))
(define comma/p
(discard/p (==/p #\,)))
(define crlf/p
(==seq/p "\r\n"))
(define escaped-char/p
(cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(define escaped/p
(and/p (==/p #\")
(lv/p (((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\")))
'((expected-sep-or-end-error
. EOF-in-escaped-text)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p
(lookahead/p comma/p)
(lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-escaped str)))))))
(define unescaped-char/p
(satisfy/p (lambda (x)
(and (not (char<? x #\x20))
(not (char=? x #\x22))
(not (char=? x #\x2C))
(not (char=? x #\x7F))))))
(define unescaped/p
(lv/p (((lst) (many/p unescaped-char/p)))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-unescaped str))))))
(define field/p
(cond/p
((lookahead/p (==/p #\")) escaped/p)
(else unescaped/p)))
(define record/p
(handle-errors/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p)))
('return-from-record (lambda (k dto dict dyn mark value)
(values dict value)))))
(define csv-eof/p
(or/p eof/p (and/p crlf/p eof/p)))
(define csv/p
(many-until/p record/p
csv-eof/p
`((sep/p . ,crlf/p))))
(define abort-after-field/p
(skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
advance/p)))
(define abort-field/p
(cond/p
((==/p #\") (and/p (skip/p (cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(or/p (==/p #\") (return/p))
abort-after-field/p))
(else abort-after-field/p)))
(define abort-record/p
(many-until/p abort-field/p
(lookahead/p (or/p eof/p crlf/p))
`((sep/p . ,comma/p))))
(define (parse-csv str)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
csv/p
('garbage-after-escaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
(return/p str))
dto dict dyn)))))
('garbage-after-unescaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
abort-record/p
(error/p 'return-from-record #f))
dto dict dyn)))))
('EOF-in-escaped-text
(lambda (k dto dict dyn mark . rest)
(k (lambda ()
(values dict #f)))))))Normal code is fine:
(parse-csv "a,b,c") ; ⇒ (("a" "b" "c"))
(parse-csv "h,e,llo\r\nw,o,rld") ; ⇒ (("h" "e" "llo") ("w" "o" "rld"))
(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not") ; ⇒ (("This is\r\n\"escaped\" "this is not"))CSV files with garbage after escaped work as expected:
(parse-csv "field,\"my \"\"escaped\"\" field\" with \"garbage,next field\r\nother, field\r\n")
; ⇒ (("field "my \"escaped\" field" "next field") ("other field"))Garbage after unescaped will handle even very mangled files:
(parse-csv "multiple, garbage \n, and \"this has a \r\n\
in it, causing another row\", extra stuff\r\n\
this, is OK")
; ⇒ (#f #f ("this" "is OK"))CSV files with unterminated escaped text will return #f:
(parse-csv "a,record,with,\"EOF") ; ⇒ (("a" "record" "with" #f))This parser will work, but what if we are parsing untrusted data? Then we might read a very large file that will cause our program to run out of memory. This is more of an issue for file formats like JSON with arbitrary nesting, but we might still want to impose limits on the length of fields, records, and files.
We can use parameterize/p for this, unlike R7RS's parameterize,
the parser that is being parameterized will always be tail-called. This is
because parser parameters are carried around in the dyn dictionary.
We will add the parameters field-max, record-max, and file-max. These
are either #f or exact positive integers. The many-until/p and
many/p parsers accept a 'max keyword that will limit the text to
a maximum length, and they also can emit errors when this maximum length is
hit without a corresponding limiter.
(define escaped/p
(and/p (==/p #\")
(lv/p (((max) (ref-parameter/p 'field-max))
((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\")))
`((expected-sep-or-end-error
. EOF-in-escaped-text)
(max . ,max)
(expected-end-error . escaped-too-big)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p
(lookahead/p comma/p)
(lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-escaped str)))))))
(define unescaped/p
(lv/p (((max) (ref-parameter/p 'field-max))
((lst) (many/p unescaped-char/p
`((max . ,max)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
(return/p str))
((lookahead/p unescaped-char/p)
(error/p 'unescaped-too-big))
(else (error/p 'garbage-after-unescaped str))))))
(define record/p
(lv/p (((max) (ref-parameter/p 'record-max)))
(handle-errors/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p)
(max . ,max)
(expected-end-error . record-too-big)))
('return-from-record (lambda (k dto dict dyn mark value)
(values dict value))))))
(define csv/p
(lv/p (((max) (ref-parameter/p 'file-max)))
(many-until/p record/p
csv-eof/p
`((sep/p . ,crlf/p)
(max . ,max)
(expected-end-error . file-too-big)))))Here we use the ref-parameter/p parser, which will return the value associated with the key given in its argument.
The parse-csv procedure now takes limits.
(define (parse-csv str file-max record-max field-max)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
(parameterize/p csv/p
'file-max file-max
'record-max record-max
'field-max field-max)
('garbage-after-escaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
(return/p str))
dto dict dyn)))))
('garbage-after-unescaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
abort-record/p
(error/p 'return-from-record #f))
dto dict dyn)))))
('EOF-in-escaped-text
(lambda (k dto dict dyn mark . rest)
(k (lambda ()
(values dict #f)))))
(#t (lambda (k dto dict dyn mark . rest)
(values dict mark)))))The parser is set up to fail as soon as a limit as reached. Exercise for the reader: modify the parser to allow for sensible error recovery after a limit has been reached.
The complete script looks like:
(import (scheme base) (phosphate))
(define comma/p
(discard/p (==/p #\,)))
(define crlf/p
(==seq/p "\r\n"))
(define escaped-char/p
(cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(define escaped/p
(and/p (==/p #\")
(lv/p (((max) (ref-parameter/p 'field-max))
((lst)
(many-until/p escaped-char/p
(and/p (==/p #\")
(not/p (==/p #\")))
`((expected-sep-or-end-error
. EOF-in-escaped-text)
(max . ,max)
(expected-end-error . escaped-too-big)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p
(lookahead/p comma/p)
(lookahead/p crlf/p))
(return/p str))
(else (error/p 'garbage-after-escaped str)))))))
(define unescaped-char/p
(satisfy/p (lambda (x)
(and (not (char<? x #\x20))
(not (char=? x #\x22))
(not (char=? x #\x2C))
(not (char=? x #\x7F))))))
(define unescaped/p
(lv/p (((max) (ref-parameter/p 'field-max))
((lst) (many/p unescaped-char/p
`((max . ,max)))))
(let ((str (if (list? lst) (list->string lst) lst)))
(cond/p
((or/p eof/p (lookahead/p comma/p) (lookahead/p crlf/p))
(return/p str))
((lookahead/p unescaped-char/p)
(error/p 'unescaped-too-big))
(else (error/p 'garbage-after-unescaped str))))))
(define field/p
(cond/p
((lookahead/p (==/p #\")) escaped/p)
(else unescaped/p)))
(define record/p
(lv/p (((max) (ref-parameter/p 'record-max)))
(handle-errors/p
(many-until/p field/p
(lookahead/p (or/p crlf/p eof/p))
`((sep/p . ,comma/p)
(max . ,max)
(expected-end-error . record-too-big)))
('return-from-record (lambda (k dto dict dyn mark value)
(values dict value))))))
(define csv-eof/p
(or/p eof/p (and/p crlf/p eof/p)))
(define csv/p
(lv/p (((max) (ref-parameter/p 'file-max)))
(many-until/p record/p
csv-eof/p
`((sep/p . ,crlf/p)
(max . ,max)
(expected-end-error . file-too-big)))))
(define abort-after-field/p
(skip/p (and/p (not/p (or/p comma/p crlf/p eof/p))
advance/p)))
(define abort-field/p
(cond/p
((==/p #\") (and/p (skip/p (cond/p
((==/p #\") (==/p #\"))
(else advance/p)))
(or/p (==/p #\") (return/p))
abort-after-field/p))
(else abort-after-field/p)))
(define abort-record/p
(many-until/p abort-field/p
(lookahead/p (or/p eof/p crlf/p))
`((sep/p . ,comma/p))))
(define (parse-csv str file-max record-max field-max)
(parse (phosphate-dto)
(char-sequence->dict "<stdin>" str)
(phosphate-empty-dict)
(parameterize/p csv/p
'file-max file-max
'record-max record-max
'field-max field-max)
('garbage-after-escaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
(return/p str))
dto dict dyn)))))
('garbage-after-unescaped
(lambda (k dto dict dyn mark str)
(k (lambda ()
((and/p abort-after-field/p
abort-record/p
(error/p 'return-from-record #f))
dto dict dyn)))))
('EOF-in-escaped-text
(lambda (k dto dict dyn mark . rest)
(k (lambda ()
(values dict #f)))))
(#t (lambda (k dto dict dyn mark . rest)
(values dict mark)))))You can try some of the previous examples and see that they work.
(parse-csv "\"This is\r\n\"\"escaped\"\"\",this is not" #f #f #f) ; ⇒ (("This is\r\n\"escaped\" "this is not"))We can also try some limited parses, and see them fail:
(parse-csv "one,two,three" #f #f 4 ) ; ⇒ 'unescaped-too-big
(parse-csv "this,record,is,too,big" #f 4 #f ) ; ⇒ 'record-too-big
(parse-csv "this,file\r\nis not too\r\nbig" 3 #f #f ) ; ⇒ (("this" "file") ("is not too") ("big"))
(parse-csv "this,file\r\nis too\r\nbig" 2 #f #f ) ; ⇒ 'file-too-bigThis covers all the major features of Phosphate.
All procedure arguments that start with dict or are described as dictionaries must
be objects that are (dictionary? dto dict) when
passed to the procedure.
Default DTO.
Default empty dictionary.
Invoke the procedure associated with key in dict with arguments. Raises an error when there is no mapping for key.
Start a parser with dict as the input dictionary,
dyn as the initial dynamic state, dto
as the DTO for both, parser as the parser, and
handler as the handler for each error.
No handlers are installed by default, even for #f, so make sure
to install one or the parser will raise an exception when the parse
fails.
All procedure arguments that start with parser,
or are described as parsers must be procedures that take
three arguments (a DTO, the global dictionary, and the dynamic
dictionary). A procedure that ends
in /p is either a parser, a procedure that makes a
parser, or a macro that makes a parser.
The number of return values of a parser is the number of values returned by the procedure minus the return dictionary: so a parser that “returns one value” will return two values: the return dictionary and some Scheme value.
Creates a parser that returns its input dictionary unchanged and the values passed to it.
Returns a parser that returns two values: its input dictionary unchanged, and the expr evaluated at the time of calling the parser.
Creates a parser that returns the global dictionary unchanged, with the second value being the value associated with key in the global dictionary.
A parser that returns the global dictionary unchanged, with the dto, global dictionary, and dynamic dictionary as values after it.
Returns a parser that returns no values if obj is true, and fails if obj is false.
Returns a parser that will inject obj as the next object that will be parsed. This object does not increase the line or character number. The next object returned by the parser is the object that would be returned if inject/p had not been parsed.
Returns a parser that parses parser and discards the return values (but keeps the dictionary).
The fail mark is #f, and is used to denote a
parse failure that is not necessarily a reportable parse error. This is
used for backtracking.
The fallback mark is #t.
When an error is raised and there is no handler for it, but there is
a handler for the fallback mark, then the handler for the fallback mark
is invoked.
Each expression must evaluate to a parser.
Returns a parser that, when parsed, will evaluate the passed parsers.
The parser then parses conditional/p with a dynamic state handling the fail mark.
If that fails, then parse alternative/p in the tail-context of the call to if/p and the input dictionary.
Otherwise, if conditional/p returns the dictionary dict2
and values vals, then parse (apply subsequent/p vals) with
the dictionary dict2.
This is a macro and not a procedure in order to delay the evaluation of
the subsequent and alternative so that it matches the behavior of Scheme's
if.
Evaluates to a parser that will try the left hand side of each clause in order. If one matches, then it binds its results to formals (or just discards them), and then tail-parses the right hand side.
If no parser matches and there is a final else clause, then that parser
is tail-parsed. Otherwise the match fails.
Returns a parser that will parse parser with a dynamic state handling the fail mark. If the parser returns normally, then return the return dictionary and values. If the parser fails with the fail mark, then raise an error with mark, the continuation of the call to as-error/p, DTO, and dictionary, with the irritants passed after it.
Returns a parser that aborts to the first handler marked with mark, passing as arguments:
A continuation that takes a single argument, a thunk. The thunk is
called with the continuation of
The DTO.
The global state.
The dynamic state.
The mark that the error was raised with.
along with irritants as rest arguments.
Parse parser with each the evaluation of name assigned in the dynamic state to a handler with formals formals and body handler in order of inclusion. The handler is called with the continuation of handle-errors/p.
The arguments to a handler depend on how the handler was invoked. In all cases with the included procedures, the handler is invoked using error/p, and so the next arguments are those arguments.
The or/p combinator implements infinite lookahead. To use limited lookahead, use if/p or cond/p. Unlike MegaParsack or other PEG implementations, this does not track whether or not a branch of or/p has consumed input. (This was the behavior of early versions of Phosphate.) This is done to make parts of the parser that look ahead explicit to make it easier to understand where error recovery will occur. This also gets rid of dead stack frames and allows for idiomatic code to iterate using tail-recursion.
Returns a parser that will try each parser in order. If that succeeds, return its values. Otherwise try the next parser, failing if all parsers fail.
Returns a parser that parses parser. If the parser succeeds, then the returned dictionary is the input dictionary with no values. If it fails it fails as normal.
Failure will return a continuation that is inside the dynamic extent of lookahead/p.
Returns a parser that parses parser. If that parse succeeds, raise a parse failure with the returned dictionary. Otherwise, return zero values and the input dictionary.
Failure will return the continuation of the invocation of not/p.
Returns a parser that returns the value at this point and a dictionary advanced this point if there is a value at this point. EOF is a parse failure.
Returns a parser that returns a dictionary that is advanced one element
from this point if the value at this point if it satisfies
predicate, and fails otherwise.
Returns a parser that returns a dictionary that is advanced one element
from this point if the value at this point if it is
equal? (defaults to R7RS's equal?), to obj, and fails otherwise.
Returns a parser that returns a dictionary that is advanced one element
from this point if the value at this point if it does not satisfies
equal? (defaults to R7RS's equal?), and fails otherwise.
sequence is either a list, or a string or a vector, which are converted to lists.
Returns a parser that matches each element in sequence with the input stream. If successful, it will return a dictionary increment as many positions as there are elements of sequence, and raises a parse failure otherwise.
sequence is a list.
Returns a parser that will return the dictionary advanced one past the
current element and the element, if it is equal? (defaults to R7RS
equal?) to any one of the elements of sequence.
Expands to a parser that parses each parser in order, binding the returned values (without the dictionary) to the formals interpeted as multiple-values bindings. The returned dictionary from the previous parsers is used for future parsers. Then the parserfinal is parsed, and its continuation is the continuation of the invocation of lv/p.
Returns a parser that parses each parser in sequence like in lv/p. The values are returned, appended from this parser.
Returns a parser that parses each parser in order, discarding all return results except the final parser, which is tail-called.
The keywords argument is an association list that accepts keys
'min, 'max, 'sep/p, 'too-little-error, and
'expected-after-sep-error.
Returns a parser that parses parser between min (defaults
to 0) to max (defaults to #f, which is equivalent to +inf.0).
If sep/p is #f (the default), then the parser will detect the
end of the repetition when parser fails to parse.
If sep/p is not #f, then it must be a parser. If the
parse of sep/p fails, then the parser will halt repetitions.
After the first parse, the returned parser attempts to parse sep/p. If the parse of sep/p succeeds, then it will try to
parse parser. If parsing parser fails, then an error
with mark expected-after-sep-error (defaults to #f) is raised
with arguments:
the number of elements matched,
the minimum number of elements that can be matched,
the maximum number of elements that can be matched,
a procedure of two arguments (an integer and a list), that returns a parser that will continue to parse, where the number of successful parses is the first argument, and the accumulated list of values in reverse is the second element,
the values returned by the parse of sep/p.
The continuation of this error returns from the parse of many/p.
If the parser fails due to matching too little elements, then it will
raise an error with error too-little-error, which defaults to #f.
The arguments passed to too-little-error are
the number of elements matched,
the minimum number of elements that can be matched,
the maximum number of elements that can be matched,
a procedure of two arguments (an integer and a list), that returns a parser that will continue to parse, where the number of successful parses is the first argument, and the accumulated list of values in reverse is the second element,
The continuation returns the accumulated list to whatever parsed the call to many/p.
The parser will return a list of values, appended, from each parse of
sep/p and parser/p.
Returns a parser that parses parser repeatedly until it parses end/p.
The keywords argument is an association list that accepts keys
'min, 'max, 'sep/p, 'too-little-error,
'expected-after-sep-error, 'expected-sep-or-end-error, and
'expected-end-error.
The returned parser will try to parse end/p. If that succeeds, then its values are discarded and the parse either ends with too-little-error as in many/p (i.e. if min elements are not matched), or the list of elements parsed is returned.
If the maximum number of elements has been parsed, then the parser
will try to parse end/p, failing with expected-end-error
(defaults to #f) if it could not.
If it does not parse end/p, then it will parse sep/p like in many/p.
If neither works, then the parser raises an error of mark
expected-sep-or-end-error (defaults to #f), with the
same continuation and values passed to it as too-little-error.
Returns a parser that matches parser zero or more times, discarding the results and returning no values.
Returns a parser that tail-calls parser with each key mapped to its respective val in the dynamic state of the call.
Returns a parser that tail-calls parser with key mapped to
(dict-ref dto dyn key updater) in the dynamic extent of the call.
Returns a parser that returns its global dictionary unchanged with a single extra value, the value that key is mapped to in the dynamic state.
A dictionary has (generally) the following keys:
here: The current element to match against.
eof?: Is the parser at EOF?
next: A procedure that takes a DTO and a dictionary, and returns a
dictionary that is at the next element in the sequence. This must
memoize the result, so that streams are lazy.
line-number: The line number (when parsing chars).
offset: The character offset (when parsing chars).
filename: Filename.
Advances dict if here is not a key.
Input must be either a list of chars, a string, a vector of chars, or a textual input port.
Returns a dictionary that parses each character in the sequence, with line numbers and offsets.
Returns a dictionary with the filename, tracking line and offset
numbers starting from the line-number and offset in the dictionary,
which defaults to 1 and 0 respectively.
Returns a dictionary that parses input by calling accessor on port.
Returns a dictionary that parses input by calling read-char on port.
Returns a dictionary that parses input by calling read-u8 on port.
Returns a dictionary that parses input by invoking generator with no arguments.
Returns a dictionary that parses input by reading from the list in sequence. Mutating the list will cause undefined behavior.
Returns a dictionary that parses input by reading from the list in sequence. The list is copied beforehand.
Returns a dictionary that parses input by reading from the string in sequence.
This is a set of parser combinators for RFC 8259 JSON. Instead of offering a parser that parses a single JSON document, this library offers a set of JSON parsers that can validate the input structure during parsing. For instance:
(object*/p (string/p)
"first_name" (string/p)
"last_name" (string/p)
"age" (number/p))will create a parser that will only accept objects with exactly three
keys, "first_name", "last_name", and "age" (in any order), that
take a string, a string, and a number respectively.
Skips internal whitespace and returns no values.
Reads null and returns 'null if the parse is successful.
Returns a parser that reads JSON numbers, and returns a number parsed
using number->string. The maximum number of characters for each part
of the number is max-of-part (defaults to infinite).
The parser will raise the following errors:
'no-digits-after-frac
Raised when . is parsed, but no digits follow it. The continuation
of this error expects a list of ASCII digits as characters.
'no-digits-after-e
Raised when e is parsed, but no digits follow it. The continuation
of this error expects a list of ASCII digits as characters.
Looks ahead to see if the input stream is at a character that starts a
number, which is an ASCII digit or -.
Looks ahead to see if the input stream is at a character that can follow
a well formed number. This is a whitespace character, the start/end of an
object, or ,.
Returns a parser that parses a JSON string that is equal to string.
The parser will parse a string that has the exact length of the
input string, and will return a newly allocated string if the string is
equal? to the string.
Parses a UTF-16 escape, starting with \u.
If an escape is not a surrogate pair, the parser parses one
\u escape and returns the character.
If the escape is a high surrogate, and the next character is an escape for a low surrogate, then the parser will consume both escapes and returns the character reprsented by the pair.
If the first \u is not a valid unicode sequence (it is not
four hexadecimal characters), then the parser raises a
'invalid-unicode-escape error, whose continuation accepts
one value, an integer representing the parsed codepoint. The
recommended method for handling this error is to return #xFFFD.
(This is an integer, not a character.)
If the first \u is a low surrogate, or it is a high surrogate that
is not followed by a low surrogate sequence, then the parser raises
a 'lone-surrogate error with the first codepoint as an integer
parsed. The continuation is the continuation of the parse of
UTF-16/p. The recommended method for handling this error is to
return #\xFFFD. (This is the character, not an integer.)
Parses a character that is allowed to appear unescaped in a string.
Returns a parser that parses a string with minimum length min (default 0) and maximum length max.
The parser will raise the same errors as UTF16/p. In addition, it will return
'expected-end-of-string
The end double quote " was not found, either because of EOF or
because an unescaped character was found. The string parsed
up to that point is passed. The continuation of this error returns
the string that is returned by the parse of string/p.
'unknown-escape
An unknown escape character was found. The parser is at the unknown escape character. The continuation of this error returns a character that is inserted into the string.
Returns a parser that reads a JSON object, where all keys match
key/p and all values match value/p. The object matched has
length between min (default 0) and max (default infinite).
The returned value is an alist.
The following errors are raised by returned parser:
'expected-key-separator
The character that separates the key and value in the object is missing. The continuatuion takes no values and tries to parse the value.
'expected-value
The value parser failed. The key is passed as an argument. The continuation takes one value, which is the value for the key.
'object-too-small
The object parsed was smaller than the length limit. The arguments
are the minimum length, maximum length, and the accumulated alist
(in reverse). The continuation of this error accepts the whole alist
for the parser, and tries to match }.
'expected-end-of-object
An invalid char or EOF was found in the middle of the object. The alist is passed as an argument. The continuation of this error accepts the whole alist.
Parses a JSON object that has the same keys as the keys of key-dict
(in the sense of equal?). The parsed values are parsed according to
the parser associated with each key in key-dict.
The errors raised by the returned parser include the errors returned by object/p, and in addition
unexpected-key
The object matched a key that is either not in the dictionary, or was already matched. The handler is passed the key, and the accumulated alist (in reverse). The continuation of this call returns the object.
unexpected-value
A parse of the value for the key raised a parse failure. The handle is passed the key. The continuation of this error returns the value that was to be parsed.
Equivalent to calling object-dict/p with a dictionary that maps each string to each parser.
Create a parser that parses a JSON array. Each element in the array is parser, and the array must be between min elements (default 0) and max elements (default infinity).
The errors raised by the returned parser are the errors raised by parser, and
'array-too-small
The array parsed was smaller than the length limit. The arguments
are the minimum length, maximum length, and the accumulated array
(as a list in reverse). The continuation of this error accepts
the whole list (in regular order as a vector) for the parser,
and tries to match ].
expected-end-of-array
An unexpected character or EOF was found. The arguments to this is the array (as a vector). The continuation of this error accepts the array as a vector.
Create a parser that parses a JSON array with the same number of elements as arguments to this procedure, and where the nth element is parsed by the nth parser.
The errors raised by the returned parse are the errors raised by the parsers, the errors raised by array/p, and in addition
'unexpected-array-value
The parser that parses the value in this part of the array failed. The continuation accepts one value, the value in place of the failed parse.
Parses a JSON file. The errors raised by this parser is the union of the errors raised from number/p, object/p, array/p, and string/p.
© Peter McGoron 2026
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.