Input Mode:
Trace:
Display Mode | |
---|---|
Records to Display:
max records
last record
Operator Nodes:
Rule Nodes:
About APG
APG was originally developed to generate parsers directly from the Augmented Backus-Naur Form (ABNF) syntax. ABNF is the grammar syntax used to describe many, if not most, Internet technical specifications. It has been standardized by the IETF with RFC 5234 and recently updated it with RFC 7405. APG has since grown to include additional features and the syntax for those features generate additional syntax elements creating a “superset” of ABNF or SABNF.
Besides JavaScript, APG is available in the C/C++ and Java languages. The C/C++ version has a small footprint, is fast and reliable and has been the parser of choice for a number of large Telecom companies.
APG is the generator and parser that powers the new pattern-matching alternative to RegExp, apg-exp. (For an introduction to apg-exp, see this sitepoint.com tutorial.)
About the Author
The author lives on the Gulf Coast of Florida, USA, loves JavaScript and has 30+ years experience with software development of many kinds in many languages. In his spare time he loves hard-boiled detective novels, historical fiction and outdoor photography among many other things. He serves as a volunteer executive director on the boards of multiple, local charitable organizations.
February, 2017
Lowell D. Thomas
Overview↓toc
This file, apg.html, is a static, stand-alone web page interface to APG, an ABNF Parser Generator. It can be launched in any modern web browser from a local file and requires no external resources. It provides easy visual access to the grammar, any grammar errors, the rules and their dependencies, the rule attributes and the generated parser. Additionally, once a parser has been successfully generated, facilities are provided for testing it against input strings of your choice. The user interface consists of a series of panels. Each panel presents one specific result or input opportunity. Access to the panels is through the main menu, the fixed dark blue bar across the top, and sub-menus that appear in the various panels. This User’s Guide will walk you through each of the menu and sub-menu items to explain how this APG interface works.
There are a couple of sub-menu items that appear on most panels consistently and those will be described here once and for all and not specifically referenced in the other panel specifics presented below.
- Copy: Web browsers have no access to the local file system†, and since this is a stand-alone web page with no server behind it, there is no upload/download ability. Therefore, it is necessary to type or paste any input into a textarea. For output, to save the text in the textarea, it is necessary to copy and paste into a permanent file. This Copy menu action will copy the contents of the textarea onto the clipboard to facilitate this process.
- Full Screen: Often it is desirable to see the panel contents in a full screen view. This Full Screen action will pop up the panel contents in a new, full-screen view.
†The HTML5 FIle API does not have sufficiently universal support here to be considered.
Table of Contents
The Generator | The Parser | |
---|---|---|
Overview | Parser | State |
Grammar | Configure | Stats |
Rules | Input | Trace |
Attributes | Phrases |
The Generator↑toc
The first panel is the generator itself. It presents a sub-menu and a textarea.
- sub-menu:
- Generate: Click this link to generate a parser from the grammar in the textarea. If the grammar has errors the Grammar panel will open automatically displaying an annotated version of the grammar with error messages. If there are no errors the Parser panel will open automatically, displaying the generated parser in a text area.
- textarea: The grammar to be used by the generator is typed or pasted into the textarea.
Grammar↑toc
The generator will display an annotated version of the input grammar in this panel. If there are grammar errors, this panel will open automatically. However, the annotated grammar is always available here, with or without errors. The grammar is presented in tabular form with a row for each grammar line.
- line no.: the line number of the grammar line displayed
- first char: the character number of the first character of the line
- length: the number of characters in the line
- text: the ASCII text of the line. The control characters tab
(\t, 0x09)
, linefeed(\n, 0x0A)
and carriage return(\r, 0x0D
) are displayed asTAB
,LF
andCR
, respectively.
If there are errors in the grammar, a second table will follow the first, with two lines for each error. The first is the same as described above except that a red marker will precede the location of the discovered error. The second line will give a brief description of the error type.
APG generates four types of grammar errors.
- invalid characters: The first step is to validate the input grammar’s character set. SABNF grammars accept only the printing ASCII character set,
0x20-7E
plus tab(0x09)
, linefeed(0x0A)
and carriage return(0x0D)
. Any other character will generate a grammar error. Also, each line must end with a line end character†, including the last line. A missing line end will be reported as a part of this initial input validation. (Note that forgetting or missing a last line end is a common and easy-to-make error and it may seem that it would be and easy fix to insert one on the user’s behalf. However, ABNF syntax requires a line end for each line and reporting this as an error rather than inserting one and proceeding is a considered feature, not an oversight.) - syntax errors: The second step is to parse the SABNF grammar against its ABNF definition. If the input is not a valid SABNF grammar, errors will be reported here.
- semantic errors: The third step is to translate the Abstract Syntax Tree (AST) of the previous step into a parser object. Semantic, or translation errors, are reported here. A semantic error might be, for example,
%d57-48
. While this character range is syntactically correct,min > max
is a semantic error. - attribute errors: Recursive-descent parsers cannot accept left-recursive, cyclic or infinite grammars. These are characterized, respectively, in the simple examples below.
S = S "x" / "y"
S = S
S = "y" S
If any of these errors occur, the Attributes panel will automatically open with the errors appearing in the top lines, but the errors are also reported in the Grammar panel.
†Strict ABNF accepts only the carriage return, linefeed pair, CRLF
, as a line end. However, because of the ways different browsers handle textareas, it is sometimes difficult or impossible to insert this line end pair into the grammar text. Therefore, the generator always follows the SABNF rules regarding line ends. That is, it accepts any of CRLF
, LF
or CR
as a line end.
Rules↑toc
This panel displays a table of all the rule and UDT names defined in the grammar. The first column is the rule index—the order in which the rules appear in the grammar. Clicking the index
column label will toggle between ascending and descending ordering on the indexes.
The second column is the rule name. Clicking the rule
column label will toggle between ascending and descending alphabetic ordering on the names.
The third column lists all of the rule names referenced by the named rule. The first entry, for each rule, is a show/hide
link. This link will toggle the display of the rule dependencies on and off. For example, the first rule will usually reference all of the other rules. For large grammars this is a lot of uninteresting information that you may like to hide.
The sub-menu has links to show or hide all of the rule dependencies for all of the rules and UDTs simultaneously.
Note that UDTs, if present, are always listed below the rule names.
Attributes↑toc
It is well-known that recursive-descent parsers cannot parse left-recursive grammars. Left-recursion is a fatal “attribute” of the grammar. APG looks for left-recursion and six other attributes as it analyses a grammar. These attributes are displayed here in table form. The seven attributes types, with simple examples of each are:
- left recursion (fatal)
S = S "x" / "y"
- nested recursion (OK)
S = "a" S "b" / "y"
- right recursion (OK)
S = "x" S / "y"
- cyclic (fatal)
S = S
- finite (fatal if not finite)
S = "y" S
(defines only infinite strings) - empty (OK, but very important to know about)
S = "x" S / ""
- not empty (OK)
S = "x" S / "y"
Note that these are “aggregate” attributes, in that if the attribute is true it only means that it can be true, not that it will always be true for every input string. It also means that more than one attribute may be true for a given rule.
You may wonder why we would be interested in both empty
and not empty
as separate attributes. The importance is not apparent here, and won’t be explained in detail, but they are not mutually exclusive and both attributes turn out to be important to the algorithms that determine the recursiveness of a rule.
- first column, the rule index: the order in which the rules appear in the grammar. Clicking the “index” column label will toggle between ascending and descending ordering on the indexes.
- second column, the rule name: Clicking the
rule
column label will toggle between ascending and descending ordering on the names. - third column, the recursive type: Clicking the
type
column label will toggle between ascending and descending ordering on the types. - the attribute columns: each has
yes/no
entries indicating whether the attribute is true or false. Clicking on the column heading will toggle a column sort on theyes/no
values.
Fatal attributes—left-recursion, cyclic, infinite—will appear in red. A checkbox will also appear at the top. When checked, rules with attribute errors will always remain at the top of the list.
Note that UDTs, if present, are always listed below the rule names.
The Parser↑toc
If the grammar is error-free, the Generator will be successful and this panel will open automatically. The generated parser will be in the textarea and can be copied and pasted into your parser application. Numerous examples are available to show you how to use this parser in both node.js and web page applications. However, the parser can also be tested right here. The sub-menu has panels that let you try the parser against various input strings. You can view the phrases matched by the individual rules, look at the parsing statistics and a trace of the parser’s path through the parse tree. The trace is especially useful as a debugging tool for failed parses. The details are in the following discussions of the sub-menu panels.
Note that if the grammar has UDTs, parsing here will not be possible. In this case the parser is incomplete without the user-written code for the UDTs and there are no facilities here for code writing.
Parser↑toc
This panel is a textarea containing the generated parser as a JavaScript function. You might want to change the variable name before you copy it into your application, but any other changes may invalidate the parser.
To test the parser, use the Input and Configure panels.
Configure↑toc
This panel allows a degree of control over how the input string is interpreted and what tracing information is produced.
Input Mode:
Only the printing ASCII characters, 0x20-7F
plus the control characters, tab(0x09
, linefeed(0x0A)
and carriage return(0x0D)
are allowed in the input <textarea>
. Any other character, in any mode, will be flagged as an input error. For non-ASCII characters, however, an escaped mode, as well as a base 64 option, are available.
- auto (default) - the application will scan the input to determine which mode to use. If all of the characters (ignoring white space and control characters \r and \n) are limited to the base 64 character set, a dialog will allow the user to select the underlying encoding. If any escape sequences are found, `x or `u, it will choose escaped mode. Otherwise, it will select ASCII mode.
- ASCII characters only - the ASCII characters are interpreted literally. The grammar defining the parser in this case must only define ASCII sentences and phrases.
- escaped characters - escaped characters can be used to generate a string of non-ASCII character codes as input to the parser. (See the Input section for a complete description.)
Trace:
- on - Turns tracing on. Opens further tracing configuration options.
- off - (default) No tracing is done. The other tracing options are hidden.
Display Mode:
For each step the parser takes, the trace will display the portion of the string that the parser is trying to match. If any of the characters are non-ASCII, they will be displayed as Unicode values (U+HH[HH[HHHH]]
) However, the trace strings may be displayed in other formats, which may be preferred, depending on the input.
- ASCII - (default) printing characters are displayed normally. Control characters are displayed as,
TAB
,LF
andCF
. All others are displayed in Unicode format,U+HH[HH[HHHH]]
. The non-printing character representations are in small-cap italics to distinguish from possible conflicts with similar text. - decimal - all character codes are display as comma-delimited decimal integers.
- hexidecimal - all character codes are displayed as comma-delimited hex integers of the form
\xHH[HH[HHHH]]
. - Unicode - all character codes are displayed as comma-delimited Unicode integers of the form
U+HH[HH[HHHH]]
. H
- 0-9 or A-F
Records to Display:
As the parser completes its recursive-descent traversal of the parse tree, it will visit each tree node twice, once going down and again on the way up. For large grammars and input strings this can amount to hundreds of thousands of node visits. The visits themselves are not very time consuming. Modern JavaScript engines perform well for large grammars. However, when tracing the parser, a trace record is generated for each node visit and the accumulation and display of these can easily overwhelm the capacity of the browser, slowing it or even grinding it to a virtual halt. The number of records can be limited with selections here.
- max records - The maximum number of records to retain.
- last record - Normally, it is the last group of records that is of interest. If the parser fails, the failure will usually appear near the end of the parse. Use a negative value,
-1
, to indicate that the last to keep is to be the actual last record (unknown prior to parsing.) Use a positive number to indicate a specific value for the last saved record. Note that some boundary conditions apply.
Suppose the total number of records generated is 10,000.
- example 1: Choosing
max records = 5000
andlast record = -1
will display records 5001 - 10000. - example 2: Choosing
max records = 5000
andlast record = 9000
will display records 4001 - 9000. - example 2: Choosing
max records = 10000
andlast record = 5000
will display records 1-10000. The number of records takes precedence if the last record would limit the number of records to less thanmax records
(or the total number of records, if less.)
Operator Nodes:
The node visits to the non-rule name nodes are often not of interest and trace records for these are not kept by default. However, these may be turned on by checking the boxes by the operator names. The all/none
buttons can be used to turn them all on or off with one click. In either state, the check boxes can be selected individually.
- all - check all operator boxes
- none - (default) uncheck all operator boxes
Rule Nodes:
Once a parser has been successfully generated from a grammar, all rule names defined by the grammar will appear here in alphabetical order. Often there are many rule names of little or no interest. For example, if an alphanumeric name is defined, the name is usually of interest, but the individual characters are not. The rule names to trace can be selected with the checkbox by each name.
- all - (default) check all rule name boxes
- none - uncheck all rule name boxes
Note that only the checked rules will appear in the phrases displayed in the Phrases panel.
Input↑toc
Testing the parser is as simple as typing or pasting an input string into the textarea and clicking Parse Input
. The Phrases panel will open up automatically to display the parsed phrases.
While an SABNF grammar can easily define sentences and phrases with arbitrary character codes, handling anything other than the printing ASCII characters in a <textarea>
is tricky to impossible. apg.html offers two ways to handle non-ASCII characters—base 64 encoded and escaped format. (To convert your data to either of these formats, see apg-conv.)
Base 64
Data in any of the Unicode formats or 8-, 16- or 32-bit integer formats can be base 64 encoded. The base 64 data is purely ASCII and can be pasted into the input <textarea>
. When Parse Input
is clicked a dialog will open up to allow a selection of the underlying data format. The data will be base 64 decoded and replaced with an escaped format before parsing.
Escaped Format
In escaped character mode, escaping is identical to escaping in JavaScript strings except that the escape character is the grave accent (`). This choice was made because the backslash (\) causes annoying conflicts with actual JavaScript strings. Other special characters were rejected because they required the shift key. The escaped encoding rules are:
- `` – literal grave accent (`)
- `xhh – 8-bit integer
- `uhhhh – 16-bit integer
- `u{h[hhhhhhh]} – 8- to 32-bit integer, i.e. 1 to 8 hexidecimal characters
- h – hexadecimal digit, 0-9, a-f, or A-F
In escaped character mode, using the grave accent (`) in any other context is an input error.
Parse Input
This action will run the generated parser against the string in the Input panel. If the parse is successful, the Phrases panel will open automatically, allowing you to peruse the matched phrases. If not successful, the State panel will open automatically, displaying the failed state. While the State panel does have some useful information, you will probably need to turn on tracing and study the Trace panel to locate the error. It may be a grammar error; you didn’t accurately describe the intended phrases. Or it may be an error in the input string; it does not belong to the language, or set of sentences, defined by the grammar.
Phrases↑toc
This panel gives you the opportunity to peruse the phrases matched by the grammar’s rules. The sub-sub-menu has a drop-down combo box for selecting a rule name. The names are listed alphabetically, and the number in parenthesis is the number of phrases matched by the rule. Only rule names with one or more matched phrases appear. To further limit the rule names appearing here, see the Configure panel. Selecting a rule name will highlight the first matched phrase for that rule in the input string, scrolling the window if necessary.
Next to the combo box of names are the phrase controls. The have the following functions:
- ⏮ - display the first matched phrase for this rule
- ⏴ - display the previously matched phrase for this rule
phrases
- display all matched phrases for this rule- ⏵ - display the next matched phrase for this rule
- ⏭ - display the last matched phrase for this rule
Note that the empty string often qualifies as a correctly matched phrase. In this case, a green epsilon, (ε), will be inserted to make the match visible.
State↑toc
This panel shows the state of the parse in tabular form. The main thing to note is that a successful parse requires that the entire string be matched. It is quite possible for the parser to find a matched sentence before it reaches the end of the string.
- success - true if the parser ends in a
MATCH
state and the entire string was matched. - state -
MATCH
,EMPTY
(a successful match of 0 characters),NOMATCH
. - string length - the number of characters in the sub-string (the parser has facilities for parsing sub-strings, but no access to that facility is available here.)
- matched length - the number of characters matched
- max matched - the maximum number of characters matched. Approximate if the state is
NOMATCH
. - max tree depth - maximum tree depth reached in the traversal of the parse tree.
- node hits - the number of parse tree nodes processed
- input length - the number of characters in the complete input string. (May be larger than
string length
above in the case of parsing a sub-string.) - sub-string begin - index of the first character of the sub-string to parse. (Not available here.)
- sub-string end - index of the end character character of the sub-string to parse. (Not available here.)
- sub-string length - number of characters in the sub-string (end - begin). (Not available here.)
Stats↑toc
This panel displays the parsing statistics. The first table shows the number of node hits for each of the 18 operator types. The columns show the EMPTY
, MATCH
, NOMATCH
and total hits. The second table shows the same information for the RNM
operators broken down by individual rule names. The rule names are in descending order on the number of hits. Rules with 0 hits are not listed.
Trace↑toc
APG generates recursive-descent parsers. These are parsers that start at the root node of the syntax tree and proceed down the left-most branch of the tree until a terminal node is reached. The terminal node will attempt to match a character or phrase in the remaining input string. If the match is successful, the matched characters are removed from the remaining string and the parser proceeds back up the branch looking for further instructions (concatenation, repetition, etc.) from the nodes it finds along the way. If the match does not succeed, the parser will proceed up to the nearest ALT
(Alternate) node and try the next alternate branch, if any. This continues until the parser has returned to the root node. If the entire string has been consumed, the parse is a success, otherwise, not.
For APG a terminal node is one of the node operators, TLS
, TBS
or TRG
. (User-Defined Terminals will not be discussed here.) Only terminal nodes actually match and consume characters or phrases in the input string.
TLS
—Terminal Literal String. This node attempts to make a case-insensitive match of a grammar-defined phrase to the remaining characters in the input string.TBS
—Terminal Binary String. This node attempts to make a case-sensitive match of a grammar-defined phrase to the remaining characters in the input string.TRG
—Terminal Range. This node attempts to match a character within a grammar-defined range to the leading character of the remaining the input string.
Tracing is the process of generating a record of each node visit and this panel displays a text version of those records. The legend at the bottom of the Trace panel will explain the details of the data columns. See the Configure panel for instructions on how to control the selection of records displayed.
The important thing to note is that as you read down from the first record, you see the path the parser has taken through the parse tree and which characters and phrases are matched or missed along the way.
While this can be illuminating in its own right, the chief purpose of the trace is to find where and what has gone wrong with a failed parse. It may be that the grammar does not correctly describe the language specification. Or it might be that the input string fails to match the language specification—or both, of course. It’s not automatic and a bit of an art. It may take some practice. But an examination of the trace usually pinpoints the problem quickly.
One tip though, if you don’t see the error right away with the default trace options, one of the first things to try is to check the TLS
, TBS
and TRG
operators. Since these are the ones that actually consume input string characters they might be the first to allow you to see when and where a phrase is incorrectly missed or consumed.
Written with StackEdit.