Zbnf Parser

1. Article structure

linked from (parent)	similar files in same level	important links in this
Overview Java Text XML conversion →	→ This page →	→ ZBNF syntax description

linked from (parent)

similar files in same level

important links in this

→ This page →

I have firstly used microprocessors from 1978, in may master thesis. That was programmed in assembly with the Z80 processor. The microprocessor technology was very newly. Firstly I have written some programs to show and change RAM content and such ones.

Later, from 1982 till 1985, Basic language becomes familiar and I have written a system with Basic, also possible to edit on the fly storing a meta machine code. It may be seen as similar to a byte code as in Java.

In that time I want to write a complete and complex parser using the idea of graphical syntax diagrams https://en.wikipedia.org/wiki/Syntax_diagram. But unfortunately I have had only assembly language.

In the 1990th using C/++ language I have tried to write this parser, but on the first hand I havn’t had the necessary time, and second hand the C/++ language and my development environment and knowledge was not enough powerful to do so.

Only when I used Java, the idea of a parser which works on the principles of syntax graphs was successfully. I have written the first version of the parser on only one week end. Java which its capabilities for container and the before written class org.vishia.util.StringPartScan which has had its precursor in some C++ classes has enabled this work.

The result was org.vishia.zbnf.ZbnfParser This was in the year 2007, ~20 years after the first ideas.

3. Approach

ZBNF is firstly a syntax description which can be used also outside of the parser. It is similar or can be seen as enhancement of EBNF (https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) from Prof. Niklaus Wirth. The EBNF idea bases on the BNF which is the "''Backus Naur Format''" (https://en.wikipedia.org/wiki/Backus–Naur_form) which was used to described the syntax of programming language in the 50th and 60th, especially for Algol.

The ZBNF has the semantic aspect in the formal syntax definition. The 'Z' is a reverse 'S' for "Semantic". With that it is possible to immediately store the parsed data in Java instances without additional effort. That is done by the org.vishia.zbnf.ZbnfParser.

The `org,vishia.zbnf.ZbnfParser works with the textual syntax script. It reads the script, stores it as Java data for faster running.

The ZBNF syntax script can also be used to generate proper data storage classes, see Example storage class for data, how to get

4. Working principle

The parser itself tests the source code similar as the Syntax Diagram (see https://en.wikipedia.org/wiki/Syntax_diagram or https://en.wikipedia.org/wiki/Wirth_syntax_notation (both articles are similar).

It tests all possibilities, go back to the previous fork point if nothing matches and hence search the correct path. This is not the fastest way to parse, but it works proper with all necessities of parsing approaches. The speed to parse is not the first question in the century of fast processors. It was an important question in the 1980th and 1990th if processor power was lesser.

5. Usages

This ZbnfParser.html was used firstly for parsing Java language files to translate to C code: ../../../Java2C/index.html. Secondly I have used it for parsing C/ header files to generate reflection for C/ programming. See ../../../Inspc/html/Reflect.html[].

Later all texts which I have parsed uses this parser Especially also the RWtrans.html#JZtxtcmd work also with this parser.

The parser was written independently of other known parser principles. It works with a textual given syntax and does not need deep depending packages. Only some Java files all contained in the 1.4 MByte vishiaBase.jar package are necessary. Though the parser is fast enough for all practical approaches. The textual given syntax is firstly translated in internal pre prepared data, then it is used.

The parser is used also in some professional projects.

6. Example for a ZBNF syntax file, how to get

You can write the syntax for any syntactically evaluable text simply manually by viewing and thinking about the text syntax. But sometime some experience or study of given patterns are usefull.

6.1. Example Bill of material

As example a bill of material should be used. This may be given in any special format, in this case in a simple list maybe from older tools.

Bill of material
================

order-number: 134.23.14
date:febr-16-2008

amount  code     description     value
---------------------------------------------
  21    1234567  Resistor        3.9 kOhm/5%
  12    1234537  Resistor        1.8 kOhm/2%
   5    1234237  Resistor        2.7 kOhm/2%
   7    1234557  Resistor         10 kOhm/5%
   1    1234127  Resistor        120 Ohm/1%
   2    1234897  Resistor        630 Ohm/1%
  34    1235771  capacitor        47 nF
  12    1235781  capacitor       100 nF
   5    1235791  capacitor       4.7 uF
---------------------------------------------

The file format is not very well for parsing. Unified delimiters between rows are missing. It seems to be that spaces are delimiters. But in the last column spaces are parts of the text. The description should be only 16 characters, so the specific rule (for the example). It is a print out format. But nevertheless it can be parsed:

$setLinemode.

BillOfMaterial::=
  <*|order-number?>          ##skip over all characters until string "order-number"

  order-number : <#?order/@part1>\.<#?order/@part2>\.<#?order/@part3> \n  ##its a three-part number. NOTE: . must be written \.
  date : <date> \n           ##date have its own syntax.

The first line is only used to inform about the encoding of the syntax file. But this information can be omitted if for example the syntax is given as Java String. It is more formally.

The $setLinemode. is an option. It defines that newline characters (0x0d and also 0x0a, any combination) is not used as white spaces. The line structure should be regarded.

The BillofMaterial::= … is the top level parsing component which presents the whole input file.

As explained also in the comment, the User script should start with the text order-number. All before is skipped. With the <*|TEXT?> item the parser gets the information "skip all till this text" or "search this text". This is more a statement as syntax description, but exactly this possibility can be seen as one of the advantages. From view of syntax description it is "get all content till TEXT". Hence it can be stored as semantic after the question mark, here left empty.

The text order-number itself is not handled by the syntax item before. It is a constant text, named terminal in BNF and EBNF. But rather than there the terminals are not written in quotation marks (not "order-number" here). That is an important difference or new value, without quotation mark the syntax is better readable.

The next syntax forces reading numbers (<#…) and a specific syntax for <date> which’s defintion follows. This is a syntax component, sometimes in BNF named meta morphem or non terminal

  <*|---?><*\n?>\n           ##skip over all until ---- and than until newline,than accept newline.
  { <position> \n }
  <*|---?><*\n?>             ##skip over all until ---- and than until newline.
  [{ \n}] \e                 ##skip over all newline, than end of file is expected.
.

The continuation (above) starts with searching the ------- line, where at least three - are checked. Then the rest till a newline is skipped. It means the whole line is skipped. \n designates a line delimiter. This can be also \r\n or a single \r, depending on some natives of operation systems. It is not distinguished here.

Then the positions are parsed, one per line, see below.

At end also a --------- line is expected. After them all following lines are skipped. The \e is "end of text" and checks whether the input is really finished. Elsewhere errors in the following text are not detected because it is not checked.

The end of a syntax component, here the main or root, is marked with a dot. This is same as older EBNF formats, newer uses also the ;, but ZBNF only uses the ..

##NOTE: Notes to the syntax of input text:
##The fields amount and code are red as number, whitespaces before and behind were skipped.
##But the description is not terminated by chars, but it is a maximum of chars.
##The description is stored with white spaces on end.
##The value is also a block without any terminating chars, else a line end with possible carrige return.
##
position::= <#?amount> <#?code> <16*?description>  <*\r\n?value> .

date::= <*\r\n?date>.

Now above definition of syntax components.

Ones you can see some maybe helpfully comments. Then the definition of a `<position>´ follows.

The #? means, "parse a number". After the question mark the semantic, the meaning of the number is written. Here the first number is designated as amount which can be refer in any verbal explaining text. But it is also used as name of a variable where the number should be stored in the current data element.

The description is expected with any character (<*?…) but exact 16 character. This helps to parse print formats without other designation. The description must not be empty, the first position of the 16 character starts at the first character after white spaces. Note that a white space in the syntax forces skip over white spaces in the parsed input, so long as <$NoWhiteSpaces> is not given.

The rest till one of the newline characters is stored as value, but without leading spaces.

The syntax component for <date> is very simple, it parses only all till newline. It’s only an example.

6.2. Example variable declaration in C/++ or Java

Also in C and C++ as in Java the variable declarations can be written in form

int a,b,c;

This are three variable, three parse results, all variable are from type int. The type is of course semantically part of the variable. The definition line itself is not interesting, the variable definition by its own is it. The variable definition is semantically exact the same as:

int a;
int b;
int c;

The application must not make a difference between both writing forms.

The syntax definition as part of …TODO

7. Example storage class for data, how to get

//in file: src/test/java/org/vishia/zbnf/test/Test_Bom_Zbnf.java
  /**Generates the sources for destination classes for the given billOfMaterial.zbnf
   */
  static void genDstClassForBom() {
    String[] args_genJavaOutClass = 
      { "-s:src/test/files/zbnfParser/billOfMaterial.zbnf"
      , "-dirJava:$(TMP)/exmpl_ZbnfParser_Bom"
      , "-pkg:org.vishia.zbnf.test.gen"
      , "-class:Bom_Data"
      };
    GenZbnfJavaData.smain(args_genJavaOutClass);
  }

This operation shows how generation of destination classes can be invoked from Java, and also as command line with Java. The arguments of main(String[] args) are identically.

The functionality reads the syntax file using

parser.setSyntax(syntaxFile);

After them the parser contains a data structure of

ZbnfSyntaxPrescript mainScript = parser.mainScript();

as tree of org.vishia.zbnf.ZbnfSyntaxPrescript.

This tree contains the syntax description with the semantic aspects. Any node in this tree can be used to create a data class:

evaluateSyntax(mainScript);

8. Example for parsing with ZBNF syntax and storage class

The example is contained in