Zbnf Syntax description and parser

1. Article structure

linked from (parent)	similar files in same level	important links in this
Overview Java Text XML conversion →	→ This page →	→ ZBNF syntax description

linked from (parent)

similar files in same level

important links in this

→ This page →

date=2025-12-06 new Shorten writing style of syntax (ZBNF2)

I have firstly used microprocessors from 1978, in may master thesis. That was programmed in assembly with the Z80 processor. The microprocessor technology was very newly. Firstly I have written some programs to show and change RAM content and such ones, and also written an assembler which works immediately with the stored machine code in RAM, but allows also edit the machine code.

Later, from 1982 till 1985, Basic language becomes familiar and I have written a system with Basic, also possible to edit on the fly storing a meta machine code. It may be seen as similar to a byte code as in Java.

In that time I want to write a complete and complex parser using the idea of graphical syntax diagrams https://en.wikipedia.org/wiki/Syntax_diagram. But unfortunately I have had only assembly language.

In the 1990th using C/++ language I have tried to write this parser, but on the first hand I havn’t had the necessary time, and second hand the C/++ language and my development environment and knowledge was not enough powerful to do so.

First when I used Java, the idea of a parser which works on the principles of syntax graphs was successfully. I have written the first version of the parser on only one week end. Java which its capabilities for container and the before written class org.vishia.util.StringPartScan which has had its precursor in some C++ classes, has enabled this work.

The result was org.vishia.zbnf.ZbnfParser This was in the year 2007, ~20 years after the first ideas.

3. Approach

ZBNF is firstly a syntax description which can be used also outside of the parser. It is similar or can be seen as enhancement of EBNF (https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) from Prof. Niklaus Wirth. The EBNF idea bases on the BNF which is the "''Backus Naur Format''" (https://en.wikipedia.org/wiki/Backus–Naur_form) which was used to described the syntax of programming language in the 50th and 60th, especially for Algol.

The ZBNF has the semantic aspect in the formal syntax definition. The 'Z' is a reverse 'S' for "Semantic". With that it is possible to immediately store the parsed data in Java instances without additional programming effort. That is done by the org.vishia.zbnf.ZbnfParser.

The `org,vishia.zbnf.ZbnfParser works with the textual syntax script. It reads the input, interprete it with the ZBNF syntax script, and stores the read data from input as Java data for further evaluation.

The ZBNF syntax script can also be used to generate proper data storage classes, see Generate the storage class for data, from the ZBNF syntax script

4. Shorten writing style of syntax (ZBNF2)

date=2025-12-06

The syntax definition as found in 2007 is described in ZBNF syntax description

The basic idea in that time was, the control characters for the syntax itself should be immediately and simple given. It is [ ] { } < > ? for option, repetition, a syntax morphem sometimes written with ?.
The constant texts should be immediately and simple given, not with 'text' as in EBNF.
If text character clashes with syntax control character, the backslash is used to transcript. Means \[ is a [ in text etc.
meta-morphems are written in <…> with its specific syntax.

With this schema an optional repetition of an identifier is written as:

[ ident = { <$?identItem> ? , } ; ]

This is readable. A matching text is:

ident = a,b,c;

But if the syntax contains text literals which conflicts with the syntax control characters, then it is not so well readable:

[ \[ ident = { \<<$?identItem>\> ? \| } \]]

This should be the syntax for the example string:

[ ident= <a>|<b>|<c> ]

Is it well readable in a syntax documentation: NO, it’s bad. Is it unique able to interpret by the parser: YES, it is.

That’s why for documentation another writing style was used in the past. One is: coloured syntax: The syntax control characters are presented in another background colour. This is possible for a textual documentation with styles (for example in LibreOffice, or in Word). The transcription of literal text is then not necessary, it’s better readable.

For plain text, another approach is used, since 2025-12 first time:

The syntax control characters are not simple, a specific string is used. hence, they are obviously, predestined.
It means, normal characters does not need transcription, excluded the really special case that the literal text should equals with syntax control characters. For that a specific transcription is necessary, rarely used.

The writing style of syntax for the example above is then:

[: [ ident = {: <$$identItem> :?: | :} ] :]

compare it with

[ \[ident = { \<<$?identItem>\> ? \| } \]]

should be better readable.

The writing rules are: All syntax control characters itself are longer. Then it is not clashing with similar textual characters, except really special cases. And now the writing rules:

Simple place holder: semantic can always written as identifier.ident, with one or more dot between, but only in the form as a Java (C/++) identifier possible with dots. … in the following explanation means also such a dot separated identifier.
- $$semantic instead <$?semantic> for a simple identifier. Here the visual disturbing < > are omitted. The $$ is the control string to detect it as non constant literal. The semantic need to be an identifier.
- $$$-:$semantic instead <$-$:?semantic> accept also special character in the identifier, here also the $ itself as the : and -. The rule is: After the $$ some non identifier character are these additional characters. The first $ immediately is accepted also as additional character. But the last character of this possible additional characters must be one $. The following semantic starts anyway with an identifier character, so it is unique.
- ##semantic instead <#?semantic> for a number.
- ##f#semantic instead <#f?semantic> for a the number variants.
Instead <syntax?semantic> with the specific syntax:
- <=:: syntax?semantic ::=> instead <syntax?semantic>. The semantic is similar the identifier to store the data.
- All other <…?semmantic> variants are similar build with simimlar rules (yet TODO how and describe).
Syntax control characters:
- All control strings have a space before and behind, which is not a constant string literal, does not allow white spaces here. Write a secondary space to enforce white space acceptance. Secondly on inner side there is a colon. The basic syntax control character (sequence) is the same.
- ALTERNATIVE1 :|: ALTERNATIVE2 instead ALTERNATIVE1|ALTERNATIVE2
- [: OPTION :] instead [OPTION]
- [?…: OPTION :] instead [<?…>OPTION], an option with the condition notated.
- [: OPTION :|: OPTION2 :] instead [OPTION|OPTION2] the choice option
- [: OPTION :|: OPTION2 :|] instead [OPTION|OPTION2|] the choice option which can be omitted.
- [|: OPTION :|: OPTION2 :] instead [|OPTION|OPTION2] the choice option which can be omitted with first test after.
- [?: OPTION should not match :] instead [?OPTION should not match]
- [!: OPTION should match :] instead [!OPTION should match] … but not used to store
- [>: IT IS EXPECTED :] instead [>IT IS EXPECTED] the 'expected' test.
- {: REPETITION :} instead {REPETITION}
- {: REPETITION :?: REPEAT :} instead {REPETITION?REPEAT}
- {?name=…: REPETITION $$name.part :] instead {<?…> REPETITION <$?part> repetition which should create or use a specified data access.
- [{: OPTIONAL REPETITION :}] instead [{OPTIONAL REPETITION}]
- [{: OPTIONAL REPETITION :?: REPEAT :}] instead [{OPTIONAL REPETITION?REPEAT}]
- =::. instead ., the end of the syntax item definition.
Definition how to store
- As always shown above, if {: … is written, {:?…: … can be written whereas between ? and : the semantic of the syntax expression is written, or in other words the variable where the content is stored, if the variable is representing the syntax.
- [{?itemName=containerName: $$itemName.element :}] instead
  
  [{<?containerName><$?element>}] The ZBNF syntax do not need the itemName because it is implicitly: If container is a container name, a LinkedList or Map in Java, then an instance of the element class of the container is created to store, or thought in other direction, the element of the container holds the data. But it is more obviously when the item has a name used here. The itenName can also be used for a gTxt script: gTxtOutputter, OutTextPreparer.

5. Working principle

The parser itself tests the source code similar as the Syntax Diagram (see https://en.wikipedia.org/wiki/Syntax_diagram or https://en.wikipedia.org/wiki/Wirth_syntax_notation (both articles are similar).

It tests all possibilities, go back to the previous fork point if nothing matches and hence search the correct path. This is not the fastest way to parse, but it works proper with all necessities of parsing approaches. The speed to parse is not the first question in the century of fast processors. It was an important question in the 1980th and 1990th if processor power was lesser.

6. Usages

This ZbnfParser.html was used firstly for parsing Java language files to translate to C code: ../../../Java2C/index.html. Secondly I have used it for parsing C/ header files to generate reflection for C/ programming. See ../../../Inspc/html/Reflect.html[].

Later all texts which I have parsed uses this parser Especially also the RWtrans.html#JZtxtcmd work also with this parser.

The parser was written independently of other known parser principles. It works with a textual given syntax and does not need deep depending packages. Only some Java files all contained in the 1.4 MByte vishiaBase.jar package are necessary. Though the parser is fast enough for all practical approaches. The textual given syntax is firstly translated in internal pre prepared data, then it is used.

The parser is used also in some professional projects.

7. Example for a ZBNF syntax file, how to get

You can write the syntax for any syntactically evaluable text simply manually by viewing and thinking about the text syntax. But sometime some experience or study of given patterns are usefull.

7.1. Example Bill of material

As example a bill of material should be used. This may be given in any special format, in this case in a simple list maybe from older tools.

Bill of material
================

order-number: 134.23.14
date:febr-16-2008

amount  code     description     value
---------------------------------------------
  21    1234567  Resistor        3.9 kOhm/5%
  12    1234537  Resistor        1.8 kOhm/2%
   5    1234237  Resistor        2.7 kOhm/2%
   7    1234557  Resistor         10 kOhm/5%
   1    1234127  Resistor        120 Ohm/1%
   2    1234897  Resistor        630 Ohm/1%
  34    1235771  capacitor        47 nF
  12    1235781  capacitor       100 nF
   5    1235791  capacitor       4.7 uF
---------------------------------------------

The file format is not very well for parsing. Unified delimiters between rows are missing. It seems to be that spaces are delimiters. But in the last column spaces are parts of the text. The description should be only 16 characters, so the specific rule (for the example). It is a print out format. But nevertheless it can be parsed:

$setLinemode.

BillOfMaterial::=
  <*|order-number?>          ##skip over all characters until string "order-number"

  order-number : <#?order/@part1>\.<#?order/@part2>\.<#?order/@part3> \n  ##its a three-part number. NOTE: . must be written \.
  date : <date> \n           ##date have its own syntax.

The first line is only used to inform about the encoding of the syntax file. But this information can be omitted if for example the syntax is given as Java String. It is more formally.

The $setLinemode. is an option. It defines that newline characters (0x0d and also 0x0a, any combination) is not used as white spaces. The line structure should be regarded.

The BillofMaterial::= … is the top level parsing component which presents the whole input file.

As explained also in the comment, the User script should start with the text order-number. All before is skipped. With the <*|TEXT?> item the parser gets the information "skip all till this text" or "search this text". This is more a statement as syntax description, but exactly this possibility can be seen as one of the advantages. From view of syntax description it is "get all content till TEXT". Hence it can be stored as semantic after the question mark, here left empty.

The text order-number itself is not handled by the syntax item before. It is a constant text, named terminal in BNF and EBNF. But rather than there the terminals are not written in quotation marks (not "order-number" here). That is an important difference or new value, without quotation mark the syntax is better readable.

The next syntax forces reading numbers (<#…) and a specific syntax for <date> which’s defintion follows. This is a syntax component, sometimes in BNF named meta morphem or non terminal

  <*|---?><*\n?>\n           ##skip over all until ---- and than until newline,than accept newline.
  { <position> \n }
  <*|---?><*\n?>             ##skip over all until ---- and than until newline.
  [{ \n}] \e                 ##skip over all newline, than end of file is expected.
.

The continuation (above) starts with searching the ------- line, where at least three - are checked. Then the rest till a newline is skipped. It means the whole line is skipped. \n designates a line delimiter. This can be also \r\n or a single \r, depending on some natives of operation systems. It is not distinguished here.

Then the positions are parsed, one per line, see below.

At end also a --------- line is expected. After them all following lines are skipped. The \e is "end of text" and checks whether the input is really finished. Elsewhere errors in the following text are not detected because it is not checked.

The end of a syntax component, here the main or root, is marked with a dot. This is same as older EBNF formats, newer uses also the ;, but ZBNF only uses the ..

##NOTE: Notes to the syntax of input text:
##The fields amount and code are red as number, whitespaces before and behind were skipped.
##But the description is not terminated by chars, but it is a maximum of chars.
##The description is stored with white spaces on end.
##The value is also a block without any terminating chars, else a line end with possible carrige return.
##
position::= <#?amount> <#?code> <16*?description>  <*\r\n?value> .

date::= <*\r\n?date>.

Now above definition of syntax components.

Ones you can see some maybe helpfully comments. Then the definition of a `<position>´ follows.

The #? means, "parse a number". After the question mark the semantic, the meaning of the number is written. Here the first number is designated as amount which can be refer in any verbal explaining text. But it is also used as name of a variable where the number should be stored in the current data element.

The description is expected with any character (<*?…) but exact 16 character. This helps to parse print formats without other designation. The description must not be empty, the first position of the 16 character starts at the first character after white spaces. Note that a white space in the syntax forces skip over white spaces in the parsed input, so long as <$NoWhiteSpaces> is not given.

The rest till one of the newline characters is stored as value, but without leading spaces.

The syntax component for <date> is very simple, it parses only all till newline. It’s only an example.

7.2. Example variable declaration in C/++ or Java

Also in C and C++ as in Java the variable declarations can be written in form

int a,b,c;

This are three variable, three parse results, all variable are from type int. The type is of course semantically part of the variable. The definition line itself is not interesting, the variable definition by its own is it. The variable definition is semantically exact the same as:

int a;
int b;
int c;

The application must not make a difference between both writing forms.

The syntax definition as part of …TODO

8. Generate the storage class for data, from the ZBNF syntax script

//in file: src/test/java/org/vishia/zbnf/test/Test_Bom_Zbnf.java
  /**Generates the sources for destination classes for the given billOfMaterial.zbnf
   */
  static void genDstClassForBom() {
    String[] args_genJavaOutClass = 
      { "-s:src/test/files/zbnfParser/billOfMaterial.zbnf"
      , "-dirJava:$(TMP)/exmpl_ZbnfParser_Bom"
      , "-pkg:org.vishia.zbnf.test.gen"
      , "-class:Bom_Data"
      };
    GenZbnfJavaData.smain(args_genJavaOutClass);
  }

This operation shows how generation of destination classes can be invoked from Java, and also as command line with Java. The arguments of main(String[] args) are identically.

The functionality reads the syntax file using

parser.setSyntax(syntaxFile);

After them the parser contains a data structure of

ZbnfSyntaxPrescript mainScript = parser.mainScript();

as tree of org.vishia.zbnf.ZbnfSyntaxPrescript.

This tree contains the syntax description with the semantic aspects. Any node in this tree can be used to create a data class:

evaluateSyntax(mainScript);

9. Example for parsing with ZBNF syntax and storage class

The example is contained in