MATLAB Examples


Growing a Compiler, Chapter 2

format compact  % Help MATLAB save screen space

2 Grammar Executing Machine (version 1)

The Grammar Executing Machine principle has been demonstrated in Chapter 1. The version 0 driver gem0.m and MEX files iog0.c and iog0.h provide no-frills code that runs on the edge of catastophe. The slightest user error can bring all of MATLAB down in a rubble of bits. Version 0 is meant to illustrate its algorithm, as contrasted to being actually used. It is time for something more useful.

format compact  % Help MATLAB save screen space

2.1 Robust Implementation

Version 1 files iog1.c and iog1.h are identical to version 0 except that they check for obvious faults. The intent is to issue meaningful error messages when something goes wrong. These files have never crashed MATLAB. In addition version 1 has a trace facility providing a step-by-step record of parsing, searching and backtracking. The trace is activated by an optional third (flag) parameter, a string containing upper case T.

 o = GEM(i, g, 'T')

The version 1 driver (gem1.m) introduces new capabilities.

The robust implementation now supplants the base implementation here.

G   = gem1();              % Instantiate the version 1 driver.
GEM =;               % New version of GEM ready to use.
res = GEM('', 'r=;', 'T')  % Test trace facility.
SEARCH  stk:0 **ps:r  p:0 *p:;  i:0 *i:#  o:-1 *o:#
 PARSE  stk:0 **ps:r  p:3 *p:;  i:0 *i:#  o:-1 *o:#
res =

Size comparisons for the versions so far

!wc -l iog0.c  iog0.h
    136    iog0.c 
     76    iog0.h 
    212    total 
!wc -l iog1.c  iog1.h
    160    iog1.c 
    108    iog1.h 
    268    total 

Having a robust implementation available, it is time to try some more sophisticated IOGs.

2.2 Predefined character-class CFGs and IOGs.

Character classes occur a lot and are rather tedious to specify in grammatical form. Character classes here are analogous to functions isletter, isdigit and the like in C.

Eight such classes of symbols are pre-entered into gem1.m so that the GEM user never needs to define them explicitly. For example, grammar digitIOG defines phrase name D to pass digits. Read the first rule as "if you see a zero, emit a zero."

   D = '0' "0";
   D = '1' "1";
   D = '2' "2";
   D = '3' "3";
   D = '4' "4";
   D = '5' "5";
   D = '6' "6";
   D = '7' "7";
   D = '8' "8";
   D = '9' "9";

The list of pre-entered CFG and IOG character classes is

digitCFG  (defines phrase d, accept any digit)
upperCFG  (defines phrase l, accept any upper case letter)
lowerCFG  (defines phrase l, accept any lower case letter)
asciiCFG  (defines phrase a, accept any character)
digitIOG  (defines phrase D, (example above) accept and pass on any digit)
upperIOG  (defines phrase L, accept and pass on any upper case letter)
lowerIOG  (defines phrase L, accept and pass on any lower case letter)
asciiIOG  (defines phrase A, accept and pass on any character)

One or more of these grammars can be appended to any grammar to be input to GEM. For example, here is an example that accepts any digit and passes it to the output.

fprintf('%s\n', GEM('7', ['r=D;' G.digitIOG])); % MATLAB string concatenation

2.3 Whitespace and IOG nowhite

People like whitespace; GEM does not. Therefore the first useful IOG is a deblanker named nowhite.

Reading the IOG below, a g is a sequence of p. There are five definitions of p , the first two of which do no output.

Since GEM always does first things first, nowhite discards blanks and newlines, passes $V_I$ and $V_O$ entries (including literal blanks and newlines) unchanged to the output, and also as a default (last rule) passes everything else to the output. Because phrase name A was used, character class grammar asciiIOG must be appended.

   g = p g;
   g = ;
   p = ' ';
   p = '
   p = I A I;
   p = O A O;
   p = A;
   I = ''' "'"
   O = '"' """

One wants to write o=GEM(i,nowhite) to remove whitespace from i. But GEM insists on its second argument having no superfluous whitespace of which the version of nowhite above has plenty. We cannot write

    nowhite = GEM(nowhite,nowhite);

to make a whitespace-free nowhite -- chicken and egg problem.

So a de-whited nowhite has to be prepared by hand ahead of time. All of this tedious work has been done and bundled up in object gem1 which was called (above). The de-whited nowhite is available as a field of G. Here it is. A lot of nowhite is far to the right on the browser page. (Your browser presents unprintable characters however it pleases.)

fprintf('%s\n', G.nowhite);
g=pg;g=;p=' ';p='
';p=IAI;p=OAO;p=A;I='''"'";O='"'""";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A='";A='	'"	";A='
";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=''"";A=' '" ";A='!'"!";A='"'""";A='#'"#";A='$'"$";A='%'"%";A='&'"&";A='''"'";A='('"(";A=')'")";A='*'"*";A='+'"+";A=','",";A='-'"-";A='.'".";A='/'"/";A='0'"0";A='1'"1";A='2'"2";A='3'"3";A='4'"4";A='5'"5";A='6'"6";A='7'"7";A='8'"8";A='9'"9";A=':'":";A=';'";";A='<'"<";A='='"=";A='>'">";A='?'"?";A='@'"@";A='A'"A";A='B'"B";A='C'"C";A='D'"D";A='E'"E";A='F'"F";A='G'"G";A='H'"H";A='I'"I";A='J'"J";A='K'"K";A='L'"L";A='M'"M";A='N'"N";A='O'"O";A='P'"P";A='Q'"Q";A='R'"R";A='S'"S";A='T'"T";A='U'"U";A='V'"V";A='W'"W";A='X'"X";A='Y'"Y";A='Z'"Z";A='['"[";A='\'"\";A=']'"]";A='^'"^";A='_'"_";A='`'"`";A='a'"a";A='b'"b";A='c'"c";A='d'"d";A='e'"e";A='f'"f";A='g'"g";A='h'"h";A='i'"i";A='j'"j";A='k'"k";A='l'"l";A='m'"m";A='n'"n";A='o'"o";A='p'"p";A='q'"q";A='r'"r";A='s'"s";A='t'"t";A='u'"u";A='v'"v";A='w'"w";A='x'"x";A='y'"y";A='z'"z";A='{'"{";A='|'"|";A='}'"}";A='~'"~";

2.3.1 Define function scan and redefine GEM

Since de-whiting is a frequent activity, it is useful to define a MATLAB function for the purpose. At the same time, as a matter of convenience, one can redefine GEM to automatically scan its second parameter (the grammar).

scan = @(txt), G.nowhite);  % (Read the @ as lambda.)
GEM  = @(txt,c), scan(c));  % redefine MATLAB function GEM

Testing scan, the expectation is that the blanks will disappear:

scan('x Y z')
ans =

If you are interested in the new files iog1.h, iog1.c and gem1.m, you can download this File Exchange submission and look at them in the MATLAB editor.

2.4 IOG pretty, the antidote to nowhite

Returning nowhite to human-readable form can be accomplished by IOG pretty, which puts the minimal amount of blanks and newlines back. IOG pretty is an example of its own output (to avoid a few hundred lines of boring output, the character class definitions lowerIOG, upperIOG and asciiIOG are supressed in the rest of this presentation).

Note that the rule for r precisely describes itself.

idx = strfind(G.pretty, 'L=');            % where the classes begin
fprintf('%s', G.pretty(1:idx-1));         % print pretty itself
fprintf('lowerIOG upperIOG asciiIOG\n');  % just print the class names
g = r g;
g =;
r = L '=' " " "=" f ';' ";" "
f = " " p f;
f =;
p = I A I;
p = O A O;
p = L;
I = ''' "'";
O = '"' """;
lowerIOG upperIOG asciiIOG

As a final reward, pretty also serves as a syntax checker for GEM input -- non-grammars will cause a run-time error.

2.5 pretty applied to nowhite

IOG pretty can be applied to nowhite to restore it readability. Starting with white-less nowhite the pretty version can be computed and printed

pretty=@(g)GEM(g, G.pretty);
gstr = pretty(G.nowhite); idx = strfind(gstr, 'A =');
fprintf('%s', gstr(1:idx-1));
g = p g;
g =;
p = ' ';
p = '
p = I A I;
p = O A O;
p = A;
I = ''' "'";
O = '"' """;

2.6 IOG inversion

The GEM input language is symmetrical in the input and output. Systematically interchanging the input and output delimiters (' and ") turns a compiler into decompiler. That is, the inverted IOG then accepts the original output and recreates the original input. An inverted inverter is still an inverter.

idx = strfind(G.invert, 'A=');
fprintf('%s', G.invert(1:idx-1));
g = p g;
g = ;
p = ''' """ A ''' """;
p = '"' "'" A '"' "'";
p = A;

We conjecture the inverter works if and only if the IOG defines a 1-1 correspondence between input and output strings. On the other hand, We do not know under what conditions an IOG mapping is 1-1, so the insight is not immediately useful.

2.6.1 Inversion example

Right-associative sum and difference expressions can be defined naturally in a right-recursive IOG. The output of the IOG sum is the left-to-right sequence of rule applications.

fprintf('%s\n', G.sum);
g = e         "0";
e = t '+' e   "1";
e = t '-' e   "2";
e = t         "3";
t = 'x'       "4";

To be tediously explicit, the input string x+x-x is rewritten by grammar rule applications as follows:

   string    rule to be applied
   x+x-x     4
   t+x-x     4
   t+t-x     4
   t+t-t     3
   t+t-e     2
   t+e       1
   e         0
   g         QUIT

Running the sum grammar on an expression yields the predicted sequence of rules (often called the canonical parse).

parse = GEM('x+x-x', G.sum) % apply sum to 'x+x-x'
parse =

The sum grammar can be inverted and then pretty printed. Note that the only change is the interchange of " and '.

invertedsum = GEM(scan(G.sum), G.invert); % apply invert to sum
fprintf('%s', pretty(invertedsum));
g = e '0';
e = t "+" e '1';
e = t "-" e '2';
e = t '3';
t = "x" '4';

Finally, the inverted sum grammar can be applied to the canonical parse and recreate the original input.

fprintf('%s\n', GEM(parse, invertedsum)); % apply inverted sum to 4443210

At this point in the presentation a decision has to be made. We can push on using gem1 to start extending the IOGs or we can push the de-whiting and character class definitions into the C code in iog. The efficiency and convenience of not having to scan the input grammar, and not having to rummage through hundreds of grammar rules to process character classes suggests that we extend GEM immediately. The two extensions are done separately to keep the concept of "growing" clear but not much use will be made of GEM until both are complete.

The next chapter carries out the plan above.