MATLAB Examples


Growing a Compiler, Chapter 4

format compact  % Help MATLAB save screen space

4 Using the GEM capabilities

Up until now the emphasis has been on choosing new capabilities expressed first as bootstraps, then building the capabilities into GEM itself. The final built-in capability of GEM added here is multi-character input and output. To sum up the progress, the implementations have included

  • gem0 -- minimal, risky
  • gem1 -- robust implementation, primtive trace
  • gem2 -- built-in execution of grammars containing whitespace
  • gem3 -- built-in character classes asciiIOG....digitCFG, better trace
  • gem4 -- built-in multicharacter input and output symbols

If you are interested in the new files iog4.h, iog4.c and gem4.m, you can download this File Exchange submission and look at them in the MATLAB editor.

Now it is time to use those built-in capabilities.


4.1 Kleene * and Kleene + in IOGs

It is proposed that a rule such as

 g = p*;

will mean that a g is a sequence of zero or more p, and

 g = p+;

will mean that a g is a sequence of one or more p.

4.1.1 Implementing Kleene *

GEM knows nothing about the Kleene star, so progress is made by transforming regular expression grammars back to the original IOG form. The tricks are to replace each


with a new symbol (say R) and add new rules

 R = rR;  R =;

The tricks are separately applied: the resulting two IOGs are concatenated to make a grammar acceptable to GEM.

The limitation is that the upper case phrase name are going to magically appear if the grammar contains the corresponding lower case phrase name, and, as usual, names dla must be avoided if they are going to be used with the built-in meaning.

The new rules are created by IOG nostar1 which throws away the grammar and makes a few new rules. nostar1 contains 26 rules of the form

 s = 'a*' "A=aA;A=;";
 s = 'b*' "B=bB;B=;";
 s = 'z*' "Z=zZ;Z=;";

As it turns out, extra rules are generated if there is more than one occurrence of a particular a*. This is not a logical problem: backtracking cures all. But it is a catastrophic performance problem. gem4 function nodup cleans out the extras. It is not necessary for postfix below, but is necessary for pretty. Just be clear that the bootstrapping game is still being played, nodup is implemented using GEM.

The r* items are replaced in the grammar by IOG nostar2, a version of pretty containing 26 rules of the form

 s = 'a*' "A";
 s = 'b*' "B";
 s = 'z*' "Z";

Both nostar1 and nostar2 require deblanked input.

4.1.1 A new version of pretty.

To keep the ability to display IOGs, a new version of pretty is needed to handle multicharacter operators and the Kleene * and +. It is convenient to use *, in particular, to describe sequences of white space. Here is the new pretty.

fprintf('%s', GEM(G.pretty0, G.pretty, 'ALU'));
g = b* r*;
r = L b* '=' " =" f* b* ';' b* ";
f = b* " " p;
p = I I I;
p = I i "'";
p = O O O;
p = O o """;
p = L b* '*' "*";
p = L b* '+' "+";
p = L;
i = ''';
i = A i;
o = '"';
o = A o;
I = ''' "'";
O = '"' """;
b = ' ';
b = '

4.2 Using * in postfix.

The postfix grammar in the previous chapter used a functional-programming technique to achieve left-associative expression syntax; the grammar could have been instead been written using the Kleene *. Here it is:

fprintf('%s', G.postfix0);
g = e;
e = t r*;
r = '+' t "+";
r = '-' t "-";
t = f s*;
s = '*' f "*";
s = '/' f "/";
f = L;
f = D;
f = '(' e ')';

The two tricks to remove * give the following grammars which are then concatenated:

newgrammar = GEM(G.scan(G.postfix0), G.nostar1, 'ALU');
newrules   = GEM(G.scan(G.postfix0), G.nostar2, 'alu');
fprintf('%s', GEM(newgrammar, G.pretty, 'ALU'));
fprintf('%s', GEM(newrules, G.pretty, 'ALU'));
postfix = [newgrammar, newrules];
g = e;
e = t R;
r = '+' t "+";
r = '-' t "-";
t = f S;
s = '*' f "*";
s = '/' f "/";
f = L;
f = D;
f = '(' e ')';
R = r R;
R =;
S = s S;
S =;

As one can see, the Kleene * in postfix0 has been transformed away. Applying the newly generated IOG gives postfix expressions as before.

fprintf('%s\n', GEM('2*(6+a+4)-2/7', postfix, 'DL'));

4.3 Implement Kleene +

Adding Kleene operator + could be dealt with in much the same way as * . A simpler solution is to translate r+ into rr*, then use nostar1 and nostar2 to eliminate the *. The IOG noplus is a version of pretty with 26 rules of the form

 s = 'a+' "aa*";
 s = 'b+' "bb*";
 s = 'z+' "zz*";
plusexample = 'r=a+b+;a=''1''"O";b=''2''"T";';
fprintf(GEM(plusexample,G.pretty, 'ALU'));
r = a+ b+;
a = '1' "O";
b = '2' "T";
plusres = GEM(plusexample, G.noplus, 'ALU');
fprintf(GEM(plusres, G.pretty, 'ALU'));
r = a a* b b*;
a = '1' "O";
b = '2' "T";

The expectation of plusexample is that given a sequence of 1 followed by a sequence of 2, the output will be the corresponding sequence of O and T.

plusgram = [GEM(plusres,G.nostar1, 'ALU') GEM(plusres,G.nostar2, 'alu')];
fprintf('%s', GEM('11122', plusgram));

4.4 Grammar introspection

Gramars can be used to examine grammars. In particular the sets Vn, Vg, Vi and Vo (see chapter 1) can be extracted. The Vn extractor, for example, copies all of the LHS names and discards the rest. Function nodup removes the duplicates. Here Vn is used on itself.

fprintf('%s\n', G.Vn0);   % gem4 transforms away the Kleene *
res =, G.Vn, 'ULula');
fprintf('all phrase names used:    {%s}\n', res);
fprintf('Vn = {%s}', G.nodup(res, 1));
g = r*;
r = L '=' f ';';
f = p*;
p = l;
p = l '*';
p = l '+';
p = ''' a ''';
p = '"' a '"';

all phrase names used:    {grfppppp}
Vn = {grfp}

The goal symbols is

fprintf('%s\n', G.Vg)
res =, G.Vg, 'La');
fprintf('Vg = {%s}\n', res);
Vg = {g}

The sets Vi and Vo are less interesting, because the use of the built-in character classes make the sets large, and because the use of multi-character symbols makes this output hard to read. The use of the built-in character classes is not recorded in the IOGs, but rather in the call of And multi-character input and output symbols are not implemented in Vi or Vo. The choice here is to extract only the symbols explicitly defined in the IOG.

fprintf('%s\n', G.Vi0);   % gem4 transforms away the Kleene *
res =, G.Vi, 'ulaA');
fprintf('all explicit input:    %s\n', res);
fprintf('Vi = {%s}', G.nodup(res, 1));
g = r*;
r = l '=' f ';';
f = p*;
p = l;
p = l '*';
p = l '+';
p = ''' A ''';
p = '"' a '"';

all explicit input:    =;*+''""
Vi = {=;*+'"}

In the case of Vo, it is not interesting applied to itself, so instead pretty is used as input (see above). Note: the blank is there, but not obvious.

fprintf('%s\n', G.Vo0);   % gem4 transforms away the Kleene *
res =, G.Vo, 'ulaA');
fprintf('all explicit output:    %s\n', res);
fprintf('Vo = {%s}', G.nodup(res, 1));
g = r*;
r = l '=' f ';';
f = p*;
p = l;
p = l '*';
p = l '+';
p = ''' a ''';
p = '"' A '"';

all explicit output:     =;
Vo = { =;

The grammar rules themselves are supplied by pretty.

The story continues with the next chapter.