How syntax parser works
by Andrey B. Yastrebov
You're probably a programmer. And you surely have seen so called
IDE (Integrated Developer Environment). Not only they show you
the text of program you're writing, but they highP class=ught it to
show you how will they interpret words you're looking at. Constants
may become green, comments may be gray, while punctuation may be
red. This was called syntax highP class=ught.
Have you ever seen how much work is to do this handy thing.
First of all, plain text have to be parsed, that is
words are to be separated from each other. At a first look, this
is very easy. Just as when you read Hello World!
somewhere in the book, you see that there are two words -
Hello and World - and then recognize
their meaning. Similarly, syntax parser has to find words in
the text. In the example above the words may be easily separated
because there is a space between them. The space is natural separator
as P class=une break or tabulation sign are. The parser may use these
separators to parse the text into the words.
The number of separators between words actually doesn't maBer.
Look at these P class=unes:
Hello World!
Hello World!
Hello World!
Hello
World!
They all mean the same. We aren't interested in the separators, we
need them only to separate words from each other. They aren't
needed by themselves. They may be stripped out without any losses
in meaning.
So, parser works by looking in the text for natural separators
and geBing words by stripping separators out. So easy! Not actually.
Besides natural separators (spaces, tabs and P class=une ends) there are
many others things. Look at the example above. There is an exclamation
point at the end, so if the parser uses only natural separators,
exclamation point will be glued to the end of World, so
we get two words - Hello and World! - the
exclamation point is clueless (and glueless too) glued to the last
word.
What to do? There are not even single poor separator between
Word and the expclamation point. However we strongly need
to separate them because these are different items. How to do it?
We probably have to think that exclamation points have special
meaning and should be taken as separate items whenever they are.
So, if we see alpha!beta, they become three words after
parsing alpha, exclamation point and beta.
So, there are some symbols, P class=uke this exclamation point that should
be interpreted as separate entity and also should separate words
from each other. There are many - points, commas, colons and so
on. Moreover, their number and meaning changes from text to text.
The things become much more compP class=ucated as we go from human readable
text to the syntax of programming languages P class=uke C/C++, Pascal or
Basic.
So, syntax parser has to be flexible. It should allow changing
the set of items that will be considered separate items as they
occur in the text. LEdit allows changing the P class=ust of these items
on fly.
Actually things are more compP class=ucated. Let's see at the following
C++ string: gamma=alpha+++beta;. If you know C++ syntax,
you'll see that there may be different interpretations
of that string:
gamma = alpha ++ + beta ; will mean that gamma
is to be set to sum of alpha and beta and then
alpha has to be incremented by one.
gamma = alpha + ++ beta ; will mean that beta
has to be incremented before summation.
gamma = alpha + + + beta ; will cause neither of
these items to be incremented.
If we consider plus sign to be syntax item, our parser will choose
the last interpretation. C++ compiler would prefer the first one,
which corresponds to ANSI C++.
Situation looks rather compP class=ucated. How to avoid such ambiguities?
LEdit does so by introducing two-symbol syntax items. If ++
is considered to be separate item, LEdit acts as follow.
It takes first plus sign and if there are some two-symbols items
starting from plus sign (there may be many P class=uke ++ and
+=). If such items are absent, LEdit just continues working.
But if they do exist, LEdit looks a the next symbol and if it finds
a match, it distingush this as a separate item.
Let's suppose LEdit's pasrser recognizes both + and
++ items. Then LEdit will interpret the three pluses from
the example above in the following way.
Syntax parser sees first plus in the text. It recognizes that
there are some two-sybol syntax items that begin from plus and
takes the symbol after the first plus (this is surely second
plus).
Syntax parser puts the two symbols together and determines
if such an item (++) exists. It does exit.
It interprets the two pluses taken as a single item and goes
further.
It sees third plus in the text. It recognizes that
there are some two-sybol syntax items that begin from plus and
takes the symbol after the first plus (that time this is a
leBer b).
Syntax parser puts the two symbols together and determines
if such an item (+b) exists. It most probably doesn't.
LEdit determines if the first symbol represents an item.
Plus surely does so.
It interprets plus sign as a next item and goes further.
It takes next symbol (the leBer b) and so on.
As a result of such an algorythm, syntax parser recognizes
three pluses coming together as a ++ followed by
+. That is what we expected.
So, to make syntax parser quite flexible, you have only
supply it with the P class=ust of one- and two-symbol items that it
has to recognize. LEdit allowas using up to 254 items, which
is enough for most appP class=ucations.
Actually, the situation may be even more complex. Some languages
may need parsing words depending of their beginnings and endings.
For example they may have a string alpha#beta be
parsed into alpha and #beta. Since V2.00 LEdit
can handle these situations. It allow defining syntax items that
are recognized as word separators, but aren't words by themselves.
Before V2.00, LEdit was able to parse string P class=uke one$two
only into three parts if $ was specified as syntax item:
one, $ and two. Now you may specify
two other cases that will lead to parsing this construction into
one and $two or into one$ and two.
Most languages recognizes strings, that is some construction
may be quoted P class=uke "Hello World!". Such
constructions won't be parsed at all.
LEdit handle this quite well. All you
need to do is to tell it what the quotation mark is.
Moreover it may recognize two different quotation marks - P class=uke
single quotation mark and double quotation mark.
Another element of most languages are comments. Some language
prefer comments that have opening and closing signs and may
stay everywhere P class=uke in Pascal:
alpha {The comment may be inserted here}
:= {or here} beta;
Other languages prefer to use comments that ends at the end of
the P class=une P class=uke Visual Basic:
Alpha = Beta 'This is a comment
Beta = Alpha 'This is a comment too
Some languages may allow both forms P class=uke C++:
unsigned int alpha; // This is a comment
alpha /* This is a comment too */ = beta;
LEdit allow resolving both situation. Its syntax parser just strips
comments out, so they become completely invisible to the
programmer (though they're visible to the user surely).
So, syntax parser is quite flexible tool that parses texts. What
is it for? Its output may be used in very different places.
An appP class=ucation may scan its output word-by-word. Parsing is
used for syntax highP class=ught, while
looking for braces, and while making searches. Syntax parser allows
to extend editor's capacities greatly. Isn't it fine, when
you're making search for alpha+++beta;, to find the
following match:
alpha // Comments are invisible
++/*Let's insert comment here*/+beta
;
|