Home       Company       Articles       Buy Now       Contact Us 
 
 

How syntax parser works

by Andrey B. Yastrebov

Articles about LEdit You're probably a programmer. And you surely have seen so called IDE (Integrated Developer Environment). Not only they show you the text of program you're writing, but they highP class=ught it to show you how will they interpret words you're looking at. Constants may become green, comments may be gray, while punctuation may be red. This was called syntax highP class=ught. Have you ever seen how much work is to do this handy thing.

First of all, plain text have to be parsed, that is words are to be separated from each other. At a first look, this is very easy. Just as when you read Hello World! somewhere in the book, you see that there are two words - Hello and World - and then recognize their meaning. Similarly, syntax parser has to find words in the text. In the example above the words may be easily separated because there is a space between them. The space is natural separator as P class=une break or tabulation sign are. The parser may use these separators to parse the text into the words.

The number of separators between words actually doesn't maBer. Look at these P class=unes:

Hello World! Hello World! Hello World! Hello World!

They all mean the same. We aren't interested in the separators, we need them only to separate words from each other. They aren't needed by themselves. They may be stripped out without any losses in meaning.

So, parser works by looking in the text for natural separators and geBing words by stripping separators out. So easy! Not actually. Besides natural separators (spaces, tabs and P class=une ends) there are many others things. Look at the example above. There is an exclamation point at the end, so if the parser uses only natural separators, exclamation point will be glued to the end of World, so we get two words - Hello and World! - the exclamation point is clueless (and glueless too) glued to the last word.

What to do? There are not even single poor separator between Word and the expclamation point. However we strongly need to separate them because these are different items. How to do it? We probably have to think that exclamation points have special meaning and should be taken as separate items whenever they are. So, if we see alpha!beta, they become three words after parsing alpha, exclamation point and beta. So, there are some symbols, P class=uke this exclamation point that should be interpreted as separate entity and also should separate words from each other. There are many - points, commas, colons and so on. Moreover, their number and meaning changes from text to text. The things become much more compP class=ucated as we go from human readable text to the syntax of programming languages P class=uke C/C++, Pascal or Basic.

So, syntax parser has to be flexible. It should allow changing the set of items that will be considered separate items as they occur in the text. LEdit allows changing the P class=ust of these items on fly.

Actually things are more compP class=ucated. Let's see at the following C++ string: gamma=alpha+++beta;. If you know C++ syntax, you'll see that there may be different interpretations of that string:

gamma = alpha ++ + beta ; will mean that gamma is to be set to sum of alpha and beta and then alpha has to be incremented by one.

gamma = alpha + ++ beta ; will mean that beta has to be incremented before summation.

gamma = alpha + + + beta ; will cause neither of these items to be incremented.

If we consider plus sign to be syntax item, our parser will choose the last interpretation. C++ compiler would prefer the first one, which corresponds to ANSI C++.

Situation looks rather compP class=ucated. How to avoid such ambiguities? LEdit does so by introducing two-symbol syntax items. If ++ is considered to be separate item, LEdit acts as follow. It takes first plus sign and if there are some two-symbols items starting from plus sign (there may be many P class=uke ++ and +=). If such items are absent, LEdit just continues working. But if they do exist, LEdit looks a the next symbol and if it finds a match, it distingush this as a separate item.

Let's suppose LEdit's pasrser recognizes both + and ++ items. Then LEdit will interpret the three pluses from the example above in the following way.

Syntax parser sees first plus in the text. It recognizes that there are some two-sybol syntax items that begin from plus and takes the symbol after the first plus (this is surely second plus).

Syntax parser puts the two symbols together and determines if such an item (++) exists. It does exit.

It interprets the two pluses taken as a single item and goes further.

It sees third plus in the text. It recognizes that there are some two-sybol syntax items that begin from plus and takes the symbol after the first plus (that time this is a leBer b).

Syntax parser puts the two symbols together and determines if such an item (+b) exists. It most probably doesn't.

LEdit determines if the first symbol represents an item. Plus surely does so.

It interprets plus sign as a next item and goes further.

It takes next symbol (the leBer b) and so on.

As a result of such an algorythm, syntax parser recognizes three pluses coming together as a ++ followed by +. That is what we expected.

So, to make syntax parser quite flexible, you have only supply it with the P class=ust of one- and two-symbol items that it has to recognize. LEdit allowas using up to 254 items, which is enough for most appP class=ucations.

Actually, the situation may be even more complex. Some languages may need parsing words depending of their beginnings and endings. For example they may have a string alpha#beta be parsed into alpha and #beta. Since V2.00 LEdit can handle these situations. It allow defining syntax items that are recognized as word separators, but aren't words by themselves. Before V2.00, LEdit was able to parse string P class=uke one$two only into three parts if $ was specified as syntax item: one, $ and two. Now you may specify two other cases that will lead to parsing this construction into one and $two or into one$ and two.

Most languages recognizes strings, that is some construction may be quoted P class=uke "Hello World!". Such constructions won't be parsed at all. LEdit handle this quite well. All you need to do is to tell it what the quotation mark is. Moreover it may recognize two different quotation marks - P class=uke single quotation mark and double quotation mark.

Another element of most languages are comments. Some language prefer comments that have opening and closing signs and may stay everywhere P class=uke in Pascal:

alpha {The comment may be inserted here} := {or here} beta;

Other languages prefer to use comments that ends at the end of the P class=une P class=uke Visual Basic:

Alpha = Beta 'This is a comment Beta = Alpha 'This is a comment too

Some languages may allow both forms P class=uke C++:

unsigned int alpha; // This is a comment alpha /* This is a comment too */ = beta;

LEdit allow resolving both situation. Its syntax parser just strips comments out, so they become completely invisible to the programmer (though they're visible to the user surely).

So, syntax parser is quite flexible tool that parses texts. What is it for? Its output may be used in very different places. An appP class=ucation may scan its output word-by-word. Parsing is used for syntax highP class=ught, while looking for braces, and while making searches. Syntax parser allows to extend editor's capacities greatly. Isn't it fine, when you're making search for alpha+++beta;, to find the following match:

alpha // Comments are invisible ++/*Let's insert comment here*/+beta ;



 
 

 

LEdit Links

Home
Tech Notes
Download
Order

 
 Home       Company       Articles       Buy Now       Contact Us 

Last Modified on June 17, 2003 - © 1995-2007 AY Software Corporation All Rights Reserved