Introduction
ANTLR is a tool for generating lexers and parsers for a specific set of languages. It can be used to create programs which need an interpreter, or even to write a compiler for an existing or new language.
In the short time that I've used ANTLR I can't say I'm ready to provide a full tutorial for ANTLR (nor do I currently have the time). However, someone already has taken their time to provide video tutorials for ANTLR 3 (as well as written tutorials for ANTLR 2). These tutorials focus on using ANTLR, as well as a plugin for Eclipse to allow someone to easily write and test ANTLR generated code.
Disclaimer: I didn't make these videos, nor have I watched all of them yet. However, I have seen many of them and I believe that they are well done and a valuable source of information.
Here's a link to these video tutorials: ANTLR 3.x Tutorial videos on Vimeo
Here are also links to the main ANTLR website: ANTLR Parser Generator
In this tip I'm posting some useful rules for matching various common literals.
Difficulty: Medium-hard. The realm of defining a grammar is different to conventional programming, and takes some time to get use to. Also, an in-depth understanding of all the information provided in all 9 of these video tutorials is required if you want to create a complex language. However, if you follow through just a few of these tutorials you can already perform some powerful tasks and these are not too difficult.
Integer rule
This rule will match java-style integers entered in decimal notation (base 10). This is slightly different from the rule presented in the video tutorials because a number such as 00001239 in Java technically should be treated as an octal number (base 8). This rule will match a plain 0, or a non-zero digit followed by any digit.
/** * Any integer literal */ IntegerLiteral : ('0') | (('1'..'9') Digit+);
Note: this rule doesn't match negative integers. It's easier to parse the negative as a negation operator and handle negative integers this way.
Float rule
Here's a rule for matching Java style floating point number literals. A similar notation is used in other languages. A java float literal can be defined either in decimal notation (1.23), via exponential (1453e-4), or with an f at the end (132f). This notation builds off of the above IntegerLiteral rule which is used to match valid exponents.
/** * Any floating point literal */ FloatLiteral : ((('0.' Digit* ) | (('1'..'9') Digit* ('.' Digit*)? )) ('e' '-'? Digit+)?) | (IntegerLiteral 'f') ;
Note: As with the integer rule, this rule won't directly match negative numbers (negative exponents are parsed, though). It's much easier to deal with these in the semantics by matching the negative as the negation operator.
C-style string and character literals rules
Here is a rule I've found effective for finding C-style strings. This notation is used by Java, C#, and a wide variety of programming languages. A nearly identical rule can be used for parsing character literals.
STRING: '"' ( ( '\\' ~('\r' | '\n')) | ( ~('\\' | '"' | '\r' | '\n')))+ '"' ; CHAR: '\'' ( ( '\\' ~('\r' | '\n')) | ( ~('\\' | '"' | '\r' | '\n')))+ '\'' ;
Notes: with this rule, I've defined it to accept all characters after an escape character (the backslash) as allowed. However, there is only a limited set of escape characters which are actually allowed. I've found that by accepting all characters after an escape character works better and doing semantic error handling later is a better alternative than hard-coding which escape characters are allowed. Additionally, the character literal rule allows matches to any length character literal. i.e., something like this would result in a positive match:
Again, this is something that can be better handled in the semantics rather than in the lexer rule definition.'abcdef abd'
Secondly, these rules leave on the surrounding single quotes and double quotes. It's possible to immediately trim these out, and this information is provided in video tutorial #6 (about 1/3 the way in).
Lastly, I'm not quite sure why these rules work. In my opinion the middle section should require a * (zero or more) rather than a + (one or more). I haven't figures out why this will only work with the +, perhaps at some point in time I will figure this out and update the information here.