Text Processing with Regular Expressions explained
A string can be formatted or parsed based on a specified pattern that will be searched in the string.
In order to format or parse data (text or data types), you want to be able to tell your code: pick up that data item from there, and do this to it.
How can we do this? You do it with a search based on pattern matching. The search pattern is described by what is called a Regular Expression (regex for short). In other words, Regular Expressions can be used to process text. A piece of text (sequence of characters) that is found to correspond to the search pattern is called a match.
For example, you may want to say: Find a dot (.) in a string and split the string around each dot you find. That means if there are two dots in a string, the string will be split into three pieces. Or you may want to validate user input such as an email address.
Java provides support for Regular Expressions (to define search patterns) and for matching the patters (to the text in the string) by providing the following elements:
The java.util.regex.Pattern class
The regular expression constructs
The java.util.regex.Matcher class
The simplest form of a regular expression is searching for string literal such as "Hello". You may want to look for certain words in a users input. However, you can also build very sophisticated expressions using what are called Regular Expression Constructs.
Consider the following expression:
[A-Za-z0-9]
Any character in the range of A through Z, a through z, or 0 through 9 (any letter or digit) will match this pattern.
Any other character will match the pattern described by the following expression:
[^A-Za-z0-9]
Notice the ^ character. This ^ character negates the expression.
Here are some important points about constructs:
Use backslash (\) as an escape character. For example, \. matches a period, whereas \\ matches a backslash.
Use | or logical OR, ^ to match the beginning of a line, and $ to match the end of a line. Remember that ^ inside [ ] means negation.
A Character class is a set of character alternatives enclosed in brackets;
for example, [abc] means a, b or c. The character - denotes a range, and the character ^ inside [ ] denotes a negation (that is all the characters except those specified here). For example, the character class [^a-zA-Z] means all the characters that are not included in the range a through z and A through Z.
There are several character classes that are already defined for you, such as \d means all digits.
Character Classes (Brackets used as a grouping mechanism)
[ABC]
Any of the characters represented by A, B, C etc.
[^ABC]
Any character except A, B, C (negation)
[a-zA-Z]
a through z or A through Z (range)
[...&&...]
Intersection of two sets (AND)
Predefined Chatacter Classes
.
(dot) Any character if the DOTALL flag is set, else any character except the line terminators.
\d
A digit [0-9]
\D
A non-digit: [^0-9]
\s
A whitespace character
\S
A non whitespace character
\w
A word character: [a-zA-Z0-9]
\W
A non-word character: [^\w]
If you are looking for a regular expression, say X, to repeat itself a number of times, you can say it in the pattern by using a quantifier immediately following X. For example, X+ means one or more X.
Greedy Quantifiers (X Represents Regular expression)
X?
X, zero or one time
X*
X, zero or more times
X+
X, one or more times
X{n}
X, exactly n times
Some other Constructs
^
The beginning of a line
XY
Y following X
X|Y
Either X or Y
(?:X)
X, as a noncapturing group
The following is a typical process for pattern matching:
- Compile the regular expression specified as a string into an instance of the Pattern class, for example, with a statement like the following:
Pattern p = Pattern.compile("[^a-zA-Z0-9]");
- Create a Matcher object that will contain the specified pattern and the input text to which the pattern will be matched:
Matcher m = p.matches("[EMAIL="myemail@emailaddress.com"]myemail@emailaddress.com[/EMAIL]")
- Invoke the matches() method or the fine() method on the Matcher object to find if a match is found.
boolean b = m.find();
The following is a code example to validate email addresses:
import java.util.regex.*; public class EmailValidator { public static void main(String[] args) { String email=""; if(args.length < 1) { System.out.println("Command syntax: java EmailValidator <emailAddress>"); System.exit(0); } else { email = args[0]; } //Look for for email addresses starting with //invalid symbols: dots or @ signs. Pattern p = Pattern.compile("^\\.+|^\\@+"); Matcher m = p.matcher(email); if (m.find()) { System.err.println("Invalid email address: starts with a dot or an @ sign."); System.exit(0); } //Look for email addresses that start with www. p = Pattern.compile("^www\\."); m = p.matcher(email); if (m.find()) { System.out.println("Invalid email address: starts with www."); System.exit(0); } p = Pattern.compile("[^A-Za-z0-9\\@\\.\\_]"); m = p.matcher(email); if(m.find()) { System.out.println("Invalid email address: contains invalid characters"); } else { System.out.println(args[0] + " is a valid email address."); } } }
This code will be taking the user input as a parameter. For this to work properly you will need to give a string as a command-line argument.
Try a few combinations of email addresses and see what you get. As you will see, this code will only return a correctly formatted email address as correct.