Lesson 21: Regular Expressions

Suppose your task is to confirm the format of an incoming telephone number. It must include exactly ten digits the first of which must be between 2 and 9, and may or may not be punctuated with hyphens in the fourth and eighth positions–but it’s either both or neither.

You might write a method like this:

private boolean judgePhone(String thePhone) {
    if (thePhone == null) {
        return false;
    }
        
    if (thePhone.length() == 12) {
        if (thePhone.charAt(3) != '-' || thePhone.charAt(7) != '-') {
            return false;
        }
    }
        
    thePhone = thePhone.replaceAll("-", "");
    if (thePhone.length() != 10) {
        return false;
    }
        
    if (thePhone.charAt(0) < '2' || thePhone.charAt(0) > '9') {
        return false;
    }
        
    for (int i = 1; i < thePhone.length(); i++) {
        if (thePhone.charAt(i) < '0' || thePhone.charAt(i) > '9') {
            return false;
        }
    }
        
    return true;
}

Or you could use the power of regular expressions to write something like this:

private static final Pattern phoneNumberPattern = Pattern.compile("^[2-9]\\d{2}(-?)\\d{3}\\1\\d{4}$");
...
private boolean judgePhone(String thePhone) {
    if (thePhone == null) {
        return false;
    }
        
    Matcher m = phoneNumberPattern.matcher(thePhoneNumber);
    return m.matches();
}

Regular expressions are patterns used for matching text. They’re useful for validating the formats of and manipulating phone numbers, email addresses, Social Security numbers, and especially for FileNameFilters, among other things. Search engines, and find/replace operations in text editors, make extensive use of them.

We’ll explain below how we used a regular expression in this code. First, we’ll have a look at the components of regular expressions and the types of objects that use them in Java.

Types of Regular Expression Constructs

There are several classes of regular expression components. Here is a rundown, with selected examples of each.

Characters

These are simply either literal characters, or escape sequences.

In the case of the literal characters, if the character has a special meaning to the matching engine, it needs to be escaped by a “\”. For example, a question mark has special meaning. If your pattern is searching for a question mark, it has to appear in the regular expression as “\?”. And because a backslash is an escape character in Java, you need to code the String that specifies the regular expression as “\\?”; otherwise, Java will assume you want a question mark in the regular expression, not a question mark preceded by a backslash.

Character	Meaning
x	The character x
\\	A single backslash (but coded in Java as “\\\\”)
\t \n \r \f \a	The tab, newline, carriage return, form feed, and alert (bell) characters, respectively (coded as “\\t, ==n, etc.)

Selected Character expressions

Character Classes

The following represent ways to specify a set of characters any of which may (or must not) match a single character in the input.

Class	Meaning
[abc]	a, b, or c
[^abc]	Any character except a, b, or c
[a-zA-Z]	Any character from a to z or from A to Z; a hyphen indicates the range between the characters before and after
[a-d[m-p]]	The union of [a-d] and [m-p]: any letter from a to d or from m to p
[a-z&&[def]]	The intersection of [a-z] with [def]: any letter from a to z which is also in [def]–essentially, d, e, or f
[a-z&&[^bc]]	Any letter from a to z that is not b or c; same as [ad-z]
[a-z&&[^m-p]]	Any letter from a to z that is not between m and p; same as [a-lq-z]

Character classes

Predefined Character Classes

The following are expressions that represent types of characters rather than specific characters.

Predefined Character Class	Meaning
. (a period)	Any character, which may or may not match a line terminator
\d \D	A digit [0-9] or a non-digit [^0-9], respectively
\s \S	A whitespace character (tab, newline, space, carriage return, form feed) oe a non-whitespace character, respectively
\w \W	A word character (letter, digit, or underscore) or a non-word character, respectively

Predefined character classes

POSIX Character Classes (US-ASCII only)

POSIX stands for Portable Operating System Interface, which is a standard for Unix operating systems. Here are some of the POSIX expressions used to match characters in regular expressions.

POSIX Expression	Meaning
\p{Lower} \p{Upper} \p{Alpha}	A lowercase letter, uppercase letter, or either, respectively
\p{Digit}	A digit from 0 to 9
\p{Punct}	A punctuation mark: one of !”#$%&'()*+,-./:;<=>?@[\]^_`{\|}~
\p{Blank}	A space or a tab

Selected POSIX expressions

Boundary Matchers

Boundary Matcher	Meaning
^	Beginning of a line
$	End of a line
\b \B	A word boundary and a non-word boundary, respectively
\A	The beginning of the input
\G	The end of the previous match
\Z	The end of the input but for the final terminator, if any
\z	The end of the input

Boundary matchers

Quantifiers

Each of these expressions represents a number of repetitions of whatever precedes them. They have three forms:

Greedy (the default): the longest match possible. For example, in the pattern “A+..” matched against “AAAAA”, the expression A+ matches AAA.
Lazy (aka Reluctant): the shortest match possible. For example, in the pattern “A?.+” matched against “AAAAA”, the expression A? matches A.
Possessive: Similar to Greedy, but without the provision to “back up” to accommodate subsequent pattern characters. For example, in the pattern “A++..” matched against “AAAAA”, the match fails because having matched “AAAAA”, the engine has nothing to match the two dots at the end of the pattern.

Quantifier (Greedy/Lazy/Possessive)	Function
X? X?? X?+	X, once or not at all
X* X? X+	X, zero or more times
X+ X+? X++	X, one or more times
X{n}	X, exactly n times; greedy, lazy, and possessive have no meaning
X{n,} X{n,}? X{n,}+	X, at least n times
X{n,m} X{n,m}? X{n,m}+	X, at least n time but no more than m times

Quantitiers

Greedy vs. Possessive

The difference between greedy and possessive may be hard to understand. Here’s an example that will help.

Suppose we match the pattern “\d+8”, a greedy match of consecutive digits followed by an “8”, against the input string “012345789”. This match will first greedily match consecutive digits in the input until it runs out of them: all the way through the “9”. But there’s still another pattern character left: the “8”. So greedy match will back up and match only “0” through “8”. That still doesn’t work, so it backs up again and matches “0” through “7”. Now the the rest of the regular expression (the literal “8”) can match the “8” in the input, and the match succeeds.

Now, suppose we match “\d++8”, a possessive match, against the same string. Again, the pattern matches consecutive digits until it runs out. But a possessive match doesn’t back up to accommodate the literal “8” in the pattern. Instead, the match fails immediately.

Group Expressions

Parts of what regular expressions match can be sequestered into groups, which are specified or referenced in various ways.

Group Expression	Function
(X)	What X matches is an unnamed capturing group
(?:X)	What X matches is a non-capturing group
(?<name>X)	What X matches as a capturing group given the name name
\n	Whatever the nth capturing group n matched; group \0 is what the entire expression matches
\k<name>	Whatever the named capturing group name matched

Grouping expressions

Pattern

The Pattern class compiles regular expressions for subsequent use. Here are most of its methods.

Pattern Method	Function
static Pattern compile(String regex) static Pattern compile(String regex, int flags)	Compiles the given regular expression, with the given flags if specifed.
int flags()	Returns the flags used in compilation.
Matcher matcher(CharSequence input)	Returns a Matcher object resulting from matching input against the compiled regular expression. (CharSequence is a superclass of String, StringBuilder, and StringBuffer, among others.)
static boolean matches(String regex, CharSequence input)	Compiles regex and matches input against it. A once-per-run convenience method.
String pattern()	Returns the regular expression from which the Pattern was compiled.

Selected methods of Pattern

Here are some of the flags you can pass to the compile() method. Each is a constant from the Pattern class. To apply multiple flags, add them together like so:

Pattern myPattern = Pattern.compile(".", Pattern.CASE_INSENSITIVE + Pattern.DOTALL);

Pattern Flag	Embedded Flag Expression	Function
Pattern.CASE_INSENSITIVE	(?i)	Enables case-insensitive matching.
Pattern.DOTALL	(?s)	By default, a period matches any character except end of line. This flag enables a period to match end of line as well.
Pattern.MULTILINE	(?m)	By default, ^ and $ match at the beginning and end of an entire character sequence. Specifying this flag also makes ^ match just after an end of line, and $ match just before an end of line.
Pattern.UNIX_LINES	(?d)	Usually, both \r and \n count as end of line for ., ^, and $. Adding this flag makes only \n count as end of line.

Selected Pattern flag values

In each case, starting the regular expression with one or more of the embedded flag expressions in the table above is equivalent to coding the corresponding flag value.

Explaining Our Phone Number Test

So where does the code at the top of this lesson come from?

We started by creating a Pattern, which is a compiled regular expression that can be used later on to create a Matcher object, which we’ll see more about later.

Our Pattern only needs to be compiled once, and will be used by every instance of our class, to we’ve made it static and final.

As far as the value of the expression, first understand that although we’ve coded it as
"^[2-9]\\d{2}(-?)\\d{3}\1\\d{4}$"
the regular expression engine sees it as
"^[2-9]\d{2}(-?)\d{3}\1\d{4}$"

Backslashes indicate the specification of character classes in regular expressions. But remember that in Java, backslashes also introduce special characters (e.g. \t for tab). If you want a backslash in the resulting String, you have to code is as “\\”.

In both Java and in regular expressions, if you want a “\” to mean an actual backslash and not the preface to something else, you escape it with another backslash: “\\”. But in the Java String that contains that expression, each of those backslashes needs its own escape character, so you code “\\\\” to have a regular expression that matches one “\”. By now you should be thoroughly confused.

Now let’s examine our regular expression piece by piece, and how it relates to a string to be compared to it.

Regular Expression Component	Function
^	The caret at the start of the expression indicates that the pattern must match at the start of the string. If the pattern doesn’t match until a subsequent character, there is no match.
[2-9]	For the first digit of the area code, the values in brackets indicate that the next single character in the string must be between 2 and 9. The characters in the brackets need not express a range of values; we could also have coded “[23456789]”.
\d{2}	\d indicates a single digit. The {2} indicates exactly two repetitions of “\d”–in other words, two digits. Equivalent to “\d\d”. This is the second and third digits of the area code.
(-?)	The parentheses indicate that whatever comes between them is a capturing group. The question mark indicates that whatever precedes it should occur 0 or 1 times. The hyphen is a literal character. So this is a capturing group that matches either 0 or 1 hyphens.
d{3}	This indicates matches on three consecutive digits: the phone exchange.
\1	This indicates a match on whatever capturing group number 1, above, found: either no hyphen or one hyphen.
\d{4}	Four consecutive digits: the end of the phone number.
$	This indicates that the end of the pattern must coincide with the end of the string; if there is more string left over, it doesn’t match the pattern.

Our phone number pattern, explained

Regular Expressions in String Methods

There are five methods in the String class in which you can use regular expressions.

Method	Function
boolean matches(String regex)	Returns true if the String instance matches the regular expression regex, and false otherwise.
String replaceAll(String regex, String replacement) String replaceFirst(String regex, String replacement)	Returns a String that replaces all (replaceAll) or the first (replaceFirst) match(es) of the regular expressing regex with replacement.
String[] split(String regex) String[] split(String regex, int limit)	Splits the String around matches on the regular expression regex. If limit is used, the regular expression will be applied no more than limit – 1 times.

You can’t specify flags on the regular expression as when you compile a Pattern, but you can get the same effects by beginning the regular expression with the desired embedded flag expressions from the table of Pattern flag values above.

Each time a regular expression is used in a method call, it needs to be compiled. These methods are fine if the regular expression is going to be used just once. Otherwise, it’s more efficient to compile a regular expression with the Pattern class and reuse the compiled expression with Matcher.

Matcher

A Matcher is an engine that performs match operations on a character sequence by interpreting a Pattern.
From the API documentation

In its simplest form, the Matcher tells us whether the entire character string passed to Pattern’s matcher() method matches the pattern.

But Matcher has some concepts it’s useful to understand:

Region: a subsequence of the character string being examined at any given time. By default, it’s the entire string.
Transparency and opacity: If the Matcher uses transparent bounds, matching operations can extend beyond the boundaries of the current region. Otherwise, they cannot, and the Matcher is using opaque bounds; this is the default.
Groups: as seen above, parentheses can be used to define parts of the regular expression so what they match can be used later and/or delimit parts of the expression. Groups can be capturing or non-capturing, and may be assigned names, as described above.

Now let’s look at some of the methods of Matcher.

Matcher Method	Function
boolean matches()	Returns true if the entire current region matches the pattern. If the operation succeeds, more information is available via the start(), end(), and group() methods.
boolean lookingAt()	Returns true if any of the region matches the pattern. If the operation succeeds, more information is available via the start(), end(), and group() methods.
Matcher region(int start, int end)	Sets the index to start (inclusive) and end (exclusive) searches, and returns this Matcher.
int regionStart() int regionEnd()	Returns the boundaries of the current region.
Matcher useTransparentBounds(boolean b)	Sets transparent bounds if b is true, and opaque bounds if b is false, and returns this Matcher.
boolean find()	Attempts to find the next subsequence of the input pattern. If the Matcher has been reset or there has not been a previous successful find(), the operation starts at the beginning of the current region. Otherwise, it starts after the previous successful find() operation ended. Returns true if the operation succeeds and false otherwise. If the operation succeeds, more information is available via the start(), end(), and group() methods.
boolean find(int start)	Resets this Matcher and looks for the next subsequence of the input sequence that matches the pattern, starting at the specified start index. Otherwise behaves like find() above.
boolean hitEnd()	Returns true if the end of input was reached by the last matching operation.
String group() String group(int group) String ground(String name)	group() returns the entire last subsequence matched by the Matcher, the subsequence captured by the indicated capturing group, and the capturing group with the specified name, respectively, after a successful match. Capturing groups are numbered successively, starting at 1.
int groupCount()	Returns the number of capturing groups in the pattern after a successful match.
Matcher reset() Matcher reset(CharSequences string)	Discards all position information. If input is specified, the Pattern is matched against the new input.
int start() int start(int group) int start(String name)	Returns the start index of the matching subsequence, the indicated capturing group, or the named capturing group respectively, after a successful match.
int end() int end(int group) int end(String name)	Returns the offset of the matching subsequence, the indicated capturing group, or the named capturing group respectively, after a successful match.

Selected methods of Matcher

Exercise

Now that you’ve seen an example of how to use regular expressions, here’s your next assignmentl.

Back in your project, Lesson-20-Example-01, make a copy of BroadwayReaderDOMSingles.java and call it BroadwayReaderDOMSinglesRegex.java.
Modify this class’s extractShowTag() method to use a regular expression instead of the indexOf() method on the StringBuilder.

You can download a solution here. The Java source is inside the ZIP file.

Good luck!

What You Need to Know

Regular expressions are patterns against which you can match text.
Regular expressions are useful for breaking text apart, validating text formats, and filtering.

Next topic: Working With Databases