| Sign In/My Account | View Cart |
In Java applications that do text searching and manipulation, the StringTokenizer and String classes are used heavily. This can often result in complex code and lead to a maintenance nightmare.
Often such Java applications are looking for an occurrence of a particular character or token in a String, and then trying to find a string surrounding it, validating the extracted String. A simple example is validation of a web site URL or an email address. To validate an email address, we could check for an occurrence of '@', followed by one or more '.'. This logic might be implemented in Java as shown below.
JDK 1.4 supports regular expressions in the java.util.regex package. Use of this package and supporting classes makes string search and manipulation very easy. It helps reduce the development effort, and at the same time significantly improves the maintenance of code. Since classes in this package are a standard part of core Java, they don't have to be distributed separately, and can be assumed to be present. We will see at the end of article how regular expressions simplify the implementation of email validation.
String str="administrator@admin.com";
int indexOfAtChar=str.indexOf("@");
if(indexOfAtChar > 0)
{
int indexOfDotChar =
str.indexOf(".",indexOfAtChar);
if(indexOfDotChar > 0)
{
System.out.println ("Valid Email Address.");
}
else
{
System.out.println
("Invalid Email Address- " +
"Missing character '.' after '@'.");
}
}
else{
System.out.println("Invalid Email Address- " +
"Missing character'@' .");
}
This produces the output:
Valid Email Address.
Interest in regular expressions has been around for a number of years in the software industry. It has been heavily used in:
Many programming languages and operating systems tools support regular expressions, such as:
This article explains the benefits of writing regular expressions using the java.util.regex package, and how to use its key components.
First of all, let's define a regular expression in a simple approach: A regular expression is a pattern, a template, to be matched against a string.
Users of a command-line operating system like DOS or Unix often use a directory listing command to find a list of files in a directory. On DOS, this would be:
dir *.txt
And on Unix, it would be:
ls *.txt
Here "*.txt" is a command parameter to display the list of files with file extension 'txt', irrespective of file name.
Now, say we want to see list of files where the filename begins with 'a'; then the DOS command will be
dir a*.*
and the Unix command will be
ls a*.*
|
Related Reading
Regular Expression Pocket Reference |
Here "a*.*", means a filename starting with 'a' followed by any number of characters, followed by a character '.', followed by any file extension.
These examples are straightforward uses of regular expressions.
Before we jump into how to write regular expression code using the java.util.regex package, let's first have a brief look at regular expression syntax in general.
In its simplest form, a regular expression is just a word or phrase for which to search. For example, the regular expression 'John' would match any string with the string 'John' in it. Strings like 'John', 'Ajohn', and ' Decker John' all would match.
In regular expressions some characters are used for more special purposes. These are called Quantifiers. For instance, '*' matches any sequence of characters, and the '.' matches any single character except a new line. Hence, the regular expression '.ine' matches any four character strings that ends with 'ine', including 'line', and 'nine'.
But what if you want to search for a string containing a period and, say, references to pi. The following regular expression would not work:
3.141592
This would indeed match "3.141592", but it will also match "3x141592",and "38141592". To get around this, we can use a metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.141592", we would use:
3\.141592
The entire regular expression support is contained in the package java.util.regex and is made up of the following two main classes:
java.util.regex.Patternjava.util.regex.MatcherA typical implementation of text searching and/or manipulation using the java.util.regex package is divided into three steps.
PatternPattern object to create a Matcher object.Matcher object to search and/or manipulate the character sequenceA typical invocation sequence might be like the example to follow, which uses a regular expression to match 'cats', followed by any number of characters, followed by 'dogs':
Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");
boolean flag=matcher.matches();
We will look at each of the above methods in detail in next few sections.
The Pattern class provides an overloaded static factory method compile() to create Pattern instances.
static Pattern compile(String regex)Pattern.static Pattern compile(String regex, int flags)In the java.util.regex package, text matching defaults to case sensitivity and treats each character as ASCII rather than Unicode. To modify this default behavior, you can provide flags to the compile() method. All flags are static int members of Pattern. To combine behaviors, you can mathematically OR flags together with the "|" operator.
| Flag | Purpose |
|---|---|
CANON_EQ |
Enables canonical equivalence in the search. |
CASE_INSENSITIVE |
Enables case-insensitive matching. |
COMMENTS |
Permits white space and comments in pattern. If this flag is set then white spaces, and embedded comments starting with # are ignored. |
DOTALL |
By default the metaCharacter '.' does not match line terminator, but using this flag it matches any character, including a line terminator. |
MULTILINE |
Enables multiline searches. In multiline input character sequence '^' and '$' MetaCharacters match, respectively, after or before a line terminator or at the end of input sequence. |
UNICODE_CASE |
This flag specified along with the CASE_INSENSITIVE flag makes case-insensitive matching in a manner consistent with the Unicode Standards. |
UNIX_LINES |
Unix lines mode. |
Once we have a compiled Pattern, we call matcher(charsequence) on it to create a Matcher.
Matcher matcher (CharSequence input)Matcher that will match the given input against this pattern.java.lang.CharSequence is an interface to represent a readable sequence of characters. The String, StringBuffer, and CharBuffer classes implement this interface. Typically, we pass Strings to the matcher method:
Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");
Pages: 1, 2 |