Wednesday, December 29, 2010

Regular Expressions in Groovy (via Java)

The people who attended my presentations at RMOUG Training Days 2010 asked several good questions. One question that was asked in the Groovy presentation that I really wish I had included a slide on was "Does Groovy support regular expressions?" The focus of my presentation was on using Groovy for scripting, so this was a very natural and relevant question. I answered that Groovy uses Java's regular expression support and adds a few nifty features of its own to make regular expressions easier to apply. In this blog post, I briefly summarize what wish I had dedicated a slide on in that presentation. I do plan to have a slide on and cover Groovy regular expression support in my RMOUG Training Days 2011 presentation "Groovier Java Scripting."

One of my favorite software development quotes concerns regular expressions (see Source of the Famous 'Now You Have Two Problems' Quote for extensive background on this quote):
Some people, when confronted with a problem, think
"I know, I'll use regular expressions." Now they have two problems.

I personally have a sort of love/hate relationship with regular expressions. Regular expressions almost always allow for more concise code, but that does not always translate to "easier" or "more readable" code. There are times when I feel the regular expression solution is most concise, most elegant, and most readable and then there are the other times ...

There are a few things that make use of regular expressions seem difficult at times. If I use regular expressions regularly, I find myself increasingly fond of them. When I only use regular expressions sporadically, I don't find them as easy. Lack of familiarity with regular expressions syntax due to infrequent use often makes them more difficult to write and read (I typically find them easier to write than read). Another cause of the difficulty in reading and writing regular expressions stems from their very advantage: they can be almost too concise at times (especially when not used often). Finally, even when I get comfortable with regular expressions, it can be a minor annoyance to realize again (often the hard way) that there are multiple dialects of regular expressions and the different dialects have differing syntaxes (Java's regular expression is often said to be Perl-like). These differences make regular expressions less regular.

SIDE NOTE: One of the things I like about the book Regular Expressions Cookbook is that it lists which dialects (it calls them "flavors") of regular expressions work for each recipe (example) in the book. For example, Recipe 2.18 ("Add Comments to a Regular Expression") states that this particular recipe applies to the regular expression "flavors" of Java, Perl, Perl Compatible Regular Expressions, .NET, Python, and Ruby, but does not apply to the JavaScript flavor of regular expressions.

For those of us who used Java and languages (or Unix or Linux or vi) that supported regular expressions, it was welcome news when it was announced that Java would add regular expression support with JDK 1.4.

Although the addition of regular expressions to Java was welcome, Java's regular expression support is not always the easiest to apply due to language requirements of the Java Language Specification. In other words, Java language limitations add another layer of challenge to using regular expressions. Groovy, goes a long way toward reducing this extra complexity of regular expressions in Java.

The Java Tutorials's lesson on regular expressions introduces Java's support for regular expressions via the java.util.regex package and highlights the two classes Pattern and Matcher.

The Java Pattern is described in its Javadoc documentation as "a compiled representation of a regular expression." The documentation further explains that "a regular expression, specified as a string, must first be compiled into an instance of this class [Pattern]." A typical approach for accessing a compiled Pattern instance is to use Pattern p = Pattern.compile(""); with the relevant regular expression specified within the pair of double quotes. Groovy provide a shortcut here with the ~ symbol. Prefixing a String literal in Groovy with the ~ creates a Pattern instance from that String. This implies that Pattern p = Pattern.compile("a*b"); can be written in Groovy as def p = ~"a*b" (example used in Javadoc for Pattern).

The availability of ~ is a speck of syntactic sugar, but Groovy provides more regular expression sweetness than this. One of the least appealing parts of Java's regular expression support is the handling of backslashes within regular expressions. This is really more of a problem of backslash treatment in Java Strings. Groovy makes this much nicer to use by providing the ability to specify the regular expression used in a Pattern with "slashy syntax." This allows regular expressions to appear more natural than they do when the String must be made to comply with Java expectations.

I use the example provided by Regular Expressions Cookbook Recipe 3.1 ("Literal Regular Expressions in Source Code") to illustrate the advantages of Groovy in Pattern representation of a regular expressions. This recipe provides the literal regular expression string [$"'\n\d/\\] for its example and explains what this represents: "This regular expression consists of a single character class that matches a dollar sign, a double quote, a single quote, a line feed, any digit between 0 and 9, a forward slash, or a backslash." The only "escape" character in the entire regular expression is the need for two backslashes to represent that the character can be a single backslash.

As the Regular Expressions Cookbook recipe explains, this regular expression is represented in Java as "[$\"'\n\\d/\\\\]". Ignoring the double quotes on either side of the Java representation, it is still clear that Java String treatment forces the regular expression String [$"'\n\d/\\] to be represented in Java as [$\"'\n\\d/\\\\]. Note that the Java representation must add a backslash in front of the double quote that is part of the regular expression to escape it, must do the same thing for the \d that represents numeric digit, and then must provide four consecutive backslashes at the end to appropriately escape and represent the two that are actually meant for the regular expression. Regular expressions can be cryptic anyway and even the slightest typo can change everything, so the extra syntax needed for the Java version is more that can go wrong.

I demonstrate this example more completely with the following simple Java code.

// Regular Expression: [$"'\n\d/\\]
      //    For Java, must escape the double quote, the \d, and the \\
      final String regExCookbook31RegExString = "[$\"'\n\\d/\\\\]";
      final Pattern regExCookbook31Pattern = Pattern.compile(regExCookbook31RegExString);
      out.println(
           "The original regular expression is: "
         + regExCookbook31Pattern.pattern());

Running the above code leads to the output demonstrated in the next screen snapshot.


Before looking at how Groovy improves on the handling of the regular expression, I round out the Java example that was started above to also include an example of Java's Matcher class in action. The Matcher is obtained from the Pattern instance and supports three types of matching: (1) matching the entire provided sequence against the regular expression pattern [Matcher.matches()], (2) matching at least the beginning portion of the provided sequence against the regular expression pattern [Matcher.lookingAt()], and (3) iterating over the provided sequence looking for one or more pattern matches [Matcher.find()].

The third approach is demonstrated in the Java Tutorial on regular expressions. It provides a "test harness" that I have adapted here:

package dustin.examples;

import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

import static java.lang.System.err;

/**
 * Regular expression test harness slightly adapted from example provided in
 * Java Tutorial on regular expressions. The location of the original is
 * http://download.oracle.com/javase/tutorial/essential/regex/test_harness.html.
 */
public class RegExTestHarness
{
   /**
    * Simple executable method that provides demonstration of Java's regular
    * expression support with java.util.regex package and its classes Pattern
    * and Matcher.  Adapted from Java Tutorial regular expressions test harness.
    *
    * @param arguments Command-line arguments: none expected.
    */
   public static void main(final String[] arguments)
   {
      final Console console = System.console();
      if (console == null)
      {
         err.println("No console available; this application requires it.");
         System.exit(-1);
      }
      String regExInput;
      do
      {
         regExInput = console.readLine("%nEnter your regular expression: ");
         final Pattern pattern = Pattern.compile(regExInput);

         final String searchStringInput =
            console.readLine("Enter input string to search with regular expression: ");
         final Matcher matcher = pattern.matcher(searchStringInput);

         boolean found = false;
         while (matcher.find())
         {
            console.format(
                 "Text \"%s\" located starting at "
               + "index %d and ending at index %d.%n",
               matcher.group(), matcher.start(), matcher.end());
            found = true;
         }
         if (!found)
         {
            console.format("No match found.%n");
         }
      } while (!regExInput.isEmpty());
   }
}

The next screen snapshot demonstrates the output from this adapted Java-based regular expression test harness on the regular expression used above.


This example demonstrates the Matcher.find() method in action: it iterates over the provided input String and returns true whenever it evaluates a character that satisfies the single character regular expression. The Java code then uses other methods on Matcher (Matcher.group(), Matcher.start(), and Matcher.end()) to provide more details on the match.

The Matcher.find() method is the best choice when one wants to find and act upon any regular expression matches in a given expression. However, if one is only interested in whether the given expression begins with a match to the regular expression of if the entire expression is an exact match of the regular expression, then Matcher.lookingAt() or Matcher.find() are likely to be preferred. The three methods in the next code listing can be used to demonstrate these two Matcher methods along with providing another example of Matcher.find().

/**
 * Regular Expression: [$"'\n\d/\\]"
 * For Java, must escape the double quote, the \d, and the \\
 */
private final static String REG_EX_COOKBOOK_31_REGEX_STRING = "[$\"'\n\\d/\\\\]";

/**
 * Superset String to be used in various Matcher demonstrations, but its key
 * differentiating characteristic is that it does NOT begin with a match for
 * the regular expression (does not being with $, ", ', new line, numeric
 * digit, or backslash).
 */
private final static String SUPERSET_STRING = "regular\\expressions$can_be_2tons\"of'fun.";

/**
 * String that should always be match (even exact) for Regular Expressions
 * Cookbook Recipe 3.1 regular expression.
 */
private final static String EXACT_MATCH_STRING = "$";

/**
 * Superset String set up to start with a match for the Recipe 3.1 regular
 * expression pattern from Regular Expressions Cookbook.
 */
private final static String SUPERSET_STRING_STARTING_WITH_MATCH =
     EXACT_MATCH_STRING + SUPERSET_STRING;

/**
 * Demonstrate Matcher.matches().
 */
private static void demonstrateMatches()
{
   final Pattern regExCookbook31Pattern = obtainPatternForRegularExpressionsCookbookRecipe3_1();
   final String formatString = "%n%s%s%san EXACT match for regular expression %s";

   final Matcher matcher1 = regExCookbook31Pattern.matcher(SUPERSET_STRING);
   final boolean exactMatch1 = matcher1.matches();
   out.println(
      String.format(
         formatString,
         exactMatch1 ? "YES!  " : "NO :( ",
         SUPERSET_STRING,
         exactMatch1 ? " IS " : " is NOT ",
         regExCookbook31Pattern.pattern()));

   final Matcher matcher2 = regExCookbook31Pattern.matcher(EXACT_MATCH_STRING);
   final boolean exactMatch2 = matcher2.matches();
   out.println(
      String.format(
         formatString,
         exactMatch2 ? "YES!  " : "NO :( ",
         EXACT_MATCH_STRING,
         exactMatch2 ? " IS " : " is NOT ",
         regExCookbook31Pattern.pattern()));

   final Matcher matcher3 = regExCookbook31Pattern.matcher(SUPERSET_STRING_STARTING_WITH_MATCH);
   final boolean exactMatch3 = matcher3.matches();
   out.println(
      String.format(
         formatString,
         exactMatch3 ? "YES!  " : "NO :( ",
         SUPERSET_STRING_STARTING_WITH_MATCH,
         exactMatch3 ? " IS " : " is NOT ",
         regExCookbook31Pattern.pattern()));
}

/**
 * Demonstrate Matcher.lookingAt().
 */
private static void demonstrateLookingAt()
{
   final Pattern regExCookbook31Pattern = obtainPatternForRegularExpressionsCookbookRecipe3_1();
   final String formatString = "%n%s%s%sbegin with a match for regular expression %s";

   final Matcher matcher1 = regExCookbook31Pattern.matcher(SUPERSET_STRING);
   final boolean portionMatch1 = matcher1.lookingAt();
   out.println(
      String.format(
         formatString,
         portionMatch1 ? "YES!  " : "NO :( ",
         SUPERSET_STRING,
         portionMatch1 ? " DOES " : " does NOT ",
         regExCookbook31Pattern.pattern()));

   final Matcher matcher2 = regExCookbook31Pattern.matcher(EXACT_MATCH_STRING);
   final boolean portionMatch2 = matcher2.lookingAt();
   out.println(
      String.format(
         formatString,
         portionMatch2 ? "YES!  " : "NO :( ",
         EXACT_MATCH_STRING,
         portionMatch2 ? " DOES " : " does NOT ",
         regExCookbook31Pattern.pattern()));

   final Matcher matcher3 = regExCookbook31Pattern.matcher(SUPERSET_STRING_STARTING_WITH_MATCH);
   final boolean portionMatch3 = matcher3.lookingAt();
   out.println(
      String.format(
         formatString,
         portionMatch3 ? "YES!  " : "NO :( ",
         SUPERSET_STRING_STARTING_WITH_MATCH,
         portionMatch3 ? " DOES " : " does NOT ",
         regExCookbook31Pattern.pattern()));
}

/**
 * Apply Matcher.find() to determine the number of matches in the provided
 * sequences to the provided Pattern.
 */
private static void demonstrateFindToCountMatches()
{
   final Pattern regExCookbook31Pattern = obtainPatternForRegularExpressionsCookbookRecipe3_1();
   final String formatString = "%n%s contains %d matches for regular expression %s";

   final Matcher matcher1 = regExCookbook31Pattern.matcher(SUPERSET_STRING);
   int numMatches1 = 0;
   while (matcher1.find())
   {
      numMatches1++;
   }
   out.println(
      String.format(
         formatString,
         SUPERSET_STRING,
         numMatches1,
         regExCookbook31Pattern.pattern()));

   final Matcher matcher2 = regExCookbook31Pattern.matcher(EXACT_MATCH_STRING);
   int numMatches2 = 0;
   while (matcher2.find())
   {
      numMatches2++;
   }
   out.println(
      String.format(
         formatString,
         EXACT_MATCH_STRING,
         numMatches2,
         regExCookbook31Pattern.pattern()));

   final Matcher matcher3 = regExCookbook31Pattern.matcher(SUPERSET_STRING_STARTING_WITH_MATCH);
   int numMatches3 = 0;
   while (matcher3.find())
   {
      numMatches3++;
   }
   out.println(
      String.format(
         formatString,
         SUPERSET_STRING_STARTING_WITH_MATCH,
         numMatches3,
         regExCookbook31Pattern.pattern()));      
}

When the code above is executed (as part of a class and after being invoked), the output appears as shown in the next screen snapshot.


This example confirms what the Javadoc states about the Matcher.matches() and Matcher.lookingAt() methods. In particular, we see that Matcher.matches() looks for an exact match from the beginning of the provided String against the regular expression while Matcher.lookingAt() only verifies that the provided expression begins with a portion matching the regular expression and does NOT require an exact match. The Matcher.find() method finds and allows action upon any and all matches. The Methods of the Matcher Class portion of the Java Tutorial provides another example of these two methods.

Whew! It's been a long way getting here, but I've now covered enough Java handling of regular expressions to move onto what Groovy can do with regular expressions.

I mentioned previously that Groovy doesn't require the developer to explicitly instantiate a Pattern instance to get access to one. Instead this can be done implicitly by prefixing a String with the ~ character. Groovy's goodness doesn't stop there. Groovy also supports "easier" (certainly more concise) syntax for using Java's Matcher. This Groovy syntax is more Perl-like in nature than the Java API counterpart.

The Groovy =~ operator acts something like instantiating Java's Matcher while the ==~ operator acts more like Java's Matcher.matches() (exact match) method. However, it's better than that. The Matcher instance provided by =~ implicitly returns multiple boolean values when used in a conditional statement. Looking at Groovy code makes it more obvious what's happening.

NetBeans 6.9 includes Groovy support and it helps us to see that plopping the escape character-ridden Java String used for a regular expression in the Java example above is actually not allowed in Groovy. Here's what NetBeans 6.9 shows for such a case (note the red squiggly line and the error message; thanks NetBeans!).


The small snippet of code shown above in the NetBeans 6.9 editor window indicates that the dollar sign ($) needs to be escaped in Groovy when specifying the regular expression using double quotes. However, the next line shows that this is not necessary when the "slashy syntax" is used instead of the double quotes. For many developers who use regular expressions, the slashy syntax may be more appealing anyway, but the fact that it doesn't require escaping of $ and the fact that it isn't as confusing when there are double quotes naturally present in the regular expression are sweet morsels.

Using the slashy syntax helped avoid the need to escape the $ or the ", but how does one handle a slash in the regular expression when using slashy syntax? It turns out that NetBeans 6.9 warns us about that.


In response to the NetBeans flagged errors, I can fix the regular expression Pattern definitions to get the following Groovy code that creates Pattern instances with both quoted strings and slashy strings:

#!/usr/bin/env groovy
/*
 * Demonstrate how Groovy simplifies regular expression handling.
 */

// Setting up Pattern instances in Groovy

def patternQuoted = ~"[\$\"'\n\\d/\\\\]"
println "It's a Pattern: ${patternQuoted.class} (quoted) for regular expression ${patternQuoted.pattern()}"
def patternSlashy = ~/[$"'\n\d\/\\]/
println "It's a Pattern: ${patternSlashy.class} (slashy) for regular expression ${patternSlashy.pattern()}"

When the above Groovy script is executed, the output looks like that shown in the next screen snapshot.


The snapshot shows the general advantage in presentation of the slashy syntax as compared to the quoted syntax. The slashy syntax required far less escaping than the quoted syntax or the equivalent Java syntax, making the String closer to the original regular expression.

There is one downside evident from this. Note that the new line \n is left in place as two characters rather than being treated as the new line character in the slashy syntax. This can be addressed by using the syntax ${"\n"} in place of the "\n" in the slashy syntax String as shown next:

def patternQuoted = ~"[\$\"'\n\\d/\\\\]"
println "It's a Pattern: ${patternQuoted.class} (quoted) for regular expression ${patternQuoted.pattern()}"
def patternSlashy = ~/[$"'${"\n"}\d\/\\]/
println "It's a Pattern: ${patternSlashy.class} (slashy) for regular expression ${patternSlashy.pattern()}"

Being required to express the newline as ${"\n"} instead of simply "\n" is less than desirable. Fortunately, I don't need a match to a newline for the examples in this post and I generally don't need them in real life use either. Even when I might, I prefer this small price to buy the advantages of the slashy syntax.

The next code listing demonstrates Matcher handling in Groovy:

// Setting up Matcher instances in Groovy

def findRegExCookbook31MatchesQuoted = "regular\\expressions\$can_be_2tons\"of'fun." =~ patternQuoted
println "It's a Matcher for Quoted Pattern!: ${findRegExCookbook31MatchesQuoted.class}"
println "\tNumber of matches (count): ${findRegExCookbook31MatchesQuoted.count}"
println "\tNumber of matches (size()): ${findRegExCookbook31MatchesQuoted.size()}"

def findRegExCookbook31MatchesSlashy = "regular\\expressions\$can_be_2tons\"of'fun." =~ patternSlashy
println "It's a Matcher for Slashy Pattern!: ${findRegExCookbook31MatchesQuoted.class}"
println "\tNumber of matches (count): ${findRegExCookbook31MatchesQuoted.count}"
println "\tNumber of matches (size()): ${findRegExCookbook31MatchesQuoted.size()}"

def findRegExCookbook31ExactMatchQuoted = "regular\\expressions\$can_be_2tons\"of'fun." ==~ patternQuoted
println "It's a Boolean for Quoted Pattern!: ${findRegExCookbook31ExactMatchQuoted.class}"
println "\t${findRegExCookbook31ExactMatchQuoted ? 'Exact Match!' : 'NOT Exact Match.'}"

def findRegExCookbook31ExactMatchSlashy = "regular\\expressions\$can_be_2tons\"of'fun." ==~ patternSlashy
println "It's a Boolean for Quoted Pattern!: ${findRegExCookbook31ExactMatchSlashy.class}"
println "\t${findRegExCookbook31ExactMatchSlashy ? 'Exact Match!' : 'NOT Exact Match.'}"

def findRegExCookbook31ExactMatchQuoted2 = '$' ==~ patternQuoted
println "It's a Boolean for Quoted Pattern!: ${findRegExCookbook31ExactMatchQuoted2.class}"
println "\t${findRegExCookbook31ExactMatchQuoted2 ? 'Exact Match!' : 'NOT Exact Match.'}"

def findRegExCookbook31ExactMatchSlashy2 = '$' ==~ patternSlashy
println "It's a Boolean for Quoted Pattern!: ${findRegExCookbook31ExactMatchSlashy2.class}"
println "\t${findRegExCookbook31ExactMatchSlashy2 ? 'Exact Match!' : 'NOT Exact Match.'}"

The output from running this, shown in the next screen snapshot, tells the tale.


This output confirms that the Groovy =~ operator provides a Matcher instance that is smarter than your average Matcher. That's because the Groovy Matcher (part of Groovy GDK) provides many additional utility methods including two used in the code above (getCount() used as count property and size() method). The GDK's Matcher.asBoolean() method is behind the magic that allows a Groovy Matcher to return a boolean in a conditional expression. I don't discuss it here, but the GDK does provide one small extension to the Pattern class as well.

Mr. Haki provides a nice overview of Groovy's treatment of regular expressions via Matchers in his post Groovy Goodness: Matchers for Regular Expressions. He similarly covers Groovy handling of Patterns in the post Groovy Goodness: Using Regular Expression Pattern Class.


Conclusion

The ability to use more natural-looking (at least by regular expressions standards) regular expressions, the ability to use operators rather than APIs and method calls, and the extra "smarts" added to GDK's extensions of the Java RegEx library make using regular expressions in Groovy easier and more natural to the people who are probably most familiar with regular expressions: script writers. As is true with most of Groovy, Groovy's regular expression support is a reflection of Java's regular expression support. Generally speaking, anything one knows about regular expressions in Java (including the syntax supported in Java's flavor/dialect) can be applied when using regular expressions in Groovy. In several cases, though, Grooy makes it easier to use. With anything, but especially with regular expressions, easier is always better.


Additional Resources

I mentioned previously that there are numerous great online resources on Groovy's support for regular expressions. Some of them are listed here. I especially recommend the first listed resource (Groovy: Don't Fear the RegExp) and book I frequently cited in this post (Regular Expressions Cookbook).

Regular Expressions Cookbook

Groovy: Don’t Fear the RegExp

Groovy Regular Expressions

Groovy Tutorial 4 - Groovy Regular Expressions Basics

⇒ Groovy Goodness: Using Regular Expression Pattern Class

⇒ Groovy Goodness: Matchers for Regular Expressions

Big Collection of Regular Expressions (not specific to Groovy)

Finding Files by Name with Groovy

Online Regular Expression Test Page (uses java.util.regex)

RegexBuddy

RegexPal

Regular Expression Tool

No comments: