Monday, June 21, 2010

Determining Valid Java Regular Expression Characters for String.split

The String.split methods can be very convenient for easily splitting a provided String based on a provided regular expression String. The only trick is figuring out a regular expression token to use to separate strings that are to be split that doesn't exist in the string naturally. For example, "e" would not work well because it is so commonly present in English text naturally. A related challenge in picking the appropriate regular expression token is to ensure that a symbol with special meaning in regular expressions (such as "^" for start of line) is not used.

I have blogged previously on how I like to develop simple Java examples to learn from. One example that has provided me benefit multiple times is a simple Java application that I can run against a String I'm considering as a regular expression token to be used with String.split. It is shown next.

Simple Java Application For Determining Viability of Regular Expression Splitting Token

package dustin.examples;

import java.util.regex.PatternSyntaxException;
import static java.lang.System.out;

/**
* This simple class accepts a String as a potential regular expression token
* and demonstrates how this provided String work work as a token in a
* {@code String.split} invocation.
*/
public class Main
{
/** OS-independent new line. */
public static String NEW_LINE = System.getProperty("line.separator");

/**
* Main executable function for running this test.
*
* @param arguments Command-line arguments: single argument is expected that
* represents the candidate regular expression token.
*/
public static void main(final String[] arguments)
{
if (arguments.length < 1)
{
out.println(
NEW_LINE
+ "No argument was provided. A candidate String token must be provided."
+ NEW_LINE);
System.exit(-1);
}

final String candidateToken = arguments[0];
out.println("Provided token is: " + candidateToken);

final String stringWithCandidateToken =
"Java" + candidateToken + "has" + candidateToken + "regular"
+ candidateToken + "expression" + candidateToken + "support"
+ candidateToken + ".";
out.println("String with candidate token is: " + stringWithCandidateToken);

try
{
final String[] splitStrings = stringWithCandidateToken.split(candidateToken);
for (final String splitString : splitStrings)
{
out.println(splitString);
}
}
catch (PatternSyntaxException badRegExpPatternSyntax)
{
out.println(
"Unable to parse " + stringWithCandidateToken + " on token "
+ candidateToken + " using String.split method - "
+ badRegExpPatternSyntax.toString());
}
}
}


This very simple example accepts a single string from the command line and attempts to use that string as a token for splitting a longer string into pieces. The example builds a generic string with the provided token and then splits on that token. The output of running this simple application tells how well that provided string works as a token.

The remainder of this blog post presents screen snapshots from running this simple application with some potential splitting tokens.

The first screen snapshot demonstrates running this application against a candidate token of "d". This actually works in this example because there's no "d" already in the string, but obviously would not work in most realistic situations.



In the second screen snapshot, the problem of using a common letter such as "e" as the splitting token is demonstrated. The strings do not split as desired because "e" existed in the string before tokens were applied.



The third example uses "^" as the regular expression splitting token. Because "^" has significance in regular expressions (start of line), it doesn't work too well as demonstrated in the next screen snapshot.



The "^" character is not the only token behaving this way. The next screen snapshot demonstrates similar behavior for "$" (end of line).



The "^" and "$" characters were not good choices for a regular expression-based token for splitting strings, but there are even worse choices. This is demonstrated for "*" in the next screen snapshot.



The last example, using "*" as the regular expression splitting token, led to an unchecked (but explicitly caught in my example anyway) exception PatternSyntaxException.

Parentheses also are significant in regular expressions (mark groupings). These don't work well as tokens for string splitting as demonstrated in the next screen snapshot.



As long as the colon (":") or semicolon (";") are not used naturally in the text being split, they work well in this scenario as demonstrated next.



The String.split method is not limited to splitting on a single character. In the next example, I use the three characters of the Groovy Spaceship Operator (<=>) as the token (and it works fine).



Finally, what happens when the "pipe" character ("|") is used for a regular expression token to split a string? Let's see.



The pipe is like an "or" in regular expression parlance and we see that behavior manifested in the example in the just-shown screen snapshot.

The Pattern class documentation (part of the java.util.regex package) documents "regular expression constructs," but the simple application demonstrated in this post provides an easy to way to determine and verify appropriate Strings to be used as tokens in splitting a string. A Groovy script version of this simple Java application is shown in my blog post Groovier Java RegEx Token Determination.

No comments: