Splitting strings on unescaped delimiters only

I've run into this problems several times in the past, in several different languages. I'll be trying to split a string on a delimiter using the language's split() method, but at some point I'll discover that the delimiter occurs within the string to be split, but in an escaped form. So for instance, you might have a string like this:

123, person, Tom Smith\, MD, 45202

And what you want is an array of those values like this:

@elements = [ '123', 'person', 'Tom Smith\, MD', '45202' ];

In most languages, there's a split operator which usually does what's needed:

@elements = split(", ", $string);

But if you use this method on the string above, the output you'll get is:

@elements = [ '123', 'person', 'Tom Smith\', 'MD', '45202' ];

The above is invalid syntax, it would actually be 'Tom Smith\\', but you get the idea. In perl, you have the handy option of using a regular expression as the delimiter for the split() method, but I have yet to find some regex syntax that can match a character only if it is not preceded by some other character. It's probably possible, but I have yet to find a working example.

In the past, I've just worked around the issue. Prevent the user from using the delimiter within strings as part of the application's validation functions, or just use split() and pray the data I'm processing doesn't contain escaped delimiters. But there have been a few times where I didn't have a choice, and I've written methods to accomplish this task in several different languages, usually by substituting the escaped delimiter with a placeholder, splitting the new string, then replacing the placeholder with the delimiter in each value in the array. But I've never liked the results of this approach, they have always felt very inelegant, and usually they're overly complex, raising the possibility that the method will fail in particular circumstances.

Today I finally came up with something I like, and I imagine others might have need for something similar. It has some limitations in that the delimiter is assumed to be a single character, as is the escape character. But it works great as long as those two assumptions hold true.

The code below is in Perl, but could be tweaked to work in most languages. If you find it helpful, please let me know.

#!/usr/bin/perl
use strict;

my $string = 'This is an, escaped \, string containing, several \, escaped and , unescaped commas';

print "normal split: " . join(" ... ", split(",", $string)) . "\n";

print "splitOnUnescaped: " . join(" ... ", splitOnUnescaped(',', '\\', $string));

sub splitOnUnescaped {
	my ($delim, $escaper, $line) = @_;
	my @out;
	my @chars = split('', $line);
	my $i = 0;
	my $element = "";
	foreach my $char (@chars) {
		if (($char eq $delim) && ($chars[$i-1] ne $escaper)) {
			# This is a valid delimiter, split the string
			push(@out, $element);
			$element = "";
		} else {
			# There's an escaper before this character, don't split
			$element .= $char;
		}
		$i++;
	}
	if ($element ne '') {
		push (@out, $element);
	}
	return (@out);
}

Upcoming Events

Add to calendar