Regular Expression Builder

From Visual Basic to GNU C, this is the place to talk programming.

Moderators: SecretSquirrel, just brew it!

Regular Expression Builder

Postposted on Fri Aug 27, 2010 1:57 pm

After reading the spec for ECMAScript regular expressions I put together some code in C#4 that produces immutable, functionally pure character matcher sequences. Here is a simple example that will match a social security number.

Code: Select all
public static class Example
{
    public static readonly Matcher SsnMatcher =
        MatcherSequence.Begin()
        .Term("0123456789", min: 3, max: 3)
        .Term("-", min: 0, max: 1)
        .Term("0123456789", min: 2, max: 2)
        .Term("-", min: 0, max: 1)
        .Term("0123456789", min: 4, max: 4)
        .ToMatcher();
}

Let me know what you think.

Code: Select all
public delegate MatchState Continuation(MatchState state);
 
public delegate string Matcher(string input, int index);
 
public struct MatchState
{
    public readonly string Value;
    public readonly int Index;
    public readonly bool Success;
 
    public bool End
    {
        get { return Index == Value.Length; }
    }
 
    public char Current
    {
        get { return !End ? Value[Index] : char.MinValue; }
    }
 
 
    public MatchState(string value, int index, bool success)
    {
        Value = value;
        Index = index;
        Success = success;
    }
 
 
    public MatchState Test(bool success)
    {
        if (success)
        {
            return new MatchState(Value, Index + 1, true);
        }
        return new MatchState(Value, Index, false);
    }
 
    public MatchState Fail()
    {
        return new MatchState(Value, Index, false);
    }
}
 
public static class MatcherSequence
{
    public static Continuation Begin()
    {
        return state => state;
    }
 
    public static Matcher ToMatcher(this Continuation value)
    {
        if (value == null)
        {
            throw new ArgumentNullException("value");
        }
 
        return (input, index) =>
        {
            if (string.IsNullOrEmpty(input))
            {
                throw new ArgumentNullException("input");
            }
            else if (index < 0 || index >= input.Length)
            {
                throw new ArgumentOutOfRangeException("index");
            }
 
            var initialState = new MatchState(input, index, true);
            var finalState = value(initialState);
            if (finalState.Success)
            {
                return input.Substring(index, finalState.Index - index);
            }
            return null;
        };
    }
 
    public static Continuation Disjunction(this Continuation value, Continuation left, Continuation right)
    {
        if (value == null)
        {
            throw new ArgumentNullException("value");
        }
        else if (left == null)
        {
            throw new ArgumentNullException("left");
        }
        else if (right == null)
        {
            throw new ArgumentNullException("right");
        }
 
        return state =>
        {
            var result = value(state);
            if (result.Success)
            {
                var leftResult = left(result);
                if (!leftResult.Success)
                {
                    var rightResult = right(result);
                    if (!rightResult.Success)
                    {
                        return state.Fail();
                    }
                    return rightResult;
                }
                return leftResult;
            }
            return result;
        };
    }
 
 
    public static Continuation Term(this Continuation value, IEnumerable<char> cs, int min = 1, int max = 1, bool negate = false)
    {
        if (value == null)
        {
            throw new ArgumentNullException("value");
        }
        else if (cs == null)
        {
            throw new ArgumentNullException("cs");
        }
 
        var set = new HashSet<char>(cs);
        if (set.Count == 0)
        {
            throw new ArgumentException("At least on character is required.", "cs");
        }
 
        return state =>
        {
            var count = 0;
            var result = value(state);
 
            if (!result.Success)
            {
                return result;
            }
 
            do
            {
                if (result.End || (!set.Contains(result.Current) ^ negate))
                {
                    break;
                }
                result = result.Test(true);
            } while (++count < max || max < 0);
 
            if (count >= min && (count <= max || max < 0))
            {
                return result;
            }
 
            return state.Fail();
        };
    }
 
    public static Continuation Term(this Continuation value, Continuation next, int min = 1, int max = 1)
    {
        if (value == null)
        {
            throw new ArgumentNullException("value");
        }
        else if (next == null)
        {
            throw new ArgumentNullException("next");
        }
 
        return state =>
        {
            var count = 0;
            var result = value(state);
 
            if (!result.Success)
            {
                return result;
            }
 
            do
            {
                if (result.End)
                {
                    break;
                }
                var nextResult = next(result);
                if (result.End || !nextResult.Success)
                {
                    break;
                }
                result = nextResult;
            } while (++count < max || max < 0);
 
            if (count >= min && (count <= max || max < 0))
            {
                return result;
            }
 
            return state.Fail();
        };
    }
       
    public static Continuation IsNotFollowedBy(this Continuation value, IEnumerable<char> cs)
    {
        if (value == null)
        {
            throw new ArgumentNullException("value");
        }
        else if (cs == null)
        {
            throw new ArgumentNullException("cs");
        }
 
        var set = new HashSet<char>(cs);
        if (set.Count == 0)
        {
            throw new ArgumentException("At least on character is required.", "cs");
        }
 
        return state =>
        {
            var result = value(state);
            if (!result.Success)
            {
                return result;
            }
 
            if (result.End || !set.Contains(result.Current))
            {
                return result;
            }
 
            return result.Fail();
        };
    }
}
Chaospandion
Gerbil Team Leader
 
Posts: 201
Joined: Thu Mar 25, 2004 8:20 pm
Location: Pottstown, PA

Re: Regular Expression Builder

Postposted on Sun Aug 29, 2010 3:04 am

You may want to consider setting the default values for max and min in the Term function to -1 and 0 respectively by default. There also seems to be no support for "." and other such meta-characters.
Olreich
Gerbil In Training
 
Posts: 7
Joined: Mon Aug 23, 2010 4:52 am

Re: Regular Expression Builder

Postposted on Sun Aug 29, 2010 4:41 pm

Olreich wrote:You may want to consider setting the default values for max and min in the Term function to -1 and 0 respectively by default.


Do you really think it should be greedy like that?

Olreich wrote:There also seems to be no support for "." and other such meta-characters.


Yep, I'll be adding those eventually. I was thinking of something like this.

Code: Select all
MatcherSequence.Begin()
.DecimalDigit(min: 3, max: 3);
Chaospandion
Gerbil Team Leader
 
Posts: 201
Joined: Thu Mar 25, 2004 8:20 pm
Location: Pottstown, PA

Re: Regular Expression Builder

Postposted on Sun Aug 29, 2010 5:45 pm

I have to ask, though: did you have a very good reason for writing your own regexp parser, or you did it as an exercise?
There is a fixed amount of intelligence on the planet, and the population keeps growing :(
morphine
Gerbil Khan
Silver subscriber
 
 
Posts: 9986
Joined: Fri Dec 27, 2002 8:51 pm
Location: Portugal (that's next to Spain)

Re: Regular Expression Builder

Postposted on Sun Aug 29, 2010 7:18 pm

morphine wrote:I have to ask, though: did you have a very good reason for writing your own regexp parser, or you did it as an exercise?


Well to be a bit pedantic my code does not do any parsing. It fluently produces a series of continuations (functions) that test portions of an string.
As to your question, I am trying to learn and develop new techniques for lexical analysis that involve pure functional programming. My end goal is to mold this in
to a series of continuations that will produce tokens that I can annotate with extra information such as errors, line number, and character column. I am also trying to prove
how simple regular expressions can be implemented (Prove to myself I mean.).

That being said I have also written a regexp parser/compiler that I integrated into the ECMAScript (JavaScript/JScript/ActionScript) implementation I am developing.
Chaospandion
Gerbil Team Leader
 
Posts: 201
Joined: Thu Mar 25, 2004 8:20 pm
Location: Pottstown, PA

Re: Regular Expression Builder

Postposted on Tue Aug 31, 2010 3:26 pm

Chaospandion wrote:Well to be a bit pedantic my code does not do any parsing. It fluently produces a series of continuations (functions) that test portions of an string.
As to your question, I am trying to learn and develop new techniques for lexical analysis that involve pure functional programming. My end goal is to mold this in
to a series of continuations that will produce tokens that I can annotate with extra information such as errors, line number, and character column. I am also trying to prove
how simple regular expressions can be implemented (Prove to myself I mean.).

I haven't had a chance to look at your code, but from this description above -- evaluating a regular expression through a series of continuations -- it sounds like you're using the theory of "regular expression derivatives", either by chance or by design. From the link above:
Modern times need non-blocking lexers. Whenever programmers write non-blocking applications, they often end up re-inventing hand-rolled continuations. Fortunately, using a concept called "the derivative of regular expressions," a programmer can roll their own non-blocking version of lex that hides all of that machinery. In fact, derivatives make it possible to do this in just a few hundred lines of code!

So, what's the insight that gives us this engineering win? The continuation of a finite state machine is equivalent to the derivative of the regular expression from which it came.
bitvector
Grand Gerbil Poohbah
 
Posts: 3234
Joined: Wed Jun 22, 2005 4:39 pm
Location: Mountain View, CA


Return to Developer's Den

Who is online

Users browsing this forum: Google [Bot] and 2 guests