Moving to Freedom, .Org(on)

Python: Regex Test Function

'Learning Python', by Mark Lutz 'Mastering Regular Expressions', by Jeffrey Friedl

Fun with Python and regular expressions! Here’s a little test regex function I wrote in Python to help me as I work through the regular expression book.

I’m mostly working at the interactive prompt and had been running commands from Python re (the regex module) as I experimented with different regular expressions. This was good as I spent time in help(re) and built up some muscle memory for Python regex functions, but it was becoming repetitious to keep typing the commands for analyzing the results of a match. Once I started learning about writing functions in Python, I realized it was time to enhance my regex learning experience with a simple Python function.

There are sophisticated regex tools out there that do much more than this, but it was fun to cobble the function together and learn more about Python in order to learn more about Regex. So far it’s proven helpful in understanding how regular expressions work.

Function Definition

The function will print whether there is a match or not, starting and ending positions along with the matched part of the string for each match, captured strings (groups), and then finally will do a global search and replace on the string and print the result.

match(pattern, string[, repl])

(I would have preferred putting repl before string to match the re.sub parameter order, but I switched them to make it an optional last argument.) I put the function in a file named imisc.py (interactive miscellaneous) that I import into an interactive session to make regex experimentation more convenient.

Keep reading below the fold for examples and the actual function!

Examples

In this first example, capturing parentheses aren’t used in the regex so there are no captured groups displayed. The r in r'\d+' indicates a “raw” string which saves us from having to escape backslashes with more backslashes. The default _._ is used for replacements.

>>> imisc.match(r'\d+', 'Go to 4782 West 70th St.')
a match!
1) start: 6, end: 10, str: 4782
2) start: 16, end: 18, str: 70
global replace (_._):
Go to _._ West _._th St.
>>>

Next we’ll use capturing parentheses to collect strings in \1 and \2. We can see these values displayed in the match groups, and we’ll use \2 in our global replace. ((?i) is a mode switch for a case-insensitive match.)

>>> imisc.match(r'(?i)The (\w+) (\w+)\.?',
... 'The quick brown fox jumps over the lazy dog.', r'\2')
a match!
1) start: 0, end: 15, str: The quick brown
   groups: ('quick', 'brown')
2) start: 31, end: 44, str: the lazy dog.
   groups: ('lazy', 'dog')
global replace (\2):
brown fox jumps over dog
>>>

Finally, some zero-width matches on “nothing”:

>>> imisc.match(r'z?', 'abc', '_')
a match!
1) start: 0, end: 0, str:
2) start: 1, end: 1, str:
3) start: 2, end: 2, str:
4) start: 3, end: 3, str:
global replace (_):
_a_b_c_
>>>

The Match Function

I’ll place this humble bit of code into the public domain to make it painless to share and include in your own work. I hope if my function finds its way into a larger work that you’ll do the right thing and share it under a free software license. :-)

import re

def match(pattern, string, repl='_._'):

    r = re.compile(pattern)
    m = r.search(string)

    if m:
        print('a match!')
        i = 0
        while m:
            m_start = m.start()
            m_end = m.end()

            i += 1
            print( '%d) start: %d, end: %d, str: %s' %
                   (i, m_start, m_end, string[m_start:m_end]) )

            if m.groups():           # capturing groups
                print('   groups: ' + str(m.groups()))

            if m_end == len(string): # infinite loop if
                break                #    m_start == m_end == len(string)
            elif m_start == m_end:   # zero-width match;
                m_end += 1           #    keep things moving along

            m = r.search(string, m_end)

        print( 'global replace (%s):\n%s' %
               (repl, re.sub(pattern, repl, string)) )

    else:
        print('not a match')