Regex Walk-Through: Match filename base and extension
This post is a rather lengthy analysis of a short regular expression:
I’m continuing to enjoy and learn a lot from Jeffrey Friedl’s Mastering Regular Expressions. One of the things that works well for me is the way he walks through examples and iteratively builds a more robust pattern for a particular task. It helps to develop logical ways of thinking about these things.
While you may not need to accomplish the particular task described in this article, I hope you might benefit by following along with the explanation. (Although it’s probably not as clear and concise as the ones in MRE.)
I recently wanted to modify a filename by inserting some text before the extension and was pleased with the regular expression I built for the task. Here it is again in all its glory:
The two capturing groups here will collect (with an exception explained below):
- Everything up to but not including the last dot in a filename (the filename base).
- From the last dot to the end of the filename (the extension).
But why not use ____ instead?
Before proceeding, we might ask: Why use a regex instead of something like Python’s os.path.splitext('test.txt')? It returns exactly what I described above, e.g. for test.txt: ('test', '.txt'). Well, I’m learning a lot about regular expressions right now, so I tend to think of possible regex solutions for string parsing challenges. (Filename nail, meet Regex hammer.) I was interested in the challenge of crafting a good pattern and had fun figuring it out. And I didn’t know about splitext offhand at the time I was doing this. So, yes, I might have used the easier method had I known better, but this way I gained some good regexperience.
Let’s say we want to split death-star-plans.htm into a filename root and an extension. We might try:
(.*)(\..*) # capturing parentheses populate to \1 and \2
This says to match zero or more characters followed by a dot followed by zero or more characters. Which gives us what we want for our captured values: death-star-plans and .htm. The first .* greedily consumes the entire filename, but then the engine has to backtrack to the dot before htm in order to make the match on \.. This will also work for death-star.plans.htm, since the engine only backtracks to the last dot before resuming with the rest of the match (using another greedy .*).
But what if we have a filename without a dot, like bespin-plans? Then our regular expression won’t match.
Well, let’s just make the dot optional with ?!
Now our match works as expected for a dotless name, but as is pointed out several times in MRE, we have to be careful with regular expressions that match too much. In particular with expressions like this one where everything is optional. Since the second part is now optional, the first greedy group will always capture everything, and our expression will no longer split filenames with dots. It will always put the entire filename into the first matching group.
Perhaps \. should be required, but we can offer an alternative for dotless file names. Suppose we try (\..*|$). This essentially works the same as (\.?.*), but now we can make the first part lazy with good results:
(.*?)(\..*|$) # slothful: .*?
For dotted filenames, the engine will only consume as much as necessary to get to a dot, and then the final .* will get us through the rest of the filename. Laziness in action: At each character up to the first literal dot, the engine checks: do I have a dot? No. Do I have the end of the string? No. Damn, I have to consume another character. It will eventually reach that first dot before it reaches the end of the string, at which point the .* will kick in and finish the filename.
For dotless names, the alternative end-of-string metacharacter $ forces the lazy clause to consume the entire filename. (We’ll eventually see that we need a $ on the left side of the alternation as well, but for now let’s follow this line.)
Okay! Now we can handle death-star-plans.htm and bespin-plans.
However, maybe you already see a problem ahead with death-star.plans.htm. Our lazy .*? will only work hard enough to get to the first dot, and then \..* will greedily race ahead and use up the rest of the string, giving us death-star and .plans.htm. But we typically only want the extension to start at the last dot.
There are many examples in MRE that show how to deal with a situation like this. Instead of an “anything goes” dot metacharacter (.) following our literal dot (\.), we can use a negative character class to specify that only characters other than a dot may follow our literal dot ([^.]):
(Remember that inside of a character class, literal dots don’t need to be escaped with a backslash.)
We’re getting close, but consider how we’ve changed the expression. With the first alternative in the second group, we’re now allowing the match to finish before the end of a filename if there is more than one dot. The lazy match will still only get us to the first dot, and then the negative character class will only let us match to the second dot. We’ll need to add another $ to ensure we consider the entire filename:
Now the engine will either place the entire filename into our first capturing group, \1, in the case where a filename has no dots, or will keep moving past dots until it finds one that has no dots after it, placing everything to the left of that last dot in \1, and the remainder (extension) in \2.
But we’re still not quite there, or at least I’m not. With this regex, we’re matching Python’s os.path.splitext() function blow-for-blow, getting the same results for all filenames. (Well, for several that I tested.) But for filenames that start with a dot (hidden files in GNU/Linux), it treats the entire filename as an extension.
For my purposes, I’d rather treat that dot special. I want a dot at the start of a filename to be considered part of the filename base. For that we only have to change our lazy clause from .*? to .+?. Now the first group always has to match at least one character. If it’s a dot, it then won’t be matched in the second group, and if no other dots, the first clause will match everything up to the $ in the second clause, making the whole hidden filename part of the base.
Here’s the final pattern, elaborated in verbose mode (?x):
(?x) # verbose/comment mode (.+?) # lazy match capture # start of filename into \1 ( # capture extension into \2 \.[^.]*$ # last dot to end | # or $ # forces \1 match to end if no dot ) #
Not that this particular regex is “all that,” but regular expressions are so beautiful. So terse and powerful. I really enjoyed working out this pattern. (I have to confess I didn’t follow the route above for myself. My discovery process was more random. This just seemed like a plausible sequence of events.)
I’m placing the regex and associated (?x) comments into the public domain. It feels silly to say that — it’s such a small, simple thing — but I know when I think about using snippets from the web or from books, I worry someone might claim some kind of “ownership” over them. Especially for simple patterns where there aren’t a lot of alternatives. Once you see it done one way that makes sense, you shouldn’t have to make some trivial change to be different. It’s like math or facts — no one should get to claim “ownership” of something functional like this.
As for this post as a whole, however, it is licensed under the Creative Commons Share-Alike License. If you want to redistribute the tutorial or parts of it, please share and share-alike.