How to Write a RegEx Worthy of Passing Code Review | by Matthew Cannalte | Apr, 2022

Let’s make RegExes cool once more

Graphic by creator

How would you react to seeing a RegEx like this:


in a code evaluate?

Is it clear what this expression is attempting to do, what it doesn’t do, or learn how to take a look at it? Sadly, a typical RegEx is both so fundamental that it’s unhelpful, or so sophisticated that solely the engineer who wrote it may well perceive it.

It doesn’t need to be this manner! You’ll be able to write efficient common expressions with just a few fundamental information, and maintain them readable for different engineers, even the opinionated ones who vow by no means to make use of them.

Let the next be a sensible information for understanding, using, and sustaining RegExes.

Common expressions (RegEx) present a concise, language-agnostic format for matching and parsing strings. In theoretical laptop science, it may be proven that any finite state machine has an equal common expression. Talking virtually, that implies that a RegEx can symbolize a complete algorithm for processing knowledge, all in a single, compact string.

Therefore, a RegEx must be handled like some other code. It must be written cleanly and readably, in order that anybody else can come again and keep it with minimal effort. Learn on to see some how we will apply programming greatest practices to RegExes (with examples).

Right here’s a RegEx that can match any string containing 5 digits, and nothing else, comparable to a US Postal (zip) Code:


Right here’s what all these symbols imply:

  • ^ — matches the start of a string
  • [0-9] — matches any digit (particularly a personality within the set0,1,2,3,4,5,6,7,8,9 )
  • 5 — applies [0-9] precisely 5 instances
  • $ — matches the top of a string

All collectively, our RegEx will match any string with precisely 5 digits, nothing earlier than, and nothing after. Let’s run it, and examine the complexity to a selfmade string-parsing answer:

Appears to work. Right here’s the output that I bought:

Now, let’s make issues fascinating with a way more tough parsing drawback. Suppose we need to validate a string representing a listing of zip codes, every separated by both a comma or white house. We are able to test whether or not the string matches this (hideous) RegEx:


Right here’s what that each one means:

  • ([0–9]5) matches a zipper code (proven in instance 1, parentheses deal with grouping)
  • [s]+ matches a number of white house characters like areas, tabs, and so on.
  • | matches the previous sample or the one which follows
  • , matches a comma
  • ([s]+|,)+ matches a number of white-spaces or a comma, one or many instances (this represents a legitimate separator or sequence of separators)
  • ([0–9]5([s]+|,)+)* matches a zipper code, adopted by the white-spaces-or-comma separator, zero or extra instances
  • ^([0–9]5([s]+|,)+)*([0–9]5)$ matches a single zip code preceded by delimiter-separated sequence of zip codes, with nothing else earlier than or after the entire sequence

Let’s code it out, with some assessments to see if it really works (we’ll come again to those assessments later):

And right here’s what we get:

Cool, it really works! This RegEx is concise and environment friendly, and so far as I can inform, it’s appropriate. However, ~ugh~

This expression — ^([0–9]5([s]+|,)+)*([0–9]5)$ is UGLY.

At first look, it’s fully unclear what that is alleged to do. This may not cross a rigorous code evaluate, for my part. So, let’s clear it up!

Writing a really efficient RegEx requires the identical greatest practices as writing any efficient code. Right here’s just a few guidelines you might acknowledge, however aren’t instantly apparent with regards to RegExes:

DRY (don’t repeat your self)

If the identical sample exhibits up a number of instances in your RegEx, refactor it into its personal variable / fixed:

Use intermediate variables for readability

Utilizing intermediate variables is technically pointless, and doesn’t make code any extra purposeful or computationally environment friendly. Nonetheless, it may well save a great deal of your costliest useful resource (time) by making your code extra readable. It’s already changing into apparent how our RegEx works with this following enchancment:

This may increasingly appear pointless and pedantic, however needless to say many individuals don’t use RegExes typically, so it doesn’t harm to be excessively clear.

Write assessments!!

All code deserves to be examined, and RegExes are not any exception. It’s at all times good to have close to 100% department protection with common code. In easy phrases, each conditional (if, else, whereas, for, …) expression ought to have a take a look at the place that conditional is true, and a take a look at the place it’s false. Due to this fact, each attainable “department” of code is examined.

An analogous precept applies to RegExes as a result of they’re simply compact representations of finite state machines (mainly, restricted department/loop algorithms). Some examples of tokens to deal with like branches for testing:

  • 5 (a.okay.a. a quantifier) —write a take a look at case the place there are 5 of the previous token, in addition to one thing apart from 5 of the previous token. For instance, the take a look at case “12345” assessments a legitimate zip with 5 digits, and “123456” assessments an invalid one with 6 digits
  • | (a.okay.a. an or expression)— write a take a look at case for each tokens on the left and proper of an or expression like this. For instance, the take a look at case “12345,67890”, assessments the comma token earlier than the | , and “12345 67890” assessments the white house token earlier than the | in our ZIP_CODE_LIST expression
  • + and *(a.okay.a. a one-or-many or zero-or-many expression)— write take a look at instances the place there are zero, one, and a number of of the tokens earlier than certainly one of these expressions. For instance,“1234567890” ”12345 67890"and “12345(house)(house)67890” and all take a look at the [s]+ piece of our expression with zero, one, and two white house tokens separating two zip codes.

More Posts