Pattern Matching in Ruby with Regex Look-arounds

My first encounter with regular expressions, or regex, felt a bit like magic. I had just started diving into programming and was looking for a way to remove all punctuation from a string. A StackOverflow page recommended using a regex that looked something like this:

/[!@#$%^&*()-=_+|;’:”,.<>?’]/

I used the code as prescribed, and voilà, all punctuation disappeared. How the mashup of characters worked was a mystery to me, but it did the job.

Viewing something as magic, though, means most of its usefulness remains hidden, and therefore, out of reach. A recent Ruby project brought me back to regexes and helped me pull back the curtain on how they work and why learning even just the basics can be incredibly useful.

In this post, we’ll explore what a regex is and how to apply a special regex tool called look-arounds when working in Ruby.

What is a regex?

Regular expressions, or regexes, are powerful tools for pattern matching in strings. Regexes can be used in most programming languages, from JavaScript to Python, though syntax can vary a bit.

A regex is a written as a set of characters that represents a pattern you’re seeking to match within a string. You’ll often see a Ruby regex written with forward slashes like this, /your_pattern/, though you can also use %r{…} literals or the Regexp.new(‘pattern’) constructor. In this post, we’ll use the /…/ notation.

When are regexes used?

Regexes are used with strings and can be especially handy when you need to perform actions like input validation or parsing. For example, regexes can help you:

  • Ensure users enter valid input types in fields
  • Remove commas from all number instances
  • Make sure all zip codes have 5 digits
  • Pull key value pairs out of a complex string
  • Split a mailing address into its individual components
  • Find and replace instances of a word

In my recent project, regexes provided me with a way to clean up and format data from a JSON file that was littered with html tags. We’ll walk through one example from this project a bit later.

How do you create a regex?

The simplest regexes contain just one pattern. The regex /e/ would match any instance of the letter ‘e’ inside a string.

It can be helpful to think of more complex regexes as compositions of these simple patterns, especially when initially learning how to write or read them.

At first glance, regexes can look like a random mishmash of characters, but the specific arrangement of each character matters. For example, inside square brackets, a period represents a literal period, while a period outside square brackets represents any single character.

As an example, in the string:

“A chocolate donut with sprinkles, please.”

/[.]/ only matches the . at the end of the string while /./ matches any single character in the string, from the spaces between words to the comma or period.

As you learn what characters mean in various regex arrangements, you’ll have an easier time building and reading regexes. To learn the basic building block characters and arrangements, check out these helpful resources:

As you build your regex toolbox, you’ll find tools to help you match all kinds of patterns, including patterns:

  • At the beginning or ends of newlines or strings (anchors)
  • That occur a certain number of times (quantifiers)
  • Based on a certain range of characters, like 0–5 (ranges)
  • That don’t include a specific character or sequence (negation)
  • That you can turn into key value pairs (named captures)

In Ruby, you can utilize regexes with methods like #split, #scan, #match? and #gsub to extract, edit, and otherwise manipulate strings.

Look-aheads and look-behinds

Syntax Overview and Application

Look-aheads and look-behinds (or collectively, look-arounds) work as you might expect, allowing you to match portions of a string based on what comes before or after the section you want to select. They can be either positive or negative.

A positive look-ahead allows you to match a portion of a string based on a pattern that comes immediately after it in the full string. The regex ‘looks ahead’ to check what comes next. They’re written with the following syntax:

(?=your_pattern)  #positive look-ahead

A positive look-behind allows you to match a portion of a string by identifying a pattern that comes immediately before it in the full string. The regex ‘looks behind’ to check what came previously. They’re written with the following syntax:

(?<=your_pattern)  #positive look-behind

To negate a look-around and find matches that are not immediately preceded or followed by a specific pattern, replace = with ! in the syntax.

A negative look-ahead helps you match a portion of a string by checking to make sure a specific pattern does not follow it in the full string.

(?!your_pattern)  #negative look-ahead

A negative look-behind helps you match a portion of a string by checking to make sure a specific pattern does not come before it in the full string.

(?<!your_pattern)  #negative look-behind

The pattern you use in your positive or negative look-around is not included in the match. Let’s see how this works with an example. Suppose you want to collect only the numbers associated with minutes in the string below:

"Add in the remaining ingredients and mix for 2 min on high. Pour into a 13x9 pan and bake for 20 min on 350°."

The following regex uses a positive look-ahead and matches one or more digits, /d+, followed by a space, \s, and the characters min. This matches 2 and 20, but not 13, 9, or 350. Only the numbers “2” and “20” will be included as matches, not “2 mins” or “20 mins”.

/\d+(?=\smin)/

Look-arounds in action

To illustrate how look-arounds can work with Ruby methods, we’ll walk through an example from the project I mentioned earlier that uses the method #scan.

#scan can be used with or without a block, but we’ll be using it without a block in our example. Used without a block, #scan will pull all matches and store them in an array. If no matches are found (and no block is passed), #scan will return an empty array.

your_string.scan(/regex/) #basic syntax without a block

Using #scan to pull matches

One challenge I ran into with the dataset used in my project involved pulling the ‘alias’ names of various fish species from hashes that looked like the example below (desired matches in bold).

alias_hash = {“Species Aliases”=>”<a href=\”/species-aliases/illex-squid\” typeof=\”skos:Concept\” property=\”rdfs:label skos:prefLabel\” datatype=\”\”>Illex squid</a>, <a href=\”/species-aliases/summer-squid\” typeof=\”skos:Concept\” property=\”rdfs:label skos:prefLabel\” datatype=\”\”>Summer squid</a>”}

I wanted to extract the alias names and store the values in an array, but they were surrounded by html tags. #scan paired with two look-around regexes provided a solution:

alias_hash[“Species Aliases”].scan(/(?<=”>).+?(?=<)/) 
=> [“Illex squid”, “Summer squid”]

How does this regex work? Let’s break it down:

  • (?<=”>).+ is a look-behind that looks for one or more characters, .+, preceded by the characters “>, which only appear before the alias names in the string. Because .+ matches spaces as well as letters, we can select alias names that contain more than one word.
  • (?=<) is a look-ahead that ends the match when the string reaches a character that’s followed by a < sign
  • When the quantifier ? is added to form (?<=”>).+?, the regex now looks for zero or one instances of the pattern (?<=”>).+ This prevents the regex from matching from the character following the first instance of “> all the way through the last instance of < in the string. With the ? quantifier added, the regex now checks for matches at each individual instance of the sequence “> that’s followed by one or more characters.

Using regexes in your own projects

When browsing a book store recently, I found a regex textbook as thick as any Harry Potter novel. The power of regexes extends well beyond what we covered in this post, but you’ve already pulled back the curtain a bit by reading this and have deepened your understanding of them.

As you work on regexes in your own projects, check out Rubular, a Ruby-based regex editor that helps you test your regexes. There are similar editors for other languages, like Scriptular for JavaScript.

If you’ve found any tools or resources especially helpful in learning about regexes, leave a note in the comments for others!

Software engineer interested in the intersection of tech, design+art, and social innovation