Deep learning has gotten a lot easier ever since Keras came on the scene a few years ago (you should be using tf.keras). Spacy has made NLP a breeze, and the similar-sounding scrapy even lets you assemble your own datasets. Then there are the corporate players like Apple’s Turi create and Google’s Tensorflow 2. You also can’t forget scikit-learn, which everyone uses. Then there’s good old NLTK and OpenCV’s Python API. Resting beneath all of these tools are numpy and scipy. And who could live without pandas, or maybe pyspark if you’re working with huge datasets.

But there’s one library that’s more useful than any of these. It gets used when we’re collecting data sets. It’s everywhere when we’re cleaning and munging data. It rears its head during exploratory analysis and modeling. It even comes into play when we create visualizations. This is Python’s regular expression library, re, and it’s fundamental to every AI system and process.

A Story from Siri

I used to work on the Siri team at Apple. Siri is a huge project that involves lots of moving pieces and is divided into several teams. When Siri produces some speech, an appropriate response text (like hello, I’m Siri) is first selected and then transformed into synthesized speech. You might think that most of the text-to-speech process involves feeding the text into a cool deep learning model that produces a synthetic response. This is usually not what happens. Instead, what often happens is that a regular expression created by some engineer like myself is used to select an appropriate response from a database of pre-recorded speech that we had a voice talent (the “real” Siri) record.

# Not actual Apple code. Please...
import re
sentence = "I hate you Siri"
reg = r'hate.*Siri'

if re.search(reg, sentence):
    response = apology.get_random_apology() # "sorry to hear that"

Regular expressions aren’t just critical when generating the text of Siri’s responses. They’re also used everywhere for pronunciation as well. How should you pronounce the word Dr in the phrase Dr Smith? Is it pronounced the same way as the Dr in Mulholland Dr? What about St Francis vs Francis St? If you were an attentive data science student you might hope that a cool deep learning model built in Tensorflow or Pytorch might be able to accurately predict the pronunciations of these words. But that’s not how it usually works out in real life. Data in natural language often follows a Zipfian distribution which means that there are lots of rare edge-cases that are exceptions to the general rules that deep learning models can learn. Unfortunately, these exceptions occur frequently and can lead to really bad user experiences.

Chatbot systems have dictionaries of exceptions: the things that the core model gets wrong. But when put into production, it’s frequently the case that the exceptions are being triggered more often than the rules! One of the earliest AI’s, ELIZA, was powered entirely by regular expressions: not one hint of a convolution, dropout layer, or embedding to be found.

Spam vs Ham

UCI’s Spambase dataset is a collection of spam and ham (ham = non-spam) text messages. You can do some featurization in sklearn and build a nifty machine learning model that will give you fairly high accuracy on the binary classification task of distinguishing spam from ham. However, you can get results that are almost as good by using some intuition and a few regular expressions:

import re

class regex_model():
    def __init__(self):
        self.spam_re = r'free|FREE|drugs|money|viagra'

    def predict(message):
        if re.search(self.spam_re, message):
            return 1 # Spam
        else:
            return 0 # Ham

If you don’t believe me, check out this notebook where I compare the accuracy of a regex model that took 90 seconds to create and a pretty decent machine learning model that took me an entire afternoon.

Regular expression models are often a good starting point for any natural langauge task. Sometimes they’re also the ending point. I’ve had times in my career when I’ve created a simple rule-based model and presented it to my boss as a “first pass at the problem, ya know, a placeholder” and been surprised to hear that the model achieved good-enough results to go into production – time to move onto the next project!

Odds and Ends

In the last two sections I showed that regular expressions are critical to the functioning of the world’s second-best virtual assistant (after HAL of course) and can get pretty decent results on common ML tasks like classification. But regular expressions are used everywhere else in the AI world as well: have some messy data in a directory and want only the csv files? Throw a regex into the mix (or maybe check out Python’s glob library too). Do you want to know the dates of all the medical documents you’re OCR-ing? Make a regex to match and extract the dates. What if you have some strings you want to match but are okay with some fuzziness? Use re (but fuzzywuzzy is cool too).

Sometimes the simplest tools are the most powerful. Most software developers and data scientists are familiar with regular expressions, but they’re under-utilized. Instead of rushing for a complex tool that might be difficult to apply and prone to failure, stick with the simple methods that are proven to work. Hopefully this blog post has made you more interested in the re library.