RegEx in ObjC: The Memoirs

by Louis Tur

Humans are great at finding patterns in their environment, even to a fault. There are far too many examples of our failures to possibly be thorough, but consider the mythical "hot streak" in sports, our tendancy towards confirmation bias, the ambiguity experienced in pattern recognition, Kanizsa's Triangle, or illusionary correlation (i.e. "correlation does not imply causation"). (Image from xkcd)

But to keep this in perspective: pattern recognition ends up being an important advantage in human development and cognition. The ability to process large datasets and then to generate "best guesses" in nearly an instant is proven to have been a huge evolutionary advantage. And now that we have moved much of our needs of data processing to computers, there is a need to be able to tell a computer to search for patterns as well as we could.

And so, regular expression syntax was developed in an effort to create a flexible way for a computer to parse strings of data. But as I found, and quite possibly everyone that has ever had to deal with regex has as well, translating human pattern recognition into programming pattern recognition has got to be one of the most frustrating exercises you could put yourself through.

Don't parse HTML with Regex?

So, apparently this is a thing: you're really not suppose to parse an HTML page using regex. But when the goal is to search for Youtube links on a webpage, you're not left with many options. So, you just try to make the experience as non-terrible as possible.

There were a few tools at my disposal that made the journey from completely impossible to probable, with high degree of error and difficulty. One of which was Regex101, which not only lets you see where a regex pattern matches, but also the step-by-step process of why it is failing in other cases! As a diehard believer in reverse-engineering as a learning tool, this made Regex101 the single most important resource I came across in my two weeks of mandatory regex hell. Not only this, but Regex101 also lets you save and version control your regular expressions, AND it lets you write unit tests to verify your expressions.

Also somewhat important was a regex code generator that took your expression and converted it into ObjC code, Patterns for Mac ($3, Mac App Store). Definitely worth the price just to make sure you're using NSRegularExpression correctly.

Lastly, the Hpple pod simplified the parsing process by first letting me filter out just a href and iframe tags of a page's HTML into an NSArray (it's an ObjC wrapper on the XPathQuery library). Knowing that I would only be dealing with the values stored in href and src tags made it possible to generate much simpler regular expressions.


To give you some perspective, here's what ultimately ends up being the regular expression I used to check for valid Youtube and Vimeo URLs (found here):

What in the hell is it doing? In a very high level overview, it's looking for one of the 14 or so ways a youtube URL can appear. And when it finds a match, you can access the extracted videoID of the URL from the second capture group.

I did, however, write out a long form explanation of what this is doing as well (thanks to some scaffolding from this stackoverflow answer).

Using NSRegularExpression didn't end up being all that bad after some trial and error.

An instance of NSRegularExpression (regexParser) is created to manage the parsing process. This instance gets initialized with the regex pattern to be used, along with some NSRegularExpressionOptions.

As mentioned, prior to this method call I used hpple to only get values contained in <a> and <iframe> tags. Since these tags only include at most one link in them, using -[NSRegularExpresion firstMatchInString:] satisfies my need (there are other methods that make it possible to locate multiple matches).

firstMatchingInString: returns a NSTextCheckingResult which contains the range(s) of the strings that match the regex string. If you look at the explanation of the regex I used, you'll notice that I have multiple capture groups. You can think of a capture group as a regex query, inside of a regex query (yo dawg, I heard you liked regex queries...). Capture groups allow you to capture a string that matches some regex pattern and then to use that captured string in another part of the regex. The indexes kept by NSTextCheckingResult are essentially these capture groups. I arranged it such that the second capture group in my search results would contain the video ID I was looking to test (I also did this for a Vimeo-compatible version of the regex). If the index[1] did not exist, I would send back an empty string (meaning no valid pattern had been found).

Where it goes from there

Now armed with a list of (potential) Youtube video ID's, I do one last final and definitive check using the GoogleAPI. I query the API for information regarding the video ID I pass in. If I receive a 200 response, I store information about the video (thumbnail URL, length, title, etc..) and present it to the user in a UITableView

But don't just take my word for it, the (very well documented) repo is found here: TubulrShareExtension.

Louis Tur

"How" has been the single most used word in my literary arsenal for as long as I can remember. I've never really been satisfied knowing that something works, but only by knowing how it works.

Read more from this author