Humans are great at finding patterns in their environment, even to a fault. There are far too many examples of our failures to possibly be thorough, but consider the mythical "hot streak" in sports, our tendancy towards confirmation bias, the ambiguity experienced in pattern recognition, Kanizsa's Triangle, or illusionary correlation (i.e. "correlation does not imply causation"). (Image from xkcd)
But to keep this in perspective: pattern recognition ends up being an important advantage in human development and cognition. The ability to process large datasets and then to generate "best guesses" in nearly an instant is proven to have been a huge evolutionary advantage. And now that we have moved much of our needs of data processing to computers, there is a need to be able to tell a computer to search for patterns as well as we could.
And so, regular expression syntax was developed in an effort to create a flexible way for a computer to parse strings of data. But as I found, and quite possibly everyone that has ever had to deal with regex has as well, translating human pattern recognition into programming pattern recognition has got to be one of the most frustrating exercises you could put yourself through.
Don't parse HTML with Regex?
So, apparently this is a thing: you're really not suppose to parse an HTML page using regex. But when the goal is to search for Youtube links on a webpage, you're not left with many options. So, you just try to make the experience as non-terrible as possible.
There were a few tools at my disposal that made the journey from completely impossible to probable, with high degree of error and difficulty. One of which was Regex101, which not only lets you see where a regex pattern matches, but also the step-by-step process of why it is failing in other cases! As a diehard believer in reverse-engineering as a learning tool, this made Regex101 the single most important resource I came across in my two weeks of mandatory regex hell. Not only this, but Regex101 also lets you save and version control your regular expressions, AND it lets you write unit tests to verify your expressions.
Also somewhat important was a regex code generator that took your expression and converted it into ObjC code, Patterns for Mac ($3, Mac App Store). Definitely worth the price just to make sure you're using
Lastly, the Hpple pod simplified the parsing process by first letting me filter out just a href and iframe tags of a page's HTML into an
NSArray (it's an ObjC wrapper on the XPathQuery library). Knowing that I would only be dealing with the values stored in
src tags made it possible to generate much simpler regular expressions.
To give you some perspective, here's what ultimately ends up being the regular expression I used to check for valid Youtube and Vimeo URLs (found here):
What in the hell is it doing? In a very high level overview, it's looking for one of the 14 or so ways a youtube URL can appear. And when it finds a match, you can access the extracted videoID of the URL from the second capture group.
NSRegularExpression didn't end up being all that bad after some trial and error.
An instance of
regexParser) is created to manage the parsing process. This instance gets initialized with the regex pattern to be used, along with some
As mentioned, prior to this method call I used hpple to only get values contained in
<iframe> tags. Since these tags only include at most one link in them, using
-[NSRegularExpresion firstMatchInString:] satisfies my need (there are other methods that make it possible to locate multiple matches).
firstMatchingInString: returns a
NSTextCheckingResult which contains the range(s) of the strings that match the regex string. If you look at the explanation of the regex I used, you'll notice that I have multiple capture groups. You can think of a capture group as a regex query, inside of a regex query (yo dawg, I heard you liked regex queries...). Capture groups allow you to capture a string that matches some regex pattern and then to use that captured string in another part of the regex. The indexes kept by
NSTextCheckingResult are essentially these capture groups. I arranged it such that the second capture group in my search results would contain the video ID I was looking to test (I also did this for a Vimeo-compatible version of the regex). If the
index did not exist, I would send back an empty string (meaning no valid pattern had been found).
Where it goes from there
Now armed with a list of (potential) Youtube video ID's, I do one last final and definitive check using the GoogleAPI. I query the API for information regarding the video ID I pass in. If I receive a
200 response, I store information about the video (thumbnail URL, length, title, etc..) and present it to the user in a
But don't just take my word for it, the (very well documented) repo is found here: TubulrShareExtension.