1.0 Introduction
1.1 Text: Characters, Strings, and Words... Oh My!
2.0 Getting Started
2.1 Delimiting "Words": Space Marks vs. Metsequences
2.2 "Or": Disjunction with Parentheses and Vertical Bars
2.3 Pick A Character, But Not Just Any Character: Square Brackets
2.4 Making Literals out of Metacharacters: The Backslash (Escape
Characte)
2.5 Negation: The Caret in Square Brackets
2.6 Line Starts and Ends: Carets and Dollars
2.7 Summary
3.0 Quantifiers
3.1 Match Many Times or Not at All: The Asterisk
3.2 Match One Time or Not at All
3.3 Match One or More Times: The Plus Mark
3.4 Getting More Specific: Minimums and Maximums
3.5 Summary of Quantifiers
4.0 Some Useful POSIX Conventions
5.0 Conclusion and Summary
Before starting, it will help to define a few terms. Initially, definitions will be given somewhat loosely to give the reader the vocabulary necessary to understand the explanations given in this chapter. Subtleties and nuances will be drawn out and discussed where necessary.
The salmon-falls, the mackerel-crowded seas,
Fish, flesh, or fowl, commend all summer long
Whatever is begotten, born, and dies.
How many words are there in each line? Some would reply: five, eight, and six. If we defined a word typographically, by saying that a word is everything between spaces, then these figures would indeed be accurate. But aren't the "words" salmon-falls and mackerel-crowded actually two words? The answer depends on how one delimits a word.
Before diving in to the business at hand--that is, how regular expressions work--a bit more needs to be said on the subjects of charcters. When dealing with regular expressions, one must also carefully distinguish between two types of characters: literal characters and metacharacters. Literal characters are everything from letters and numbers to various odd punctuation marks like @ or & and they stand for themselves in regular expressions. In other words, a literal character is simply a character that stands for itself. So, cat stands for cat because all of the characters are literals. It's as simple as that. Metacharacters, however, are an entirely different kettle of fish.[1] They are special characters that stand for more than themselves. For example, a period stands for any character. The number of metacharacters in regular expression is actually quite small compared to the overall numer of characters available in ASCII. This is understandable, since the metacharacters are essentially characters that have been reserved for special use in regular expressions.
1.) a
2.) an
This search would in fact produce (or, in the jargon of regular expression, "match") every indefinite article in our text, but it would produce (or "match") a hell of a lot more. (In the lingo of generative grammar, it would "overgenerate".) The reason is that we have not defined word boundaries, so these regular expressions will match every "a" and every "an", even if they are not stand-alone words but are part of other words, as in "animal", "command", etc. (Another way of saying this is that a and an are sub-words, or sub-strings, of the words, or strings, animal, command, etc.) The first improvement we will have to make on our search is to restrict it to words.
Defining a word for computers is not a simple business (and good linguists will tell you that it's not such a simple matter for natural language either). Fortunately, there are a number of tricks for solving the problem. We will discuss two at this stage.
Trick #1: White Space in Searches
Since most words are delimited by space marks, the space marks can be included in the regular expression itself, as in (3) and (4). (Typographical Note: Since space marks are hard to see, they will be represented by #.)
3.) #a#
4.) #an#
However, it is important to bear in mind that not all words will be preceded and followed by spaces. For example, if one searches for the words "hard" in this manner, as below,
5.) #hard#
it will turn up when part of phrases like the following:
hard-working
hard-nosed
die-hard
hardly
hardened
Trick #2: Special Metasequences (\< and \>)
The problem with (3) through (5) is that space marks in regular expression can be hard to see and easy to neglect, which can wreak havoc with one's searches. Fortunately, there is a convention in regular expressions to represent "words": the metasequence \< (backslash plus left-pointing angle bracket) for the start of words and the metsequence \> (backslash plus right-pointing angle bracket) for the end of words, as in (6) and (7).[2]
6.) \<a\>
7.) \<an\>
There are two important caveats to be made with regards to these metasequences. First, note that \< and \> are not metacharacters but rather metasequences, composed of a backslash followed by a left-pointing or right-pointing angle bracket. (In a sense, then, the backslash makes the angle bracket a metacharacter. Later in this chapter we will see that backslashes normally do the opposite. That is to say, they normally make metacharacters literal.) Second, notice that I have repeatedly used scare quotes when talking about the word-delimiting function of \< and \>. I have done so because what \< and \> delimit does not coincide neatly with our common-sense notion of words, which are generally delimited by space marks, and in that sense, the metasequences \< and \> in not in fact word-delimiting. What they actually delimit are continuous alphanumeric strings, which is just a concise way of saying that they mark the start and ends of any uninterrupted sequences of letters and numbers. More concretely, the following regular expressions on the left match the strings on the right,
| \<bare\> | matches | bare-chested |
| \<dad\> | dad's | |
| \<1.00\> | $1.00 |
This happens because \< and \> do not carve up our common-sense notion of words at the joints and can therefore cause a fair bit of confusion.[3] This can be made clearer by ire-examining the case of (5), where space marks were used to delimit the word "hard", to compare the results with those that would have been obtained if we used the angle bracket metsequence in lieu of space marks. What we discover is that the angle bracket metasequences work in some cases where space marks fail. Thus, the following regular expression
8.) \<hard\>
matches
hard-working
hard-nosed
die-hard
but does not match
hardly
hardened
while (5) from above (with the word hard delimited by spaces) matches neither set. So, neither (5) nor (8) will match the last set. To match such words, the following regular expression, which say nothing about what follows hard, will work:
9.) #hard
10.) \<hard
In summary, we now have two ways of searching for our indefinite articles, using space marks and using the regular expression metasequence for words (or, more precisely, alphanumeric sequences), but there is still a problem: we are running two searches when we could be running just one. Running two separate searches and compiling the results is fairly straightforward in this case, there are many cases where this would be excessively time-consuming. Obviously, it is preferable to integrate the two sentences into one.
The most obvious way in which the search for indefinite articles can be improved is by introducing the concept of "or", so that searches can say look for one or the other--that is, for a or an. Disjunction--as the concept of "or" is referred to in logic--is accomplished in regular expressions with parentheses and the vertical bar metcharacter (|).
11.) a|an
Although some overly fussy English-speakers feel that "or" should be used exclusively for binary disjunction--that is, for either-or situations where there are only two options--the everyday English "or" is often extended to situations where there are is a number of options, and the same holds true for the regular expression version of "or". So, it is also possible to have three or more possibilities, as in (12), where our search for articles in expanded to include the definite article "the".
12.) a|an|the
The usefulness of disjunction should be obvious. It is quite handy when one wishes to allow more than one string to fill a particular slot, as in (13).
13.) book(shop|store)
which would match either "bookshop" or "bookstore" (the British and American words, respectively, for the same thing). Note that parentheses have been used in this previous example. Parentheses are important when there might otherwise be ambiguity. In the case of (13), it is clear that what is wanted is book followed by either shop or store. However, if the parentheses were not used, as in (13') grep would think that the search was for bookshop or store, as shown explicitly in (13"). 13'.) bookshop|store
13".) (bookshop)|(store)
Disjunction is also quite handy when searching for the various forms of a given lexeme, as in (14).[4]
14) profan(e|ity)
which would match
profane
profanity
This brings us to another common metacharacter in regular expressions--the period (or, if you prefer, the full stop). A period is a metacharacter than serves as a stand-in for one character. It essentially means, "take your pick for this character, it can be anything". For example, the following search
15.) p.n
matches any three-character string whose first and last characters are "p" and "n", irrespective of whether the middle character is a letter, a number, or neither. Therefore, (15) matches
pin
pln
p&n
p2n
but does not match
pain
pawn
and so forth, since these words have two characters between "p" and "n" and the period in (15) will match only one.
Obviously, there are many cases where the variable portion of one's search consists of more than one character. One way of working around the built-in limitation on the period is to use more than one. To search for four-letter words beginning with "b" and ending with "n", one might use the following search
16.) b..n
which matches
boon
been
born
barn
but does not match
bin
brawn
broken
bracken
since the words in the latter set begin with "b" and end with "n" but have more than two characters sandwiched in between. (In other words, they do begin with "b" and "n", but they're not four-letter words.)
It is also quite useful when performing searches to allow more than one character to fill a position in a word or a string. For example, imagine searching for the word "enquire" in a collection of texts from all over the world. Since the spelling of this word is variable--"enquire" in commonwealth countries, but "inquire" in the USA--one would want to leave two possibilities for the first character of this string. One way to accomplish this is by searching for either spellings disjunctively, as in (17).
17.) (enquire|inquire)
But there is a better way of doing this--namely, by placing the various possible characters within square brackets (effectively restricting the disjunction to a single character), as in (22).
18.) [ei]nquire*
In more technical term, a string of characters (of any length) surrounded by square brackets forms a pattern that matches any single character in that string. This type of regular expression is useful in other contexts, as well. It is quite handy when dealing with variation in capitalization. If one searched for the word "hope", for example, one would be forced to deal with the fact that its first letter is capitalized at the beginning of sentences, as in "Hope springs in the heart eternal." The square bracket conventions allows one to build this sort of case-insensitivity into regular expression, as follows:
19.) [Hh]eart
(Since different regular expressions parsers treat case differently, this may not be necessary in some cases, but it is best to assume case-sensitivity unless otherwise indicated.) A very similar convention involving square brackets allows a wider variety of characters to fill a given character slot by specifying a range. This is really a time-saving device, since it obviates the need to spell out every letter of the alphabet, as below
20.) [abcdefghijklmnopqrstuvwxyz]
In lieu of (20), which waste time and space, one can instead define the range from 'a' through 'z' by placing them within square brackets and separating them with a hyphen, as below
21.) [a-z]
The regular expression in (21) will match any one character in the alphabet from 'a' through 'z'. More than one range can be placed within a pair of brackets, as seen in
22.) [a-zA-Z]
So, (22) will match any character, regardless of case. Note that more than one range can be included between square brackets and that separation of the two is unnecessary. By the same token, other characters can be included, as in the following regular expression
23.) [a-zA-Z&]
which would match a letter or an ampersand. It is important to bear in mind that non-alphabetical characters can also be used in this type of regular expression, as in the following regex:
24.) [1-5]
which matches any number between 1 and 5. Be warned, though, that unexpected results can arise from the use of non-alphanumeric characters. For example, it is doubtful that the reader knows which characters the following regular expression will match
25.) [!-a]
This is because the average reader knows the order of the alphabet perfectly well, but not the order of all characters in ASCII (the American Standard Code for the Interchange of Information). In fact, any reader who has memorized the order of ASCII is unlikely to need this book (but is probably likely to need a tutorial on social interaction). For those who have not memorized the order of ASCII, the range specified by (25) is the following:
123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`a
Since including nonalphanumeric elements in ranges can produce unpredictable--or, to be more accurate, difficult-to-predict--results, it is wise to avoid them and to spell out any nonalphanumeric characters explicitly, as in (23), where an ampersand has been added to the set of all uppercase and lowercase letters.[5]
At this point, a clever reader may have second-guessed me by raising the following important question: How does one search a text for characters that happen to be metacharacters, such as periods, back slashes, etc.? More concretely, how would someone search for all sentences with, say, two periods in a row. Obviously, the following search will not work
26.) ..
since it produces any two characters in a row (given the normal interpretation of periods), which matches nearly everything (except empty lines). Fortunately, there is a convention for searches on characters that happen to be metacharcters: the use of a backslash (\) before a metacharacter that one wishes to be treated as a literal. (The backslash is also known as an escape character, since it allows an "escape" from the normal interpretation of a metacharacter.) So, to search for two periods in a row, we need to place a backslash before each period, so that each one is interpeted literally, as below:
27.) \.\.
To use a better example, cosider what is involved in searching for etc. It will not do to use the following regular expression
28.) etc\.*
since it matches
etcetera
etching
fetch
and so forth. A much better way of searching for etc. is by converting the period from a metacharacter to a literal by placing a backslash in front of it, as follows
29.) etc\.
will not match any of the above-listed words but will match
etc.
Unfortunately, the backslash has its limitations. It does not work with the characters (, (, <, >, or the digits 1-9, because these characters serve as metasequences in conjunction with the backslash (see Section ???). Furthermore, it should be noted that a backslash can be followed by a literal character (that is, a non-metacharacter), even though the backslash would in such cases be superfluous. After all, one would get exactly the same result with the character alone!
Related to the use of square brackets to enclose multiple character choice is the use of square brackets with an initial caret to enclose characters that will be rejected during matching. In other words, any character enclosed within square brackets and preceded by a caret will be taken out of the running while parsing the regular expression. The need for such a convention is not obvious, but imagine someone searching for the preposition "for" anywhere in a sentence except at the end. Since a period typically marks the end of a sentence, this person would want to search for "for" followed by anything but a period. The following regular expression would do the trick
30.) for[^.]
This would match
not called for,
for nothing
but specifically not
for.
However, the regular expression in (?) would also match
forever
forest
fortress
and so on, which is not the desired result, given that we are searching for only the preposition "for". There is another possibility which is more desirable given its specificity--namely, the one in (?).
31.) for*
This will search for "for" followed by a space, which normally represent the end of a word. Such a regular expression would match
for nothing
but would not match
forever
forest
fortress
not called for,
At this point, we have not tackled the problems involved in the breakdown of texts into lines. [SAY MORE ABOUT LINE BREAKS GENERALLY-SPEAKING]
There are two metacharacters related to the starts and ends of lines: caret (^) and dollar ($). The caret is used to represent the start of a line while the dollar is used to represent the end of a line. If we were searching for the word "fast" in a text, then (32) matches every line beginning with "fast", while (33) matches every line ending with "fast".
32.) ^fast
33.) fast$
It is important to note that since no additional information has been provided, it is possible for (32) to match
faster
fastest
fastener
fastidious
and so forth, and for (33) to match
breakfast
Belfast
and so forth, since what follows fast isn't specified in (32) and what precedes fast isn't specified in (33).
Before expanding upon our repretoire of regular expressions, it may be worthwile to recap what we have learned so far. We've been introduced to one metaseuqence--the backslash plus angle brackets--and a variety of metacharacters--the period, the question mark, the vertical bar, square brackets (with and without hyphens), parentheses, and the backlash (or escape character, as it is sometimes known). These are summarized below:
| . | matches any character |
| | | or |
| [abc] | matches one of enclosed characters |
| [a-z] | matches one of the characters in a through z |
| () | grouping |
| \ | causes metacharacters to be read as literals also causes < and > to be defined as words starts and ends, respectively, (e.g., \<abc\>) |
Regular expressions built up from these metacharacters are fairly powerful and give searching a good deal flexibility, but there are a number of ways in which they are lacking. In the next section, we will expand the possibilities provided by these metacharacters by introducing a number of quantifiers--that is, metacharacters which allow other metacharacters to apply once, twice, multiple times, or even not at all.
Perhaps the best way to get a handle on quantifiers is to imagine a situation where they are indispensable. Imagine that you are searching a collection of texts for every instance of a particular type of relative clause--say, so-called restrictive relative clauses. One could search for "that" or for "who", as in "the car that broke down by the side of the road" or "the man who killed the president," but this would match many string that aren't relative clauses but nevertheless contain these words, such as
I hope that everyone understands.
That computer should be made into a fish tank.
Who was able to parse out that long regular expression?
Clearly, we need some way to restrict the scope of our search and zero in on only relative clauses, to the exclusion of everything else. One way to do this is to look for "that" or "who" when it co-occurs with "the", which is frequently the case, as we have already seen in the above-cited relative clause examples. The problem with such a search is that nearly any noun can come between the definite article and the beginning of the relative clause (either "that" or "who", or even in some cases nothing at all).
We already know how to handle the fact that relative clauses begin with one of two markers (ignoring "which", which is typically, but not exclusively, used for so-called non-restrictive clauses):
34.) the X (that|who)
What we need is the regular expression equivalent of a variable that stands for a word. And in the case of relative clauses, at least one word (which we will call the head noun) will always come between the definite article "the" and the relative clause marker ("who" or "that"). There are a number of ways in which this sort of thing can be written into one's regular exprsesions, but the asterisk metacharater is in this case the most useful.
The asterisk say essentially, find as many instances as possible. It's domain of application is either the immediately preceding character or expression. [SAY MORE HERE] Assuming that the head noun consists entirely of letters, the following regular expression should do the trick:
35.) the_[A-Za-z]*_(who|that)
What (?) say, essentially, is to find "the" followed by a space, any number of letters (lowercase or uppercase), another space, and then by either "who" or "that". (Remember that understrokes are meant to stand for space marks.) This regular expression would match any of the following relative clauses
the man who mowed my lawn
the woman who won a gold medal in the Olympics
the dog that pissed on the fire hydrant
but it would not match these
the stupid boy who drives a Ferrari
the man standing over there who is wearing the ugly shirt
the recalcitrant thief with a limp who like to rob banks
the corrupt politician's wife, who likes to buy shoes
Why? The answer is simple. Because these relative clauses involve either a) two or more words between "the" and the relative clause marker, or b) words coming after the head noun and before the relative clause marker. To solve this problem, we need to modify our regular expression slightly to get around the problems posed by (a) and (b). The basic problem is the same--namely, that although some relative clauses consist of "the" followed by a word, and then by the relative clause, as in "the boy who drives a Ferrari", many involve multiple words between the article and the relative clause, as seen in (?).
The point is that the regular expression in (?) will not match any of these relative clauses because what comes between the head noun and the relative clause contains non-alphanumeric characters. Probably the best way of understanding the problem is to think of the intervening material as a single string, as follows
_stupid_boy_
_man_standing_over_there_
_recalcitrant_thief_with_a_limp
_corrupt_politician's_wife
In essence, we have a string composed of alphanumeric characters and space marks, apostrophes, or hyphens. The way to match such a string is to simply add these non-alphanumeric characters to the square brackets in (?), and allow it to apply repeatedly by using the asterisk, as shown in (36):
36.) the_[A-Za-z*'-]*_(who|that)
This regular expression will match all of the relative clauses provided above, and many more.
To recap, the asterisk (or star) metacharacter makes whatever it modifies match zero or more times, which is quite handy when dealing with variation that can range, in theory, from zero to infinity. However, the asterisk has its limitations, namely that it has no way of placing a limit on the number of matches that can be made. What is often needed is a quantifier that places some sort of ceiling on the number of matches it will make. The quantifier that conforms to this desideratum is the question mark, which we will discuss in the following section.
The question mark metacharacter makes whatever it modifies match once or not at all, which amounts to saying that it makes whatever element it modifies optional. Clearly, there will always be elements in one's searches whose presence is unimportant. They can either be included or excluded without making much of a difference, and therefore could be said to be optional. For example, imagine searching for every instance of the word "unbelievable" in a transcription of an argument between Australian wharfies. It's quite likely that some instances of the f-word will be inserted into words like "unbelievable", producing "un-fucking-believable" or "unbe-fucking-lievable". (If you don't think the latter is possible, see A Fish Called Wanda.) We wouldn't want to search for three words:
unbelievable
un-fucking-believable
unbe-fucking-lievable
Instead, we want some sort of convention that will allow for the infixing of the f-word in "unbelievable" without requiring it. Fortunately, there is such a convention in regular expressions, and it involves the question mark metacharacter (?). The question mark makes optional whatever precedes it, which will be an entire string, provided it is grouped with parentheses, as in (37).[6]
37.) un(-fucking-)?be(-fucking-)?lievable
The question mark is indeed a handy device, but it must be treated with care because what it makes optional varies according to whether it is preceded by parentheses. With immediately preceding parentheses, it modifies the parenthetical expression, but without them, it modifies the preceding character, which is invaluable when dealing with optional characters. For example, American and British spelling of a number of words differs by only a single letter, as in "color" (American) versus "colour" (British). If searching for this word in a collection of mixed texts, the question mark metacharacter provides a fast and simple solution, as shown in (38).[7]
38.) colou?r
The plus mark (+) essentially says to match the preceding character or expressions at least once. It is like the asterisk except that it requires at least one match, whereas the asterisk is happy to match nothing at all. More concretely, consider the two spelling variants of the word for an Indian ruler:
39.) raja
40.) rajah
If you were searching for either spelling of this word, you could make the final 'h' optional with the asterisk, as in (?), but not the plus mark, as in (?), since the asterisk would be happy to find no 'h' at all (therefore matching "raja" or "rajah") whereas the plus mark would require at least one 'h' (therefore matching "rajah" but not "raja").
41.) rajah*
42.) rajah+
The most exact method of quantification would of course amount to stating the lower and upper limit of matches desired. Precisely this sort of specificity is supported by some more recent versions of grep (such as egrep on UNIX). The quantifiers '*' and '+' therefore correspod to {0,n} and {1,n}, respectively.
This is quite useful if you want to find a particular number of characters--say, a consonant cluster of two to three segments. This could be accomplished rather easily, as follows (where it is assumed that the language in question has only five consonants--ptklm):
43.) [ptklm]{2,3}
To recap, we have learned four ways of quantifying regular expressions. Each quantifiers behaves somewhat differently and allow for different matching possibilities.
| ? | causes previous character to match zero or one time |
| * | causes previous regex to match zero or more times |
| + | causes previous regex to match one or more times |
| {min, max} | min required, max allowed (not supported by all version of egrep) |
The POSIX standard defines two classes of regular expressions: Basic Regular Expressions and Extended Regular Expressions. The former are used in grep and sed while the latter are used in egrep and awk. If you're running egrep, it's useful to know about "bracket expressions". Bracket expressions are like character classes, but they are significantly more powerful--and, remember, they're only available on egrep (not on fgrep or grep). Recall from 2.3 that characters in brackets define a character class. Thus, (44) is a character class that contains all lowercase letters:
(44.) [abcdefghijklmnopqrstuvwyz]
Listing all the lowercase letters can be a pain in the neck, so fortunately the POSIX standard provides a way of avoiding such cumbersome statements, as shown in (45).
(45.) [:lower:]
Instead of spelling out each and every lowercase letter, this convention allows you to say what type of characters you want to match, according to the established conventions. Note that all of the keyword characters classes are preceded by "[:" and followed by ":]", as in (46).
(46.) [:keyword:]
The full list of these keyword character classes is given below:
| [:alnum:] | alphanumeric characters (that is, letters and numbers) |
| [:alpha:] | alphabetic characters (that is, letters, but not numbers) |
| [:blank:] | spaces and tabs |
| [:cntrl:] | control characters |
| [:digit:] | numeric characters (that is, numbers) |
| [:graph:] | printable and visible (non-space) characters |
| [:lower:] | lowercase letters (that is, a ... z) |
| [:upper:] | uppercase letters (that is, A ... Z) |
| [:print:] | alphnumeric characters (same as [:alnum:], as far as I can tell) |
| [:punct:] | punctuation characters |
| [:space:] | whitespace characters |
| [:xdigit:] | hexidecimal digits |
| . | matches any character |
| ? | causes previous character to match zero or one time |
| * | causes previous regex to match zero or one time |
| + | causes previous regex to match zero or more times |
| * | causes previous regex to match one or more times |
| | | or |
| [abc] | matches one of enclosed characters |
| [a-z] | matches one of the characters in a through z |
| () | grouping |
Created and maintained by: Stuart Robinson.