如何写出优雅的代码?

everything is null
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

(转载自www.knowmadic.com)一篇关于正则表达式的文章,非常全面!

Posted on 2006-02-28 12:31  灰色  阅读(1087)  评论(0)    收藏  举报
 

Table Of Contents

1.0 Introduction to regular expressions
2.0 Character classes
3.0 Regular expressions
4.0 Perl zero width assertions
5.0 Perl Regular expression modifiers
6.0 Perl Extended regular expression patterns
7.0 Perl substitute command
8.0 Regular expression discussion
9.0 Different regular expression engines


1.0 Introduction to regular expressions

See also Perl regular expression manual and <http://www.perldoc.com/>.

      prompt> perldoc perlre
    

Many other programming languages use regular expressions as well:

1.1 Where are regular expressions used for

Here comes the scenario: Your boss in the documentation department wants a tool to check double words e.g. "this this", a common problem with documents subject to heavy editing. Your job is to create a solution that will:
  • Accept any number of files to check, report each line of each file that has double words.
  • Work across lines, find word even in separate lines.
  • Find double words in spite of the capitalization differences "The", "the", as well as allowing whitespace in between the words.
  • Find doubled words that might even be separated by HTML tags. "it s very <I>very</I> important"

That is not an easy task! If you use such a tool for existing documents, you may surprisingly find similar spelling mistakes in various sources. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.

Regular Expressions are the key to powerful, flexible, and efficient text processing. Regexps themselves, with a general pattern notation, almost like a mini programming language, allow you to describe and parse text. With additional support provided by the particular tool being used, regular expressions can add, remove, isolate, and generally fold, spindle all kinds of text and data. It might be as simple as text editor's search command or as powerful as a full text processing language. You have to start thinking in means of Regexps, and not the the way you have used to with your previous programming languages, because only then you are taking the full magnitude of their power.

The host language (Perl, Python, Emacs Lisp) provides the peripheral processing support, but the real power comes from regular expressions. Using the Regexps right will make it possible to identify the text you want and bypass the portions that you are not interested in.

1.2 Solving real problems

1.2.1 Summary of Email mailbox

If you wanted to create a summary of the messages in your mailbox, it would be tedious to read all your 1000 mails and store the important lines to a separate lines by and (like From: and Subject:). What if you were behind dial-up? The on-line time spend in making such summary easily eats your pocket if you had to do it multiple times. In addition, you couldn't do that to some other person, because you would see the contents of his mailbox. Regexps come to rescue again. A very simple command could display summary of those two lines immediately.

Do not mind at this point what the code does, let's just say: wow, that simple?

      command prompt> perl -ne "print if /(From|To):/" ~/Mail/*
    

What if someone asked about that summary? It would be non-needed to send the 5000 line results, when you could send that little one-liner to the friend and ask him to run it for his mailbox.

1.2.2 Checking text in files

As a simple example, suppose you need to check slew of files (70-150) to confirm that each file contained SetSize exactly as often as contained ResetSize. To complicate matters, you should disregard the capitalization and accept SETSIZE. The total count of lines in those files could easily end up to 30000 or more and checking them by hand would give you a headache. Even using normal "find this word" with text processor would have been truly arduous, what with all the files and all the possible capitalizations. Regexps come to rescue. Typing just a single short command your make the work in seconds and confirm what you want to know.

      % perl -0ne "print qq($ARGV\n) if s/ResetSize//ig != s/setSize//ig" *
    

1.2.3 Matching RFC compliant email address

Friedl finishes his book with real-life example by parsing the Internat standard (RFC) email address according to the defined BNF grammar rules. Well, this must be the longest regular expression ever:

[Picture 1.  pic/book-regexp-email.jpg]
Picture 1. Internet RFC standard compliant email address matcher. Friedl, 1st edition, page 316.

1.3 Regular expressions as a language

Unless you have had some experience with regular expressions, you wouldn't understand the above commands. There really is no special magic here, just set of rules that must be digested. once you learn how to hide a coin behind your hand, you know there is not much magic in it, just lot of practice and learning new skills. Like a foreign language, it will start stopping sound like "gibberish" after a while.

1.4 The filename analogy

If you have only experience on the Win32/Windows environment, you have a grasp that following refers to multiple files:

      *.txt
    

With such filename patters like this (called file globs) there are few characters that have a special meaning.

      *   => means: "MATCH ANYTHING"
      ?   => means: "MATCH ONE CHARACTER"
    

The complete example above will be parsed as

[Picture 2.  pic/lect-regexp-file-analogy.jpg]
Picture 2. How opeerating system interprets metacharacters

Exercise: open a dos-command prompt and try out directory listing commands. Try lising your Word or text files by using the metacharacters (*) and (?).

And the whole patters is thus read as "Match files that start with anything and end with .txt"

Most systems provide a few additional special characters, but in general these filename patterns are limited in expressive power. This is not much of a shortcoming because the scope of the problem (to provide convenient ways to specify group of files) is limited to filenames.

On the other hand, dealing with general text is a much larger problem. Prose and poetry, program listings, reports, lyrics, HTML, articles, code tables, word lists ...you name it. over the years a generalized pattern language has developed which is powerful and expressive for wide variety of uses. Each program implements and uses them differently, but in general this powerful pattern language and the patterns themselves are called Regular Expressions.

1.5 Searching text files

Finding text is the simples uses of regular expressions - many text editors and word processors allow you to search a document using some kind of pattern matching. Let's return to the original example of finding some relevant lines from a mailbox, we study it in detail:

[Picture 3.  pic/lect-regexp-perl-cmd-line-ne.jpg]
Picture 3. Searcing text files with Perl in Win32

Even more simple example would be searching every line containing word like "cat":

      perl -ne "print if /cat/" file.txt
    

But things are not that simple, because how do you know which word is plain "cat", you must consider how "catalog", "caterpillar", "vacation" differs semantically from the animal "cat". The matched results do not show what was really matched and made the line selected, the lines are just printed. The key point is that regular expressions searching is not done a "word" basis, but in general only character basis without any knowledge about e.g. English language syntax.

Exercise: Write a simple text file, data.txt, which conatains some searchable lines. Now try to search every line that contains a word "test" or "Test" or "line" by using perl from command line. s #t2html::td:class=color-white:tableclass:dashed

      test
      one more test
      and yet another test
      this line here
      and more lines here
      end
      testdata
    

1.6 Grep, egrep, fgrep, perl, say what?

There is a family of products that started the era of regular expressions in Unix tools know as grep(1), egrep(1), frep(1), sed(1) and awk(1). The first of all was grep, soon followed by extended grep egrep(1), which offered more patterns in regular expression syntax. The final evolution is perl(1) which enhanced the regular expressions way further that could be imaginable. Whenever you nowadays talk about regular expression, the foundation of new inventions lies in Perl language. The Unix manual page concerning regular expression it in regexp(5).


2.0 Character classes

You must think that the character class notation is something of its own regular expression sub language. It has its OWN rules that are not the same as outside of character classes.

2.1 Matching list of characters

What if you want to search all colors of "grey" but also spelled like "gray" with a one character difference. You can define a list of characters to match, a character class. This regexp reads: "Find character g followed by e and try next character with e OR a and finally character y is required.".

[Picture 4.  pic/lect-regexp-classes-greay.jpg]
Picture 4. Using character CLASS for individual characters.

As another example, suppose you want to allow capitalization of word's first letter. Remember that this still matches lines that contain smith or Smith embedded in another word as blacksmith. This issue is usually the source of the problem among new users.

      /[Ss]mith/
    

There are few points here: 1) You can list in the class as many characters as you like amd 2) characters can be placed in any order:

      /[0123456]/
      /[6543210]/
    

The above might be a good set of choices to find HTML heading from the page with: <H1> <H2> .. <H6> (That is the maximum according to HTML 4.x specification. Refer to <http://www.w3c.org/>)

      /<H[0123456]>/          or
      /<H[6543210]>/          [the order does not matter]
    

Exercise: Pick any URL page and save it to disk and write small perl program to print all H1 or H2 headings.

There are few rules concerning the character class: Not all characters inside it are pure literals, taken "as is". A dash(-) indicates a range of characters, and here is identical example:

[Picture 5.  pic/lect-regexp-classes-range-dash.jpg]
Picture 5. The RANGE metacharacter

Note: One thing to memorize is, that regular expressions are case sensitive. It is different to match "a" or "A", like if you would construct a set of alphabets for regular 7bit English text. (Different countries have different sets of characters, but that is a whole separate issue)

      /[a-z]/             Ehm...
      /[a-zA-Z]/          Maybe this is what you wanted?
    

Note: Remember that the dash(-) applies only inside character class. In normal regular expression, the dash is just a regular character:

      /a-z/               Match character "a", character "-", character "z"
    

Multiple dashes can be used inside a class in any order, but the dash-order must follow the ASCII-table sequence, and not run backwards:

      /[a-zA-Z0-9]/       ok
      /[A-Za-zA-Z0-9]/    Repetitive, but no problem
      /[0-9A-Za-z]/       ok
      /[9-0]/             Oops, you can't say it "backwards"
    

Exclamation point and other special characters are just characters to match:

      /[!?:,.=]/
    

Or pick your own personal set of characters. This does not match word "help":

      /[help]/
    

2.2 Negated character classes

It is easy to write what characters you want to include, but what if you would like to match everything except few characters? It would be unpractical to list all the possible character and then leave out only some of them:

      /[ACDEFGHIJKLMNOPQ .... ]/         THIS IS NOT THE FULL REGEXP
         |
         Character "B" was left out.
    

A special meta-character caret (^) inside character class tells "not to include".

[Picture 6.  pic/lect-regexp-classes-not-09.jpg]
Picture 6. Negated character class: excludes range 0-9

Exercise: Why would following regular expression list items below?

      % perl -ne "print if /q[^u\r\n]/" file.txt

      Iraqi
      Iraqian
      miqra
      qasida
      qintar
      qoph
      zaqqum

Exercise: Why didn't it list words like

      Quantas
      Iraq

2.3 Character class and special characters

2.3.1 The brackets ([])

As we have learned earlier, some of the characters in character class are special, these include range(-) and negation(^), but you must remember that the characters continuing the class itself must also be special: ( ] ) and ( [ ). So, what happened if you really need to match any of these characters? Suppose you have text:

      "See [ref] on page 55."
    

And you need to find all texts that are surrounded within the brackets []. You must write like this, although it looks funny. It works, because an "empty set" is not a valid character class so in here there is not really two "empty character class sets":

[Picture 7.  pic/lect-regexp-classes-bracket.jpg]
Picture 7. How to include brackets itself inside character class

Rule: ( ] ) must be at the beginning of class and ( [ ) can be anywhere in the class.

2.3.2 The dash (-)

If the dash operator is used to delimit a range in a character class we have problem what to do with it if we want to match person names like "Leary-Johnson". The solution can be found if we remember that dash need a FROM-TO, but if we omit either one, and write FROM- or -TO, then the special meaning is canceled.

[Picture 8.  pic/lect-regexp-classes-dash-included.jpg]
Picture 8. POSIX regular expression syntax

Rule: dash(-) character is taken literally, when it is put either to the beginning or to the end of character class

2.3.3 The caret (^)

We still have one character that has a special meaning, the negation operator, that excludes characters from the set. We can solve the conflict, to take (^) literally as a plain character, by moving it out from its special position: at the beginning of the class

[Picture 9.  pic/lect-regexp-classes-caret-included.jpg]
Picture 9. How to include caret itself inside character class

Rule: caret(^) loses its special meaning, when it is not the first character in the class.

2.3.4 How to put all together

Huh, do we dare to combine all these exceptions in one regular expression that would say, "I want these character: ^, - , ] and [". It might be impossible or at least time consuming task if you didn't know the rules of these characters. With trial and error you could eventually come up with right solution, but you would never understand fully why it works. Here is the answer.

[Picture 10.  pic/lect-regexp-classes-all-included.jpg]
Picture 10. How to include all metacharacters literally inside character class

Exercise: There above regular epxression can be writting differently to mean the same thing. Can you determine the right order of the charcters?

Exercise: And now the final master thesis question: how do you reverse the question, "I want to match everything, except characters ^, -, ] and [" ??

2.4 POSIX locales and character classes

POSIX, short for Portable Operating System Interface, is a standard ensuring portability across operating systems. Within this ambitious standard are specifications for regular expressions and many of the traditional Unix tools use them.

One feature of the POSIX standard is the notion of locale, setting which describe language and cultural conventions such as

  • the display of dates,
  • times
  • and monetary values
  • the interpretation of characters in the active encoding
  • and so on.

Locales aim at allowing programs to be internationalized. It is not regexp-specific concept, although it can affect regular expression use. For example when working with Latin-1 (ISO-8859-1) encoding, the character "a" has many different meanings in different languages (think adding ' at top of "a").

2.4.1 POSIX collating sequence

A locale can define collating sequences to describe how to treat certain characters or sets of characters, for sorting. For example Spanish ll as in tortilla traditionally sorts as if it were a logical character between l and m. These rules might be manifested in collating sequences named span-ll and eszet for German ss. As with span-ll, a collating sequence can define multi-character sequences that should be taken as single character. This means that the dot(.) in regular expression /torti.a/ could match "tortilla".

2.4.2 POSIX character class

A POSIX character class is one of several special meta sequences for use within a POSIX bracket expression. An example is [:lower:] which represents any lowercase letter within the current locale. For normal English that would be [a-z]. The exact list of POSIX character classes is locale independent, but the following are usually supported (appeared 2000-06 in perl 5.6). See more from the [perlre] manual page.

2.4.3 The overlall syntax of locale regexps

      [:class:]       GENERIC POSIX SYNTAX, replace "class" with names below
      [:^class:]      PERL EXTENSION, negated class
    

2.4.4 Supported locale regexps

      [:alpha:]        alphabetic characters
      [:alnum:]        alphabetic characters and numeric characters
      [:ascii:]
      [:cntrl:]        control characters
      [:digit:]  \d    digits
      [:graph:]        non-blank (no spaces or control characters)
      [:lower:]        lowercase alphabetics
      [:print:]        like "graph" but includes space
      [:punct:]        punctuation characters
      [:space:]  \s    all whitespace characters
      [:upper:]        uppercase alphabetics
      [:word:]   \w
      [:xdigit:]       any hexadecimal digit, [0-0a-fA-F]
      |          |     |
      |          |     Explanation
      |         Perl's equivalent shorthand regexp syntax
      POSIX regexp syntax
    

Note: Perl uses \w to mean "word" which translates to basic regular expression [a-zA-Z0-9_], but this in not the whole story. Perl respects the use locale directive in programs and thus allows enlarging the A-Z character range. See perl manual page [perllocale] for more information.

Here is an example how to use the basic regular expression syntax and the roughly equivalent POSIX syntax: They match a word that is started with uppercase letter.

[Picture 11.  pic/lect-regexp-classes-posix.jpg]
Picture 11. How to include all metacharacters literally inside character class

#todo: exercises

2.4.5 POSIX character equivalents

Some locales define character equivalents to indicate that certain characters should be considered identical for sorting. The equivalent characters are listed with the [=...=] notation. For example to match Scandinavian "a" like characters, you could use [=a=].

Perl 5.6 2001-06 recognizes this syntax, but does not support it and according to [perlre] manual page: "The POSIX character classes [.cc.] and [=cc=] are recognized but not supported and trying to use them will cause an error: Character class syntax [= =] is reserved for future extensions"


3.0 Regular expressions

3.1 Backslash notation

Backslash - special ASCII characters

You can match space and tab with regular expression /[ ]/ but it is quite difficult to read that there is SPACE and TAB in inside the class notation. Perl, like other languages has alternative notation for few special characters. Here is the list:

      \t          tab                   (HT, TAB)
      \n          newline               (LF, NL)
      \r          return                (CR)
      \f          form feed             (FF)
      \a          alarm (bell)          (BEL)
      \e          escape                (ESC)
      \033        octal char
      \x1B        hex char
      \x{263a}    wide hex char         (Unicode SMILEY)
      \c[         control char, like Control-C: \cc
      \N{name}    named char
    

Backslash - The regular expression escape character

The backslash is used normally in the meaning "take next character as-is no matter what is it's meaning in regular expression syntax.".

[Picture 12.  pic/lect-regexp-backslash.jpg]
Picture 12. Using backslash to quote regexp metacharacters

To remove e.g the meaning of "match any character except newline" from the dot(.), you would add the backslash

      ega.att.com             /ega.att.com/       WRONG!
      megawatt.computing                          SURPRISE!
                              /ega\.att\.com/     ok
    

Exercise: The backslash can be used anywhere. Consider the difference between these two. What do they match?

      /find (word)/
      /find \(word\)/

3.2 Marking start and end

In regular expression language, there are expressions to define beginning-of-line (^) and end-of-line ($). As we have seen ?cat' is catched everywhere in the line, but we may want to anchor the match to the start of the line. The ^ and $ are particular in that they match a position, not characters. [IMPORTANT] Anchors do match any actual characters.

[Picture 13.  pic/lect-regexp-classes-anchor-start.jpg]
Picture 13. The basic anchor: BEGINNING of input

Exercise: What's wrong with interpreting the above regular expression like this: "MATCH WORD 'CAT' AT THE BEGINNING OF LINE".?

[Picture 14.  pic/lect-regexp-classes-anchor-end.jpg]
Picture 14. The basic anchor: END of input

Exercise: How would you read following expressions:

      /^cat$/
      /^$/
      /$^/        # Tip: See perlvar
      /^/
      /$/
      //
      /c^at/
      /cat^/
      /$cat/      # This is not "end-of-line" + "cat", but a variable

3.3 Matching any character

The meta-character dot (.) is shorthand for a pattern that matches any character, except newline. For example, if you want to search regular ISO 8601 YYYY-MM-DD dates like 2000-06-01, 2000/06/01 or 2000.06.01, you could construct the regular expression like this:

[Picture 15.  pic/lect-regexp-dot.jpg]
Picture 15. DOT metacharacter matches anything

A more accurate version is presented below. Notice the different semantics again in the regexp: the special meaning takes into effect only outside of character class.

              Note, the "/" must be escaped with \/ because
              it would otherwise terminate Perl Regexp / ..... /
              |
      /2000[.\/-]06[.\/-]/
    

Consider using dot(.) only if you know that the data is in consistent format, because it may cause trouble and match lines that you didn't want to. Note: The above regexp is not secure enough for production use. What's wrong with it?

Exercise: Analyze following regular expression. What lines would it match, and what lines not?

      /./
      /^.../
      /...$/

Exercise: Write a regular expression to search any date in standard ISO 8601 format. This time allow anything as a date value separator:

      2002.12.01
      2001/03/15
      2001*01*03
      2000%01%10
      2000%0%31

3.4 Alternatives and grouping

When you are inclined to choose from several possibilities, you mean word OR. The regular expression atom for it is |, like in programming languages. When used in regexps, the parts of the regular expressions are called alterations.

[Picture 16.  pic/lect-regexp-alteration.jpg]
Picture 16. ALTERATION metacharacter, first one is served in Perl

Looking back to color matching with gr[ea]y, it could have been written using the alteration

      /grey|gray/         Both of the regexps are effectively
      /gr[ea]y/           Same. This is also faster.
    

The alteration can't be written like this, which would mean different thing, because both sides, around |, are tried as they are read.

       Choice one: "gre"
       |
      /gre|ay/
           |
           OR choice two: "ay"
    

In order to write the regexp this way, you need to say which characters belong together and restrict the range of effect, in this case, the alteration only to "e" and "a". You need grouping () like this:

[Picture 17.  pic/lect-regexp-alteration-grouping.jpg]
Picture 17. GROUPING and ALTERNATION metacharacters: parentheses and pipe are a common combination

Character class and the alteration does have a equivalence, but it is important to remember that character classes are meant to match only one character, whereas alteration is meant to separate many choices. To match range of numbers, you would need to write a long alteration, compared to elegant character class notation. It is not a good idea to use leftmost regexp:

      /<H(1|2|3|4|5|6)>/      DON'T WRITE LIKE THIS
      /<H[1-6]>/              This is much better and FASTER

    

Exercise: Find all colors from following lines

      The sky is blue.
      Well, I think it's more in the reddish side.
      C'mon, I see brown there at the skyline!
      You must be all blind colored, that's definitely yellow there.
      I would say ... it also shows shades of gray.
      What? You mean strokes of white?

Exercise: Find transportation and times from these lines

      The train 07:30 to Monaco.
      The bus 09:00 to Vienna.
      The plane 15:33 to Helsinki.

Exercise: Explain following regular expressions:

      /()Hmmm()/
      /([)Hmmm(])/
      /[()]Hmmm[()]/

Exercise: Fix regular expression below to match the books correctly:

      "Your Book is here"
      "A big novel is interesting"
      "The pocket book is of size A6"

      /(an|the) (book|novel|pocket book)/

Delimiting numeric ranges with regex is possible, but they can be quite complicated sometimes. Remember that regular expressions treat all characters as "atoms", and it does not know what range 26-33 means, like we human do. It only sees a character "2" followed by character "2". You have to break the numeric range into individual character and their ranges:

      /26-33/             # Wrong
      /[26-33]/           # Wrong, that's a class of "2" "6-3" and "3"
      /2[6-9]|3[0-3]/     # Righ. 20-or-30 based numbers
       |      |
       |      30, 31, 32, 33
       26, 27, 28, 29
    

Exercise: Write a regular expression to correctly match a 24 hour clock. Note: 24:00 is a special case.

Exercise: Write a regular expression to correctly match a a US and Great Britan based clock, which ticks until 12:00 and requires am and pm.

Exercise: Write a regular expression to correctly match a days in a month; up till 31.

3.5 Alternatives and anchors

The alteration is usually put inside parenthesis, and that is the best guard if you are even a bit insure. Consider what would happen if we added anchors, beginning-of-line and end-of-line to the regexp:

[Picture 18.  pic/lect-regexp-alteration-grouping-start.jpg]
Picture 18. Parentheses missing, if anchors were intended span both

The two regular expressions are not interpreted by the regular expression engine as "find at the beginning of line choice grey OR choice gray". The key point to understanding is that each part is a complete and self contained subexpression. The regexp engine reads the regexp as two separate regexps. Here the effect of the beginning-of-line anchor only affects the first, not the second.

      First try this:  /^grey/
      Then try this:   /gray/
    

Note: BE VERY CAREFUL NOT TO LEAVE THE ALTERATION LAST. Try and see what would this regular expression match and think about it for a moment.

      /(this|that|)/
                 ^
                 ^
                 Oops !!

    

Exercise: Which of the above choices is equivalent to regular expression /^this|that$/ ?

      /^(this|that)$/
      /(^this)|(that$)/
      /^(this)|(that)$/

Exercise: Explain these regular expressions.

      /a|b/
      /[a|b]/
      /[a]|[b]/

Exercise: Write smallest regular expression possible to match lines below:

      First Street
      1st street

Exercise: Write smallest regular expression possible to match lines below:

      Jul 4
      July 4th
      Jul fourth

Exercise: Make a simple summary of newsfeed/mailbox by extracting few key fields: From, Subject and Date. Use sample mailbox at mailbox-comp.lang.perl.moderated

3.6 Nested parentheses

You can have more than one set of parentheses and parentheses are numbered by counting open parentheses from left to right:

[Picture 19.  pic/lect-regexp-paren-nested.jpg]
Picture 19. Counting nested parentheses from left to right

To the outside of regular expression the captured values are available through perl variables $1, $2 .. $N and there is no maximum limitation of 9 levels. You can e.g. refer to $22. This phenomena is also called capturing parentheses. You can print the captured text in perl like this:

      % perl -ne "print $1 if /(Mike|Dan)/" file.txt
                        |      |
                        |      capture value inside ()
                        |
                        print variable where captured text is stored
    

Yet another classical example to match temperature values:

[Picture 20.  pic/book-regexp-2-3-nesting-parens.jpg]
Picture 20. Nesting parentheses with temperatures

3.7 Grouping parentheses and back references

Parentheses can remember text matched by the sub-expression they enclose. For example to remove double words like " the ", or " one " you could match the first one and refer to next with it. There is only one problem with simple matches: it would also find "the theory" and "one one-list". Referring to existing match is known as backreferencing. Backreferencing uses special notation \N where N is \1 \2 .. \9. This special syntax can be used only inside regular expression. The maximum limit is \9, and if you were temptated to write \22 it would have been interpreted as \2 and number 2.

[Picture 21.  pic/lect-regexp-backref-the-the.jpg]
Picture 21. Using capturing parenthese and memory \1

A more general solution to find double words might be

[Picture 22.  pic/lect-regexp-backref-the-the-general.jpg]
Picture 22. Double word searcher.

Exercise: Match the HTML tags correctly, so that <EM> is closed with </EM> and so on.

          <I>italics</I>
          <B>bold</B>
          <STRONG>Strong like bold</STRONG>
          <EM>emphasis like italics</EM>

3.8 Problems with parentheses

You can print the parentheses match with perl variable $N, but counting the possible N becomes interesting when you have marked the parentheses optional:

      /([0-9][0-9]):([0-9][0-9]) *(am|pm)?/
       |            |             |
       1st          2nd           3rd, BUT OPTIONAL, notice "?" at the end


      #!/usr/bin/perl
      use English;
      $ARG = "11:12";

      print $3 if /([0-9][0-9]):([0-9][0-9]) *(am|pm)?/;
      #     |
      #     won't print anything
      __END__
    

3.9 Quantifiers (basic, greedy)

Remember that quantifiers affect the immediately previous regular expression element, whatever that be. If you group the regular expression, the quantifier will affect the whole group, like in /(grey|gray)*/ would repeat greygrey or graygray or any of these combinations like greygray

If you want to repeat something, it is possible to just repeat the character class e.g. to match an three letter alphabet word:

      /[a-z][a-z][a-z]/
    

Usually you don't know how long a word will be, so trying to match multiple choices by adding more alterations blurrs the clear regexp fast.

      /[a-z][a-z][a-z]|[a-z][a-z]|[a-z]/
       |               |          |
       |               |          OR maybe ONE character?
       |               Try TWO characters
       Try THREE characters
    

There are three basic meta-characters that can repeat, one or as many times as needed. They are:

  • ? Match 0 or 1 times [NOTE] Will always match
  • * Match 0 or more [NOTE] Will always match
  • + Match 1 or more

[Picture 23.  pic/lect-regexp-quantifier-basic-list.jpg]
Picture 23. Basic quantifiers (repeaters)

Pay attention to the notes above. If the regular expression is not required to match, as the zero count allows, it will flag "ok" and the regular expression is happy with it. It has succeeded. Consider following examples and their differences for a while:

      /a/    /a+/    /a*/    /a?/
    

Let's take a real example from the HTML specification, which says that "whitespace are allowed inside tags". The term whitespace here refers to both space and the tab character. Perl in the other hand would consider whitespace to include more characters: [ \t\n\r\f]. Here are few HTML. Refer to "HTML 4.01, section 5.5" specification at <http://www.w3c.org/>.

      <H1>
      <H1 >
      <H1            >
      <      H1>
      < H1 >
    

The star (*) quantifier is suited for this kind of repeating, when there can be any number of whitespace including none:

      /< *H[1-6] *>/
    

Exploring further, the HR horizontal rule element can contain a specifier SIZE where a line of 14 pixels wide is drawn across the screen:

      <HR SIZE=14>
    

This tag requires a bit more intelligence for crafting the regular expression, because there must be a separation between HR and SIZE, they are not allowed to read "<HRSIZE=14>". We use quantifier +, which allows any number of repetitions, but at least 1 must be present.

      /< *HR +SIZE=14 *>/
             |
             There must be preceding space. The + will require at least 1
    

Exercise: There is still room for an improvement in this regexp. Can you make it even better? (Don't concentrate on the case conversion: there is some other thing that needs to be addresses).

Exercise: The XHTML standard takes a more strict approach by requiring that double quotes must be used to define attribute's value. Modify the regexp to match both <HR SIZE=14> and <HR SIZE="14">

Master thesis study: It is possible that the HTML element includes several attributes, some of which do not contain value part at all, like <HR id="xml" class="html" token>. would it be possible to write a regular expression to cope with different number of attributes?

The expression is still inflexible with respect to the is given in the tag. It only finds the is 14, when it should find HR elements with any size. We need a number that can be of varying size. A number is one digit or more digits. Notice the use of (+) quantifier in the number class:

[Picture 24.  pic/lect-regexp-quantifier-basic-html-hr.jpg]
Picture 24. Combination of (*) and (+) quantifiers.

All is well now, but now this regular expression won't find the basic <HR> any more, because it looks for SIZE keyword as well. To make it complete, we must remember what the HR specifications says: the SIZE part is optional.

Exercise: Extend the regular expression to find simple <HR> tag.

Exercise: If we convert the ORIGINAL regexp to A or B, which one would you think is correct: A or B?

      /[ \t]*/            ORIGINAL

      / *|\t*/            Is it like this (A)?
      /( |\t)*/           Is it like this (B)?

Exercise: How would you match subjects that are replied or forwarded multiple times?

      Subject: Hi There!
      Subject: Re: Hi There!
      Subject: Re: Re: Hi There!
      Subject: fw: Hi There!
      Subject: fw: fw: Hi There!
      Subject: fw: Re: Hi There!
      Subject: Re: fw: :Re: Hi There!

3.10 Quantifiers (basic, greedy, additional)

In addition to basic quantifiers ?, * and + that are part of every regular expression engine, there can also be more meta-syntaxes which offer more precise control of how many exactly is matched. In perl you will find following extra syntaxes:
  • {n} exactly n times
  • {n,} at least n times
  • {n,m} at least n times, but no more than m times

[Picture 25.  pic/lect-regexp-quantifier-basic-additional.jpg]
Picture 25. Additional QUANTIFIERS (repeaters)

A few examples, let's say that you want to match decimal number of varying size, like 1.1 or 1.12 or 12.12, but nothing else. You know how long the numbers are and you can use the extended quantifiers:

      /[0-9]{1,2}\.[0-9]{1,2}/
                 |
                 Notice this, remember what dot(.) would have been !!
    

Exercise: What if the numbers are always paired, like 01.10, 05.30 and 11.21?

You see a little new construct in this regular expression, and escaped dot \. The dot in regular expression notation means "any character", but we don't want that, we really want the character "." that makes a decimal number. In order to prevent dot(.) meta-character, we must escape, thus disable, its special meaning when parsed by the regular expression engine. You can always add a backslash(\) in front of characters to take them literally.

3.11 Quantifiers (extended, non-greedy)

The normal quantifies are greedy, meaning that they get as much as they can from the regular expression But this is not always desirable. Consider trying to match all the text between the bold tags in HTML:

      #!/usr/bin/perl
      use English;
      $ARG = "<B>Intel</B> and <B>AMD</B> are competitors.";

      #   Will print "Matched: Intel</B> and <B>AMD"

      print  "Matched: $1\n"  if /<B>(.*)<\/B>/;
      __END__
    

As you can see, the match does not stop to the first /B, but continues until the last /B, that is, the match is greedy. To overcome this, perl has the non-greedy quantifier alternatives, where an additional question mark is added to the end of the basic quantifies:

  • ?? Match 0 or 1 times [NOTE] Will always match
  • *? Match 0 or as little as possible [NOTE] Will always match
  • +? Match 1 or as little as possible
  • {n}? exactly n times. [NOTE] This is identical to plain {n}
  • {n,}? at least n times
  • {n,m}? at least n times, but no more than m times

[Picture 26.  pic/lect-regexp-quantifier-basic-additional-non-greedy.jpg]
Picture 26. Non-greedy (take as little as possible) quantifiers.

With non-greedy quantifier, the regular expression can match only the shortest possible portions:

      #!/usr/bin/perl
      use English;
      $ARG = "<B>Intel</B> and <B>AMD</B> are competitors.";

      #   Now correctly prints: "Matched: Intel"

      print  "Matched: $1\n"  if /<B>(.*?)<\/B>/;
      __END__
    

3.12 Perl's shorthand regular expressions

In addition to previous backslash notations, some basic regular expression classes have short notations: keep in mind that perl word \w is more a "programming language word", than a spoken language word. The Perl word won't find e.g. word "non-greedy", because the class doesn't include dash character "-".

      \w  Match a "word" character (alphanumeric plus "_") [a-zA-Z0-0_]
      \W  Match a non-word character  [^a-zA-Z0-0_]
      \s  Match a whitespace character [ \t\n\r\f\v]
      \S  Match a non-whitespace character [^ \t\n\r\f\v]
      \d  Match a digit character [0-9]
      \D  Match a non-digit character [^0-9]
    

The shorthand regular expressions are easier to understand and shorter to use. Below there are some equivalent basic regular expressions and their equivalents with short notation:

      /[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/  <=>  /\d\d\d\d-\d\d-\d\d/
      /void[ \t][a-zA-Z_]+\(.*\)/                   <=>  /void\s+\w+\(.*\)/
      /time is [0-9][0-9]:[0-9][0-9]/               <=>  /time\s+is\s+\d\d:\d\d/

Latest version of Perl added unicode support and sequences:

      \pP Match P, named property. Use \p{Prop} for longer names.
      \PP Match non-property, non-P
      \X  Match eXtended Unicode "combining character sequence",
          equivalent to C<(?:\PM\pM*)>
      \C  Match a single C char (octet) even under utf8.
    

Two special backslash tokens are useful in par with regular expressions:

      \Q          quote (disable) pattern metacharacters till \E
                  NOTE: You still have to escape Perl's special
                  variable characters \$ and \@
      \E          end marker
    

Exercise: In regular expression language there are many metacharacters. Find an alternative \Q and \E notation for this regular expression:

      /\[\/path\/name\/\]/

If we return to the HTML tag matching, writing a regular expression to match HTML tag can now be written as:

      <B>         /<B>/
      < B>        /<\s*B>/
      <B >        /<B\s*>/
      < B >       /<\s*B\s*>/

      <           /<\s*B\s*>/                 NOTE: COMPARE TO ABOVE !!
        B         /<[ \t\r\n]*B[ \t\r\n]*>/   ... AND THE OLD STYLE
      >
    

3.13 Discussion: line ending and operating systems

You remember that the dot(.) matches all other characters, except newline, but what if you wanted to match the whole line including the terminating newline, you might want to write:

[Picture 27.  pic/lect-regexp-dot-newline.jpg]
Picture 27. Matching complete line

However, In Win32 operating system, the line ending (like found from C:\autoexec.bat) is combination of two characters CR and LF and to refer to them, you need to explicitly list the characters. To summarize the end of line terminator in various operating systems:

  • \r\n are used in Windows (or "Win32") operating systems.
  • \n is used in Unix or Linux systems
  • \r is used in Mac operating system.

Notice above, that in no end-of-line anchor $ is used, but only a beginning-of-line anchor:

      /^(.*)[\r\n]/
            |
            The last character can be either CR or LF
    

Exercise: What's wrong with the above regular expressions? Doesn't it work for every operating system?

Exercise: Why can't you include end-of-line anchor $ in the regular expression above?


4.0 Perl zero width assertions

The term zero-width means that the regular expression element does not consume any character. It is just a marker, or a spot, where regular expression takes place. In fact the position is between characters and this is important to understand.

4.1 Beginning of line (^) and (\A)

[Picture 28.  pic/lect-regexp-anchor-start-a.jpg]
Picture 28. Two methods to match logical lines and beginning of string

The beginning of line assertion (^) works between lines, meaning that it can find a place where line ends or empty position. There is also more specialized assertion, that matches only at the beginning of string (\A) and the only possible match for it even in multi line string is as follows.

Exercise: Spend a moment thinking what would plain anchor match?

      /(^)/
      /(\A)/
    

4.2 End of line ($) and (\Z) and (\z)

[Picture 29.  pic/lect-regexp-anchor-end-z.jpg]
Picture 29. Methods to match end of line

The \A and \Z won't match multiple times when the /m modifier is used. The ^ and $ will match at every internal line boundary. To match the actual end of the string and include trailing newline, use \z.

Exercise: Can you figure out what do the following regexps mean. Can they match a line that contains a line terminator?

      /.$/
      /.\z/
      /\n$/
    

4.3 Word (\b) and non-word (\B) boundaries

A common problem is that regular expressions that matches the word you want can also match where the "word" is embedded within a larger word. The word meta-character in Perl is \b and it is like the anchor and belongs to the set of zero width assertions which means that it never consumes a character. Anchor matches a position before or after a word, effectively in the middle of two words if needed:

[Picture 30.  pic/lect-regexp-word-boundary.jpg]
Picture 30. WORD anchor which does not consume any characters.

Note: Pay atention to the locations of (\b), which are between the characters.

How is the word boundary defined? Regular expression does not know anything about the spoken or artificial language semantics and "words" in a sense that we have learned. We have to know the definition of a word from the perspective of regular expression engine. In perl word is any combination of character in class [a-zA-Z0-9_].

[Picture 31.  pic/lect-regexp-word-definition.jpg]
Picture 31. How perl sees WORD positions.

In regular expression syntax, there usually is a meta-character notation to mark a word boundary, but the meta-character varies between languages and tools:

[Picture 32.  pic/lect-regexp-word-tokens.jpg]
Picture 32. Word boundary tokens in languages and tools that support regular expressions.

Here are few examples. In Emacs the regular expression search command is named re-search-forward.

      % perl -ne "print if /\bcat\b/" file.txt
      % egrep "\<cat\>" file.txt
    

You can think \b as as a synonym for "Text must be separated" and \B as synonym for "keep together".

[Picture 33.  pic/lect-regexp-word-boundary-non-b.jpg]
Picture 33. Matching non-word boundaries. Read \B as: Text must be kept together.

Exercise: Why isn't cat-days matched with regexp:

      /cat\B/

Exercise: Is these regular expressions sensible?

      /\Bion/
      /\w\Bion/
      /\wion/

4.4 Continue match from last position (\G)

Note: This discussion goes beyond the average use of the regular expressions. You must study this document to understand what notations ?: and *? mean.

The \G assertion can be used to chain global matches (using m//g), as described in the section on "Regexp Quote-Like Operators" in the [perlop] manpage. \G is a zero-width assertion that matches the exact position where the previous m//g, if any, left off. The \G assertion is not supported without the /g modifier. An example demonstrates this modifier best. You can read \G as "try next full match". Here we try to find series of five digits from the full line, all of which start with "11". The sequences has been broken for clarity in the comment. See [Friedl] page 236. The below is the obvious regular expression that comes to the mind but it does not work right, because the regular expression engine bumps along one character every time, resulting wrong answers.

      #!/usr/bin/perl
      use English;
      $ARG = "112231121155";

      #   Correct sequence of THREES would be:
      #       112 231 121 155;
      #
      #   But our regular expression does not count in THREES

      print "Matched: $1\n" while /(1\d\d)/g;
      __END__

      -->
      Matched: 112
      Matched: 112
      Matched: 115
    

The way to solve this is to force regular expression always to match three digits, so that next match will continue from the point of last try position. In the regexp below, the *? is not attempted if the sequence starts with number 1. But if it did not, then it occupies the three digits allowing smooth transition in the string:

      print "Matched: $1\n" while /\G(?:\d\d\d)*?(1\d\d)/g;

      -->
      Matched: 112
      Matched: 121
      Matched: 155
    

What actually happens between these two regular expressions is illustrated below.

[Picture 34.  pic/lect-regexp-anchor-g.jpg]
Picture 34. Testing group of 3 digit sequences


5.0 Perl Regular expression modifiers

5.1 Perl match operator

In Perl, the regular expression is enclosed in slashes /REGEXP/, but there is also a formal match command m/REGEXP/ where the delimiter character can be chosen. Let's take an example: match filename which frequently contain slashes and the standard delimiter forces escaping all the "/" path delimiters:

      #!/usr/bin/perl
      use English;
      $ARG = "/home/bin/perl/test.pl";

      print "Path: $1 Bin: $2\n" if /^\/(.*\/)(.*)/;
      __END__
    

It can be written in more readable form by using some other delimiter:

[Picture 35.  pic/lect-regexp-match-operator.jpg]
Picture 35. Pattern-Match operator and changing the delimiter

The choice of the delimiter is up to the user, but it is recommended that a visually appealing and clear delimiter is chosen. In above, the colon gives a good contrast on the uprising slash.

[Picture 36.  pic/lect-regexp-match-operator-choices.jpg]
Picture 36. Pattern-Match operator and different delimiter choice

Some of the choices are not supported, try these and it gives you an error (the @ is a perl list data type symbol). You will get error "Useless use of a constant in void context".

      m@something@;       # Error
      m_something_;       # Error
      m3something3;       # no numbers allowed
      mYsomethingY;       # no characters allowed
    

5.2 Modifiers in matches

Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside are listed below. Modifiers that alter the way a regular expression is used by Perl are detailed in the section on "Regexp Quote-Like Operators" in the perlop manpage and the section on "Gory details of parsing quoted constructs" in the perlop manpage. The modifiers are placed after the regular expression syntax:

      /REGEXP/modifiers       General syntax

      m/REGEXP/cgimosx        The (m)atch command as delimiter change
      /REGEXP/cgimosx

      /cat/i                  Example: Do case-insensitive "cat" matching
      m,/,                    Match the "/" character in string
    

The list of modifiers as of Perl 5.6 are:

  • i Do case-insensitive pattern matching.
  • g Match globally, i.e., find all occurrences.
  • m Treat string as multiple lines.
  • s Treat string as single line.
  • x Use extended regular expressions.
  • o Compile pattern only once.
  • e Evaluate perl code
  • c Do not reset search position on a failed match when /g is in effect.

5.3 Extended writing mode (x)

Rule: Modifier x allows regexp to be written freely: Whitespaces are ignored when you use the /x modifier

Extend your pattern's legibility by permitting whitespace and comments.

These are usually written as "the /x modifier", even though the delimiter in question might not really be a slash. Any of these modifiers may also be embedded within the regular expression itself using the (?...) construct. See below.

The /x modifier itself needs a little more explanation. It tells the regular expression parser to ignore whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a meta character introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), that you'll either have to escape them or encode them using octal or hex escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable. Note that you have to be careful not to include the pattern delimiter in the comment--perl has no way of knowing you did not intend to close the pattern early.

      /\bcat[a-z]+/;     <=>   / \b  cat [a-z]+ /x;
                                =  ==   =      =   <== spaces ignored
    

In complex regular expressions, you can't write the whole regular expression in one line conveniently, so breaking it into separate lines makes the regular expression more manageable. Notice also the use if /i modified with the /x modifier

      # "Book is here"'
      #  ====
      # "A novel is interesting"
      #  =======
      # "The pocket book is A6"
      #  ==========

      print if
      /
          #  The book name can be at the start of string
          ^(book|novel|pocket)

          |            #    OR-MARKER

          #  There can also be "a" "an" or "the" before the book
          \b(an?|the)\s+(book|novel|pocket)\b

      /ix;
    

5.4 Ignore case (i)

Rule: Modifier i makes all matches to be be done case insensitive.

[Picture 37.  pic/lect-regexp-i-modifier.jpg]
Picture 37. The (i)nore case modifier

To ignore character case in matching and treat all the same, you would add this options to the regular expression. If `use locale' is in effect, the case map is taken from the current locale. See the [perllocale] manpage.

Exercise: Does it make difference if you would write /<hr>/i instead of /<HR>/i ?

Further reading: ignoring case and command line options

HTML tags can be written with mixed case capitalization, so <h3> and <H3> or <HR SIZE=14> and <Hr Size=14> are all legal. Making a regexp to match a simple tag is easy, /<[Hh][0-9]>/ but making a longer regexp to ignore capitalization become more troublesome using plain regular expression atoms:

      /<[Hh][Rr] [Ss][Ii][Zz][Ee]=14/       Hm!
    

Most languages and tools that support regular expressions can perform a match in case insensitive manner. This feature is not part of the regular expression language, but a feature that the tool or programming language provides. Perl also provides set of modifiers that you can apply to regular expression to achieve the same effect but they are explained later. From command line prompt:

      % perl -ne    "print if /cat/" file.txt
      % perl -ne    "print if /[Cc][Aa][Tt]/" file.txt

      % perl -i -ne "print if /cat/" file.txt
              |
              This command line option means: "ignore case". See [perlrun]
    

With egrep(1) tool, option is the same:

      % egrep -i "cat" file.txt
    

Emacs controls the case sensitivity with variable case-fold-search. Value t tells to make all characters lowercase, which in effect makes the searches case insensitive, like Perl's -i option.

      M-x set-variable RET case-fold-search RET t RET
      C-s <Search text forward>
    

5.5 Global matching (g)

Rule: Modifier g forces regexp to find all matches.

The /g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

[Picture 38.  pic/lect-regexp-g-modifier.jpg]
Picture 38. The (g)lobal modifier

A simple perl program demonstrates traversal of all matches:

      #!/usr/bin/perl
      use English;
      $ARG = "12 123 1234";

      print "position ", pos(), "\n"  while  /(12)/g;
      __END__

      --> position 2
      --> position 5
      --> position 9
    

Exercise: Use g modifier and print all positive whole numbers

      +376 33 -21 1 -200
      -1 2 -3.3 -10

Exercise: Use g modifier and print all URLs on the line

      http://www.this.is/that.html  http://my.it/a.html
      http://come.to/a/b.htm  http://127.0.0.1/test.html

5.6 Span multiple lines (m)

Rule: Modifier m affects anchors (^) and ($).

[Picture 39.  pic/lect-regexp-m-modifier.jpg]
Picture 39. The (m)ultiline modifier

Exercise: What happens if you leave out the (g)lobal modifier and use /^1./m instead if /^1./gm

Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. The following short program demonstrates that the beginning of line anchor moves in the string.

      #!/usr/bin/perl
      use English

      $ARG = "12 abc\n13 abc\n";
      print "Matched: $1" while /^(1.)/gm;
      __END__

      -->
      Matched: 12
      Matched: 13
    

Exercise: Print seconds word from each line: suppose that the input is concatenated as one big string with embedded newlines.

      #!/usr/bin/perl
      use English;

      sub Main ()
      {
          $ARG = "
          col1  col2 col3 col4
          a     1    aa   11
          b     2    bb   22
          c     3    cc   33
          ";

          <... now print col2 from every line ...>
      }

      Main()
      __END__

5.7 Single line (s)

Rule: Modifier s affects dot (.) metacharacter.

[Picture 40.  pic/lect-regexp-s-modifier.jpg]
Picture 40. The (s)ingle line modifier

The /s will force ^ to match only at the beginning of the string and $ to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the "." (dot) match any character whatsoever, while yet allowing ^ and $ to match, respectively, just after and just before newlines within the string.

5.8 Lock regular expression (o)

Rule: Modifier o affects perl variables that are used in regexps.

[Picture 41.  pic/lect-regexp-o-modifier.jpg]
Picture 41. The l(o)ck modifier

Using a variable in a regular expression match forces a re-evaluation (and perhaps re-compilation) each time through. The /o modifier locks in the regexp the first time it's used. This always happens in a constant regular expression, and in fact, the pattern was compiled into the internal format at the same time your entire program was. Use of /o is irrelevant unless variable interpolation is used in the pattern, and if so, the regexp engine will neither know nor care whether the variables change after the pattern is evaluated the very first time.

/o is often used to gain an extra measure of efficiency by not performing subsequent evaluations when you know it won't matter (because you know the variables won't change), or more rarely, when you don't want the regexp to notice if they do.

      #!/usr/bin/perl
      use English;

      $ARG = "Hello there";
      my $regexp = '(\w+)';

      print "Matched: $1\n" while /$regexp/go;
      __END__

      -->
      Matched: Hello
      Matched: there
    

5.9 Advanced: Evaluate perl code (e)

This modifier is used with the substitution operator s/// and it allows to run arbitrary perl code in the SUBSTITUTION portion. Examples best demonstrate the usage:

      s/\b(\d+)/$1 + 10/eg;     # increase all values by 10
      s/\b(\d+)/$1 * 1.10/eg;   # increase all values by 10%

The following will format all numbers to have only exactly two decimals:

[Picture 42.  pic/lect-regexp-subst-e-sprintf.jpg]
Picture 42. The (e)val modifier

      20.222      =>  20.22
      1.12        =>  1.12
      200         =>  200.00

5.10 Advanced: Do not reset position, continue (c)

Rule: Modifier c remembers last match position.

A failed match normally resets the search position to the beginning of the string, but you can avoid that by adding the /c modifier (e.g. m//gc). Modifying the target string also resets the search position. The perl function pos() tells the position of the last match in string.

      #!/usr/bin/perl
      use English;

      #       01234567            <= positions in a string
      $ARG = "123 12 4";
      print "match $1" while /(12)/gc;
      print "\nMatch pos = ", pos(), "\n";
      __END__

      -->
      match 12
      Match pos = 6

5.11 Advanced: Modifier (?imsx-imsx) localization

One or more embedded pattern-match modifiers. This is particularly useful for dynamic patterns, such as those read in from a configuration file, read in as an argument, are specified in a table somewhere, etc. Consider the case that some of which want to be case sensitive and some do not. The case insensitive ones need to include merely (?i) at the front of the pattern. The simple way to ask for case insensitivity is:

      $pattern = "foobar";

      if ( /$pattern/i )                  # HARD CODED
      {
          # do something
      }

If the case sensitivity can be either on/off, it can be passed in a variable:

      $pattern = "(?i)foobar";            # Can be changed on the fly

      if ( /$pattern/ )
      {
          # do something
      }


6.0 Perl Extended regular expression patterns

6.1 Non-capturing parenthesis (?:pattern)

[Picture 43.  pic/lect-regexp-paren-non-grouping.jpg]
Picture 43. Selecting what is captured with parentheses

The `(?:) is similar to (), but you can control exactly to which variables $1 etc. the match goes. The ?: still groups subexpressions, but doesn't make back references as () does.

      #!/usr/bin/perl
      use English;
      $ARG = "123 456";

      print "[$1] [$2]"  if  /(123)\s+(456)/;         # [123] [456]
      print "[$1] [$2]"  if  /(?:123)\s+(456)/;       # [456] []
      __END__
    

With the non-capturing parenthesis, you can control exactly, which submatch will contain the answer.

      (?:  () () (   (?:)    )  )
      1    2  3  4   5             Normal count of submatches
           1  2  3                 Non-capturing count of submatches
    

A simple example will demonstrate this:

      #!/usr/bin/perl
      use English;
      $ARG = "123";
      print "[$3]"  if /(((12)))/;        # [12]
      print "[$3]"  if /(?:(?:(12)))/;    # [], there is no  $3, only $1
      __END__
    

6.2 Zero-width positive lookahead (?=pattern)

[Picture 44.  pic/lect-regexp-paren-positive-lookahead.jpg]
Picture 44. Positive lookahead

A zero-width positive look-ahead assertion peeks the content which follows the real regular expression. For example in the above example /\w+(?=\t)/ matches a word followed by a tab, without including the tab in perl variable $MATCH. Here are more examples:

      /Bill(?=\s+The\s+Cat|\s+Clinton)/       Match "Bill"
      /(?=IBM|HAL)(\w+)/g;                    IBM900  HAL20001
    

Exercise: What lines are matched below?

      Joe has car
      Mike has motorcycle
      Bill has Truck

      /\b(\w+)\b.*(?=car|truck))(\w+)/

Exercise: Use positive lookahead and pick all numbers in range 30-55 from folowing line

      30 12 133 333 45 1 A-30     => 30 45
      11 4  42 564 1012 50        => 42 50

Exercise: Use positive or negative lookahead and pick all positive decimal numbers from following lines. Allow decimal numbers that do not contain leading zero (like 123)

      -1.22 1.1 22.22.22 10 2.1 +3.12  => 1.1 2.1 3.12
      .11 +.4 45 ABC.123               => .11 .4

6.3 Zero-width negative lookahead (?!pattern)

[Picture 45.  pic/lect-regexp-paren-negative-lookahead.jpg]
Picture 45. Negative lookahead

A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind. Few examples:

      /\d+(?!\.)/                 Match number, not decimal
      /(?!000)\d+/                Exclude 000
      /(?!000|255)+\d{3}/         Exclude 000 and 255
    

Be aware that the lookahead also moves if the first position is not correct:

      /(?!cat)\w+/        "cattle"
                            ======
    

You need to say, that it should stay put at the the beginning of word like this:

      /\b(?!cat)\w+/
    

Exercise: What lines are matched below?

      /\b(?!mike|joe)\w+\s+\w+\s+\b(?!tall)\w+/i;

      Bill is tall
      Henry is tall
      Jordan is huge, literally
      Kenny is weighty
      joe is big
      mike is ok

6.4 Advanced: Zero-width positive lookbehind (?<=pattern)

Note: this feature is experimental in Perl. For example trying to use backward whitespace check "(?<=\s|^)" in fails in Perl 5.8.

[Picture 46.  pic/lect-regexp-paren-positive-lookbehind.jpg]
Picture 46. Positive lookbehind

A zero-width positive look-behind assertion. For example, /(?<=\t)\w+/ matches a word that follows a tab, without including the tab in $MATCH. Works only for fixed-width look-behind.

6.5 Advanced: Zero-width negative lookbehind (?<!pattern)

Note: this feature is experimental in Perl

[Picture 47.  pic/lect-regexp-paren-negative-lookbehind.jpg]
Picture 47. Negative lookbehind

A zero-width negative look-behind assertion. For example /(?<!bar)foo/ matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.

6.6 Advanced: Zero-width Perl eval assertion (?{ code })

This extended regular expression feature is considered highly experimental.

This zero-width assertion evaluate any embedded Perl code. It always succeeds, and its code is not interpolated. Currently, the rules to determine where the code ends are somewhat convoluted.

6.7 Advanced: Postponed expression (??{ code })

This extended regular expression feature is considered highly experimental.

This is a "postponed" regular subexpression. The code is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered as a regular expression and matched as if it were inserted instead of this construct. The code is not interpolated. As before, the rules to determine where the code ends are currently somewhat convoluted.

6.8 Advanced: Independent subexpression (?>pattern)

This extended regular expression feature is considered highly experimental.

An "independent" subexpression, one which matches the substring that a stand-alone pattern would match if anchored at the given position, and it matches *nothing other than this substring*. This construct is useful for optimizations of what would otherwise be "eternal" matches, because it will not backtrack (see the section on "Backtracking"). It may also be useful in places where the "grab all you can, and do not give anything back" semantic is desirable.

For example: ^(?>a*)ab will never match, since (?>a*) (anchored at the beginning of string, as above) will match all characters a at the beginning of string, leaving no a for ab to match. In contrast, a*ab will match the same as a+b, since the match of the subgroup a* is influenced by the following group ab (see the section on "Backtracking"). In particular, a* inside a*ab will match fewer characters than a stand-alone a*, since this makes the tail match.

An effect similar to (?>pattern) may be achieved by writing (?=(pattern))\1. This matches the same substring as a stand-alone a+, and the following \1 eats the matched string; it therefore makes a zero-length assertion into an analogue of (?>...). (The difference between these two constructs is that the second one uses a capturing group, thus shifting ordinals of backreferences in the rest of a regular expression.)

6.9 Advanced: Conditional pattern (?(condition)yes-pattern|no-pattern)

Conditional expression. (condition) should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), or look-ahead/look-behind/evaluate zero-width assertion.

6.10 Advanced: Comment (?#text)

A comment. The text is ignored. If the /x modifier enables whitespace formatting, a simple # will suffice. Note that Perl closes the comment as soon as it sees a ), so there is no way to put a literal ) in the comment.

[Picture 48.  pic/lect-regexp-paren-comment.jpg]
Picture 48. Comment inside regular expression


7.0 Perl substitute command

The very close counterpart to matching m// is the substitution function, which replaces the matched text. The command syntax is almost identical to the match operator.

[Picture 49.  pic/lect-regexp-subst-operator.jpg]
Picture 49. Substitute operator

You can change the delimiters like in match operator:

[Picture 50.  pic/lect-regexp-subst-operator-choices.jpg]
Picture 50. Substitute operator

If the regular expression is long, it can be divided in separate lines as just like in the m// by using the /x options. However, the substitution prt must be exactly like that, because all the text in it will constitute a replacement, including any spaces.

[Picture 51.  pic/lect-regexp-subst-operator-x.jpg]
Picture 51. Substitute operator with free from regexp placement

The captured text like ?$1' is available in the substitution part. The special literal commands are also available, which are listed below. Refer to [perlop] manpage. These commands are not limited to the substitution operator, but they can be used directly in strings as well.

      \l          lowercase next char
      \u          uppercase next char
      \L          lowercase till \E
      \U          uppercase till \E
      \E          end case modification (think vi)
      \Q          quote (disable) pattern metacharacters till \E
                  NOTE: You still have to escape Perl's special
                  variable characters \$ and \@
    

Here are some examples of the s/// command, be sure to count three delimiters in the command.

      s/\bred\b/blue/g;           # Change all colors
      s/^(.*)/\U$1/;              # Whole line to uppercase

      s/0x(\d\d)(\d\d)/0x$2$1/g;  # Swap 0x1122 HI-LO => 0x2211 LO-HI

      s/n\.a\.t\.o/N.A.T.O/       # All in big letters
      s/(\Qn.a.t.o\E)/\U$1/       # Same, but using \Q
    

Exercise: Write program to correct person names that have been mistakenly written in lowercase in whole line. The initial setup is:

      $ARG = "mike joe mary helen BILL";
      s/REGEXP/substitution/modifiers;
      print;

      =>

      Mike Joe Mary Helen Bill

Exercise: See [Friedl] page 46 and consider what does the following substitution do?

      s/\bJeff/\bJeff/i;

Exercise: Write programs dos2unix.pl and unix2dos.pl which convert end-of-line markers according to operating system. In perl, reading a file is quite simple with notation <> and you only have to change the lines that are stored in $ARG. Program skeleton is below. (Tip: You can also use Emacs to convert files from one format to another. See M-x set-buffer-file-coding-system and value undecided-dos and undecided-unix. Emacs also has a hex editor at M-x hexl-mode)

      #!/usr/bin/perl
      use English;
      sub Main ()
      {
          while ( <> )
          {
              print;
          }
      }
      __EMD__


8.0 Regular expression discussion

8.1 Matching numeric ranges

Regular expressions do not deal numeric ranges, they only match one number at the time. A number range 0-9 is seen and one character range, whereas 10-99 is interpreted as two digits. Suppose we want to define a month day range 1-31, we could conceptualize the range of numbers in series of rectangles:

[Picture 52.  pic/book-regexp-date.jpg]
Picture 52. Regexp to handle dates

Exercise: A similar table can be constructed to decide regular expressions for a 24-hour clock. Here is the initial table for the hour part. Define your own regular expression "areas"

       0  1  2  3  4  5  6  7  8  9
      00 01 02 03 04 05 06 07 08 09
      10 11 12 13 14 15 16 17 18 19
      20 21 22 23 24

8.2 Pay attention to the use of .*

Be very careful when using the "match everything" regular expression .*. If you don't save the match anywhere with parenthesis, the regular expression is useless:

[Picture 53.  pic/lect-regexp-note-star.jpg]
Picture 53. How the greedy star works.

Exercise: What if the regular expression whould have read /(.*)(\d*)/

Consider following regular expressions for a moment:

      /this.*/        The ".*" is redundant without grouping ()
      /this(.*)/      You can recall the text from $1

      /.*(.*)/
      /.*/
      //
    

The third thing to remember is that you should use anchor to prevent infinite number of matches from happening. Consider what would happen if string does not contain a digit:

[Picture 54.  pic/lect-regexp-note-star-diagram.jpg]
Picture 54. Greedy star keeps looking over and over.

But using an anchor, you only let regular expression to try only the First part of the choices and the transmission doesn't move the regular expression character by character forward. You should use:

      /^(.*)(\d+)/
       |
       Do not let the transmission try multiple times
    

8.3 Variable names

Many programming languages have identifiers (variables and function names) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with numbers. Here is solution for it. You could limit the length with {1,31}

      /[a-zA-Z_][a-zA-Z0-9_]*/
    

8.4 A String within double quotes

The quotes at either end are to match the open and close quote of the string. Between them there can be anything .. except another quote character.

      #!/usr/bin/perl
      use English;
      $ARG = qq(Find this "word" in the line\n);

      print $MATCH if /\"([^\"]*)\"/;  # NOTE: count the double quotes "
      #                | ======  |
      #                | |       End
      #                | Anything in between
      #                Start
      __END__
    

A very good discussion concerning the backtracking is presented in the [regexp] book, where matching double quotes are studies in detail.

Successful match of ".*"

[Picture 55.  pic/book-regexp-5-3.jpg]
Picture 55. Regexp paths

Failing attempt of ".*"!

[Picture 56.  pic/book-regexp-5-4.jpg]
Picture 56. Regexp paths

Failing attempt of `"[^"]*"! "'

[Picture 57.  pic/book-regexp-5-5.jpg]
Picture 57. Regexp paths

8.5 Dollar amount with optional cents

From top level perspective this is a simple regular expression with three parts: " a literal dollar sign, a bunch of one thing, and perhaps finally another thing". This expression is a bit naive, because the optional part doesn't need to match. But if you add beginning of line and end of line anchors, it becomes important. One type of value that it does not take into account is $.49.

      /\$[0-9]+(\.[0-9][0-9])?/
    

8.6 Matching range of numbers

Matching time can be taken to varying levels of strictness. This matches 9:17 am and 12:30 pm. Note: You have to write this better for production use, beecause it also allows 9:99 am.

      /[0-9]?[0-9]:[0-9][0-9] +(am|pm)/
    

Looking at the hour part, it must be two-digit number if the first digit is one. But 1?[0-9]? still allows number 19 and 0, so it's better to break it into more details: one digit hours and two digit hours:

      /([1-9]|1[012])/
    

The minute part is easier: the time range is 1-59 or 01-09 and 10-59 or we think of term of individual characters in matches.

      /[0-5][0-9]/
    

Using the same logic can you extend this to handle 24-hour clock, allow a leading zero in the hour part, like 09:41.

8.7 Matching temperature values

Temperatures can have both negative and positive values, where the positive value is usually implicit and not written out with "+".

      /[-+]?[0-9]+/
    

To allow optional decimal part, you must add a decimal point followed by the numeric part:

      /(\.[0-9]*)?/   OK
      /\.?[0-9]*/     WRONG. Can you figure what's wrong with this?
    

Putting this all together we get:

      /[-+]?[0-9]+(\.[0-9]*)?/
    

This will allow numbers like 32, -2, -3.723 and +54.4, it is actually not perfect because number .234 is not found and solving this leads to quite complex regular expression. To actually show in detail what is matched, you could add parentheses;

      % perl -ne "print qq($1-$2-$3\n) if /([-+]?)([0-9]+)(\.[0-9]*)?/" file.dat
    

8.8 Matching whitespace

To match tabs and spaces you could write:

      /[     ]*/              There is TAB pressed inside the regexp
    

Perl can also use the standard escape character notation \t to represent the tab and the above can be more cleanly written as:

      /[ \t]*/
    

But even more elegant way is to use special notation. Note: this will also include characters \r and \n', which is in most cases desirable.

      /\s*/
    

8.9 Matching text between HTML tags

Matching HTML is not an easy task and sometimes required lot of expressive and carefully crafted regular expression. Here is simplified regular expression that matched text between the <B> and </B>. You can use the same idea to match other tags. It is simplified, because it doesn't take into account whitespace inside tags, like < B >

      /<B>[^<]*<\/B>/
          |      |
          |      The "/" must be quoted, backslashed "\/" like in </B>
          |      This is because regular expression is delimited by / .. /
          match any character as long as it is NOT "<"
    

Notice that this almost like non-greedy matching, where you could say like below, but this would be slower.

      /<B>(.*?)<\/B>/
    

The difference between negated and non-greedy construct is best demonstrated in this example. See Friedl 1st ed. page [226].

      $ARG = "very <very> angry. <Angry!> that is.\n";
                                  ====== Negation
                    ==================== Non-Greedy

      print "Negation: $1\n"    if /<([^>]*!)>/;
      print "Non-Greedy: $1\n"  if /<(.*?!)>/;

      -->
      Negation: Angry!
      Non-Greedy: very> angry. <Angry!
    

8.10 Matching something inside parenthesis

This example is almost same as the HTML tag matching example. We examine mail header field "From" which can have following syntaxes according to Internet mail standards RFC 822 and 1036 at <http://www.cis.ohio-state.edu/hypertext/information/rfc.html>.

      From: mark@cbosgd.ATT.COM
      From: mark@cbosgd.ATT.COM (Mark Horton)
      From: Mark Horton <EM>mark@cbosgd.ATT.COM</EM>
    

We concentrate on the second alternative and try to pick up person's name inside the parenthesis. The regexp looks complicated and you have to read it several times in order to grasp the right "what parentheses mean what" idea:

                             find anything else but NOT "()"
                group start  |    group end
                          |  |    |
      /^From:\s+(\S+)\s+\(([^()]*)\)/
                         |         |
                   literal   literal
    

8.11 Reducing number of decimals to three (substituting)

See [Friedl] pages 47, 110-11 for full explanation. Suppose you want to grab two first decimals from the number and optionally third if it is non-zero. That is in case of value 12.375000 you would strip it to 12.375 and in case of value 37.500 number 37.50 would be fine.

      s/(\.\d\d[1-9]?)\d*/$1/g;
          |
          . 3 7 5
    

Now to the discussion, what if the value is already well-formed to number 27.625. It still replaces something because the match succeeds, but the effect is no-op, because \d* does not match any additional numbers. The number stays in 27.625.

Hm, wouldn't it be a bit more efficient to bypass this ineffective replace and run the command only if there is more than three decimal numbers? okay, we can change the last atom to \d+ to require an additional number. Looks like it's working fine with long numbers still. But, wait, what happens to the number 27.625 which has exactly 3 digits? SURPRISE, the question mark is not required to get any digit, but the plus atom is. The total effect is that the number is in fact converted to 2.62!!

      s/(\.\d\d[1-9]?)\d+/$1/g;
          |
          . 6 2  -     5
    

The lesson here is that match always takes precedence over the atoms that not necessarily require any characters.

Is there a solution for the above? Can we make it match only if there is 3 or more digits? Yes, you can use the lookahead that just peeks more numbers ahead.

          s/(\.\d\d[1-9]?)(?=\d\d)\d+/$1/g;
                           |
                           make sure that at least 2 digits follow
                           (requiring total of 4 always)
    


9.0 Different regular expression engines

9.1 Regexp engine types

There are two fundamentally different types of engines: one called DFA, Deterministic finite automaton, and one called NFA, non-deterministic finite automaton. Both engine types have been around for a long time, but the NFA type seems to be used more often.

      NFA         Tcl, Perl, Python, Emacs, ed(1), sed(1), vi(1),
                  mawk(1)
      DFA         grep(1), egrep(1), awk(1), lex(1), flex(1)
                  expr(1)

    

[Picture 58.  pic/book-regexp-6-1-engines.jpg]
Picture 58. Survey of a few common programs and their regexp engines

9.2 NFA engine relies on regexp (Perl, Emacs)

Let's consider one way an engine might match to(nite|knight|night) against the text tonight. Starting with the t, the regular expression is examined one component at the time. If the component matches, the next component is checked until all components have been checked. Faced with three possibilities, the engine just tries each one in turn. The writer of the regular expression has considerable opportunity to craft just what he wants to happen.

      after tonight           to(nite|knight|night)
            |                 |

      after tonight           to(nite|knight|night)
            ||                ||

      after tonight           to(nite|knight|night)
            |||               || |

      after tonight           to(nite|knight|night)  NOT OK, back
            ||||              || |<>

      after tonight           to(nite|knight|night)  NOT OK, back
            |||               ||      <>

      after tonight           to(nite|knight|night)
            |||               ||             |

      after tonight           to(nite|knight|night)   ..And so on
            ||||              ||             ||
    

As noted above, the writer can exercise fair amount of control simply by changing how the regular expression is written. Perhaps less work would have been wasted had the regexp been written differently such as:

      to(ni(ght|te)|knight)
      to(k?night|nite)
    

9.3 DFA engine reads text

The DFA tracks all matches and all possible choices during the evaluation of each component, until it can conclude that there is a match to return.

      after tonight           to(nite|knight|night)
            |                 |

      after tonight           to(nite|knight|night)
            ||                ||

      after tonight           to(nite|knight|night)
            |||               || |           |

      after tonight           to(nite|knight|night)
            ||||              || ||          ||

      after tonight           to(nite|knight|night)
            |||||             ||             |||
    

Because DFA keeps track of text position and matches, it is generally faster because each element is checked only once. The NFA engine wastes time attempting to match different subexpressions against the same text. In DFA engine, the way the regular expression is written does not matter at all, because everything is evaluated at the same time.

9.4 Crafting regular expressions

Following regular expressions yield completely different results if run on DFA or NFA engine, because NFA takes "what comes first", whereas DFA must try "all matches" and pick the longest.

      /\d\d|\d\d\d/       NFA: "12"  DFA: "123"
    

Another example, see [Friedl] page 112.

      Three tournaments won?      /tour|to|tournament/
                                   |       |
                                   NFA     DFA choice
    

The NFA engine happily accepts the first match "tour" and never bothers to look into the other possible matches. What if the regexp were written differently:

      /to(ur(nament)?)?/
    

Before you try to answer to the question above, you should know that both regular expressions are logically the same: there are three possibilities.

Exercise: The question is: "is question mark DFA greedy or like NFA first served"

9.4.1 DFA - the longest leftmost match

Consider the following example and remember the rule, that the longest match always wins in DFA:

      oneselfsufficient       one(self)?(selfsufficient)?
    

The DFA will try all possibilities and conclude that it can match the whole string. The POSIX requires that if there is multiple possible matches that start at the same position, the one matching the most text must be returned. NFA would only find "oneself". The longest requirement is demonstrated well if we see and example with parenthesis, see [Friedl] page 117. The regular expression itself could match in numerous ways, including no matches in the middle and still satisfying the whole condition.

      topological         /(to|top)(o|polo)?(gical|o?logical)/
    

But a POSIX DFA engine has no choice, but always math the longest, whereas the NFA would stick to taking the first one that does the job:

      top o logical   DFA
      to polo gical   NFA
    

If efficiency is an issue, the POSIX NFA engine really does have to try all possible permutation of the regexp every time. A poorly written regexps can suffer extremely severe performance penalties.

Text directed DFA is way around the efficiency penalty that DFA backtracking requires. The DFA engine spends extra time and memory before a match attempt to analyze the regular expression more throughly than an NFA. Once it starts looking at the string, it has an internal map describing all the possible character positions. Building the map, also called compiling regular expression, can sometimes take a fair amount of time and memory, but once done for regexp, the results can be applied to unlimited amount of text. In contrast NFA compilation is generally faster and takes less memory.

9.5 Differences in capabilities

An NFA engine can support many things that a DFA cannot. Among them are:
  • Capturing text with parenthesized subexpression. An information saying where in the text each parentheses matches is available.
  • Lookahead and Lookbehind. "This expression must match before others can continue, but do not consume any of the text"
  • Non-greedy quantifies and alteration. A DFA could also implement the shortest overall match. They just haven't done it yet.

Some comparison of the size of the typical engines:

  • 1979 NFA ed(1) 350 lines of C code.
  • 1986 NFA Version 8 general regexp library 1900 lines of C code.
  • 1992 DFA sed(1) 9700 lines of C code. o 1992 DFA grep(1) 4500 lines of C code.
  • 199? DFA and NFA GNU egrep(1) 2.0 8300 lines of C code.

GNU egrep(1) takes a simple but effective approach. It uses a DFA when possible (due to speed), reverting to NFA when backreferences are used. GNU awk(1) does something similar - It uses DFA engine for simple "does it match" checks and reverts to different engine for checks where the actual extent of the match must be known. AS of 1997, there has been development for adding the capturing support to the DFA engine and the author Henry Spencer has reported that the penalty is quadratic in text size, where it is exponential in NFA.


Html date: 2003-04-16 14:09