15. Regular expressions
Pattern matching:
Task: from a number of strings, filter only those resembling a given example
Example string — pattern — has some special characters, describing string structure
matching is testing if the whole string can be described by pattern
searching is finding a substring that matches pattern
Shell patterns
see glob
used for filename generation before executing shell command
performed by shell, not by command called (i. e. not by ls here)
Too weak.
Regular expressions
Mastering Regular Expressions by Jeffrey Friedl (aka The Owl Book).
Narrowest Chomsky_hierarchy formal language class.
- Can be (relatively) easily parsed
Can describe almost any possible pattern, not bound to context and having no internal parts dependence (e. g. «a if precedes by b» or «integer number, than that number of characters» can not be described)
- Laconic
Warning
To write a regexp is far more easier than to read other's regexp
- Atomic regexp:
- any non-special character matches exactly same character
"E" → «E»
a dot "." matches any one character
"." → «E»
"." → «:»
"." → «.»
a set of characters matches any character from the set:
"[quack!]" → «a»
"[quack!]" → «!»
"[a-z]" → «q» (any small letter)
"[a-z]" → «z» (any small letter)
"[a-fA-F0-9]" → «f» (any hexadecimal digit)
"[a-fA-F0-9]" → «D» (any hexadecimal digit)
"[abcdefABCDEF0-9]" → «4» (any hexadecimal digit)
a negative set of characters matches any character not from the set:
"[^quack!]" → «r»
"[^quack!]" → «#»
"[^quack!]" → «A»
any atomic regexp followed by "*" repeater matches a continuous sequence of substrings, including empty sequence, each matched by the regexp
"a*" → «aaa»
"a*" → «»
"a*" → «a»
"[0-9]*" → «7»
"[0-9]*" → «»
"[0-9]*" → «1231234»
".*" → any string!
any complex regexp enclosed by special grouping parenthesis "\(" and "\)" (see below)
Complex regexp
- A sequence of atomic regexps
- Matches a continuous sequence of substrings, each matched by corresponded atomic regexp
"boo" → «boo»
"r....e" → «riddle»
"r....e" → «r re e»
"[0-9][0-9]*" → any non-negative integer
"[A-Za-z_][A-Za-z0-9]*" → C identifier (alphanumeric sequence with «_», not started from digit)
- grouping parenthesis can be used for repeating complex regexp:
"\([A-Z][a-z]\)*" → «ReGeXp»
"\([A-Z][a-z]\)*" → «»
"\([A-Z][a-z]\)*" → «Oi»
Implies leftmost longest rule (aka «greedy»):
In successful match of complex regexp leftmost atomic regexp takes longest possible match, second leftmost atomic regexp takes longest match that possible in current condition; and so on
".*.*" → all the string leftmost, empty string next
"[a-z]*[0-9]*[a-z0-9]*" → «123b0c0»
"[a-z]*" → «»
"[0-9]*" → «123»
"[a-z0-9]*" → «b0c0»
"[a-d]*[c-f]*[d-h]*" → «abcdefgh»
"[a-d]*" → «abcd»
"[c-f]*" → «ef»
"[d-h]*" → «gh»
- Positioning mark
"^regexp" matches only substrings located at the beginning of the line
"rgexp$" matches only substrings located at the end of line
Regexp tools
grep: filtering strings that contain regexp:
try all examples above via grep (happily it colors all substring matches)
vim (command enter command-line mode)
/regexp: search forward
?regexp: search backward
less — same
- …
Search and replace
sed — stream editor; if not sure, do not go too deep in
search and replace: s/regexp/replacement
- e. g.
- replace once
replace all (globally)
Group recall: every substring matched regexp grouped by "\("/"\)" can be inserted into replacement string by referencing to corresponded number ("\1", "\2`" etc):
1 $ cal | sed 's/2\([0-6]\)/=\1/g' 2 March =0=0 3 Su Mo Tu We Th Fr Sa 4 1 2 3 4 5 6 7 5 8 9 10 11 12 13 14 6 15 16 17 18 19 =0 =1 7 =2 =3 =4 =5 =6 27 28 8 29 30 31 9 $ echo '15 16 17 18 19 20 21' | sed 's/\(15\)\(.*\)\(20\)/\3\2\1/' 10 20 16 17 18 19 15 21 11 $ echo '==15 16 17 18 19 20 21==' | sed 's/\([0-9][0-9]*\).*\([0-9]\)/\1\2/' 12 ==151== 13 $ echo '==15 16 17 18 19 20 21==' | sed 's/\([0-9]*\).*\([0-9]\)/\1\2/' 14 1== 15
check #complex examples with sed
Groups are numbered by the opening parenthesis order:
vim: same as sed, but plus:
":s/regexp/replacement/" — replace once in current line
":s/regexp/replacement/g" — replace all (globally) in current line
":%s/regexp/replacement/g" — replace all in the while file
":10,30s/regexp/replacement/g" — replace all in 10, 11, …, 30 line
":/BEGIN/,/END/s/regexp/replacement/g" — replace all in lines started from line contains BEGIN to the line contains END
":/main(/,/^}/s/'\([^']*\)'/"\1"/g" — replace all «'...'» strings (wich are not in C ) to «"..."» (which are) in function main()
":/main(/,/^}/s/'\([^']*\)'/"\1"/gc" — do the same with confirmation of each replacement
Extended regexp and dialects
Disadvantages of traditional regexp: it;s not easy to
Search for one regexp or another — the "|" operator
Use «one-or-more» repeater (ok, it's easy, but boring); also, "*" repeater is dangerous
Use character class like letters or spaces (also boring)
Use those "\"-s every time (most boring, in fact)
"(" and ")" are grouping specials, parenthesis are "\(" or "[(]"
"+" is «one-ore-more» repeater; "?" is zero-or-one repeater
"{6}", "{6,10}", "{6,}", "{,6}" like repeaters (exactly 6, from 6 to 10, not less than 6, not more than 6 time repeat)
character classes like "[:digit:]", "[:upper:]" etc.
note -E (Extended) key
sometimes you can use extended regexp specials with '\': '\+', '\{' etc.
- collation, equivalence classes etc.
Can be also read from vim itself by :help pattern
A lot of stuff. E. g. own character classes and "\W"."\w" positioning marks
Superseding engines
Regexps are context unaware: see main() example above: how to replace «'..'» and «'..…'» patterns only, but not «.»?
Consider «Not '.', nor ':', but 'any'» → «Not '.', nor ':', but "any"»
E. g. "s/'\([^'][^'][^']*\)'/"\1"/g" → «Not '.", nor ":", but "any'»
- ⇒ When using regexp yous should check or restrict the context by yourself
Superseding engines