15. Regular expressions

Pattern matching:

Task: from a number of strings, filter only those resembling a given example
Example string — pattern — has some special characters, describing string structure
matching is testing if the whole string can be described by pattern
searching is finding a substring that matches pattern

Shell patterns

see glob
used for filename generation before executing shell command
```
   1 $ ls
   2 a  a0  aa  aaa  aaaa  abab  abba  acb  b  b0  bb  bbb  cabba  cba
   3 $ ls a*
   4 a  a0  aa  aaa  aaaa  abab  abba  acb
   5 $ ls a?b
   6 acb
   7 $ ls c*a
   8 cabba  cba
   9 $ ls [ab]*
  10 a  a0  aa  aaa  aaaa  abab  abba  acb  b  b0  bb  bbb
  11 $ ls *[a-z]
  12 a  aa  aaa  aaaa  abab  abba  acb  b  bb  bbb  cabba  cba
  13 $ ls *[^a-z]
  14 a0  b0
  15 
```
- performed by shell, not by command called (i. e. not by ls here)

Too weak.

Regular expressions

Mastering Regular Expressions by Jeffrey Friedl (aka The Owl Book).

Narrowest Chomsky_hierarchy formal language class.

Can be (relatively) easily parsed
Can describe almost any possible pattern, not bound to context and having no internal parts dependence (e. g. «a if precedes by b» or «integer number, than that number of characters» can not be described)
Laconic

Warning

To write a regexp is far more easier than to read other's regexp

Atomic regexp:
- any non-special character matches exactly same character
- "E" → «E»
- a dot "." matches any one character
- "." → «E»
- "." → «:»
- "." → «.»
- a set of characters matches any character from the set:
- "[quack!]" → «a»
- "[quack!]" → «!»
- "[a-z]" → «q» (any small letter)
- "[a-z]" → «z» (any small letter)
- "[a-fA-F0-9]" → «f» (any hexadecimal digit)
- "[a-fA-F0-9]" → «D» (any hexadecimal digit)
- "[abcdefABCDEF0-9]" → «4» (any hexadecimal digit)
- a negative set of characters matches any character not from the set:
- "[^quack!]" → «r»
- "[^quack!]" → «#»
- "[^quack!]" → «A»
- any atomic regexp followed by "*" repeater matches a continuous sequence of substrings, including empty sequence, each matched by the regexp
- "a*" → «aaa»
- "a*" → «»
- "a*" → «a»
- "[0-9]*" → «7»
- "[0-9]*" → «»
- "[0-9]*" → «1231234»
- ".*" → any string!
- any complex regexp enclosed by special grouping parenthesis "$" and "$" (see below)
Complex regexp
- A sequence of atomic regexps
- Matches a continuous sequence of substrings, each matched by corresponded atomic regexp
- "boo" → «boo»
- "r....e" → «riddle»
- "r....e" → «r re e»
- "[0-9][0-9]*" → any non-negative integer
- "[A-Za-z_][A-Za-z0-9]*" → C identifier (alphanumeric sequence with «_», not started from digit)
- grouping parenthesis can be used for repeating complex regexp:
- "$[A-Z][a-z]$*" → «ReGeXp»
- "$[A-Z][a-z]$*" → «»
- "$[A-Z][a-z]$*" → «Oi»
- Implies leftmost longest rule (aka «greedy»):
  - In successful match of complex regexp leftmost atomic regexp takes longest possible match, second leftmost atomic regexp takes longest match that possible in current condition; and so on
  - ".*.*" → all the string leftmost, empty string next
  - "[a-z]*[0-9]*[a-z0-9]*" → «123b0c0»
    - "[a-z]*" → «»
    - "[0-9]*" → «123»
    - "[a-z0-9]*" → «b0c0»
  - "[a-d]*[c-f]*[d-h]*" → «abcdefgh»
    - "[a-d]*" → «abcd»
    - "[c-f]*" → «ef»
    - "[d-h]*" → «gh»
Positioning mark
- "^regexp" matches only substrings located at the beginning of the line
- "rgexp$" matches only substrings located at the end of line

Regexp tools

grep: filtering strings that contain regexp:
```
   1 $ cal | grep 18
   2 16 17 18 19 20 21 22
   3 $ cal | grep '9.*4'
   4  9 10 11 12 13 14 15
   5 
```
try all examples above via grep (happily it colors all substring matches)
vim (command enter command-line mode)
- /regexp: search forward
- ?regexp: search backward
less — same
…

Search and replace

sed — stream editor; if not sure, do not go too deep in

search and replace: s/regexp/replacement

e. g.

replace once

   1 $ cal | sed 's/[12][23]/@@/' 
   2      March 2020     
   3 Su Mo Tu We Th Fr Sa
   4  1  2  3  4  5  6  7
   5  8  9 10 11 @@ 13 14
   6 15 16 17 18 19 20 21
   7 @@ 23 24 25 26 27 28
   8 29 30 31            
   9

replace all (globally)

   1 $ cal | sed 's/[12][23]/@@/g'
   2      March 2020     
   3 Su Mo Tu We Th Fr Sa
   4  1  2  3  4  5  6  7
   5  8  9 10 11 @@ @@ 14
   6 15 16 17 18 19 20 21
   7 @@ @@ 24 25 26 27 28
   8 29 30 31
   9

Group recall: every substring matched regexp grouped by "$"/"$" can be inserted into replacement string by referencing to corresponded number ("\1", "\2`" etc):

   1 $ cal | sed 's/2\([0-6]\)/=\1/g'
   2      March =0=0     
   3 Su Mo Tu We Th Fr Sa
   4  1  2  3  4  5  6  7
   5  8  9 10 11 12 13 14
   6 15 16 17 18 19 =0 =1
   7 =2 =3 =4 =5 =6 27 28
   8 29 30 31
   9 $ echo '15 16 17 18 19 20 21' | sed 's/\(15\)\(.*\)\(20\)/\3\2\1/'
  10 20 16 17 18 19 15 21
  11 $ echo '==15 16 17 18 19 20 21==' | sed 's/\([0-9][0-9]*\).*\([0-9]\)/\1\2/'
  12 ==151==
  13 $ echo '==15 16 17 18 19 20 21==' | sed 's/\([0-9]*\).*\([0-9]\)/\1\2/' 
  14 1==
  15

check #complex examples with sed
Groups are numbered by the opening parenthesis order:
```
   1 $ echo 'aaabbbccc' | sed 's/$a*\(b*$\)/\2-\1=/'
   2 bbb-aaabbb=ccc
   3 $ echo '15 16 17 18 19 20 21' | sed 's/$1.*7.*\(8.*2$\)/\2-\1=/'
   4 8 19 20 2-15 16 17 18 19 20 2=1
   5 
```
vim: same as sed, but plus:
- ":s/regexp/replacement/" — replace once in current line
- ":s/regexp/replacement/g" — replace all (globally) in current line
- ":%s/regexp/replacement/g" — replace all in the while file
- ":10,30s/regexp/replacement/g" — replace all in 10, 11, …, 30 line
- ":/BEGIN/,/END/s/regexp/replacement/g" — replace all in lines started from line contains BEGIN to the line contains END
- ":/main(/,/^}/s/'$[^']*$'/"\1"/g" — replace all «'...'» strings (wich are not in C ) to «"..."» (which are) in function main()
- ":/main(/,/^}/s/'$[^']*$'/"\1"/gc" — do the same with confirmation of each replacement

Extended regexp and dialects

Disadvantages of traditional regexp: it;s not easy to

Search for one regexp or another — the "|" operator
Use «one-or-more» repeater (ok, it's easy, but boring); also, "*" repeater is dangerous
Use character class like letters or spaces (also boring)
Use those "\"-s every time (most boring, in fact)

Extended regexp:

"(" and ")" are grouping specials, parenthesis are "\(" or "[(]"
"+" is «one-ore-more» repeater; "?" is zero-or-one repeater
"{6}", "{6,10}", "{6,}", "{,6}" like repeaters (exactly 6, from 6 to 10, not less than 6, not more than 6 time repeat)
character classes like "[:digit:]", "[:upper:]" etc.
```
   1 $ echo "== Var01.field01 ==" | sed -E 's/([[:alnum:].]+)/"\1"/'
   2 == "Var01.field01" ==
   3 
```
- note -E (Extended) key
- sometimes you can use extended regexp specials with '\': '\+', '\{' etc.
collation, equivalence classes etc.

Vim regexp

Can be also read from vim itself by :help pattern
A lot of stuff. E. g. own character classes and "\W"."\w" positioning marks

Superseding engines

Regexps are context unaware: see main() example above: how to replace «'..'» and «'..…'» patterns only, but not «.»?

Consider «Not '.', nor ':', but 'any'» → «Not '.', nor ':', but "any"»
- E. g. "s/'$[^'][^'][^']*$'/"\1"/g" → «Not '.", nor ":", but "any'»
- ⇒ When using regexp yous should check or restrict the context by yourself

Superseding engines

PCRE (Syntax here)
Python re
...

HSE/ProgrammingOS/15_Regexp (последним исправлял пользователь FrBrGeorge 2021-11-15 12:47:19)