Introduction to Regular Expressions
Definition
According to wikipedia, regular expressions are a sequence of characters that define a search pattern for use in find and replace operations.
Where can Regular Expressions be used?
Regular Expressions are applied to unstructured text that have some sort of vague pattern. Figuring out the pattern of the document and the proper Regex syntax to use can sometimes be difficult and can take a lot of time. Users need to be aware of the following before beginning to use Regular Expressions.
-
Choose a text editor that can harness regular expressions and figure out how Regex works in that software (also, make sure its enabled in the software). Libre Office, Sublime, Notepad ++ all behave somewhat differently. The online tools such as RegExr or Rubular also behave somewhat differently then the desktop tools.
-
Be familiar with your text and what type of structure you want in the end.
-
Familiarize yourself with the most common regular expressions or common combinations of syntax.
Regex Syntax
Regex | What It Does |
---|---|
A b 1 | literals — letters, digits, and spaces match themselves |
[Ab1] | a character class, matching one instance of any of A, b, or 1 in this case |
[a-z] | all lowercase letters within a range |
[0-9] | all digits |
. | any character |
* | zero or more |
+ | one or more |
( ) | if contents within parentheses match, define a group for future reference |
$1 | refer to a matched group (this is the notation in LibreOffice; other notations such as \1 are sometimes used elsewhere) |
\t | tab |
^ | beginning of line |
$ | end of line |
Regex combination syntax
([A-Z])\w+ Finds all of the words with capital letters
\b\w{4}\b 4 letter words
There were the option of completing 2 tutorials to practice using Regular Expressions:
Syntax to use on the sample texts below, which is a section of the text used in the “Regex and Republic of Texas” tutorial. Try testing some of the syntax below using RegExr
a. (.+\bto\b) – find just “to”
b. \r\n[^~].+ to remove lines without the “~”
c. (,)( [0-9]{4})(.+) to find the date