The following is a simple explanation of regular expressions.

Perl regular expressions are terms used for pattern matching in text strings, e.g. 'aadgt', 'aa+dgt', 'a|d|c', '[mac]a'.
Because nucleotide and amino acid sequences are text strings, regular expressions are very useful for finding motifs within sequences.
Motifs often include repetitive or ambiguous assignments at some locations. The rules and special characters used in regular expressions help define the full set of strings that match the motif pattern.
The following is a description of some of these characters and examples of how they are used.
Although regular expressions seem complicated at first, they are very useful and easy to understand after going through some examples.

Special Characters
. Match any character.
+ Matches "one or more of the preceding characters".
* Matches "any number of occurrences of the preceding character", including 0.
? Matches "zero or one occurrences of the preceding character".
[ ] Matches any character contained in the brackets.
[^ ] Match any character except those in the brackets.
{n} Matches when the preceding character, or character range, occurs exactly n times.
{n,} Matches when the preceding character occurs at least n times.
{n,m} Matches when the preceding character occurs at least n times, but no more than m times.


Here are some examples of searches.
ad+f (1 or more occurences of 'd') would match any of the following:
adf
addf
adddf
addddddf
...

ad*f (0 or more occurences of 'd') would match:
af
adf
addf
adddf
...

ad?f (0 or 1 occurence of 'd') would match:
af
adf

a[yst]c would match:
atc
asc
ayc


Specify the number of occurrences of a residue.
P{1,5} would match P from 1 to 5 times.

.{1,30} would match any amino acid 1 to 30 times so you could find a motif within 30 amino acids of something like the beginning.


Pattern Anchors
^ Match only at the beginning of the string.
$ Match only at the end of the string.


Here are examples of expressions using pattern anchors.
^mdef (e.g. a protein sequence starting with 'mdef') would match:
  • mdef
  • mdefab
  • mdefaredfadfk
but not match :
  • edefa
  • emdefa
  • eeeemdef


kdel$ (searches for proteins ending with 'kdel', a standard ER retention signal) would match:
  • eeeekdel
  • kdel
but not match :
  • edefkdell
  • akdeleefg