Java regular expressions
A regular expression is a sequence of characters that defines a search pattern for matching strings. Regular expressions can define simple patterns, such as an exact sequences, or complex ones describing variable regions, repetitions, and alternative residues within biological sequences. Motifs described by regular expressions use the Java regular expression syntax. Common syntax components include:
- Range [A-Z] matches the characters A through Z, while [AGT] matches the characters A, G, or T.
- Union [A-D[M-P]] matches the characters A through D and M through P, while [AG[M-P]] matches the characters A, G, and M through P.
- Intersection [A-M&&[H-P]] matches the characters A through M that are also within H through P, while [A-M&&[HGTDA]] matches the characters A through M that are also H, G, T, D, or A.
- Exclusion [^A-M] matches any character except those between A and M, while [^AG] matches any character except A and G.
- Subtraction [A-Z&&[^M-P]] matches the characters A through Z except those between M and P, while [A-P&&[^CG]] matches the characters A through P except C and G.
- Any character The symbol . matches any single character.
- Exact repetition pattern{n} matches pattern repeated exactly n times. For example, ACG{2} matches ACGG, and (ACG){2} matches ACGACG.
- Range repetition pattern{n,m} matches pattern repeated between n and m times. For example, ACT{1,3} matches ACT, ACTT, and ACTTT.
- At least repetition pattern{n,} matches pattern repeated at least n times. For example, (AC){2,} matches ACAC, ACACAC, ACACACAC, and so on.
- Group (pattern) groups components together, allowing other syntax components to apply to the entire group. For example, (ATG){2} matches ATGATG.
- Or pattern1|pattern2 matches either pattern1 or pattern2. For example, (GAA|GAG) matches either GAA or GAG.
- Start of sequence The symbol ^ matches the beginning of the sequence. For example, ^AC matches sequences starting with AC.
- End of sequence The symbol $ matches the end of the sequence. For example, GT$ matches sequences ending in GT.
Examples
The following examples illustrate how Java regular expressions can be applied to motifs:
- [ACG][^AC]G{2} matches all sequences of length 4 where:
- The first character is A, C, or G.
- The second character is any character except A and C.
- The third and fourth characters are G.
- G.[^A]$ matches all sequences of length 3 where:
- The first character is G.
- The second character is any character.
- The third character is any character except A.
- The sequence ends with these three characters.
