Java regular expressions
A regular expressions is a string that describes
or matches a set of strings, according to certain syntax rules. They
are usually used to give a concise description of a set, without
having to list all elements. The simplest form of a regular
expression is a literal string. The syntax used for the regular
expressions is the Java regular expression syntax (see
http://java.sun.com/docs/books/tutorial/essential/regex/index.html).
Below is listed some of the most important syntax rules which are
also shown in the help pop-up when you press Shift +
F1:- [A-Z] will match the characters Athrough Z (Range). You can also put single characters between the brackets: The expression [AGT] matches the characters A, G or T.
- [A-D[M-P]] will match the characters Athrough D and M through P (Union). You can also put single characters between the brackets: The expression [AG[M-P]] matches the characters A, G and M through P.
- [A-M&&[H-P]] will match the characters between A and M lying between H and P (Intersection). You can also put single characters between the brackets. The expression [A-M&&[HGTDA]] matches the characters A through M which is H, G, T, D or A.
- [^A-M] will match any character except those between A and M (Excluding). You can also put single characters between the brackets: The expression [^AG] matches any character except A and G.
- [A-Z&&[^M-P]] will match any character A through Z except those between M and P (Subtraction). You can also put single characters between the brackets: The expression [A-P&&[^CG]] matches any character between A and P except C and G.
- The symbol . matches any character.
- X{n} will match a repetition of an element indicated by following that element with a numerical value or a numerical range between the curly brackets. For example, ACG{2} matches the string ACGG and (ACG){2} matches ACGACG.
- X{n,m} will match a certain number of
repetitions of an element indicated by following that element with
two numerical values between the curly brackets. The first number is
a lower limit on the number of repetitions and the second number is
an upper limit on the number of repetitions. For example,
ACT{1,3} matches ACT, ACTT and ACTTT.
- X{n,} represents a repetition of an element
at least n times. For example, (AC){2,} matches all strings
ACAC, ACACAC, ACACACAC,...
- The symbol ^ restricts the search to the beginning of your sequence. For example, if you search through a sequence with the regular expression ^AC, the algorithm will find a match if AC occurs in the beginning of the sequence.
- The symbol $ restricts the search to the end of your sequence. For example, if you search through a sequence with the regular expression GT$, the algorithm will find a match if GT occurs in the end of the sequence.
Examples
The expression [ACG][^AC]G{2} matches all strings of
length 4, where the first character is A,C or G and the second
is any character except A,C and the third and fourth character is
G. The expression G.[^A]$ matches all strings of
length 3 in the end of your sequence, where the first character is
C, the second any character and the third any character except
A.