

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object org.jscience.biology.alignment.PairwiseAlignmentAlgorithm org.jscience.biology.alignment.CrochemoreLandauZivUkelson
public abstract class CrochemoreLandauZivUkelson
This abstract class is the superclass of both global and local sequence alignment algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and Michal ZivUkelson (2002).
This implementation derives from the paper of M.Crochemore, G.Landau and M.ZivUkelson, A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices (available here as PDF or Postscript).
It employs LempelZiv compression techniques to speed up the classic dynamic programmin approach to sequence alignment (see NeedlemanWunsch and SmithWaterman classes). It reduces the time and space complexity from O(n^{2}) to O(n^{2}/log n). In fact, the complexity is actually O(h n^{2}/log n), where 0 <= h <= 1 is a real number denoting the entropy of the text (a measure of how compressible it is). This means that, the more compressible a sequence is, the less memory the algorithm will require, and the faster it will run.
The idea behind this improvement is to identify repetitions in the sequences and reuse the computation of their alignments. The first step is, therefore, to parse the sequences into LZ78like factors. LZ78 is a popular compression algorithm of the LempelZiv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished by the FactorSequence class (for more information about this factorisation, please refer to the specification of that class) that builds a doublylinked list of factors. Each factor is an instance of the Factor class (refer to the specification of this class for more information).
Once the sequences have been parsed, the algorithm builds a matrix of blocks, called
block table, that is vaguely similar to the dynamic programming matrix used by both
NeedlemanWunsch
and SmithWaterman
. Each block contains an
instance of an AlignmentBlock (please refer to its specification for more
information on what information is stored) and represents the alignment beteween one
factor of each sequence. This block table is, in fact, a partition of the alignment
graph.
Consider a block B which corresponds to the alignment of factor F1 = xa from sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of S1 with character a, while F2 extends a previous factor of S2 with character b. We can define the input border of B as the set of values at the left and top borders of block B, and the output border as the set of values at the right and bottom borders of B. Moreover, we can define the following prefix blocks of B:
Note that each factor has a pointer to its prefix factor, called ancestor (see the
specification of the Factor
class). This pointer makes it easy to retrieve
any of the three prefix blocks of B in constant time.
Rather than computing each value of the alignment block B, the algorithm will only compute the values on its input and output borders. This is precisely what makes it more efficient.
In this class there is a general specification of how the block table is computed
(see the computeBlockTable
method for details). The actual
method depends on the subclasses. In general, there are two phases:
In fact, for each block, only one column of the DIST matrix needs to be computed,
all other columns are actually retrieved from its prefix blocks. This is precisely what
is accomplished by the assembleDistMatrix
method found in
this class (it is general enough for both global and local alignment versions of the
algorithm.
From the DIST matrix, we obtain the OUT matrix defined as
OUT[i,j] = I[i] + DIST[i,j]
where I is the input border array. This means
that the OUT matrix is the DIST matrix updated by the input border of a block. The
output border is then computed from the OUT matrix by taking the maximum value of each
column. This class also have a general method for assembling the input border (see
assembleInputBorder
The OUT matrix is encoded by the OutMatrix class that takes as both a DIST matrix and an input border array. Note that it does not compute the OUT matrix, it just stores the necessary information to retrieve a value at any position of the matrix.
A naive approach to compute the output border of a block from the OUT matrix of size n x n would take a time proportional to n^{2}. However, it happens that, due to the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which implies that they are also totally monotone. This property allows the computation of the output border of B in linear time with the SMAWK algorithm (see the specification of the Smawk class for more information on SMAWK).
This class contains a general specification that is pertinent to both global and local versions of the algorithm. For more information on each version of, please refer to the appropriate subclass.
A note about the performance of these algorithms. Although theoretical results suggest that these algorithms are faster and consume less memory than the classical methods, in practice it is hard to realise their performance gains.
These algorithms are extremely complex and require the storage of many extra
pointers and other auxiliary data for each block (see the AlignmentBlock
class for more details). Hence, even though the space requirement is
O(n^{2}/log n), which is less than O(n^{2}), in practice, for most of
the cases these algorithms will take more time and memory space than their clasical
counterparts (we have to keep in mind that the Big Oh notation ignores all constants
involved).
Therefore, in order to realise the full power of these algorithms, they have to be used with extremly large and redundant sequences. This will allow a proper reutilisation of the computations and, maybe, provide an improvement in terms of space and run time. For instance, it is easy to devise such a sequence if we use a onecharacter alphabet because, in this case, a sequence is factorised into a series of factors that are a prefix of the next.
CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
NeedlemanWunsch
,
SmithWaterman
,
FactorSequence
,
AlignmentBlock
,
OutMatrix
,
Smawk
,
computeBlockTable()
,
assembleDistMatrix(org.jscience.biology.alignment.AlignmentBlock, int, int, int, int)
Field Summary  

protected AlignmentBlock[][] 
block_table
The block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence. 
protected static byte 
DIAGONAL_DIRECTION
A constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. 
protected static byte 
LEFT_DIRECTION
A constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. 
protected int 
num_cols
Number of columns of the block table. 
protected int 
num_rows
Number of rows of the block table. 
protected OutMatrix 
out_matrix
An instance of the OutMatrix class that encodes the OUT matrix of a
block when supplied with the DIST matrix and the input border array of a block. 
protected FactorSequence 
seq1
The first factorised sequence being aligned. 
protected FactorSequence 
seq2
The second factorised sequence being aligned. 
protected Smawk 
smawk
An instance of the Smawk class that implements the SMAWK algorithm to
compute the column maxima of a totally monotone matrix. 
protected static byte 
STOP_DIRECTION
A constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop. 
protected static byte 
TOP_DIRECTION
A constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment. 
Fields inherited from class org.jscience.biology.alignment.PairwiseAlignmentAlgorithm 

alignment, APPROXIMATE_MATCH_TAG, GAP_CHARACTER, GAP_TAG, MATCH_TAG, MISMATCH_TAG, score, score_computed, scoring, sequences_loaded, use_match_tag 
Constructor Summary  

CrochemoreLandauZivUkelson()

Method Summary  

protected int[][] 
assembleDistMatrix(AlignmentBlock block,
int dim,
int r,
int c,
int lc)
Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks. 
protected int[] 
assembleInputBorder(int dim,
int r,
int c,
int lr)
Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks. 
protected abstract PairwiseAlignment 
buildOptimalAlignment()
Retrieves an optimal alignment between the loaded sequences. 
protected void 
computeBlockTable()
Computes the block table. 
protected PairwiseAlignment 
computePairwiseAlignment()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to retrieve an
optimal alignment between the loaded sequences. 
protected int 
computeScore()
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to locate the
score of the highest scoring alignment between the two sequences in the block
table. 
protected abstract AlignmentBlock 
createBlock(Factor factor1,
Factor factor2,
int row,
int col)
Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2. 
protected abstract AlignmentBlock 
createFirstColumnBlock(Factor factor1,
Factor factor2,
int row)
Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string. 
protected abstract AlignmentBlock 
createFirstRowBlock(Factor factor1,
Factor factor2,
int col)
Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string. 
protected abstract AlignmentBlock 
createRootBlock(Factor factor1,
Factor factor2)
Computes the root block of the block table. 
protected AlignmentBlock 
getDiagonalPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the diagonal prefix of a block from the block table. 
protected AlignmentBlock 
getLeftPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the left prefix of a block from the block table. 
protected AlignmentBlock 
getTopPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the top prefix of a block from the block table. 
protected void 
loadSequencesInternal(java.io.Reader input1,
java.io.Reader input2)
Loads sequences into FactorSequence instances. 
protected abstract int 
locateScore()
Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed. 
protected void 
traverseBlock(AlignmentBlock block,
int source,
java.lang.StringBuffer gapped_seq1,
java.lang.StringBuffer tag_line,
java.lang.StringBuffer gapped_seq2)
Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border. 
protected void 
unloadSequencesInternal()
Frees pointers to loaded sequences and the the block table so that their data can be garbage collected. 
Methods inherited from class org.jscience.biology.alignment.PairwiseAlignmentAlgorithm 

getPairwiseAlignment, getScore, loadSequences, max, max, max, scoreDeletion, scoreInsertion, scoreSubstitution, setScoringScheme, unloadSequences, useMatchTag 
Methods inherited from class java.lang.Object 

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait 
Field Detail 

protected static final byte STOP_DIRECTION
protected static final byte LEFT_DIRECTION
protected static final byte DIAGONAL_DIRECTION
protected static final byte TOP_DIRECTION
protected FactorSequence seq1
protected FactorSequence seq2
protected AlignmentBlock[][] block_table
protected int num_rows
protected int num_cols
protected Smawk smawk
Smawk
class that implements the SMAWK algorithm to
compute the column maxima of a totally monotone matrix. It is used to speed up the
computation of the output border of a block.
protected OutMatrix out_matrix
OutMatrix
class that encodes the OUT matrix of a
block when supplied with the DIST matrix and the input border array of a block.
Note that it does not compute the OUT matrix itselft, it just stores the necessary
information to retrieve a value at any position of the matrix.
This object is then used to compute the output border of a block with the
Smawk
class. Note that the OutMatrix
class implements the
Matrix
interface as required by the Smawk
class.
Matrix
,
Smawk
Constructor Detail 

public CrochemoreLandauZivUkelson()
Method Detail 

protected void loadSequencesInternal(java.io.Reader input1, java.io.Reader input2) throws java.io.IOException, InvalidSequenceException
FactorSequence
instances. In case of any error,
an exception is raised by the constructor of FactorSequence
(please
check the specification of that class for specific requirements).
A FactorSequence
is an LZ78like factorisation of the sequences
being aligned.
loadSequencesInternal
in class PairwiseAlignmentAlgorithm
input1
 Input for first sequenceinput2
 Input for second sequence
java.io.IOException
 If an I/O error occurs when reading the sequences
InvalidSequenceException
 If the sequences are not validFactorSequence
protected void unloadSequencesInternal()
unloadSequencesInternal
in class PairwiseAlignmentAlgorithm
PairwiseAlignmentAlgorithm.unloadSequences()
protected PairwiseAlignment computePairwiseAlignment() throws IncompatibleScoringSchemeException
computeBlockTable
for details) and requests subclasses to retrieve an
optimal alignment between the loaded sequences. The actual product depends on the
subclasses which can produce a global (see
CrochemoreLandauZivUkelsonGlobalAlignment
) or local alignment (see
CrochemoreLandauZivUkelsonLocalAlignment
).
Subclasses are required to implement the buildOptimalAlignment
abstract method defined by this class according to their own method.
computePairwiseAlignment
in class PairwiseAlignmentAlgorithm
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
computeBlockTable()
,
buildOptimalAlignment()
protected int computeScore() throws IncompatibleScoringSchemeException
computeBlockTable
for details) and requests subclasses to locate the
score of the highest scoring alignment between the two sequences in the block
table. The result depends on the subclasses, and either a global alignment
(see CrochemoreLandauZivUkelsonGlobalAlignment
) or local alignment
score (see CrochemoreLandauZivUkelsonLocalAlignment
) will be produced.
Subclasses are required to implement the locateScore
abstract
method defined by this class according to their own method.
Note that this method calculates the similarity value only (it doesn't trace back into the block table to retrieve the alignment itself).
computeScore
in class PairwiseAlignmentAlgorithm
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.CrochemoreLandauZivUkelsonGlobalAlignment
,
CrochemoreLandauZivUkelsonLocalAlignment
,
locateScore()
protected void computeBlockTable() throws IncompatibleScoringSchemeException
There are four different cases that defines four abstract methods in this class, which subclasses must implement:
Note that each factor has a serial number which indicates its order in the list of factors of a sequence. This number will match with the row and column index of a block in the block table. For instance, if a block has factors F1 and F2 with serial numbers 12 and 53, respectively, this means that this block is found at row 12, column 53 of the block table.
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.createRootBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor)
,
createFirstColumnBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int)
,
createFirstRowBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int)
,
createBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int, int)
protected int[][] assembleDistMatrix(AlignmentBlock block, int dim, int r, int c, int lc)
block
 the block for which the DIST matrix is neededdim
 the dimension of the DIST matrixr
 the row index of this block in the block tablec
 the column index of this block in the block tablelc
 the number of columns of the alignment block
protected int[] assembleInputBorder(int dim, int r, int c, int lr)
ArrayIndexOutOfBoundsException
.
dim
 dimension of the input borderr
 row index of the block in the block tablec
 column index of the block in the block tablelr
 number of row of the block
protected void traverseBlock(AlignmentBlock block, int source, java.lang.StringBuffer gapped_seq1, java.lang.StringBuffer tag_line, java.lang.StringBuffer gapped_seq2) throws IncompatibleScoringSchemeException
block
 the block to be traversedsource
 the source of the path in the output bordergapped_seq1
 the StringBuffer to where the gapped sequence 1 is written totag_line
 the StringBuffer to where the tag_line is written togapped_seq2
 the StringBuffer to where the gapped sequence 2 is written to
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected AlignmentBlock getLeftPrefix(AlignmentBlock block)
block
 the block
protected AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
block
 the block
protected AlignmentBlock getTopPrefix(AlignmentBlock block)
block
 the block
protected abstract AlignmentBlock createRootBlock(Factor factor1, Factor factor2) throws IncompatibleScoringSchemeException
factor1
 the factor of the first sequence being alignedfactor2
 the factor of the second sequence being aligned
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createFirstRowBlock(Factor factor1, Factor factor2, int col) throws IncompatibleScoringSchemeException
factor1
 the factor of the first sequence being alignedfactor2
 the factor of the second sequence being alignedcol
 the column index of block in the block table
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createFirstColumnBlock(Factor factor1, Factor factor2, int row) throws IncompatibleScoringSchemeException
factor1
 the factor of the first sequence being alignedfactor2
 the factor of the second sequence being alignedrow
 the row index of block in the block table
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected abstract AlignmentBlock createBlock(Factor factor1, Factor factor2, int row, int col) throws IncompatibleScoringSchemeException
factor1
 the factor of the first sequence being alignedfactor2
 the factor of the second sequence being alignedrow
 the row index of block in the block tablecol
 the column index of block in the block table
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected abstract PairwiseAlignment buildOptimalAlignment() throws IncompatibleScoringSchemeException
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.protected abstract int locateScore()
IncompatibleScoringSchemeException
 If the scoring scheme is not compatible
with the loaded sequences.


PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 