org.jscience.biology.alignment
Class CrochemoreLandauZivUkelson

java.lang.Object
  extended by org.jscience.biology.alignment.PairwiseAlignmentAlgorithm
      extended by org.jscience.biology.alignment.CrochemoreLandauZivUkelson
Direct Known Subclasses:
CrochemoreLandauZivUkelsonGlobalAlignment, CrochemoreLandauZivUkelsonLocalAlignment

public abstract class CrochemoreLandauZivUkelson
extends PairwiseAlignmentAlgorithm

This abstract class is the superclass of both global and local sequence alignment algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and Michal Ziv-Ukelson (2002).

This implementation derives from the paper of M.Crochemore, G.Landau and M.Ziv-Ukelson, A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices (available here as PDF or Postscript).

It employs Lempel-Ziv compression techniques to speed up the classic dynamic programmin approach to sequence alignment (see NeedlemanWunsch and SmithWaterman classes). It reduces the time and space complexity from O(n2) to O(n2/log n). In fact, the complexity is actually O(h n2/log n), where 0 <= h <= 1 is a real number denoting the entropy of the text (a measure of how compressible it is). This means that, the more compressible a sequence is, the less memory the algorithm will require, and the faster it will run.

The idea behind this improvement is to identify repetitions in the sequences and reuse the computation of their alignments. The first step is, therefore, to parse the sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished by the FactorSequence class (for more information about this factorisation, please refer to the specification of that class) that builds a doubly-linked list of factors. Each factor is an instance of the Factor class (refer to the specification of this class for more information).

Once the sequences have been parsed, the algorithm builds a matrix of blocks, called block table, that is vaguely similar to the dynamic programming matrix used by both NeedlemanWunsch and SmithWaterman. Each block contains an instance of an AlignmentBlock (please refer to its specification for more information on what information is stored) and represents the alignment beteween one factor of each sequence. This block table is, in fact, a partition of the alignment graph.

Consider a block B which corresponds to the alignment of factor F1 = xa from sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of S1 with character a, while F2 extends a previous factor of S2 with character b. We can define the input border of B as the set of values at the left and top borders of block B, and the output border as the set of values at the right and bottom borders of B. Moreover, we can define the following prefix blocks of B:

Note that each factor has a pointer to its prefix factor, called ancestor (see the specification of the Factor class). This pointer makes it easy to retrieve any of the three prefix blocks of B in constant time.

Rather than computing each value of the alignment block B, the algorithm will only compute the values on its input and output borders. This is precisely what makes it more efficient.

In this class there is a general specification of how the block table is computed (see the computeBlockTable method for details). The actual method depends on the subclasses. In general, there are two phases:

In fact, for each block, only one column of the DIST matrix needs to be computed, all other columns are actually retrieved from its prefix blocks. This is precisely what is accomplished by the assembleDistMatrix method found in this class (it is general enough for both global and local alignment versions of the algorithm.

From the DIST matrix, we obtain the OUT matrix defined as OUT[i,j] = I[i] + DIST[i,j] where I is the input border array. This means that the OUT matrix is the DIST matrix updated by the input border of a block. The output border is then computed from the OUT matrix by taking the maximum value of each column. This class also have a general method for assembling the input border (see assembleInputBorder

The OUT matrix is encoded by the OutMatrix class that takes as both a DIST matrix and an input border array. Note that it does not compute the OUT matrix, it just stores the necessary information to retrieve a value at any position of the matrix.

A naive approach to compute the output border of a block from the OUT matrix of size n x n would take a time proportional to n2. However, it happens that, due to the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which implies that they are also totally monotone. This property allows the computation of the output border of B in linear time with the SMAWK algorithm (see the specification of the Smawk class for more information on SMAWK).

This class contains a general specification that is pertinent to both global and local versions of the algorithm. For more information on each version of, please refer to the appropriate subclass.

A note about the performance of these algorithms. Although theoretical results suggest that these algorithms are faster and consume less memory than the classical methods, in practice it is hard to realise their performance gains.

These algorithms are extremely complex and require the storage of many extra pointers and other auxiliary data for each block (see the AlignmentBlock class for more details). Hence, even though the space requirement is O(n2/log n), which is less than O(n2), in practice, for most of the cases these algorithms will take more time and memory space than their clasical counterparts (we have to keep in mind that the Big Oh notation ignores all constants involved).

Therefore, in order to realise the full power of these algorithms, they have to be used with extremly large and redundant sequences. This will allow a proper reutilisation of the computations and, maybe, provide an improvement in terms of space and run time. For instance, it is easy to devise such a sequence if we use a one-character alphabet because, in this case, a sequence is factorised into a series of factors that are a prefix of the next.

See Also:
CrochemoreLandauZivUkelsonGlobalAlignment, CrochemoreLandauZivUkelsonLocalAlignment, NeedlemanWunsch, SmithWaterman, FactorSequence, AlignmentBlock, OutMatrix, Smawk, computeBlockTable(), assembleDistMatrix(org.jscience.biology.alignment.AlignmentBlock, int, int, int, int)

Field Summary
protected  AlignmentBlock[][] block_table
          The block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence.
protected static byte DIAGONAL_DIRECTION
          A constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
protected static byte LEFT_DIRECTION
          A constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
protected  int num_cols
          Number of columns of the block table.
protected  int num_rows
          Number of rows of the block table.
protected  OutMatrix out_matrix
          An instance of the OutMatrix class that encodes the OUT matrix of a block when supplied with the DIST matrix and the input border array of a block.
protected  FactorSequence seq1
          The first factorised sequence being aligned.
protected  FactorSequence seq2
          The second factorised sequence being aligned.
protected  Smawk smawk
          An instance of the Smawk class that implements the SMAWK algorithm to compute the column maxima of a totally monotone matrix.
protected static byte STOP_DIRECTION
          A constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop.
protected static byte TOP_DIRECTION
          A constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.
 
Fields inherited from class org.jscience.biology.alignment.PairwiseAlignmentAlgorithm
alignment, APPROXIMATE_MATCH_TAG, GAP_CHARACTER, GAP_TAG, MATCH_TAG, MISMATCH_TAG, score, score_computed, scoring, sequences_loaded, use_match_tag
 
Constructor Summary
CrochemoreLandauZivUkelson()
           
 
Method Summary
protected  int[][] assembleDistMatrix(AlignmentBlock block, int dim, int r, int c, int lc)
          Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks.
protected  int[] assembleInputBorder(int dim, int r, int c, int lr)
          Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks.
protected abstract  PairwiseAlignment buildOptimalAlignment()
          Retrieves an optimal alignment between the loaded sequences.
protected  void computeBlockTable()
          Computes the block table.
protected  PairwiseAlignment computePairwiseAlignment()
          Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to retrieve an optimal alignment between the loaded sequences.
protected  int computeScore()
          Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to locate the score of the highest scoring alignment between the two sequences in the block table.
protected abstract  AlignmentBlock createBlock(Factor factor1, Factor factor2, int row, int col)
          Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2.
protected abstract  AlignmentBlock createFirstColumnBlock(Factor factor1, Factor factor2, int row)
          Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string.
protected abstract  AlignmentBlock createFirstRowBlock(Factor factor1, Factor factor2, int col)
          Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string.
protected abstract  AlignmentBlock createRootBlock(Factor factor1, Factor factor2)
          Computes the root block of the block table.
protected  AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
          This method is a shorthand to retrieve the diagonal prefix of a block from the block table.
protected  AlignmentBlock getLeftPrefix(AlignmentBlock block)
          This method is a shorthand to retrieve the left prefix of a block from the block table.
protected  AlignmentBlock getTopPrefix(AlignmentBlock block)
          This method is a shorthand to retrieve the top prefix of a block from the block table.
protected  void loadSequencesInternal(java.io.Reader input1, java.io.Reader input2)
          Loads sequences into FactorSequence instances.
protected abstract  int locateScore()
          Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed.
protected  void traverseBlock(AlignmentBlock block, int source, java.lang.StringBuffer gapped_seq1, java.lang.StringBuffer tag_line, java.lang.StringBuffer gapped_seq2)
          Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border.
protected  void unloadSequencesInternal()
          Frees pointers to loaded sequences and the the block table so that their data can be garbage collected.
 
Methods inherited from class org.jscience.biology.alignment.PairwiseAlignmentAlgorithm
getPairwiseAlignment, getScore, loadSequences, max, max, max, scoreDeletion, scoreInsertion, scoreSubstitution, setScoringScheme, unloadSequences, useMatchTag
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

STOP_DIRECTION

protected static final byte STOP_DIRECTION
A constant that indicates that the source of an optimal path has been reached in a block and that the trace back procedure to retrieve a high scoring alignment can stop.

See Also:
Constant Field Values

LEFT_DIRECTION

protected static final byte LEFT_DIRECTION
A constant that indicates that the left direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.

See Also:
Constant Field Values

DIAGONAL_DIRECTION

protected static final byte DIAGONAL_DIRECTION
A constant that indicates that the diagonal direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.

See Also:
Constant Field Values

TOP_DIRECTION

protected static final byte TOP_DIRECTION
A constant that indicates that the top direction must be followed to reach the source of an optimal path in a block during the trace back procedure to retrieve a high scoring alignment.

See Also:
Constant Field Values

seq1

protected FactorSequence seq1
The first factorised sequence being aligned.


seq2

protected FactorSequence seq2
The second factorised sequence being aligned.


block_table

protected AlignmentBlock[][] block_table
The block table, which is a matrix of alignment blocks where each block represents the alignment between one factor of each sequence.


num_rows

protected int num_rows
Number of rows of the block table. It is determined by the number of factors of the first sequence.


num_cols

protected int num_cols
Number of columns of the block table. It is determined by the number of factors of the second sequence.


smawk

protected Smawk smawk
An instance of the Smawk class that implements the SMAWK algorithm to compute the column maxima of a totally monotone matrix. It is used to speed up the computation of the output border of a block.


out_matrix

protected OutMatrix out_matrix
An instance of the OutMatrix class that encodes the OUT matrix of a block when supplied with the DIST matrix and the input border array of a block. Note that it does not compute the OUT matrix itselft, it just stores the necessary information to retrieve a value at any position of the matrix.

This object is then used to compute the output border of a block with the Smawk class. Note that the OutMatrix class implements the Matrix interface as required by the Smawk class.

See Also:
Matrix, Smawk
Constructor Detail

CrochemoreLandauZivUkelson

public CrochemoreLandauZivUkelson()
Method Detail

loadSequencesInternal

protected void loadSequencesInternal(java.io.Reader input1,
                                     java.io.Reader input2)
                              throws java.io.IOException,
                                     InvalidSequenceException
Loads sequences into FactorSequence instances. In case of any error, an exception is raised by the constructor of FactorSequence (please check the specification of that class for specific requirements).

A FactorSequence is an LZ78-like factorisation of the sequences being aligned.

Specified by:
loadSequencesInternal in class PairwiseAlignmentAlgorithm
Parameters:
input1 - Input for first sequence
input2 - Input for second sequence
Throws:
java.io.IOException - If an I/O error occurs when reading the sequences
InvalidSequenceException - If the sequences are not valid
See Also:
FactorSequence

unloadSequencesInternal

protected void unloadSequencesInternal()
Frees pointers to loaded sequences and the the block table so that their data can be garbage collected.

Specified by:
unloadSequencesInternal in class PairwiseAlignmentAlgorithm
See Also:
PairwiseAlignmentAlgorithm.unloadSequences()

computePairwiseAlignment

protected PairwiseAlignment computePairwiseAlignment()
                                              throws IncompatibleScoringSchemeException
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to retrieve an optimal alignment between the loaded sequences. The actual product depends on the subclasses which can produce a global (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment (see CrochemoreLandauZivUkelsonLocalAlignment).

Subclasses are required to implement the buildOptimalAlignment abstract method defined by this class according to their own method.

Specified by:
computePairwiseAlignment in class PairwiseAlignmentAlgorithm
Returns:
an optimal alignment between the loaded sequences
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.
See Also:
CrochemoreLandauZivUkelsonGlobalAlignment, CrochemoreLandauZivUkelsonLocalAlignment, computeBlockTable(), buildOptimalAlignment()

computeScore

protected int computeScore()
                    throws IncompatibleScoringSchemeException
Computes the block table (the result depends on subclasses, see computeBlockTable for details) and requests subclasses to locate the score of the highest scoring alignment between the two sequences in the block table. The result depends on the subclasses, and either a global alignment (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment score (see CrochemoreLandauZivUkelsonLocalAlignment) will be produced.

Subclasses are required to implement the locateScore abstract method defined by this class according to their own method.

Note that this method calculates the similarity value only (it doesn't trace back into the block table to retrieve the alignment itself).

Specified by:
computeScore in class PairwiseAlignmentAlgorithm
Returns:
the score of the highest scoring alignment between the loaded sequences
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.
See Also:
CrochemoreLandauZivUkelsonGlobalAlignment, CrochemoreLandauZivUkelsonLocalAlignment, locateScore()

computeBlockTable

protected void computeBlockTable()
                          throws IncompatibleScoringSchemeException
Computes the block table. This method is a general specification of how the block table should be computed. It creates the block table according to the number of factors of each sequence. It then goes over each position of the block table, retrieves the corresponding factors from each sequence, and repasses the information to the subclasses that will do the actual computation of each block using the scoring scheme previously set.

There are four different cases that defines four abstract methods in this class, which subclasses must implement:

Note that each factor has a serial number which indicates its order in the list of factors of a sequence. This number will match with the row and column index of a block in the block table. For instance, if a block has factors F1 and F2 with serial numbers 12 and 53, respectively, this means that this block is found at row 12, column 53 of the block table.

Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.
See Also:
createRootBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor), createFirstColumnBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int), createFirstRowBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int), createBlock(org.jscience.biology.alignment.Factor, org.jscience.biology.alignment.Factor, int, int)

assembleDistMatrix

protected int[][] assembleDistMatrix(AlignmentBlock block,
                                     int dim,
                                     int r,
                                     int c,
                                     int lc)
Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix blocks. In fact, it also stores pointers to the owner block for each column retrieved. These pointers are later used during the trace back procedure that builds an optimal alignment from the information computed in the block table. This method is general enough to suit both global and local alignment versions of the algorithm.

Parameters:
block - the block for which the DIST matrix is needed
dim - the dimension of the DIST matrix
r - the row index of this block in the block table
c - the column index of this block in the block table
lc - the number of columns of the alignment block
Returns:
the DIST matrix

assembleInputBorder

protected int[] assembleInputBorder(int dim,
                                    int r,
                                    int c,
                                    int lr)
Assembles the input border of a block by retrieving the values at the output borders of the left and top blocks. This method is general enough to suit both global and local alignment versions of the algorithm. Note that it can be used to assemble the input border of any block but the root one (it will cause an ArrayIndexOutOfBoundsException.

Parameters:
dim - dimension of the input border
r - row index of the block in the block table
c - column index of the block in the block table
lr - number of row of the block
Returns:
the block's input border

traverseBlock

protected void traverseBlock(AlignmentBlock block,
                             int source,
                             java.lang.StringBuffer gapped_seq1,
                             java.lang.StringBuffer tag_line,
                             java.lang.StringBuffer gapped_seq2)
                      throws IncompatibleScoringSchemeException
Traverses a block to retrieve a part of an optimal alignment from the specified source in the output border to an entry in the input border.

Parameters:
block - the block to be traversed
source - the source of the path in the output border
gapped_seq1 - the StringBuffer to where the gapped sequence 1 is written to
tag_line - the StringBuffer to where the tag_line is written to
gapped_seq2 - the StringBuffer to where the gapped sequence 2 is written to
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

getLeftPrefix

protected AlignmentBlock getLeftPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the left prefix of a block from the block table.

Parameters:
block - the block
Returns:
the block's left prefix

getDiagonalPrefix

protected AlignmentBlock getDiagonalPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the diagonal prefix of a block from the block table.

Parameters:
block - the block
Returns:
the block's diagonal prefix

getTopPrefix

protected AlignmentBlock getTopPrefix(AlignmentBlock block)
This method is a shorthand to retrieve the top prefix of a block from the block table.

Parameters:
block - the block
Returns:
the block's top prefix

createRootBlock

protected abstract AlignmentBlock createRootBlock(Factor factor1,
                                                  Factor factor2)
                                           throws IncompatibleScoringSchemeException
Computes the root block of the block table. See subclasses for actual implementation.

Parameters:
factor1 - the factor of the first sequence being aligned
factor2 - the factor of the second sequence being aligned
Returns:
the root block
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

createFirstRowBlock

protected abstract AlignmentBlock createFirstRowBlock(Factor factor1,
                                                      Factor factor2,
                                                      int col)
                                               throws IncompatibleScoringSchemeException
Computes a block at the first row (row zero) of the block table, which corresponds to an alignment block between one factor of sequence 2 and an empty string. See subclasses for actual implementation.

Parameters:
factor1 - the factor of the first sequence being aligned
factor2 - the factor of the second sequence being aligned
col - the column index of block in the block table
Returns:
the computed block
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

createFirstColumnBlock

protected abstract AlignmentBlock createFirstColumnBlock(Factor factor1,
                                                         Factor factor2,
                                                         int row)
                                                  throws IncompatibleScoringSchemeException
Computes a block at the first column (column zero) of the block table, which corresponds to an alignment block between one factor of sequence 1 and an empty string. See subclasses for actual implementation.

Parameters:
factor1 - the factor of the first sequence being aligned
factor2 - the factor of the second sequence being aligned
row - the row index of block in the block table
Returns:
the computed block
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

createBlock

protected abstract AlignmentBlock createBlock(Factor factor1,
                                              Factor factor2,
                                              int row,
                                              int col)
                                       throws IncompatibleScoringSchemeException
Computes a block of the block table, which corresponds to an alignment block between one factor of sequence 1 and one factor of sequence 2. See subclasses for actual implementation.

Parameters:
factor1 - the factor of the first sequence being aligned
factor2 - the factor of the second sequence being aligned
row - the row index of block in the block table
col - the column index of block in the block table
Returns:
the computed block
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

buildOptimalAlignment

protected abstract PairwiseAlignment buildOptimalAlignment()
                                                    throws IncompatibleScoringSchemeException
Retrieves an optimal alignment between the loaded sequences. See subclasses for actual implementation.

Returns:
the computed block
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.

locateScore

protected abstract int locateScore()
Locates the score of the highest scoring alignment between the two sequences in the block table after is thas been computed. See subclasses for actual implementation.

Returns:
the score of the highest scoring alignment between the loaded sequences
Throws:
IncompatibleScoringSchemeException - If the scoring scheme is not compatible with the loaded sequences.