org.jscience.ml.tigerxml
Class Corpus

java.lang.Object
  extended by org.jscience.ml.tigerxml.Corpus
All Implemented Interfaces:
java.io.Serializable

public class Corpus
extends java.lang.Object
implements java.io.Serializable

Represents the corpus including all syntax trees in the TIGER annotation. Corpus objects contain all data structures contained in the represented TIGER corpus.

There is one instance of this class for each corpus. It contains all Sentence objects of the corpus as an ArrayList, which you can access using getSentences(). You can also access single Sentences directly or other elements of the Corpus by using the methods implemented in this class. Sentences again have methods for accessing structural elements such as non-terminal nodes and terminal nodes.

To obtain an Corpus object call the constructor #Corpus(StringcorpusFileName) giving it the name of the TiGerXML file from which to build the Corpus object.

Sample Usage

 import java.util.*;
 

import org.jscience.ml.tigerxml.*;

public class TestTigerAPI {

public static void main(String[] args) {

// Create a Corpus object by parsing the given xml file Corpus corpus = new Corpus("sample_TIGER.xml");

// Use the corpus object to print parts of the structure System.out.println("Corpus.getId: " + corpus.getId());

// All-sentences-loop for (int i = 0; i < corpus.getSentenceCount(); i++) { Sentence sent = corpus.getSentence(i); System.out.println("Sentence ID: " + sent.getId()); System.out.println("NonTerminals: ");

// All-NTs-loop for (int j = 0; j < sent.getNTCount(); j++) { NT nt = sent.getNT(j); System.out.println("NT ID: " + nt.getId()); System.out.println(" CAT: " + nt.getCat()); System.out.println(" MOTHER: " + nt.getMother()); System.out.println(" Edge2Mother: " + nt.getEdge2Mother()); } // for j } // for i } // main } // class

See Also:
Sentence, GraphNode, NT, T, Serialized Form

Constructor Summary
Corpus()
          Creates an empty Corpus object.
Corpus(org.w3c.dom.Element rootElement)
          Creates a Corpus object from a root DOM Element.
Corpus(java.lang.String corpusFileName)
          Creates a Corpus object and builds all data structures parsing the given TiGerXML file using a DOM parser.
Corpus(java.lang.String corpusFileName, int verbosity)
          Creates a Corpus object and builds all data structures parsing the given TiGerXML file using a DOM parser.
 
Method Summary
 void addAttribute(java.lang.String name, java.lang.String value)
          Add an attribute to this Corpus instance.
 void addSentence(Sentence sent)
          Appends a given Sentence to this instances sentence list.
 boolean equals(java.lang.Object obj)
          Returns true if the object is identical to this Corpus object.
 java.util.ArrayList getAllGraphNodes()
          Returns all GraphNode objects contained in this corpus.
 java.util.ArrayList getAllNTs()
          Returns all NT objects contained in this corpus.
 java.util.ArrayList getAllTs()
          Returns all T objects contained in this corpus.
 java.lang.String getAttribute(java.lang.String name)
          Returns the value of this Corpus instance's attribute stored under key.
 java.util.ArrayList getAttributeNames()
          Returns all keys in this Corpus instance's attribute hash map as an ArrayList
 GraphNode getGraphNode(java.lang.String id)
          Returns the GraphNode which has the given ID.
 GraphNode getGraphNodeBySpan(java.lang.String span)
          Finds the GraphNode which has the most similar span to the given span.
 java.lang.String getId()
          Returns the ID of this corpus as parsed from the XML file.
 int getNoOfSentences()
          Deprecated. As of org.jscience.ml.tigerxml 1.1 - use getSentenceCount() instead.
 NT getNT(java.lang.String id)
          Returns the NT which has the given ID.
 Sentence getSentence(int i)
          Returns the Sentence object with index i.
 Sentence getSentence(java.lang.String id)
          Returns the Sentence identified by id.
 int getSentenceCount()
          Returns the number of sentences objects in this corpus.
 java.util.ArrayList getSentences()
          Returns all Sentences of this Corpus as an ArrayList.
 T getT(java.lang.String id)
          Returns the T which has the given ID.
 T getTerminal(java.lang.String id)
          Returns the T which has the given ID.
 java.lang.String getText()
          Returns the whole corpus text as a String.
 int getVerbosity()
          Gets the currently set level of verbosity of this instance.
 boolean hasAttribute(java.lang.String name)
          Returns true if there is an attribute "name" in this instances's attribute map.
 int hashCode()
          Calculates and returns the hash code of this instance as an integer.
 void print2xml(java.lang.String xmlFileName)
          Prints this corpus to the xml file named filename.
 void print2Xml(java.lang.String xmlFileName, int from, int to)
          Prints a range of this corpus to the xml file named filename.
static Corpus readSerializedFromDisk(java.lang.String fileName)
          Reads a previously serialized Corpus instance from disk.
 void serializeToDisk(java.lang.String fileName)
          Serializes and writes this Corpus instance to disk.
 void setHashCode(int code)
          Overides this intance's hash code by setting it to code.
 void setId(java.lang.String passId)
          Sets the ID of this corpus.
 void setVerbosity(int verbosity)
          Sets the currently set level of verbosity of this instance.
 java.lang.String toString()
          Returns the String representation of this Corpus - the ID.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Corpus

public Corpus()
Creates an empty Corpus object. This constructor is only useful for parsing a corpus file by hand instead of letting the org.jscience.ml.tigerxml build it.

Use #Corpus(StringcorpusFile) instead.


Corpus

public Corpus(java.lang.String corpusFileName)
Creates a Corpus object and builds all data structures parsing the given TiGerXML file using a DOM parser.

Parameters:
corpusFileName - The TiGerXML file to be parsed for building the Corpus.

Corpus

public Corpus(java.lang.String corpusFileName,
              int verbosity)
Creates a Corpus object and builds all data structures parsing the given TiGerXML file using a DOM parser. Additionally, the level of verbosity for this Corpus instance is specified.

Parameters:
corpusFileName - The TiGerXML file to be parsed for building the Corpus.
verbosity - The higher this value the more process and debug information will written to stderr.

Corpus

public Corpus(org.w3c.dom.Element rootElement)
Creates a Corpus object from a root DOM Element. Use this constructor if the XML file itself is already parsed and the document is available as a DOM object.

Parameters:
rootElement - The root Element of the corpus XML document
Method Detail

getId

public java.lang.String getId()
Returns the ID of this corpus as parsed from the XML file. The ID of the corpus is specified by the attribute "id" in the element "corpus" of the corpus document (TiGerXML).

Returns:
The ID (String) of this Corpus.

setId

public void setId(java.lang.String passId)
Sets the ID of this corpus.

Parameters:
passId - The ID of this corpus.

getVerbosity

public int getVerbosity()
Gets the currently set level of verbosity of this instance. The higher the value the more information is written to stderr.

Returns:
The level of verbosity.

setVerbosity

public void setVerbosity(int verbosity)
Sets the currently set level of verbosity of this instance. The higher the value the more information is written to stderr: 0: Only error messages
1: + Basic progress information and warnings
2: + More progress information
3: + Time stats
4: + Detailed progress information
5: + Debugging messages

Parameters:
verbosity - The level of verbosity.

getSentenceCount

public int getSentenceCount()
Returns the number of sentences objects in this corpus. The returned number is equal to the length of the ArrayList containing the Sentence objects of this Corpus object.

Returns:
An integer denoting the number of sentences in this corpus.

getNoOfSentences

public int getNoOfSentences()
Deprecated. As of org.jscience.ml.tigerxml 1.1 - use getSentenceCount() instead.

Returns the number of sentences objects in this corpus. The returned number is equal to the length of the ArrayList containing the Sentence objects of this Corpus object.

Returns:
An integer denoting the number of sentences in this corpus.

getTerminal

public T getTerminal(java.lang.String id)
Returns the T which has the given ID. Returns null if the search fails. If the sentence of the T is known, Sentence#getTerminal(Stringid) can be used to retrieve the wanted NT.

Parameters:
id - The ID of the T to be found.
Returns:
The T that is identified by ID or null if the search fails.

getGraphNode

public GraphNode getGraphNode(java.lang.String id)
Returns the GraphNode which has the given ID. Returns null if the search fails. If the sentence of the GraphNode is known, Sentence#getGraphNode(Stringid) can be used to retrieve the wanted GraphNode.

Parameters:
id - The ID of the GraphNode to be found.
Returns:
The GraphNode that is identified by ID or null if the search fails.

getGraphNodeBySpan

public GraphNode getGraphNodeBySpan(java.lang.String span)
Finds the GraphNode which has the most similar span to the given span. This method is useful for finding a GraphNode corresponding to a Markable given from another anotation of the same text corpus, for example by an MMAX annotation.

As a measure for similarity the Minimum Edit Distance is used.

Parameters:
span - The span to be approximated by the returned GraphNode.
Returns:
The GraphNode approximating the given span the closest.
See Also:
org.jscience.ml.tigerxml.tools.GeneralTools#minEditDistance(ArrayListlistA,ArrayListlistB)

getNT

public NT getNT(java.lang.String id)
Returns the NT which has the given ID. Returns null if the search fails. If the sentence of the NT is known, Sentence#getNT(Stringid) can be used to retrieve the wanted NT. This might save some runtime.

Parameters:
id - The ID of the NT to be found.
Returns:
The NT that is identified by ID or null if the search fails.

getAllNTs

public java.util.ArrayList getAllNTs()
Returns all NT objects contained in this corpus. The returned NTs in the order of the XML corpus file. In order to have the list ordered by linear precedence, use org.jscience.ml.tigerxml.tools.GeneralTools#sortNodes(ArrayListnodes).

Returns:
All NTs contained in this Corpus.

getAllTs

public java.util.ArrayList getAllTs()
Returns all T objects contained in this corpus. The returned Ts in the order of the XML corpus file. In order to have the list ordered by linear precedence, use org.jscience.ml.tigerxml.tools.GeneralTools#sortTerminals(ArrayListunsortedTerminals).

Returns:
All Ts contained in this Corpus.

getAllGraphNodes

public java.util.ArrayList getAllGraphNodes()
Returns all GraphNode objects contained in this corpus. The returned GraphNodes are in the order of the XML corpus file. In order to have the list ordered by linear precedence, use org.jscience.ml.tigerxml.tools.GeneralTools#sortNodes(ArrayListnodes).

The returned list does not contain the VROOT.

Ordering by class:
All NT objects of the corpus are followed by all T object of the corpus.

Returns:
All Ts contained in this Corpus.

getT

public T getT(java.lang.String id)
Returns the T which has the given ID. Returns null if the search fails. If the sentence of the T is known, Sentence#getTerminal(Stringid) can be used to retrieve the wanted T.

Parameters:
id - The ID of the T to be found.
Returns:
The T that is identified by ID or null if the search fails.

getSentences

public java.util.ArrayList getSentences()
Returns all Sentences of this Corpus as an ArrayList.

Returns:
All Sentences of this Corpus as an ArrayList.

getSentence

public Sentence getSentence(int i)
Returns the Sentence object with index i. Sentence indices start with index 0 in the Corpus.

Parameters:
i - The index of the Sentence.
Returns:
The Sentence with index i.

getSentence

public Sentence getSentence(java.lang.String id)
Returns the Sentence identified by id. Returns null if look-up fails.

Parameters:
id - The id (String) of the Sentence.
Returns:
The Sentence with ID id. Return null if look-up fails.

addSentence

public void addSentence(Sentence sent)
Appends a given Sentence to this instances sentence list.

Parameters:
sent - The Sentence to be appended.

addAttribute

public void addAttribute(java.lang.String name,
                         java.lang.String value)
Add an attribute to this Corpus instance. For example, the ID.

Parameters:
name - The name of the attribute.
value - The value of the attribute.

getAttribute

public java.lang.String getAttribute(java.lang.String name)
Returns the value of this Corpus instance's attribute stored under key.

Parameters:
name - The name of the attribute.
Returns:
The value of the requested attribute.

getAttributeNames

public java.util.ArrayList getAttributeNames()
Returns all keys in this Corpus instance's attribute hash map as an ArrayList

Returns:
All keys of the attribute map.

hasAttribute

public boolean hasAttribute(java.lang.String name)
Returns true if there is an attribute "name" in this instances's attribute map.

Parameters:
name - The name of the attribute to be checked.
Returns:
True if there is an attribure "name".

getText

public java.lang.String getText()
Returns the whole corpus text as a String. Note that punctuation marks are treated as words - there is a space after each punctuation mark.

Returns:
The corpus text as a String.

toString

public java.lang.String toString()
Returns the String representation of this Corpus - the ID.

Overrides:
toString in class java.lang.Object
Returns:
The ID as a String object.
See Also:
getId()

serializeToDisk

public void serializeToDisk(java.lang.String fileName)
Serializes and writes this Corpus instance to disk. This is useful for large corpora that take very long to build or where building consumes much memory. The written object can be loaded using the static method #readSerializedFromDisk(StringfileName) like this:
 Corpus corpus = Corpus.readSerializedFromDisk("corp.obj");
 
Note that loading a serialized Corpus instance is not necessarily faster than parsing the corresponding TIGER-XML file. But it consumes only about half of the memory it would take to parse the TIGER-XML file. Besides, it is conceivable that there are operations on parsed Corpus instances which consume more ressources than loading a previously built and serialized Corpus instance.

Parameters:
fileName - The name of the file where the serialized Corpus object will be stored.

readSerializedFromDisk

public static Corpus readSerializedFromDisk(java.lang.String fileName)
Reads a previously serialized Corpus instance from disk. This is useful for large corpora that take very long to parse or consume much memory. The serialized object file can be written using #serializeToDisk(StringfileName)

Parameters:
fileName - The name of the file where the serialized J48 object is stored.
Returns:
An instance of Corpus as deserialized from disk.

equals

public boolean equals(java.lang.Object obj)
Returns true if the object is identical to this Corpus object. Identity is determined by comparing the corpus IDs.

Overrides:
equals in class java.lang.Object
Parameters:
obj - The Java Object to which this is to be compared to.
Returns:
True if the corpora are identical.
See Also:
getId()

hashCode

public int hashCode()
Calculates and returns the hash code of this instance as an integer.

Overrides:
hashCode in class java.lang.Object
Returns:
The hash code of this instance as an integer.

setHashCode

public void setHashCode(int code)
Overides this intance's hash code by setting it to code.

Parameters:
code - The new hash code.

print2xml

public void print2xml(java.lang.String xmlFileName)
Prints this corpus to the xml file named filename.

Parameters:
xmlFileName - The name of the XML file to be written.

print2Xml

public void print2Xml(java.lang.String xmlFileName,
                      int from,
                      int to)
Prints a range of this corpus to the xml file named filename.

Parameters:
xmlFileName - The name of the XML file to be written.
from - Starting from sentence with index from
to - Ending with sentence with index to