|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.jscience.ml.tigerxml.Corpus
public class Corpus
Represents the corpus including all syntax trees in the TIGER annotation.
Corpus objects contain all data structures contained in the
represented TIGER corpus.
Sentence objects of the corpus as an ArrayList,
which you can access using getSentences(). You can also access
single Sentences directly or other elements of the Corpus by using the
methods implemented in this class. Sentences again have methods for
accessing structural elements such as non-terminal nodes and terminal
nodes.
To obtain an Corpus object call the constructor
#Corpus(StringcorpusFileName) giving it the name of the TiGerXML
file from which to build the Corpus object.
Sample Usage
import java.util.*;
import org.jscience.ml.tigerxml.*;
public class TestTigerAPI {
public static void main(String[] args) {
// Create a Corpus object by parsing the given xml file
Corpus corpus = new Corpus("sample_TIGER.xml");
// Use the corpus object to print parts of the structure
System.out.println("Corpus.getId: " + corpus.getId());
// All-sentences-loop
for (int i = 0; i < corpus.getSentenceCount(); i++) {
Sentence sent = corpus.getSentence(i);
System.out.println("Sentence ID: " + sent.getId());
System.out.println("NonTerminals: ");
// All-NTs-loop
for (int j = 0; j < sent.getNTCount(); j++) {
NT nt = sent.getNT(j);
System.out.println("NT ID: " + nt.getId());
System.out.println(" CAT: " + nt.getCat());
System.out.println(" MOTHER: " + nt.getMother());
System.out.println(" Edge2Mother: " + nt.getEdge2Mother());
} // for j
} // for i
} // main
} // class
Sentence,
GraphNode,
NT,
T,
Serialized Form| Constructor Summary | |
|---|---|
Corpus()
Creates an empty Corpus object. |
|
Corpus(org.w3c.dom.Element rootElement)
Creates a Corpus object from a root DOM Element. |
|
Corpus(java.lang.String corpusFileName)
Creates a Corpus object and builds all data structures
parsing the given TiGerXML file using a DOM parser. |
|
Corpus(java.lang.String corpusFileName,
int verbosity)
Creates a Corpus object and builds all data structures
parsing the given TiGerXML file using a DOM parser. |
|
| Method Summary | |
|---|---|
void |
addAttribute(java.lang.String name,
java.lang.String value)
Add an attribute to this Corpus instance. |
void |
addSentence(Sentence sent)
Appends a given Sentence to this instances sentence list. |
boolean |
equals(java.lang.Object obj)
Returns true if the object is identical to this Corpus
object. |
java.util.ArrayList |
getAllGraphNodes()
Returns all GraphNode objects contained in this corpus. |
java.util.ArrayList |
getAllNTs()
Returns all NT objects contained in this corpus. |
java.util.ArrayList |
getAllTs()
Returns all T objects contained in this corpus. |
java.lang.String |
getAttribute(java.lang.String name)
Returns the value of this Corpus instance's attribute stored
under key. |
java.util.ArrayList |
getAttributeNames()
Returns all keys in this Corpus instance's attribute hash map
as an ArrayList |
GraphNode |
getGraphNode(java.lang.String id)
Returns the GraphNode which has the given ID. |
GraphNode |
getGraphNodeBySpan(java.lang.String span)
Finds the GraphNode which has the most similar span to
the given span. |
java.lang.String |
getId()
Returns the ID of this corpus as parsed from the XML file. |
int |
getNoOfSentences()
Deprecated. As of org.jscience.ml.tigerxml 1.1 - use getSentenceCount() instead. |
NT |
getNT(java.lang.String id)
Returns the NT which has the given ID. |
Sentence |
getSentence(int i)
Returns the Sentence object with index i. |
Sentence |
getSentence(java.lang.String id)
Returns the Sentence identified by id. |
int |
getSentenceCount()
Returns the number of sentences objects in this corpus. |
java.util.ArrayList |
getSentences()
Returns all Sentences of this Corpus as an ArrayList. |
T |
getT(java.lang.String id)
Returns the T which has the given ID. |
T |
getTerminal(java.lang.String id)
Returns the T which has the given ID. |
java.lang.String |
getText()
Returns the whole corpus text as a String. |
int |
getVerbosity()
Gets the currently set level of verbosity of this instance. |
boolean |
hasAttribute(java.lang.String name)
Returns true if there is an attribute " name" in this
instances's attribute map. |
int |
hashCode()
Calculates and returns the hash code of this instance as an integer. |
void |
print2xml(java.lang.String xmlFileName)
Prints this corpus to the xml file named filename. |
void |
print2Xml(java.lang.String xmlFileName,
int from,
int to)
Prints a range of this corpus to the xml file named filename. |
static Corpus |
readSerializedFromDisk(java.lang.String fileName)
Reads a previously serialized Corpus instance from disk. |
void |
serializeToDisk(java.lang.String fileName)
Serializes and writes this Corpus instance to disk. |
void |
setHashCode(int code)
Overides this intance's hash code by setting it to code. |
void |
setId(java.lang.String passId)
Sets the ID of this corpus. |
void |
setVerbosity(int verbosity)
Sets the currently set level of verbosity of this instance. |
java.lang.String |
toString()
Returns the String representation of this Corpus - the ID. |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public Corpus()
Corpus object. This constructor is only
useful for parsing a corpus file by hand instead of letting the org.jscience.ml.tigerxml
build it.
Use #Corpus(StringcorpusFile) instead.
public Corpus(java.lang.String corpusFileName)
Corpus object and builds all data structures
parsing the given TiGerXML file using a DOM parser.
corpusFileName - The TiGerXML file to be parsed for building
the Corpus.
public Corpus(java.lang.String corpusFileName,
int verbosity)
Corpus object and builds all data structures
parsing the given TiGerXML file using a DOM parser. Additionally, the
level of verbosity for this Corpus instance is specified.
corpusFileName - The TiGerXML file to be parsed for building
the Corpus.verbosity - The higher this value the more process and debug
information will written to stderr.public Corpus(org.w3c.dom.Element rootElement)
Corpus object from a root DOM Element.
Use this constructor if the XML file itself is already parsed and the
document is available as a DOM object.
rootElement - The root Element of
the corpus XML document| Method Detail |
|---|
public java.lang.String getId()
"id" in the element
"corpus" of the corpus document (TiGerXML).
Corpus.public void setId(java.lang.String passId)
passId - The ID of this corpus.public int getVerbosity()
public void setVerbosity(int verbosity)
verbosity - The level of verbosity.public int getSentenceCount()
ArrayList
containing the Sentence objects of this Corpus
object.
public int getNoOfSentences()
getSentenceCount() instead.
ArrayList
containing the Sentence objects of this Corpus
object.
public T getTerminal(java.lang.String id)
T which has the given ID. Returns null if the
search fails. If the sentence of the T is known,
Sentence#getTerminal(Stringid) can be used to retrieve the
wanted NT.
id - The ID of the T to be found.
null if the search
fails.public GraphNode getGraphNode(java.lang.String id)
GraphNode which has the given ID. Returns null if
the search fails. If the sentence of the GraphNode is known,
Sentence#getGraphNode(Stringid) can be used to retrieve the
wanted GraphNode.
id - The ID of the GraphNode to be found.
GraphNode that is identified by ID or
null if the search fails.public GraphNode getGraphNodeBySpan(java.lang.String span)
GraphNode which has the most similar span to
the given span. This method is useful for finding a GraphNode
corresponding to a Markable given from another anotation of the same
text corpus, for example by an MMAX annotation.As a measure for similarity the Minimum Edit Distance is used.
span - The span to be approximated by the returned
GraphNode.
GraphNode approximating the given span the
closest.org.jscience.ml.tigerxml.tools.GeneralTools#minEditDistance(ArrayListlistA,ArrayListlistB)public NT getNT(java.lang.String id)
NT which has the given ID. Returns
null if the search fails. If the sentence of the
NT is known,
Sentence#getNT(Stringid) can be used to retrieve the
wanted NT. This might save some runtime.
id - The ID of the NT to be found.
null if the search
fails.public java.util.ArrayList getAllNTs()
NT objects contained in this corpus. The returned
NTs in the order of the XML corpus file. In order to have the list ordered
by linear precedence, use
org.jscience.ml.tigerxml.tools.GeneralTools#sortNodes(ArrayListnodes).
Corpus.public java.util.ArrayList getAllTs()
T objects contained in this corpus. The returned
Ts in the order of the XML corpus file. In order to have the list ordered
by linear precedence, use org.jscience.ml.tigerxml.tools.GeneralTools#sortTerminals(ArrayListunsortedTerminals).
Corpus.public java.util.ArrayList getAllGraphNodes()
GraphNode objects contained in this corpus. The
returned GraphNodes are in the order of the XML corpus file. In order
to have the list ordered by linear precedence, use
org.jscience.ml.tigerxml.tools.GeneralTools#sortNodes(ArrayListnodes).
The returned list does not contain the VROOT.
Ordering by class:
All NT objects of the corpus are followed by all
T object of the corpus.
Corpus.public T getT(java.lang.String id)
T which has the given ID. Returns null if the
search fails. If the sentence of the T is known,
Sentence#getTerminal(Stringid) can be used to retrieve the
wanted T.
id - The ID of the T to be found.
T that is identified by ID or null
if the search fails.public java.util.ArrayList getSentences()
public Sentence getSentence(int i)
Sentence object with index i.
Sentence indices start with index 0 in the Corpus.
i - The index of the Sentence.
i.public Sentence getSentence(java.lang.String id)
null if
look-up fails.
id - The id (String) of the Sentence.
public void addSentence(Sentence sent)
Sentence to this instances sentence list.
sent - The Sentence to be appended.
public void addAttribute(java.lang.String name,
java.lang.String value)
Corpus instance. For example,
the ID.
name - The name of the attribute.value - The value of the attribute.public java.lang.String getAttribute(java.lang.String name)
Corpus instance's attribute stored
under key.
name - The name of the attribute.
public java.util.ArrayList getAttributeNames()
Corpus instance's attribute hash map
as an ArrayList
public boolean hasAttribute(java.lang.String name)
name" in this
instances's attribute map.
name - The name of the attribute to be checked.
name".public java.lang.String getText()
public java.lang.String toString()
toString in class java.lang.ObjectgetId()public void serializeToDisk(java.lang.String fileName)
Corpus instance to disk.
This is useful for large corpora that take very long to build or
where building consumes much memory. The written object can be
loaded using the static method
#readSerializedFromDisk(StringfileName) like this:
Corpus corpus = Corpus.readSerializedFromDisk("corp.obj");
Note that loading a serialized Corpus instance is not
necessarily faster than parsing the corresponding TIGER-XML file. But
it consumes only about half of the memory it would take to parse the
TIGER-XML file. Besides, it is conceivable that there are operations
on parsed Corpus instances which consume more ressources
than loading a previously built and serialized
fileName - The name of the file where the serialized
Corpus object will be stored.public static Corpus readSerializedFromDisk(java.lang.String fileName)
Corpus instance from disk.
This is useful for large corpora that take very long to parse or
consume much memory. The serialized object file can be written using
#serializeToDisk(StringfileName)
fileName - The name of the file where the serialized J48 object is
stored.
Corpus as deserialized from disk.public boolean equals(java.lang.Object obj)
Corpus
object. Identity is determined by comparing the corpus IDs.
equals in class java.lang.Objectobj - The Java Object to which this is to be compared
to.
getId()public int hashCode()
hashCode in class java.lang.Objectpublic void setHashCode(int code)
code.
code - The new hash code.public void print2xml(java.lang.String xmlFileName)
xmlFileName - The name of the XML file to be written.
public void print2Xml(java.lang.String xmlFileName,
int from,
int to)
xmlFileName - The name of the XML file to be written.from - Starting from sentence with index fromto - Ending with sentence with index to
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||