org.jscience.net
Class Spider

java.lang.Object
  extended by org.jscience.net.Spider

public final class Spider
extends java.lang.Object

Spider provides several useful static functions for accessing web content and parsing HTML most based on a simple URL. Note that because this class uses functionality from the javax.swing package (although no GUIs are used in this class), there are non-terminating javax.swing threads that get created when using this class. I.e. an application using this class without using any other javax.swing GUI components may end up with unwanted non-terminated threads possible forcing calls to e.g. System.exit(0) to terminate a simple program. Most methods are synchronized, so don't expect to have longer running methods (such as getting links from a URL) run simultaneously on the same Spider object.

See Also:
CrawlerSetting, URLCache

Nested Class Summary
static class Spider.SMonitor
          Deprecated.  
static class Spider.URLWrapper
          wrappes a java.net.URL and keeps a reference to its referer
 
Constructor Summary
Spider()
          convenience constructor, that initializes the Spider with a null value as URL
Spider(java.net.URL url)
          constructs a Spider object based on the given URL
 
Method Summary
 int calculatePageWeight()
          returns the page weigt in bytes (= content length of URL plus sum of embedded images)
 java.net.URL[] crawlWeb(CrawlerSetting crawler, int numberOfURLsToFind, Logger logger)
          searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first
static java.net.URL[] crawlWeb(java.util.List<Spider.URLWrapper> searchList, java.util.List<java.net.URL> resultList, java.util.List<java.net.URL> closedList, CrawlerSetting crawler, int depth, int numberOfURLsToFind, Logger logger)
          usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)
 java.lang.String fullHeaderAsString()
           
 java.net.URL[] getBrokenLinks()
          Assuming the URL points to a HTML page, only links that are not accessible are returned.
 byte[] getBytes()
          retrieves the raw content from the embedded URL.
 java.lang.String getContentAsString()
          retrieves the entire content accessible through the embedded URL as a String.
 int getContentLength()
          retrieves the content length from an URLConnection
 java.lang.String getDomainName()
           
 javax.swing.text.html.HTMLDocument getHTMLDocument()
          returns an HTMLDocument object with the parsed content of the embedded URL for further examination
 javax.swing.text.html.HTMLDocument getHTMLDocument(java.io.Reader reader)
          returns an HTMLDocument object with the parsed content from the given reader for further examination
 java.net.URL[] getImages(boolean allowDuplicates)
          returns an array of images that are contained in the embedded URL
 java.net.URL[] getImages(java.io.Reader reader, boolean allowDuplicates)
          allows to read the content from another location but the url itself
 java.net.URL[] getLinks(boolean allowDuplicates)
          returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned.
 java.net.URL[] getLinks(boolean allowDuplicates, java.lang.String protocol)
          returns links filtered by the given protocol
 java.net.URL[] getLinks(java.io.Reader reader, boolean allowDuplicates)
          allows to read the content from another location but the url itself
 java.io.Reader getReader()
          This function constructs a reader appropriate for reading the content from the embedded URL.
 java.io.Reader getReader(java.lang.String charsetName)
           
 java.lang.String getTagText(javax.swing.text.html.HTML.Tag desiredTag, java.lang.String delimiter)
          returns all text found in the given desiredTag delimited by the given delimiter
 java.lang.String getTagText(java.io.Reader reader, javax.swing.text.html.HTML.Tag desiredTag, java.lang.String delimiter)
          allows to read the content from another location but the url itself
 java.lang.String getTitle()
          returns the title of the document
 java.net.URL getURL()
          returns the embedded URL
 boolean includesPattern(java.lang.String[] searchPattern, boolean includeHTMLCode)
          searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found
 boolean isAccessible()
          actually connects to the embedded URL while executing
 long ping()
          returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.
 void saveURLtoFile(java.io.File file)
          saves the content of the embedded URL to the given file
static java.util.List<java.net.URL> searchWebFor(java.lang.String[] searchPattern, java.util.ArrayList<java.net.URL> searchList, boolean includeHTMLCode, int level, boolean currentSiteOnly, java.util.List<java.net.URL> excludeList, java.util.List<java.net.URL> resultList, java.lang.String[] searchURLExclusionPatterns, Monitor monitor)
          Deprecated.  
static java.net.URL[] searchWebFor(java.lang.String[] searchPattern, java.net.URL entryPoint, boolean includeHTMLCode, int level, boolean currentSiteOnly, java.lang.String[] searchURLExclusionPatterns, Monitor monitor)
          Deprecated.  
 void setURL(java.net.URL url)
          sets the embedded URL
 java.lang.String stripText()
          a line break is put after each separate text occurrence
 java.lang.String stripText(java.io.Reader reader, java.lang.String delimiter)
          allows to read the content from another location but the url itself
 java.lang.String stripText(java.lang.String delimiter)
          returns a String containing the text of all HTML tag types from the embedded URL
 java.lang.String whois()
          returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address
static java.lang.String whois(java.lang.String domainName)
          returns the registrant information from the Internic database
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Spider

public Spider()
convenience constructor, that initializes the Spider with a null value as URL


Spider

public Spider(java.net.URL url)
constructs a Spider object based on the given URL

Method Detail

getURL

public java.net.URL getURL()
returns the embedded URL


setURL

public void setURL(java.net.URL url)
sets the embedded URL


getDomainName

public java.lang.String getDomainName()

fullHeaderAsString

public java.lang.String fullHeaderAsString()
                                    throws java.io.IOException
Throws:
java.io.IOException

ping

public long ping()
returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable.


saveURLtoFile

public void saveURLtoFile(java.io.File file)
                   throws java.io.IOException
saves the content of the embedded URL to the given file

Throws:
java.io.IOException

getLinks

public java.net.URL[] getLinks(boolean allowDuplicates,
                               java.lang.String protocol)
                        throws java.io.IOException
returns links filtered by the given protocol

Throws:
java.io.IOException
See Also:
getLinks(boolean)

getLinks

public java.net.URL[] getLinks(boolean allowDuplicates)
                        throws java.io.IOException
returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned. If no links are present within the given URL, an empty array is returned

Throws:
java.io.IOException

getLinks

public java.net.URL[] getLinks(java.io.Reader reader,
                               boolean allowDuplicates)
                        throws java.io.IOException
allows to read the content from another location but the url itself

Throws:
java.io.IOException
See Also:
getLinks(boolean)

getBrokenLinks

public java.net.URL[] getBrokenLinks()
                              throws java.io.IOException
Assuming the URL points to a HTML page, only links that are not accessible are returned. If all links are valid (or the page didn't contain links), an empty array is returned. Only links with 'http', 'ftp' or 'file' protocol are checked.

Throws:
java.io.IOException

isAccessible

public boolean isAccessible()
actually connects to the embedded URL while executing


getImages

public java.net.URL[] getImages(boolean allowDuplicates)
                         throws java.io.IOException
returns an array of images that are contained in the embedded URL

Throws:
java.io.IOException

getImages

public java.net.URL[] getImages(java.io.Reader reader,
                                boolean allowDuplicates)
                         throws java.io.IOException
allows to read the content from another location but the url itself

Throws:
java.io.IOException
See Also:
getImages(boolean)

includesPattern

public boolean includesPattern(java.lang.String[] searchPattern,
                               boolean includeHTMLCode)
                        throws java.io.IOException
searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found

Parameters:
searchPattern - array of search patterns this function will look for
includeHTMLCode - if true, this function will search through all content of the URL, including HTML code; if false, it will only search through text found
Throws:
java.io.IOException

getTitle

public java.lang.String getTitle()
                          throws java.io.IOException
returns the title of the document

Throws:
java.io.IOException

getTagText

public java.lang.String getTagText(javax.swing.text.html.HTML.Tag desiredTag,
                                   java.lang.String delimiter)
                            throws java.io.IOException
returns all text found in the given desiredTag delimited by the given delimiter

Throws:
java.io.IOException
See Also:
getTagText(Reader,HTML.Tag,String)

getTagText

public java.lang.String getTagText(java.io.Reader reader,
                                   javax.swing.text.html.HTML.Tag desiredTag,
                                   java.lang.String delimiter)
                            throws java.io.IOException
allows to read the content from another location but the url itself

Throws:
java.io.IOException
See Also:
getTagText(HTML.Tag,String)

stripText

public java.lang.String stripText()
                           throws java.io.IOException
a line break is put after each separate text occurrence

Throws:
java.io.IOException

stripText

public java.lang.String stripText(java.lang.String delimiter)
                           throws java.io.IOException
returns a String containing the text of all HTML tag types from the embedded URL

Throws:
java.io.IOException

stripText

public java.lang.String stripText(java.io.Reader reader,
                                  java.lang.String delimiter)
                           throws java.io.IOException
allows to read the content from another location but the url itself

Throws:
java.io.IOException
See Also:
stripText(String)

getHTMLDocument

public javax.swing.text.html.HTMLDocument getHTMLDocument()
                                                   throws java.io.IOException
returns an HTMLDocument object with the parsed content of the embedded URL for further examination

Throws:
java.io.IOException

getHTMLDocument

public javax.swing.text.html.HTMLDocument getHTMLDocument(java.io.Reader reader)
                                                   throws java.io.IOException
returns an HTMLDocument object with the parsed content from the given reader for further examination

Throws:
java.io.IOException

getReader

public java.io.Reader getReader()
                         throws java.io.IOException
This function constructs a reader appropriate for reading the content from the embedded URL. Currently, this function only supports HTTP, FTP and FILE protocol.

Throws:
java.io.IOException
java.lang.UnsupportedOperationException - if the given URL is of another protocol than HTTP or FILE

getReader

public java.io.Reader getReader(java.lang.String charsetName)
                         throws java.io.IOException
Throws:
java.io.IOException
See Also:
getReader()

getBytes

public byte[] getBytes()
                throws java.io.IOException
retrieves the raw content from the embedded URL.

Throws:
java.io.IOException

getContentAsString

public java.lang.String getContentAsString()
                                    throws java.io.IOException
retrieves the entire content accessible through the embedded URL as a String. If the URL points to an HTML page, the full HTML code is returned. This method is not suitable for retrieving binary data as it uses a BufferedReader and also places platform specific line breaks between the lines read with readLine(). If the URL could not be accessed and an IOException was caught, null is returned.

Throws:
java.io.IOException

getContentLength

public int getContentLength()
                     throws java.io.IOException
retrieves the content length from an URLConnection

Throws:
java.io.IOException

calculatePageWeight

public int calculatePageWeight()
                        throws java.io.IOException
returns the page weigt in bytes (= content length of URL plus sum of embedded images)

Throws:
java.io.IOException

whois

public java.lang.String whois()
                       throws java.io.IOException
returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address

Throws:
java.io.IOException

whois

public static java.lang.String whois(java.lang.String domainName)
                              throws java.io.IOException
returns the registrant information from the Internic database

Throws:
java.io.IOException

crawlWeb

public java.net.URL[] crawlWeb(CrawlerSetting crawler,
                               int numberOfURLsToFind,
                               Logger logger)
searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first

Parameters:
crawler - criteria for crawling
numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria
logger - to log IOExceptions occuring while processing links
Returns:
an array containing URLs found that satisfy the crawler's criteria as defined by the crawler

crawlWeb

public static java.net.URL[] crawlWeb(java.util.List<Spider.URLWrapper> searchList,
                                      java.util.List<java.net.URL> resultList,
                                      java.util.List<java.net.URL> closedList,
                                      CrawlerSetting crawler,
                                      int depth,
                                      int numberOfURLsToFind,
                                      Logger logger)
usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger)

Parameters:
searchList - List of Spider.URLWrapper objects containing nodes to be examined
resultList - List of URL objects
closedList - List of URL objects
crawler - criteria for crawling
depth - link distance from the root of the search
numberOfURLsToFind - if >0 the search is stopped when the given number of URLs are found to match the crawler's criteria
logger - to log IOExceptions occuring while processing links
Returns:
an array containing URLs found that satisfy the criteria as defined by the crawler
See Also:
crawlWeb(CrawlerSetting,int,Logger)

searchWebFor

@Deprecated
public static java.net.URL[] searchWebFor(java.lang.String[] searchPattern,
                                                     java.net.URL entryPoint,
                                                     boolean includeHTMLCode,
                                                     int level,
                                                     boolean currentSiteOnly,
                                                     java.lang.String[] searchURLExclusionPatterns,
                                                     Monitor monitor)
Deprecated. 

This special web search function returns all URLs found that contain the desired searchString given the constrains of the other parameter. The search starts at the entryPoint and goes recursively through the tree derived from the links from that URL as deep as suggested by the level parameter; the search is conducted in a bredth-first-search manner. For more flexible web searches, consider the use of a org.jscience.net.CrawlerSetting.
Use of Monitor - if present (i.e. monitor may be null, in which case no feedback is provided while the function is executing):

Parameters:
searchPattern - an array containing String patterns to search for; wildcards are not supported
entryPoint - the URL from where to start the search
includeHTMLCode - if true, the search will include not only the text, but also the HTML code of a page
level - limits the depth of the search; only pages that are reachable with less or equal than the given number of recursive links will be included
currentSiteOnly - if true, the search is limited to the host of the entryPoint
searchURLExclusionPatterns - if not null it contains an array of String patterns which will be used to filter out unwanted URLs, i.e. if any of the patterns are present in the URL's path, that URL will be disregarded; wildcards are not supported
monitor - see above for usage; may be null
See Also:
crawlWeb(CrawlerSetting,int,Logger)

searchWebFor

@Deprecated
public static java.util.List<java.net.URL> searchWebFor(java.lang.String[] searchPattern,
                                                                   java.util.ArrayList<java.net.URL> searchList,
                                                                   boolean includeHTMLCode,
                                                                   int level,
                                                                   boolean currentSiteOnly,
                                                                   java.util.List<java.net.URL> excludeList,
                                                                   java.util.List<java.net.URL> resultList,
                                                                   java.lang.String[] searchURLExclusionPatterns,
                                                                   Monitor monitor)
Deprecated. 

usually called by the other searchWebFor() function; all Lists contain URL objects

See Also:
searchWebFor(String[],URL,boolean,int,boolean,String[],Monitor)