|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.jscience.net.Spider
public final class Spider
Spider provides several useful static functions for accessing
web content and parsing HTML most based on a simple URL.
Note that because this class uses functionality from the javax.swing
package (although no GUIs are used in this class), there are non-terminating
javax.swing threads that get created when using this class.
I.e. an application using this class without using any other
javax.swing GUI components may end up with unwanted non-terminated
threads possible forcing calls to e.g. System.exit(0)
to terminate a simple program.
Most methods are synchronized, so don't expect to have longer running methods
(such as getting links from a URL) run simultaneously on the same Spider object.
CrawlerSetting,
URLCache| Nested Class Summary | |
|---|---|
static class |
Spider.SMonitor
Deprecated. |
static class |
Spider.URLWrapper
wrappes a java.net.URL and keeps a reference to its referer |
| Constructor Summary | |
|---|---|
Spider()
convenience constructor, that initializes the Spider with a null value as URL |
|
Spider(java.net.URL url)
constructs a Spider object based on the given URL |
|
| Method Summary | |
|---|---|
int |
calculatePageWeight()
returns the page weigt in bytes (= content length of URL plus sum of embedded images) |
java.net.URL[] |
crawlWeb(CrawlerSetting crawler,
int numberOfURLsToFind,
Logger logger)
searches the web from the embedded URL (used as root) for URLs based on the criteria given in the crawler; search is performed breadth-first |
static java.net.URL[] |
crawlWeb(java.util.List<Spider.URLWrapper> searchList,
java.util.List<java.net.URL> resultList,
java.util.List<java.net.URL> closedList,
CrawlerSetting crawler,
int depth,
int numberOfURLsToFind,
Logger logger)
usually called by crawlWeb(URL root, CrawlerSetting crawler, Logger) |
java.lang.String |
fullHeaderAsString()
|
java.net.URL[] |
getBrokenLinks()
Assuming the URL points to a HTML page, only links that are not accessible are returned. |
byte[] |
getBytes()
retrieves the raw content from the embedded URL. |
java.lang.String |
getContentAsString()
retrieves the entire content accessible through the embedded URL as a String. |
int |
getContentLength()
retrieves the content length from an URLConnection |
java.lang.String |
getDomainName()
|
javax.swing.text.html.HTMLDocument |
getHTMLDocument()
returns an HTMLDocument object with the parsed content of the embedded URL for further examination |
javax.swing.text.html.HTMLDocument |
getHTMLDocument(java.io.Reader reader)
returns an HTMLDocument object with the parsed content from the given reader for further examination |
java.net.URL[] |
getImages(boolean allowDuplicates)
returns an array of images that are contained in the embedded URL |
java.net.URL[] |
getImages(java.io.Reader reader,
boolean allowDuplicates)
allows to read the content from another location but the url itself |
java.net.URL[] |
getLinks(boolean allowDuplicates)
returns an array containing URLs that the embedded URL links to; if the page is a frameset, the frame sources are returned. |
java.net.URL[] |
getLinks(boolean allowDuplicates,
java.lang.String protocol)
returns links filtered by the given protocol |
java.net.URL[] |
getLinks(java.io.Reader reader,
boolean allowDuplicates)
allows to read the content from another location but the url itself |
java.io.Reader |
getReader()
This function constructs a reader appropriate for reading the content from the embedded URL. |
java.io.Reader |
getReader(java.lang.String charsetName)
|
java.lang.String |
getTagText(javax.swing.text.html.HTML.Tag desiredTag,
java.lang.String delimiter)
returns all text found in the given desiredTag delimited by the given delimiter |
java.lang.String |
getTagText(java.io.Reader reader,
javax.swing.text.html.HTML.Tag desiredTag,
java.lang.String delimiter)
allows to read the content from another location but the url itself |
java.lang.String |
getTitle()
returns the title of the document |
java.net.URL |
getURL()
returns the embedded URL |
boolean |
includesPattern(java.lang.String[] searchPattern,
boolean includeHTMLCode)
searches the content of the embedded URL for the presence of one of the searchPatterns given; returns true if one of the patterns was found |
boolean |
isAccessible()
actually connects to the embedded URL while executing |
long |
ping()
returns the time it takes to establish a live connection to the embedded URL and returns -1 only if the URL is unreachable. |
void |
saveURLtoFile(java.io.File file)
saves the content of the embedded URL to the given file |
static java.util.List<java.net.URL> |
searchWebFor(java.lang.String[] searchPattern,
java.util.ArrayList<java.net.URL> searchList,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
java.util.List<java.net.URL> excludeList,
java.util.List<java.net.URL> resultList,
java.lang.String[] searchURLExclusionPatterns,
Monitor monitor)
Deprecated. |
static java.net.URL[] |
searchWebFor(java.lang.String[] searchPattern,
java.net.URL entryPoint,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
java.lang.String[] searchURLExclusionPatterns,
Monitor monitor)
Deprecated. |
void |
setURL(java.net.URL url)
sets the embedded URL |
java.lang.String |
stripText()
a line break is put after each separate text occurrence |
java.lang.String |
stripText(java.io.Reader reader,
java.lang.String delimiter)
allows to read the content from another location but the url itself |
java.lang.String |
stripText(java.lang.String delimiter)
returns a String containing the text of all HTML tag types from the embedded URL |
java.lang.String |
whois()
returns the registrant information from the Internic database; the embedded URL must use the host name and not the IP address |
static java.lang.String |
whois(java.lang.String domainName)
returns the registrant information from the Internic database |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public Spider()
public Spider(java.net.URL url)
| Method Detail |
|---|
public java.net.URL getURL()
public void setURL(java.net.URL url)
public java.lang.String getDomainName()
public java.lang.String fullHeaderAsString()
throws java.io.IOException
java.io.IOExceptionpublic long ping()
public void saveURLtoFile(java.io.File file)
throws java.io.IOException
java.io.IOException
public java.net.URL[] getLinks(boolean allowDuplicates,
java.lang.String protocol)
throws java.io.IOException
java.io.IOExceptiongetLinks(boolean)
public java.net.URL[] getLinks(boolean allowDuplicates)
throws java.io.IOException
java.io.IOException
public java.net.URL[] getLinks(java.io.Reader reader,
boolean allowDuplicates)
throws java.io.IOException
java.io.IOExceptiongetLinks(boolean)
public java.net.URL[] getBrokenLinks()
throws java.io.IOException
java.io.IOExceptionpublic boolean isAccessible()
public java.net.URL[] getImages(boolean allowDuplicates)
throws java.io.IOException
java.io.IOException
public java.net.URL[] getImages(java.io.Reader reader,
boolean allowDuplicates)
throws java.io.IOException
java.io.IOExceptiongetImages(boolean)
public boolean includesPattern(java.lang.String[] searchPattern,
boolean includeHTMLCode)
throws java.io.IOException
searchPattern - array of search patterns this function will look forincludeHTMLCode - if true, this function will search through all content of
the URL, including HTML code; if false, it will only search
through text found
java.io.IOException
public java.lang.String getTitle()
throws java.io.IOException
java.io.IOException
public java.lang.String getTagText(javax.swing.text.html.HTML.Tag desiredTag,
java.lang.String delimiter)
throws java.io.IOException
java.io.IOExceptiongetTagText(Reader,HTML.Tag,String)
public java.lang.String getTagText(java.io.Reader reader,
javax.swing.text.html.HTML.Tag desiredTag,
java.lang.String delimiter)
throws java.io.IOException
java.io.IOExceptiongetTagText(HTML.Tag,String)
public java.lang.String stripText()
throws java.io.IOException
java.io.IOException
public java.lang.String stripText(java.lang.String delimiter)
throws java.io.IOException
java.io.IOException
public java.lang.String stripText(java.io.Reader reader,
java.lang.String delimiter)
throws java.io.IOException
java.io.IOExceptionstripText(String)
public javax.swing.text.html.HTMLDocument getHTMLDocument()
throws java.io.IOException
java.io.IOException
public javax.swing.text.html.HTMLDocument getHTMLDocument(java.io.Reader reader)
throws java.io.IOException
java.io.IOException
public java.io.Reader getReader()
throws java.io.IOException
java.io.IOException
java.lang.UnsupportedOperationException - if the given URL is of another
protocol than HTTP or FILE
public java.io.Reader getReader(java.lang.String charsetName)
throws java.io.IOException
java.io.IOExceptiongetReader()
public byte[] getBytes()
throws java.io.IOException
java.io.IOException
public java.lang.String getContentAsString()
throws java.io.IOException
java.io.IOException
public int getContentLength()
throws java.io.IOException
java.io.IOException
public int calculatePageWeight()
throws java.io.IOException
java.io.IOException
public java.lang.String whois()
throws java.io.IOException
java.io.IOException
public static java.lang.String whois(java.lang.String domainName)
throws java.io.IOException
java.io.IOException
public java.net.URL[] crawlWeb(CrawlerSetting crawler,
int numberOfURLsToFind,
Logger logger)
crawler - criteria for crawlingnumberOfURLsToFind - if >0 the search is stopped when the given number
of URLs are found to match the crawler's criterialogger - to log IOExceptions occuring while processing links
public static java.net.URL[] crawlWeb(java.util.List<Spider.URLWrapper> searchList,
java.util.List<java.net.URL> resultList,
java.util.List<java.net.URL> closedList,
CrawlerSetting crawler,
int depth,
int numberOfURLsToFind,
Logger logger)
searchList - List of Spider.URLWrapper objects containing nodes to be examinedresultList - List of URL objectsclosedList - List of URL objectscrawler - criteria for crawlingdepth - link distance from the root of the searchnumberOfURLsToFind - if >0 the search is stopped when the given number
of URLs are found to match the crawler's criterialogger - to log IOExceptions occuring while processing links
crawlWeb(CrawlerSetting,int,Logger)
@Deprecated
public static java.net.URL[] searchWebFor(java.lang.String[] searchPattern,
java.net.URL entryPoint,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
java.lang.String[] searchURLExclusionPatterns,
Monitor monitor)
searchPattern - an array containing String patterns to search for;
wildcards are not supportedentryPoint - the URL from where to start the searchincludeHTMLCode - if true, the search will include not only the text,
but also the HTML code of a pagelevel - limits the depth of the search; only pages that are reachable
with less or equal than the given number of recursive links will be includedcurrentSiteOnly - if true, the search is limited to the host of the entryPointsearchURLExclusionPatterns - if not null it contains an array of String patterns
which will be used to filter out unwanted URLs, i.e. if any of the patterns
are present in the URL's path, that URL will be disregarded;
wildcards are not supportedmonitor - see above for usage; may be nullcrawlWeb(CrawlerSetting,int,Logger)
@Deprecated
public static java.util.List<java.net.URL> searchWebFor(java.lang.String[] searchPattern,
java.util.ArrayList<java.net.URL> searchList,
boolean includeHTMLCode,
int level,
boolean currentSiteOnly,
java.util.List<java.net.URL> excludeList,
java.util.List<java.net.URL> resultList,
java.lang.String[] searchURLExclusionPatterns,
Monitor monitor)
searchWebFor(String[],URL,boolean,int,boolean,String[],Monitor)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||