org.jscience.net
Interface CrawlerSetting

All Known Implementing Classes:
MediaCrawler, SampleCrawlerSetting

public interface CrawlerSetting

CrawlerSetting defines callback functions that determine the behavior in which a web search algorithm goes through the net and calculates its results. A CrawlerSetting can be used with a org.jscience.net.Spider.

See Also:
Spider, Spider.crawlWeb(CrawlerSetting,int,Logger)

Method Summary
 boolean followLinks(java.net.URL url, java.net.URL referer, int depth, java.util.List<java.net.URL> resultURLList, java.util.List<java.net.URL> closedURLList, java.util.List<Spider.URLWrapper> searchURLWrapperList)
          followLinks() determines whether the given URL is to be searched for its links to be examined further in the next level.
 boolean matchesCriteria(java.net.URL url, java.net.URL referer, int depth, java.util.List<java.net.URL> resultURLList, java.util.List<java.net.URL> closedURLList)
          This method decides whether either the URL itself or its content qualifies for what this CrawlerSetting searches for; as this function is also called on every URL encountered, it is also the place for any custom parsing this CrawlerSetting wants to do.
 

Method Detail

matchesCriteria

boolean matchesCriteria(java.net.URL url,
                        java.net.URL referer,
                        int depth,
                        java.util.List<java.net.URL> resultURLList,
                        java.util.List<java.net.URL> closedURLList)
This method decides whether either the URL itself or its content qualifies for what this CrawlerSetting searches for; as this function is also called on every URL encountered, it is also the place for any custom parsing this CrawlerSetting wants to do. The two List objects allow the CrawlerSetting to act on potential constrains that may result from e.g. a maximum number of total nodes to be examined (or any other custom checking imaginable). Note that it is the responsibility of the calling object to ensure that this function isn't called multiple times on the same URL if that's not desired. The url may include any URL, including non-HTTP protocols (such as mailto:, ftp:) and image or media URLs

Parameters:
url - the URL in question to satisfy the criteria
referer - url's referer URL
depth - link distance from the original root URL where the search began
resultURLList - List of URLs that have already been found to match this CrawlerSetting's criteria
closedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteria

followLinks

boolean followLinks(java.net.URL url,
                    java.net.URL referer,
                    int depth,
                    java.util.List<java.net.URL> resultURLList,
                    java.util.List<java.net.URL> closedURLList,
                    java.util.List<Spider.URLWrapper> searchURLWrapperList)
followLinks() determines whether the given URL is to be searched for its links to be examined further in the next level. The three List objects allow the CrawlerSetting to act on potential constrains that may result from e.g. a maximum number of total nodes to be examined (or any other custom checking imaginable). The url may include any URL, including non-HTTP protocols (such as mailto:, ftp:) and image or media URLs.

Parameters:
url - the URL that is to be examined for its links
referer - url's referer URL
depth - distance from the original root URL where the search began
resultURLList - List of URLs that have already been found to match this CrawlerSetting's criteria
closedURLList - List of URLs that have already been found not to match the CrawlerSetting's criteria
searchURLWrapperList - List of Spider.URLWrapper objects already identified to be examined in the next level
See Also:
Spider.URLWrapper