org.jscience.net
Class URLCache

java.lang.Object
  extended by org.jscience.net.URLCache
All Implemented Interfaces:
java.io.Serializable

public class URLCache
extends java.lang.Object
implements java.io.Serializable

A wrapper around java.net.URL that caches a copy of the content and adds some additional functionality designed for HTML pages.

The URLCache cache can be made up-to-date by calling refresh().
The URLCache object also maintains the time when it was last updated successfully. The method lastUpdated() will return that time.

Most methods operate on the cached data to avoid the need to reconnect over the web for each operation. This allows to call several methods on a URLCache object sequentially with reasonable performance.

If a call to a content accessing method is first made and the data has not been cached yet, the method will enforce a refresh and wait for completion. If the content could not be refreshed due to an IOException, that exception is then immediately thrown as well as in all subsequent method calls accessing the content.

If refresh() is then called at later times, either the content is successfully cached then (and the initial exception is irrelevant) or the exception is refreshed.

If - once the content is initially cached - subsequent attempts to refresh() the the cache fail, all content accessing methods will still revert to the cached data.
If you then want to find out the cause for the failure of subsequent calls to refresh(), you will have to register a RefreshListener, as the callback method defined there will include the IOException in case of an unsuccessful refresh attempt.

A special case is if the content data of the embedded URL is too large to currently fit into memory; i.e. caching is impossible, although the URL is accessible. In that case, methods like getContentAsString() will return an IOException with the message that the data content is too large to fit into memory, while other methods (like saveContentToFile() or getInputStream()) will then directly work off the online content. Even though many operations may probably be quite difficult with content that cannot be cached due to its size, the method saveContentToFile() can still be used savely regardless of the size of the content (of course it may take quite a bit of time). The method tooLargeForCaching() will indicate whether this is the case. Note that a later call to refresh() - after some memory has been freed up - may be able to cache the same object successfully.

In addition, this class maintains a static map serving as an application-wide cache for URLCache objects, which can be accessed using the put() and get() methods.

Currently, this implementation starts a new thread whenever refresh() is called. A future revision may want to revise the performance overhead associated with this. The implementation is suited for large content on slow networks, so that it makes sense to load the data for each URL in a separate thread simultaneously.

Note that many methods assume that the underlying content is HTML data; if that is untrue for a specific object, these methods may return empty objects.

Note that the only data that is actually cached is a byte array that represents the content fetched from the URL and a header map; all other information (title, links, images, etc.) will be calculated each time based on the cached byte array.

Since:
4/2/2002
See Also:
Spider, Serialized Form

Nested Class Summary
static interface URLCache.RefreshListener
          RefreshListener objects can register with URLCache objects to be notified when the URLCache object is refreshed
 
Constructor Summary
URLCache(java.lang.String spec)
          constructs the URLCache object based on the spec denoting the absolute path of the URL and without refresh
URLCache(java.net.URL url)
          calls URLCache(url, false)
URLCache(java.net.URL url, boolean refreshNow)
          constructs the URLCache object based on the given URL.
 
Method Summary
 void addRefreshListener(URLCache.RefreshListener listener)
           
 int bytesReadByCurrentRefresh()
          returns the number of bytes currently read by the refresh thread; returns -1 if no refresh in progress
 void clearCache()
          interrupts any ongoing refresh process and clears the cache; subsequent calls to any content will force a new refresh
 boolean containsRefreshListener(URLCache.RefreshListener listener)
           
 boolean equals(java.lang.Object obj)
          tests equality on whether the embedded URL is the same file
 byte[] getContent()
          returns the raw cached content.
 java.lang.String getContentAsString()
           
 java.lang.String getContentAsString(java.lang.String charsetName)
           
 java.lang.String getContentEncoding()
          returns the header value from the cached content
 java.lang.String getContentType()
          returns the header value from the cached content
 java.lang.String getFileExtension()
          returns the file type denoted by the path of the URL.
 java.lang.String getHeaderField(java.lang.String fieldKey)
          retrieves the first field value matching the fieldKey based on case-insensitive key search
 java.util.Map getHeaderFields()
          returns a Map to the cached header fields
 javax.swing.text.html.HTMLDocument getHTMLDocument()
          returns a new HTMLDocument initialized with the cached content of this URLCache object.
 java.net.URL[] getImages()
          returns URLs to all unique images embedded in the cached HTML document
 java.io.InputStream getInputStream()
          returns an input stream from the cached content (suitable for binary data).
 long getLastModified()
          returns the header value from the cached content
 long getLastRefreshTime()
          returns the time taken by the last successfull refresh; -1 is returned if content was never successfully refreshed.
 java.net.URL[] getLinks()
          returns URLs of all links from the cached HTML document.
 java.io.Reader getReader()
          returns a reader from the cached content (suitable for non-binary data).
 java.io.Reader getReader(java.lang.String charsetName)
          returns a reader from the cached content by using the specified charset for decoding.
 int getRealContentLength()
          returns the actual length of the already cached data or -1 if the data is too large to fit into memory.
 URLCache.RefreshListener[] getRefreshListener()
           
 java.lang.String getTagText(javax.swing.text.html.HTML.Tag desiredTag, java.lang.String delimiter)
          returns all text from the HTML cache data that is found in the given tag.
 java.lang.String getTitle()
          returns the HTML title of the cached document
 java.net.URL getURL()
          returns the underlying URL object.
 int hashCode()
          hashes based on the embedded URL
 boolean isCached()
          returns true only if the content has ever been successfully refreshed before
 boolean isRefreshing()
          returns true only if the cache is currently being refreshed
 boolean isUpToDate()
          checks whether the timestamp provided by the online content is no later than your last successfull refresh.
 long lastRefreshed()
          returns the time when the last refresh() attempt was performed - whether or not successful.
 long lastUpdated()
          returns the time when this object was last refreshed successfully; 0 is returned if no refresh has been performed, yet
 int peekContentLength()
          returns the content-length header field directly from the online data; the cache is neither affected nor used.
 void refresh()
          updates the cached content asynchronously with a fresh copy directly from the web.
 void refreshAndWait()
          returns only after the refresh finished
 void removeRefreshListener(URLCache.RefreshListener listener)
           
 void saveContentToFile(java.io.File file)
          calls saveContentToFile(file, false)
 void saveContentToFile(java.io.File file, boolean streamDirectlyFromURL)
          If the file might not fit into memory, streamDirectlyFromURL should be true to stream directly from the URL; otherwise this method simply writes the cache to the file.
 void stopCurrentRefresh()
          interrupts any currently ongoing refresh process (if any) and then returns; previously cached data and subsequent calls are uneffected
 java.lang.String stripText()
          calls the other stripText() method with a line break as delimiter
 java.lang.String stripText(java.lang.String delimiter)
          returns a String containing the text of all HTML tag types, separated by the given delimiter
 boolean tooLargeForCaching()
          return value of true indicates that even though the content to the URL is accessible, the data is too large to be cached given the current memory.
 java.lang.String toString()
           
 boolean verifyContent()
          checks whether the cached content equals the current live online content.
 void waitForRefresh()
          This method only returns after ensuring that a cached a result from a previous refresh() is available (either the cached data or the cached IOException).
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

URLCache

public URLCache(java.lang.String spec)
         throws java.net.MalformedURLException
constructs the URLCache object based on the spec denoting the absolute path of the URL and without refresh

Throws:
java.net.MalformedURLException
See Also:
URLCache(URL,boolean)

URLCache

public URLCache(java.net.URL url)
calls URLCache(url, false)

See Also:
URLCache(URL,boolean)

URLCache

public URLCache(java.net.URL url,
                boolean refreshNow)
constructs the URLCache object based on the given URL. If refreshNow is true, an immediate call to refresh() is made in order to obtain a cached version; the refresh is performed asynchronously, i.e. the constructor returns immediately. A delay may then only be experienced when the first content accessing method is called.

See Also:
refresh()
Method Detail

refreshAndWait

public void refreshAndWait()
returns only after the refresh finished

See Also:
refresh()

refresh

public void refresh()
updates the cached content asynchronously with a fresh copy directly from the web.
This method returns immediately; the refresh is performed in a separate thread.

If you want to be notified when a URLCache object has completed a refresh call, (whether or not the refresh was successful) you can register a RefreshListener. The call to the RefreshListener will also contain the information whether the refresh was successfully performed or not; if unsuccessful, the IOException that caused this latest refresh() to fail will be included there, too.

As only one refresh thread per instance is allowed at a time, a call to refresh() may cause listeners to be notified of a failure due to concurrent refresh() calls; the IOException included in the callback will reflect that. In fact, this method doesn't allow a subsequent refresh in less than a second after the last refresh finished (in those cases, the method call is simply ignored). If a refresh attempt is unsuccessful, all previously cached data is maintained from the time that lastUdated() indicates. The time of the last refresh attempt can be retrieved by calling lastRefreshAttempt(); if the last call was successfull, lastUdated() and lastRefreshAttempt() will return the same value.

See Also:
addRefreshListener(URLCache.RefreshListener), lastRefreshed(), lastUpdated(), isRefreshing(), refreshAndWait(), URLCache.RefreshListener

stopCurrentRefresh

public void stopCurrentRefresh()
interrupts any currently ongoing refresh process (if any) and then returns; previously cached data and subsequent calls are uneffected

See Also:
refresh()

clearCache

public void clearCache()
interrupts any ongoing refresh process and clears the cache; subsequent calls to any content will force a new refresh

See Also:
refresh()

waitForRefresh

public void waitForRefresh()
This method only returns after ensuring that a cached a result from a previous refresh() is available (either the cached data or the cached IOException).

See Also:
refresh(), isCached()

getHeaderFields

public java.util.Map getHeaderFields()
                              throws java.io.IOException
returns a Map to the cached header fields

Throws:
java.io.IOException - if the headers could not have been cached
See Also:
URLConnection.getHeaderFields()

getHeaderField

public java.lang.String getHeaderField(java.lang.String fieldKey)
                                throws java.io.IOException
retrieves the first field value matching the fieldKey based on case-insensitive key search

Parameters:
fieldKey - if null, it returns the HTTP response if available
Throws:
java.io.IOException

getContentType

public java.lang.String getContentType()
                                throws java.io.IOException
returns the header value from the cached content

Throws:
java.io.IOException

getContentEncoding

public java.lang.String getContentEncoding()
                                    throws java.io.IOException
returns the header value from the cached content

Throws:
java.io.IOException

getLastModified

public long getLastModified()
                     throws java.io.IOException
returns the header value from the cached content

Throws:
java.io.IOException

getContent

public byte[] getContent()
                  throws java.io.IOException
returns the raw cached content. null is returned only if content cannot be cached (due to memory limitations)

WARNING: Altering the returned array means altering the internal cache, which is in effect until the next successfull refresh.

Throws:
java.io.IOException

getContentAsString

public java.lang.String getContentAsString()
                                    throws java.io.IOException
Throws:
java.io.IOException

getContentAsString

public java.lang.String getContentAsString(java.lang.String charsetName)
                                    throws java.io.IOException
Throws:
java.io.IOException

getReader

public java.io.Reader getReader()
                         throws java.io.IOException
returns a reader from the cached content (suitable for non-binary data). If the data is too large to be cached, this method returns a reader to the online content.

Throws:
java.io.IOException

getReader

public java.io.Reader getReader(java.lang.String charsetName)
                         throws java.io.IOException
returns a reader from the cached content by using the specified charset for decoding. If the data is too large to be cached, this method returns a reader to the online content.

Throws:
java.io.IOException

getInputStream

public java.io.InputStream getInputStream()
                                   throws java.io.IOException
returns an input stream from the cached content (suitable for binary data). Only if the data is too large to be cached, this method returns a stream to the online content.

Throws:
java.io.IOException

getTitle

public java.lang.String getTitle()
                          throws java.io.IOException
returns the HTML title of the cached document

Throws:
java.io.IOException

getLinks

public java.net.URL[] getLinks()
                        throws java.io.IOException
returns URLs of all links from the cached HTML document. Note that duplicate links will only be included once

Throws:
java.io.IOException

getImages

public java.net.URL[] getImages()
                         throws java.io.IOException
returns URLs to all unique images embedded in the cached HTML document

Throws:
java.io.IOException

getHTMLDocument

public javax.swing.text.html.HTMLDocument getHTMLDocument()
                                                   throws java.io.IOException
returns a new HTMLDocument initialized with the cached content of this URLCache object. If - for some reason - a BadLocationException is caught, this method returns null.

Throws:
java.io.IOException

getTagText

public java.lang.String getTagText(javax.swing.text.html.HTML.Tag desiredTag,
                                   java.lang.String delimiter)
                            throws java.io.IOException
returns all text from the HTML cache data that is found in the given tag.
The separate text sequences found are delimited by the given delimiter.

Throws:
java.io.IOException

stripText

public java.lang.String stripText()
                           throws java.io.IOException
calls the other stripText() method with a line break as delimiter

Throws:
java.io.IOException

stripText

public java.lang.String stripText(java.lang.String delimiter)
                           throws java.io.IOException
returns a String containing the text of all HTML tag types, separated by the given delimiter

Throws:
java.io.IOException

isRefreshing

public boolean isRefreshing()
returns true only if the cache is currently being refreshed


lastRefreshed

public long lastRefreshed()
returns the time when the last refresh() attempt was performed - whether or not successful. If the last attempt was successful, the returned value is identical to the return value of lastUdated()

See Also:
lastUpdated()

lastUpdated

public long lastUpdated()
returns the time when this object was last refreshed successfully; 0 is returned if no refresh has been performed, yet

See Also:
lastRefreshed()

getLastRefreshTime

public long getLastRefreshTime()
returns the time taken by the last successfull refresh; -1 is returned if content was never successfully refreshed.

See Also:
refresh()

verifyContent

public boolean verifyContent()
                      throws java.io.IOException
checks whether the cached content equals the current live online content. This method will connect to the actual URL and only return once the cached data has been fully compared to the live content.

It is a bad idea to call this method to see whether a refresh() is needed, as a call to this method is just as expensive as refresh() itself, only that the latter returns immediately. If you just want to check the provided timestamp of the online content to see whether it is no later than your last successful refresh, use isUpToDate() instead.

Returns:
true if the cached content is equal to the live online content and false if the content is different
Throws:
java.io.IOException - if the connection to the live online content failed
See Also:
isUpToDate()

isUpToDate

public boolean isUpToDate()
                   throws java.io.IOException
checks whether the timestamp provided by the online content is no later than your last successfull refresh.

Note that the result may not be accurrate if the header timestamp of the online content is incorrect or missing.

If you need to verify that the exact online content is in fact identical to the cached content, use verifyContent() instead.

Throws:
java.io.IOException
See Also:
verifyContent()

getURL

public java.net.URL getURL()
returns the underlying URL object.
Note that any operation on the URL directly cannot take advantage of the caching in URLCache.


addRefreshListener

public void addRefreshListener(URLCache.RefreshListener listener)
See Also:
removeRefreshListener(URLCache.RefreshListener), refresh()

removeRefreshListener

public void removeRefreshListener(URLCache.RefreshListener listener)
See Also:
addRefreshListener(URLCache.RefreshListener)

containsRefreshListener

public boolean containsRefreshListener(URLCache.RefreshListener listener)

getRefreshListener

public URLCache.RefreshListener[] getRefreshListener()

isCached

public boolean isCached()
returns true only if the content has ever been successfully refreshed before


bytesReadByCurrentRefresh

public int bytesReadByCurrentRefresh()
returns the number of bytes currently read by the refresh thread; returns -1 if no refresh in progress


getRealContentLength

public int getRealContentLength()
                         throws java.io.IOException
returns the actual length of the already cached data or -1 if the data is too large to fit into memory.

Note that before this method can return, this object will have already attempted to load the entire content into memory. If you try to avoid that and just want to peek at what the online content provides for its content length, use peekContentLength() instead.

Throws:
java.io.IOException - if the data cannot be accessed
See Also:
peekContentLength()

peekContentLength

public int peekContentLength()
                      throws java.io.IOException
returns the content-length header field directly from the online data; the cache is neither affected nor used.

You can use this method if you want to get information about the content length before you attempt to download or cache the content. If you need an always accurate content length, use getContentLengh(), which will return the exact length of the cached content (after the entire content has been loaded, though).

Throws:
java.io.IOException
See Also:
getRealContentLength(), Spider.getContentLength()

tooLargeForCaching

public boolean tooLargeForCaching()
return value of true indicates that even though the content to the URL is accessible, the data is too large to be cached given the current memory.
This method can only return true after a refresh attempt has failed due to the size of the content.

If this method returns true, you can still obtain an input stream or a reader to the data; these methods will then simply read from the online content. Also, you can still save the data to a file. Methods like getContentAsString(), however, will then return an IOException stating that there is no memory available.

This method may also return true if your memory is simply exhausted in a particular point in time and the URL content is acutually not that large; in that case, you may try refreshing again after freeing up some memory or you can check the online content lenght - if provided for the URL in question - with peekContentLength()

See Also:
peekContentLength()

getFileExtension

public java.lang.String getFileExtension()
returns the file type denoted by the path of the URL. The extension is the String of those characters that follow the last 'dot' (".") in the file name in lowercase. If no extension is present, null is returned.


saveContentToFile

public void saveContentToFile(java.io.File file)
                       throws java.io.IOException
calls saveContentToFile(file, false)

Throws:
java.io.IOException
See Also:
(File, boolean)

saveContentToFile

public void saveContentToFile(java.io.File file,
                              boolean streamDirectlyFromURL)
                       throws java.io.IOException
If the file might not fit into memory, streamDirectlyFromURL should be true to stream directly from the URL; otherwise this method simply writes the cache to the file.

Throws:
java.io.IOException

equals

public boolean equals(java.lang.Object obj)
tests equality on whether the embedded URL is the same file

Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
hashes based on the embedded URL

Overrides:
hashCode in class java.lang.Object

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object