Frontier (Heritrix 3 3.2.0 API)

All Superinterfaces:

org.springframework.context.Lifecycle, org.archive.util.Reporter

All Known Implementing Classes:

AbstractFrontier, BdbFrontier, WorkQueueFrontier
```
public interface Frontier
extends org.springframework.context.Lifecycle, org.archive.util.Reporter
```
An interface for URI Frontiers.
A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):
- What URIs have been discovered
- What URIs are being processed (fetched)
- What URIs have been processed
- In what order unprocessed URIs will be processed
The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.
A URIFrontier is created by the CrawlController which is in turn responsible for providing access to it. Most significant among those modules interested in the Frontier are the ToeThreads who perform the actual work of processing a URI.
The methods defined in this interface are those required to get URIs for processing, report the results of processing back (ToeThreads) and to get access to various statistical data along the way. The statistical data is of interest to Statistics Tracking modules. A couple of additional methods are provided to be able to inspect and manipulate the Frontier at runtime.
The statistical data exposed by this interface is:
- Discovered URIs
- Queued URIs
- Finished URIs
- Successfully processed URIs
- Failed to process URIs
- Disregarded URIs
- Total bytes written
In addition the frontier may optionally implement an interface that exposes information about hosts.
Furthermore any implementation of the URI Frontier should trigger CrawlURIDispostionEvents on the ApplicationContext to allow statistics modules or other interested observers to collect info about each completed URI's processing.
All URI Frontiers inherit from ModuleType and therefore creating settings follows the usual pattern of pluggable modules in Heritrix.
Author:

Gordon Mohr, Kristinn Sigurdsson

See Also:
CrawlController, org.archive.crawler.framework.CrawlController#fireCrawledURIDisregardEvent(CrawlURI), org.archive.crawler.framework.CrawlController#fireCrawledURIFailureEvent(CrawlURI), org.archive.crawler.framework.CrawlController#fireCrawledURINeedRetryEvent(CrawlURI), org.archive.crawler.framework.CrawlController#fireCrawledURISuccessfulEvent(CrawlURI), org.archive.crawler.framework.StatisticsTracker, ToeThread, org.archive.crawler.settings.ModuleType

Nested Class Summary

Nested Classes
Modifier and Type	Interface and Description
`static interface`	`Frontier.FrontierGroup` Generic interface representing the internal groupings of a Frontier's URIs -- usually queues.
`static class`	`Frontier.State` Enumeration of possible target states.

Method Summary

Methods
Modifier and Type	Method and Description
`long`	`averageDepth()` Average depth of the last URI in all eligible queues.
`void`	`beginDisposition(CrawlURI curi)` Inform frontier that a block of processing that should complete atomically with respect to checkpoints is about to begin.
`float`	`congestionRatio()` Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.
`void`	`considerIncluded(CrawlURI curi)` Notify Frontier that it should consider the given UURI as if already scheduled.
`long`	`deepestUri()` Ordinal position of the 'deepest' URI eligible for crawling.
`void`	`deleted(CrawlURI curi)` Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
`long`	`deleteURIs(String queueRegex, String match)` Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
`long`	`discoveredUriCount()` Number of discovered URIs.
`long`	`disregardedUriCount()` Number of URIs that were scheduled at one point but have been disregarded.
`void`	`endDisposition()` Inform frontier the processing signalled by an earlier pending beginDisposition() call has finished.
`long`	`failedFetchCount()` Number of URIs that failed to process.
`void`	`finished(CrawlURI cURI)` Report a URI being processed as having finished processing.
`long`	`finishedUriCount()` Number of URIs that have finished processing.
`long`	`futureUriCount()`
`String`	`getClassKey(CrawlURI cauri)`
`FrontierJournal`	`getFrontierJournal()`
`Frontier.FrontierGroup`	`getGroup(CrawlURI curi)` Get the 'frontier group' (usually queue) for the given CrawlURI.
`DecideRule`	`getScope()`
`CompositeData`	`getURIsList(String marker, int numberOfMatches, String regex, boolean verbose)` Returns a list of all uncrawled URIs starting from a specified marker until `numberOfMatches` is reached.
`long`	`importRecoverFormat(File source, boolean applyScope, boolean includeOnly, boolean forceFetch, String acceptTags)` Import URIs from the given file (in recover-log-like format, with a 3-character 'type' tag preceding a URI with optional hops/via).
`void`	`importURIs(String params)` Load URIs from a file, for scheduling and/or considered-included status (if from a recovery log).
`boolean`	`isEmpty()` Returns true if the frontier contains no more URIs to crawl.
`CrawlURI`	`next()` Get the next URI that should be processed.
`void`	`pause()` Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
`long`	`queuedUriCount()` Number of URIs queued up and waiting for processing.
`void`	`requestState(Frontier.State target)` Request the Frontier reach the given state as soon as possible.
`void`	`run()` Request that Frontier allow crawling to begin.
`void`	`schedule(CrawlURI caURI)` Schedules a CrawlURI.
`long`	`succeededFetchCount()` Number of successfully processed URIs.
`void`	`terminate()` Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
`void`	`unpause()` Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Methods inherited from interface org.springframework.context.Lifecycle
isRunning, start, stop

Methods inherited from interface org.archive.util.Reporter
reportTo, shortReportLegend, shortReportLineTo, shortReportMap

- Method Detail
  - next
```
CrawlURI next()
              throws InterruptedException
```
    Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned.
    
    Returns:
    the next URI that should be processed.
    
    Throws:
    
    InterruptedException
  - isEmpty
```
boolean isEmpty()
```
    Returns true if the frontier contains no more URIs to crawl.
    That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.
    
    Returns:
    true if the frontier contains no more URIs to crawl.
  - schedule
```
void schedule(CrawlURI caURI)
```
    Schedules a CrawlURI.
    This method accepts one URI and schedules it immediately. This has nothing to do with the priority of the URI being scheduled. Only that it will be placed in it's respective queue at once. For priority scheduling see CrawlURI.setSchedulingDirective(int)
    This method should be synchronized in all implementing classes.
    
    Parameters:
    caURI - The URI to schedule.
    See Also:
    CrawlURI.setSchedulingDirective(int)
  - finished
```
void finished(CrawlURI cURI)
```
    Report a URI being processed as having finished processing.
    ToeThreads will invoke this method once they have completed work on their assigned URI.
    This method is synchronized.
    
    Parameters:
    cURI - The URI that has finished processing.
  - discoveredUriCount
```
long discoveredUriCount()
```
    Number of discovered URIs.
    That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
    Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.
    
    Returns:
    Number of discovered URIs.
  - queuedUriCount
```
long queuedUriCount()
```
    Number of URIs queued up and waiting for processing.
    This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.
    
    Returns:
    Number of queued URIs.
  - futureUriCount
```
long futureUriCount()
```
    Returns:
    Number of URIs not currently queued/eligible but scheduled for future
  - deepestUri
```
long deepestUri()
```
    Ordinal position of the 'deepest' URI eligible for crawling. Essentially, the length of the longest frontier internal queue.
    
    Returns:
    long URI count to deepest URI
  - averageDepth
```
long averageDepth()
```
    Average depth of the last URI in all eligible queues. That is, the average length of all eligible queues.
    
    Returns:
    long average depth of last URIs in queues
  - congestionRatio
```
float congestionRatio()
```
    Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.
    
    Returns:
    float congestion ratio
  - finishedUriCount
```
long finishedUriCount()
```
    Number of URIs that have finished processing.
    Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
    
    Returns:
    Number of finished URIs.
  - succeededFetchCount
```
long succeededFetchCount()
```
    Number of successfully processed URIs.
    Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.
    
    Returns:
    Number of successfully processed URIs.
  - failedFetchCount
```
long failedFetchCount()
```
    Number of URIs that failed to process.
    URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.
    
    Returns:
    Number of URIs that failed to process.
  - disregardedUriCount
```
long disregardedUriCount()
```
    Number of URIs that were scheduled at one point but have been disregarded.
    Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
    
    Returns:
    The number of URIs that have been disregarded.
  - importURIs
```
void importURIs(String params)
                throws IOException
```
    Load URIs from a file, for scheduling and/or considered-included status (if from a recovery log).
    The 'params' Map describes the source file to use and options in effect regarding its format and handling. Significant keys are:
    "path": full path to source file. If the path ends '.gz', it will be considered to be GZIP compressed.
    "format": one of "onePer", "crawlLog", or "recoveryLog"
    "forceRevisit": if non-null, URIs will be force-scheduled even if already considered included
    "scopeSchedules": if non-null, any URI imported be checked against the frontier's configured scope before scheduling
    If the "format" is "recoveryLog", 7 more keys are significant:
    "includeSuccesses": if non-null, success lines ("Fs") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.)
    "includeFailures": if non-null, failure lines ("Ff") in the log will be considered-included. (Sometimes, this is desired.)
    "includeScheduleds": If non-null, scheduled lines ("F+") in the log will be considered-included. (Atypical, but an option for completeness.)
    "scopeIncludes": if non-null, any of the above will be checked against the frontier's configured scope before consideration
    "scheduleSuccesses": if non-null, success lines ("Fs") in the log will be schedule-attempted. (Atypical, as all successes are preceded by "F+" lines.)
    "scheduleFailures": if non-null, failure lines ("Ff") in the log will be schedule-attempted. (Atypical, as all failures are preceded by "F+" lines.)
    "scheduleScheduleds": if non-null, scheduled lines ("F+") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.) TODO: add parameter for auto-unpause-at-good-time
    
    Parameters:
    params - Map describing source file and options as above
    
    Throws:
    
    IOException - If problems occur reading file.
    
    org.json.JSONException
  - importRecoverFormat
```
long importRecoverFormat(File source,
                       boolean applyScope,
                       boolean includeOnly,
                       boolean forceFetch,
                       String acceptTags)
                         throws IOException
```
    Import URIs from the given file (in recover-log-like format, with a 3-character 'type' tag preceding a URI with optional hops/via). If 'includeOnly' is true, the URIs will only be imported into the frontier's alreadyIncluded structure, without being queued. Only imports URIs if their first tag field matches the acceptTags pattern.
    
    Parameters:
    source - File recovery log file to use (may be .gz compressed)
    applyScope - whether to apply crawl scope to URIs
    includeOnly - whether to only add to included filter, not schedule
    forceFetch - whether to force fetching, even if already seen (ignored if includeOnly is set)
    acceptTags - String regex; only lines whose first field match will be included
    
    Returns:
    number of lines in recovery log (for reference)
    
    Throws:
    
    IOException
  - getURIsList
```
CompositeData getURIsList(String marker,
                        int numberOfMatches,
                        String regex,
                        boolean verbose)
```
    Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.
    Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.
    The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).
    The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.
    While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.
    Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
    
    Parameters:
    marker - A marker specifing from what position in the Frontier the list should begin.
    numberOfMatches - how many URIs to add at most to the list before returning it
    verbose - if set to true the strings returned will contain additional information about each URI beyond their names.
    
    Returns:
    a list of all pending URIs falling within the specification of the marker
    
    Throws:
    
    InvalidFrontierMarkerException - when the URIFronterMarker does not match the internal state of the frontier. Tolerance for this can vary considerably from one URIFrontier implementation to the next.
    See Also:
    FrontierMarker, #getInitialMarker(String, boolean)
  - deleteURIs
```
long deleteURIs(String queueRegex,
              String match)
```
    Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.
    Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
    Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
    
    Parameters:
    match - A regular expression, any URIs that matches it will be deleted.
    
    Returns:
    The number of URIs deleted
  - deleted
```
void deleted(CrawlURI curi)
```
    Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
    
    Parameters:
    curi - Deleted CrawlURI.
  - considerIncluded
```
void considerIncluded(CrawlURI curi)
```
    Notify Frontier that it should consider the given UURI as if already scheduled.
    
    Parameters:
    u - UURI instance to add to the Already Included set.
  - pause
```
void pause()
```
    Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
  - unpause
```
void unpause()
```
    Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
  - terminate
```
void terminate()
```
    Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
  - getFrontierJournal
```
FrontierJournal getFrontierJournal()
```
    Returns:
    Return the instance of FrontierJournal that this Frontier is using. May be null if no journaling.
  - getClassKey
```
String getClassKey(CrawlURI cauri)
```
    Parameters:
    cauri - CrawlURI for which we're to calculate and set class key.
    
    Returns:
    Classkey for cauri.
  - getScope
```
DecideRule getScope()
```
  - run
```
void run()
```
    Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.
  - getGroup
```
Frontier.FrontierGroup getGroup(CrawlURI curi)
```
    Get the 'frontier group' (usually queue) for the given CrawlURI.
    
    Parameters:
    curi - CrawlURI to find matching group
    
    Returns:
    FrontierGroup for the CrawlURI
  - requestState
```
void requestState(Frontier.State target)
```
    Request the Frontier reach the given state as soon as possible. (Only when a later notification is given the CrawlController has the state actually been reached.)
    
    Parameters:
    target - Frontier.State to pursue
  - beginDisposition
```
void beginDisposition(CrawlURI curi)
```
    Inform frontier that a block of processing that should complete atomically with respect to checkpoints is about to begin. Callers should ensure an endDisposition() call soon follows; a mismatch risks freezing the frontier if a checkpoint is requested.
    
    Parameters:
    curi -
  - endDisposition
```
void endDisposition()
```
    Inform frontier the processing signalled by an earlier pending beginDisposition() call has finished. Implementors should be resilient against extra endDisposition calls, as callers dealing with exceptional conditions need to be free to call this 'just in case'.

Interface Frontier

Nested Class Summary

Method Summary

Methods inherited from interface org.springframework.context.Lifecycle

Methods inherited from interface org.archive.util.Reporter

Method Detail

next

isEmpty

schedule

finished

discoveredUriCount

queuedUriCount

futureUriCount

deepestUri

averageDepth

congestionRatio

finishedUriCount

succeededFetchCount

failedFetchCount

disregardedUriCount

importURIs

importRecoverFormat

getURIsList

deleteURIs

deleted

considerIncluded

pause

unpause

terminate

getFrontierJournal

getClassKey

getScope

run

getGroup

requestState

beginDisposition

endDisposition