public interface Frontier
extends org.springframework.context.Lifecycle, org.archive.util.Reporter
A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):
The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.
A URIFrontier is created by the
CrawlController
which
is in turn responsible for providing access to it. Most significant among
those modules interested in the Frontier are the
ToeThreads
who perform the
actual work of processing a URI.
The methods defined in this interface are those required to get URIs for
processing, report the results of processing back (ToeThreads) and to get
access to various statistical data along the way. The statistical data is
of interest to Statistics Tracking
modules. A couple of additional methods are provided
to be able to inspect and manipulate the Frontier at runtime.
The statistical data exposed by this interface is:
Discovered URIs
Queued URIs
Finished URIs
Successfully processed URIs
Failed to process URIs
Disregarded URIs
Total bytes written
In addition the frontier may optionally implement an interface that exposes information about hosts.
Furthermore any implementation of the URI Frontier should trigger
CrawlURIDispostionEvents
on the ApplicationContext to allow
statistics modules or other interested observers to collect info
about each completed URI's processing.
All URI Frontiers inherit from
ModuleType
and therefore creating settings follows the usual pattern of pluggable modules
in Heritrix.
CrawlController
,
org.archive.crawler.framework.CrawlController#fireCrawledURIDisregardEvent(CrawlURI)
,
org.archive.crawler.framework.CrawlController#fireCrawledURIFailureEvent(CrawlURI)
,
org.archive.crawler.framework.CrawlController#fireCrawledURINeedRetryEvent(CrawlURI)
,
org.archive.crawler.framework.CrawlController#fireCrawledURISuccessfulEvent(CrawlURI)
,
org.archive.crawler.framework.StatisticsTracker
,
ToeThread
,
org.archive.crawler.settings.ModuleType
Modifier and Type | Interface and Description |
---|---|
static interface |
Frontier.FrontierGroup
Generic interface representing the internal groupings
of a Frontier's URIs -- usually queues.
|
static class |
Frontier.State
Enumeration of possible target states.
|
Modifier and Type | Method and Description |
---|---|
long |
averageDepth()
Average depth of the last URI in all eligible queues.
|
void |
beginDisposition(CrawlURI curi)
Inform frontier that a block of processing that should complete atomically
with respect to checkpoints is about to begin.
|
float |
congestionRatio()
Ratio of number of threads that would theoretically allow
maximum crawl progress (if each was as productive as current
threads), to current number of threads.
|
void |
considerIncluded(CrawlURI curi)
Notify Frontier that it should consider the given UURI as if
already scheduled.
|
long |
deepestUri()
Ordinal position of the 'deepest' URI eligible
for crawling.
|
void |
deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the
normal next()/finished() lifecycle.
|
long |
deleteURIs(String queueRegex,
String match)
Delete any URI that matches the given regular expression from the list
of discovered and pending URIs.
|
long |
discoveredUriCount()
Number of discovered URIs.
|
long |
disregardedUriCount()
Number of URIs that were scheduled at one point but have been
disregarded.
|
void |
endDisposition()
Inform frontier the processing signalled by an earlier pending
beginDisposition() call has finished.
|
long |
failedFetchCount()
Number of URIs that failed to process.
|
void |
finished(CrawlURI cURI)
Report a URI being processed as having finished processing.
|
long |
finishedUriCount()
Number of URIs that have finished processing.
|
long |
futureUriCount() |
String |
getClassKey(CrawlURI cauri) |
FrontierJournal |
getFrontierJournal() |
Frontier.FrontierGroup |
getGroup(CrawlURI curi)
Get the 'frontier group' (usually queue) for the given
CrawlURI.
|
DecideRule |
getScope() |
CompositeData |
getURIsList(String marker,
int numberOfMatches,
String regex,
boolean verbose)
Returns a list of all uncrawled URIs starting from a specified marker
until
numberOfMatches is reached. |
long |
importRecoverFormat(File source,
boolean applyScope,
boolean includeOnly,
boolean forceFetch,
String acceptTags)
Import URIs from the given file (in recover-log-like format, with
a 3-character 'type' tag preceding a URI with optional hops/via).
|
void |
importURIs(String params)
Load URIs from a file, for scheduling and/or considered-included
status (if from a recovery log).
|
boolean |
isEmpty()
Returns true if the frontier contains no more URIs to crawl.
|
CrawlURI |
next()
Get the next URI that should be processed.
|
void |
pause()
Notify Frontier that it should not release any URIs, instead
holding all threads, until instructed otherwise.
|
long |
queuedUriCount()
Number of URIs queued up and waiting for processing.
|
void |
requestState(Frontier.State target)
Request the Frontier reach the given state as soon as possible.
|
void |
run()
Request that Frontier allow crawling to begin.
|
void |
schedule(CrawlURI caURI)
Schedules a CrawlURI.
|
long |
succeededFetchCount()
Number of successfully processed URIs.
|
void |
terminate()
Notify Frontier that it should end the crawl, giving
any worker ToeThread that askss for a next() an
EndedException.
|
void |
unpause()
Resumes the release of URIs to crawl, allowing worker
ToeThreads to proceed.
|
CrawlURI next() throws InterruptedException
InterruptedException
boolean isEmpty()
That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI.
void schedule(CrawlURI caURI)
This method accepts one URI and schedules it immediately. This has
nothing to do with the priority of the URI being scheduled. Only that
it will be placed in it's respective queue at once. For priority
scheduling see CrawlURI.setSchedulingDirective(int)
This method should be synchronized in all implementing classes.
caURI
- The URI to schedule.CrawlURI.setSchedulingDirective(int)
void finished(CrawlURI cURI)
ToeThreads will invoke this method once they have completed work on their assigned URI.
This method is synchronized.
cURI
- The URI that has finished processing.long discoveredUriCount()
That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.
long queuedUriCount()
This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.
long futureUriCount()
long deepestUri()
long averageDepth()
float congestionRatio()
long finishedUriCount()
Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).
long succeededFetchCount()
Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.
long failedFetchCount()
URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.
long disregardedUriCount()
Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
void importURIs(String params) throws IOException
The 'params' Map describes the source file to use and options in effect regarding its format and handling. Significant keys are:
"path": full path to source file. If the path ends '.gz', it will be considered to be GZIP compressed.
"format": one of "onePer", "crawlLog", or "recoveryLog"
"forceRevisit": if non-null, URIs will be force-scheduled even if already considered included
"scopeSchedules": if non-null, any URI imported be checked against the frontier's configured scope before scheduling
If the "format" is "recoveryLog", 7 more keys are significant:
"includeSuccesses": if non-null, success lines ("Fs") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.)
"includeFailures": if non-null, failure lines ("Ff") in the log will be considered-included. (Sometimes, this is desired.)
"includeScheduleds": If non-null, scheduled lines ("F+") in the log will be considered-included. (Atypical, but an option for completeness.)
"scopeIncludes": if non-null, any of the above will be checked against the frontier's configured scope before consideration
"scheduleSuccesses": if non-null, success lines ("Fs") in the log will be schedule-attempted. (Atypical, as all successes are preceded by "F+" lines.)
"scheduleFailures": if non-null, failure lines ("Ff") in the log will be schedule-attempted. (Atypical, as all failures are preceded by "F+" lines.)
"scheduleScheduleds": if non-null, scheduled lines ("F+") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.) TODO: add parameter for auto-unpause-at-good-time
params
- Map describing source file and options as aboveIOException
- If problems occur reading file.org.json.JSONException
long importRecoverFormat(File source, boolean applyScope, boolean includeOnly, boolean forceFetch, String acceptTags) throws IOException
source
- File recovery log file to use (may be .gz compressed)applyScope
- whether to apply crawl scope to URIsincludeOnly
- whether to only add to included filter, not scheduleforceFetch
- whether to force fetching, even if already seen
(ignored if includeOnly is set)acceptTags
- String regex; only lines whose first field
match will be includedIOException
CompositeData getURIsList(String marker, int numberOfMatches, String regex, boolean verbose)
numberOfMatches
is reached.
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.
The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).
The URIFrontierMarker
will be advanced to the position at
which it's maximum number of matches found is reached. Reusing it for
subsequent calls will thus effectively get the 'next' batch. Making
any changes to the frontier can invalidate the marker.
While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
marker
- A marker specifing from what position in the Frontier the
list should begin.numberOfMatches
- how many URIs to add at most to the list before returning itverbose
- if set to true the strings returned will contain additional
information about each URI beyond their names.InvalidFrontierMarkerException
- when the
URIFronterMarker
does not match the internal
state of the frontier. Tolerance for this can vary
considerably from one URIFrontier implementation to the next.FrontierMarker
,
#getInitialMarker(String, boolean)
long deleteURIs(String queueRegex, String match)
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
match
- A regular expression, any URIs that matches it will be
deleted.void deleted(CrawlURI curi)
curi
- Deleted CrawlURI.void considerIncluded(CrawlURI curi)
u
- UURI instance to add to the Already Included set.void pause()
void unpause()
void terminate()
FrontierJournal getFrontierJournal()
FrontierJournal
that
this Frontier is using. May be null if no journaling.String getClassKey(CrawlURI cauri)
cauri
- CrawlURI for which we're to calculate and
set class key.cauri
.DecideRule getScope()
void run()
Frontier.FrontierGroup getGroup(CrawlURI curi)
curi
- CrawlURI to find matching groupvoid requestState(Frontier.State target)
target
- Frontier.State to pursuevoid beginDisposition(CrawlURI curi)
curi
- void endDisposition()
Copyright © 2003-2014 Internet Archive. All Rights Reserved.