AbstractFrontier (Heritrix 3 3.2.0 API)

java.lang.Object
- org.archive.crawler.frontier.AbstractFrontier

All Implemented Interfaces:

EventListener, UriUniqFilter.CrawlUriReceiver, Frontier, ExtractorParameters, SeedListener, HasKeyedProperties, org.archive.util.Reporter, org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>, org.springframework.context.Lifecycle

Direct Known Subclasses:

WorkQueueFrontier
```
public abstract class AbstractFrontier
extends Object
implements Frontier, SeedListener, HasKeyedProperties, ExtractorParameters, UriUniqFilter.CrawlUriReceiver, org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>
```
Shared facilities for Frontier implementations.

Author:

gojomo

Nested Class Summary
- Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier
  Frontier.FrontierGroup, Frontier.State

Field Summary

Fields
Modifier and Type	Field and Description
`protected CrawlController`	`controller`
`protected ReentrantReadWriteLock`	`dispositionInProgressLock` lock allowing steps of outside processing that need to complete all-or-nothing to signal their in-progress status
`protected ThreadLocal<CrawlURI>`	`dispositionPending` remembers a disposition-in-progress, so that extra endDisposition() calls are harmless
`protected AtomicLong`	`disregardedUriCount` URIs that are disregarded (for example because of robot.txt rules
`protected AtomicLong`	`failedFetchCount`
`protected AtomicLong`	`futureUriCount`
`protected KeyedProperties`	`kp`
`protected Frontier.State`	`lastReachedState` last Frontier.State reached; used to suppress duplicate notifications
`protected CrawlerLoggerModule`	`loggerModule`
`protected Thread`	`managerThread` Distinguished frontier manager thread which handles all juggling of URI queues and queues/maps of queues for proper ordering/delay of URI processing.
`protected AtomicLong`	`nextOrdinal` ordinal numbers to assign to created CrawlURIs
`protected ReentrantReadWriteLock`	`outboundLock` lock to allow holding all worker ToeThreads from taking URIs already on the outbound queue; they acquire read permission before take()ing; frontier can acquire write permission to hold threads
`protected FrontierPreparer`	`preparer`
`protected AtomicLong`	`queuedUriCount` total URIs queued to be visited
`protected FrontierJournal`	`recover` Crawl replay logger.
`protected DecideRule`	`scope`
`protected SeedModule`	`seeds`
`protected ServerCache`	`serverCache`
`protected SheetOverlaysManager`	`sheetOverlaysManager`
`protected AtomicLong`	`succeededFetchCount`
`protected Frontier.State`	`targetState` Frontier.state that manager thread should seek to reach
`protected AtomicLong`	`totalProcessedBytes` Used when bandwidth constraint are used.

Constructor Summary

Constructors
Constructor and Description

AbstractFrontier()

Constructors
Constructor and Description
`AbstractFrontier()`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`addedSeed(CrawlURI puri)` When notified of a seed via the SeedListener interface, schedule it.
`void`	`beginDisposition(CrawlURI curi)` Inform frontier that a block of processing that should complete atomically with respect to checkpoints is about to begin.
`void`	`concludedSeedBatch()`
`void`	`crawlEnded(String sExitMessage)`
`protected void`	`decrementQueuedCount(long numberOfDeletes)` Note that a number of queued Uris have been deleted.
`long`	`disregardedUriCount()` Number of URIs that were scheduled at one point but have been disregarded.
`protected void`	`doJournalAdded(CrawlURI c)`
`protected void`	`doJournalDisregarded(CrawlURI c)`
`protected void`	`doJournalEmitted(CrawlURI c)`
`protected void`	`doJournalFinishedFailure(CrawlURI c)`
`protected void`	`doJournalFinishedSuccess(CrawlURI c)`
`protected void`	`doJournalReenqueued(CrawlURI c)`
`protected void`	`doJournalRelocated(CrawlURI c)`
`void`	`endDisposition()` Inform frontier the processing signalled by an earlier pending beginDisposition() call has finished.
`long`	`failedFetchCount()` (non-Javadoc)
`protected void`	`finalTasks()` Perform any tasks necessary before entering FINISH frontier state/FINISHED crawl state
`protected abstract CrawlURI`	`findEligibleURI()` Find a CrawlURI eligible to be put on the outbound queue for processing.
`void`	`finished(CrawlURI curi)` Note that the previously emitted CrawlURI has completed its processing (for now).
`long`	`finishedUriCount()` (non-Javadoc)
`long`	`futureUriCount()`
`String`	`getClassKey(CrawlURI curi)`
`CrawlController`	`getCrawlController()`
`boolean`	`getExtract404s()` Whether to extract links from responses with a 404 'not found' response code.
`boolean`	`getExtractIndependently()` Whether each extractor should make an independent decision as to whether it can extract links from a URI's content (when value is true), or whether a previous extractor's success (marking the URI as hasBeenLinkExtracted) should cancel later extractors (when value is false).
`FrontierJournal`	`getFrontierJournal()`
`FrontierPreparer`	`getFrontierPreparer()`
`protected abstract int`	`getInProcessCount()` The number of CrawlURIs 'in process' (passed to the outbound queue and not yet finished by returning through the inbound queue.)
`KeyedProperties`	`getKeyedProperties()`
`CrawlerLoggerModule`	`getLoggerModule()`
`protected abstract long`	`getMaxInWait()` Maximum amount of time to wait for an inbound update event before giving up and rechecking on the ability to further fill the outbound queue.
`int`	`getMaxOutlinks()` The maximum number of outlinks to discover from any URI's content.
`int`	`getMaxRetries()`
`boolean`	`getRecoveryLogEnabled()`
`int`	`getRetryDelaySeconds()`
`DecideRule`	`getScope()`
`SeedModule`	`getSeeds()`
`ServerCache`	`getServerCache()`
`SheetOverlaysManager`	`getSheetOverlaysManager()`
`long`	`importRecoverFormat(File source, boolean applyScope, boolean includeOnly, boolean forceFetch, String acceptTags)` Import URIs from the given file (in recover-log-like format, with a 3-character 'type' tag preceding a URI with optional hops/via).
`void`	`importURIs(String jsonParams)` Load URIs from a file, for scheduling and/or considered-included status (if from a recovery log).
`protected void`	`importURIsSimple(org.json.JSONObject params)` Import URIs from either a simple (one URI per line) or crawl.log format.
`protected void`	`incrementDisregardedUriCount()` Increment the running count of disregarded URIs.
`protected void`	`incrementFailedFetchCount()` Increment the running count of failed URIs.
`protected void`	`incrementQueuedUriCount()` Increment the running count of queued URIs.
`protected void`	`incrementQueuedUriCount(long increment)` Increment the running count of queued URIs.
`protected void`	`incrementSucceededFetchCount()` Increment the running count of successfully fetched URIs.
`protected boolean`	`isDisregarded(CrawlURI curi)`
`boolean`	`isEmpty()` Frontier is empty only if all queues are empty and no URIs are in-process
`boolean`	`isRunning()`
`protected void`	`log(CrawlURI curi)` Log to the main crawl.log
`protected void`	`logNonfatalErrors(CrawlURI curi)` Take note of any processor-local errors that have been entered into the CrawlURI.
`protected void`	`managementTasks()` Main loop of frontier's managerThread.
`protected boolean`	`needsReenqueuing(CrawlURI curi)` Checks if a recently processed CrawlURI that did not finish successfully needs to be reenqueued (and thus possibly, processed again after some time elapses)
`CrawlURI`	`next()` Get the next URI that should be processed.
`boolean`	`nonseedLine(String line)` Do nothing with non-seed lines
`protected void`	`noteAboutToEmit(CrawlURI curi, WorkQueue q)` Perform fixups on a CrawlURI about to be returned via next().
`void`	`onApplicationEvent(org.springframework.context.ApplicationEvent event)`
`protected boolean`	`overMaxRetries(CrawlURI curi)`
`void`	`pause()` Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
`protected void`	`prepForFrontier(CrawlURI curi)`
`protected abstract void`	`processFinish(CrawlURI caUri)` Handle the given CrawlURI as having finished a worker ToeThread processing attempt.
`protected abstract void`	`processScheduleAlways(CrawlURI caUri)` Schedule the given CrawlURI regardless of its already-seen status.
`protected abstract void`	`processScheduleIfUnique(CrawlURI caUri)` Schedule the given CrawlURI if not already-seen.
`long`	`queuedUriCount()` (non-Javadoc)
`protected void`	`reachedState(Frontier.State justReached)` The given state has been reached; if it is a new state, generate a notification to the CrawlController.
`void`	`receive(CrawlURI curi)` Accept the given CrawlURI for scheduling, as it has passed the alreadyIncluded filter.
`void`	`requestState(Frontier.State target)` Request the Frontier reach the given state as soon as possible.
`protected long`	`retryDelayFor(CrawlURI curi)` Return a suitable value to wait before retrying the given URI.
`void`	`run()` Request that Frontier allow crawling to begin.
`void`	`schedule(CrawlURI curi)` Arrange for the given CrawlURI to be visited, if it is not already scheduled/completed.
`void`	`setCrawlController(CrawlController controller)`
`void`	`setExtract404s(boolean extract404s)`
`void`	`setExtractIndependently(boolean extractIndependently)`
`void`	`setFrontierPreparer(FrontierPreparer prep)`
`void`	`setLoggerModule(CrawlerLoggerModule loggerModule)`
`void`	`setMaxOutlinks(int max)`
`void`	`setMaxRetries(int maxRetries)`
`void`	`setRecoveryLogEnabled(boolean enabled)`
`void`	`setRetryDelaySeconds(int delay)`
`void`	`setScope(DecideRule scope)`
`void`	`setSeeds(SeedModule seeds)`
`void`	`setServerCache(ServerCache serverCache)`
`void`	`setSheetOverlaysManager(SheetOverlaysManager sheetOverlaysManager)`
`String`	`shortReportLine()`
`void`	`start()`
`protected void`	`startManagerThread()` Start the dedicated thread with an independent view of the frontier's state.
`void`	`stop()`
`long`	`succeededFetchCount()` (non-Javadoc)
`protected void`	`tally(CrawlURI curi, FetchStats.Stage stage)` Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.
`void`	`terminate()` Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
`void`	`unpause()` Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.archive.crawler.framework.Frontier
averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, discoveredUriCount, getGroup, getURIsList

Methods inherited from interface org.archive.util.Reporter
reportTo, shortReportLegend, shortReportLineTo, shortReportMap

- Field Detail
  - kp
```
protected KeyedProperties kp
```
  - controller
```
protected CrawlController controller
```
  - sheetOverlaysManager
```
protected SheetOverlaysManager sheetOverlaysManager
```
  - loggerModule
```
protected CrawlerLoggerModule loggerModule
```
  - seeds
```
protected SeedModule seeds
```
  - serverCache
```
protected ServerCache serverCache
```
  - nextOrdinal
```
protected AtomicLong nextOrdinal
```
    ordinal numbers to assign to created CrawlURIs
  - scope
```
protected DecideRule scope
```
  - preparer
```
protected FrontierPreparer preparer
```
  - queuedUriCount
```
protected AtomicLong queuedUriCount
```
    total URIs queued to be visited
  - futureUriCount
```
protected AtomicLong futureUriCount
```
  - succeededFetchCount
```
protected AtomicLong succeededFetchCount
```
  - failedFetchCount
```
protected AtomicLong failedFetchCount
```
  - disregardedUriCount
```
protected AtomicLong disregardedUriCount
```
    URIs that are disregarded (for example because of robot.txt rules
  - totalProcessedBytes
```
protected AtomicLong totalProcessedBytes
```
    Used when bandwidth constraint are used.
  - recover
```
protected FrontierJournal recover
```
    Crawl replay logger. Currently captures Frontier/URI transitions. Can be null if user chose not to run a recovery.log.
  - outboundLock
```
protected ReentrantReadWriteLock outboundLock
```
    lock to allow holding all worker ToeThreads from taking URIs already on the outbound queue; they acquire read permission before take()ing; frontier can acquire write permission to hold threads
  - managerThread
```
protected Thread managerThread
```
    Distinguished frontier manager thread which handles all juggling of URI queues and queues/maps of queues for proper ordering/delay of URI processing.
  - lastReachedState
```
protected Frontier.State lastReachedState
```
    last Frontier.State reached; used to suppress duplicate notifications
  - targetState
```
protected volatile Frontier.State targetState
```
    Frontier.state that manager thread should seek to reach
  - dispositionInProgressLock
```
protected ReentrantReadWriteLock dispositionInProgressLock
```
    lock allowing steps of outside processing that need to complete all-or-nothing to signal their in-progress status
  - dispositionPending
```
protected ThreadLocal<CrawlURI> dispositionPending
```
    remembers a disposition-in-progress, so that extra endDisposition() calls are harmless
- Constructor Detail
  - AbstractFrontier
```
public AbstractFrontier()
```
    Parameters:
    name - Name of this frontier.
    description - Description for this frontier.
- Method Detail
  - getKeyedProperties
```
public KeyedProperties getKeyedProperties()
```
    Specified by:
    
    getKeyedProperties in interface HasKeyedProperties
  - getRetryDelaySeconds
```
public int getRetryDelaySeconds()
```
  - setRetryDelaySeconds
```
public void setRetryDelaySeconds(int delay)
```
  - getMaxRetries
```
public int getMaxRetries()
```
  - setMaxRetries
```
public void setMaxRetries(int maxRetries)
```
  - getRecoveryLogEnabled
```
public boolean getRecoveryLogEnabled()
```
  - setRecoveryLogEnabled
```
public void setRecoveryLogEnabled(boolean enabled)
```
  - getMaxOutlinks
```
public int getMaxOutlinks()
```
    Description copied from interface: ExtractorParameters
    
    The maximum number of outlinks to discover from any URI's content. Additional outlinks will be ignored, but the URI should be annotated as having yielded more than the maximum.
    
    Specified by:
    
    getMaxOutlinks in interface ExtractorParameters
    
    Returns:
    in maximum outlinks to discover
  - setMaxOutlinks
```
public void setMaxOutlinks(int max)
```
  - getExtractIndependently
```
public boolean getExtractIndependently()
```
    Description copied from interface: ExtractorParameters
    
    Whether each extractor should make an independent decision as to whether it can extract links from a URI's content (when value is true), or whether a previous extractor's success (marking the URI as hasBeenLinkExtracted) should cancel later extractors (when value is false).
    
    Specified by:
    
    getExtractIndependently in interface ExtractorParameters
    
    Returns:
    boolean whether to extract without regard to prior extractor's success
  - setExtractIndependently
```
public void setExtractIndependently(boolean extractIndependently)
```
  - getExtract404s
```
public boolean getExtract404s()
```
    Description copied from interface: ExtractorParameters
    
    Whether to extract links from responses with a 404 'not found' response code.
    
    Specified by:
    
    getExtract404s in interface ExtractorParameters
  - setExtract404s
```
public void setExtract404s(boolean extract404s)
```
  - isRunning
```
public boolean isRunning()
```
    Specified by:
    
    isRunning in interface org.springframework.context.Lifecycle
  - stop
```
public void stop()
```
    Specified by:
    
    stop in interface org.springframework.context.Lifecycle
  - getCrawlController
```
public CrawlController getCrawlController()
```
  - setCrawlController
```
public void setCrawlController(CrawlController controller)
```
  - getSheetOverlaysManager
```
public SheetOverlaysManager getSheetOverlaysManager()
```
  - setSheetOverlaysManager
```
public void setSheetOverlaysManager(SheetOverlaysManager sheetOverlaysManager)
```
  - getLoggerModule
```
public CrawlerLoggerModule getLoggerModule()
```
  - setLoggerModule
```
public void setLoggerModule(CrawlerLoggerModule loggerModule)
```
  - getSeeds
```
public SeedModule getSeeds()
```
  - setSeeds
```
public void setSeeds(SeedModule seeds)
```
  - getServerCache
```
public ServerCache getServerCache()
```
  - setServerCache
```
public void setServerCache(ServerCache serverCache)
```
  - getScope
```
public DecideRule getScope()
```
    Specified by:
    
    getScope in interface Frontier
  - setScope
```
public void setScope(DecideRule scope)
```
  - getFrontierPreparer
```
public FrontierPreparer getFrontierPreparer()
```
  - setFrontierPreparer
```
public void setFrontierPreparer(FrontierPreparer prep)
```
  - getClassKey
```
public String getClassKey(CrawlURI curi)
```
    Specified by:
    
    getClassKey in interface Frontier
    
    Parameters:
    cauri - CrawlURI we're to get a key for.
    
    Returns:
    a String token representing a queue
  - startManagerThread
```
protected void startManagerThread()
```
    Start the dedicated thread with an independent view of the frontier's state.
  - start
```
public void start()
```
    Specified by:
    
    start in interface org.springframework.context.Lifecycle
  - managementTasks
```
protected void managementTasks()
```
    Main loop of frontier's managerThread. Only exits when State.FINISH is requested (perhaps automatically at URI exhaustion) and reached. General strategy is to try to fill outbound queue, then process an item from inbound queue, and repeat. A HOLD (to be implemented) or PAUSE puts frontier into a stable state that won't be changed asynchronously by worker thread activity.
  - finalTasks
```
protected void finalTasks()
```
    Perform any tasks necessary before entering FINISH frontier state/FINISHED crawl state
  - reachedState
```
protected void reachedState(Frontier.State justReached)
```
    The given state has been reached; if it is a new state, generate a notification to the CrawlController. TODO: evaluate making this a generic notification others can sign up for
  - next
```
public CrawlURI next()
              throws InterruptedException
```
    Description copied from interface: Frontier
    
    Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned.
    
    Specified by:
    
    next in interface Frontier
    
    Returns:
    the next URI that should be processed.
    
    Throws:
    
    InterruptedException
  - findEligibleURI
```
protected abstract CrawlURI findEligibleURI()
```
    Find a CrawlURI eligible to be put on the outbound queue for processing. If none, return null.
    
    Returns:
    the eligible URI, or null
  - processScheduleAlways
```
protected abstract void processScheduleAlways(CrawlURI caUri)
```
    Schedule the given CrawlURI regardless of its already-seen status. Only to be called inside the managerThread, as by an InEvent.
    
    Parameters:
    caUri - CrawlURI to schedule
  - processScheduleIfUnique
```
protected abstract void processScheduleIfUnique(CrawlURI caUri)
```
    Schedule the given CrawlURI if not already-seen. Only to be called inside the managerThread, as by an InEvent.
    
    Parameters:
    caUri - CrawlURI to schedule
  - processFinish
```
protected abstract void processFinish(CrawlURI caUri)
```
    Handle the given CrawlURI as having finished a worker ToeThread processing attempt. May result in the URI being rescheduled or logged as successful or failed. Only to be called inside the managerThread, as by an InEvent.
    
    Parameters:
    caUri - CrawlURI to finish
  - getInProcessCount
```
protected abstract int getInProcessCount()
```
    The number of CrawlURIs 'in process' (passed to the outbound queue and not yet finished by returning through the inbound queue.)
    
    Returns:
    number of in-process CrawlURIs
  - getMaxInWait
```
protected abstract long getMaxInWait()
```
    Maximum amount of time to wait for an inbound update event before giving up and rechecking on the ability to further fill the outbound queue. If any queues are waiting out politeness/retry delays ('snoozed'), the maximum wait should be no longer than the shortest sch delay.
    
    Returns:
    maximum time to wait, in milliseconds
  - schedule
```
public void schedule(CrawlURI curi)
```
    Arrange for the given CrawlURI to be visited, if it is not already scheduled/completed. This implementation defers uniqueness-testing into the frontier managerThread with a ScheduleIfUnique InEvent; this may cause unnecessary contention/single-threading. WorkQueueFrontier currently overrides as an experiment in decreasing contention. TODO: settle on one approach.
    
    Specified by:
    
    schedule in interface Frontier
    
    Parameters:
    curi - The URI to schedule.
    See Also:
    Frontier.schedule(org.archive.modules.CrawlURI)
  - receive
```
public void receive(CrawlURI curi)
```
    Accept the given CrawlURI for scheduling, as it has passed the alreadyIncluded filter. Choose a per-classKey queue and enqueue it. If this item has made an unready queue ready, place that queue on the readyClassQueues queue.
    
    Specified by:
    
    receive in interface UriUniqFilter.CrawlUriReceiver
    
    Parameters:
    caUri - CrawlURI.
  - finished
```
public void finished(CrawlURI curi)
```
    Note that the previously emitted CrawlURI has completed its processing (for now). The CrawlURI may be scheduled to retry, if appropriate, and other related URIs may become eligible for release via the next next() call, as a result of finished(). (non-Javadoc)
    
    Specified by:
    
    finished in interface Frontier
    
    Parameters:
    curi - The URI that has finished processing.
    See Also:
    Frontier.finished(org.archive.modules.CrawlURI)
  - run
```
public void run()
```
    Description copied from interface: Frontier
    
    Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.
    
    Specified by:
    
    run in interface Frontier
  - requestState
```
public void requestState(Frontier.State target)
```
    Description copied from interface: Frontier
    
    Request the Frontier reach the given state as soon as possible. (Only when a later notification is given the CrawlController has the state actually been reached.)
    
    Specified by:
    
    requestState in interface Frontier
    
    Parameters:
    target - Frontier.State to pursue
  - pause
```
public void pause()
```
    Description copied from interface: Frontier
    
    Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
    
    Specified by:
    
    pause in interface Frontier
  - unpause
```
public void unpause()
```
    Description copied from interface: Frontier
    
    Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
    
    Specified by:
    
    unpause in interface Frontier
  - terminate
```
public void terminate()
```
    Description copied from interface: Frontier
    
    Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
    
    Specified by:
    
    terminate in interface Frontier
  - tally
```
protected void tally(CrawlURI curi,
         FetchStats.Stage stage)
```
    Report CrawlURI to each of the three 'substats' accumulators (group/queue, server, host) for a given stage.
    
    Parameters:
    curi -
    stage -
  - doJournalFinishedSuccess
```
protected void doJournalFinishedSuccess(CrawlURI c)
```
  - doJournalAdded
```
protected void doJournalAdded(CrawlURI c)
```
  - doJournalRelocated
```
protected void doJournalRelocated(CrawlURI c)
```
  - doJournalReenqueued
```
protected void doJournalReenqueued(CrawlURI c)
```
  - doJournalFinishedFailure
```
protected void doJournalFinishedFailure(CrawlURI c)
```
  - doJournalDisregarded
```
protected void doJournalDisregarded(CrawlURI c)
```
  - doJournalEmitted
```
protected void doJournalEmitted(CrawlURI c)
```
  - isEmpty
```
public boolean isEmpty()
```
    Frontier is empty only if all queues are empty and no URIs are in-process
    
    Specified by:
    
    isEmpty in interface Frontier
    
    Returns:
    True if queues are empty.
  - incrementQueuedUriCount
```
protected void incrementQueuedUriCount()
```
    Increment the running count of queued URIs.
  - incrementQueuedUriCount
```
protected void incrementQueuedUriCount(long increment)
```
    Increment the running count of queued URIs.
    
    Parameters:
    increment - amount to increment the queued count
  - decrementQueuedCount
```
protected void decrementQueuedCount(long numberOfDeletes)
```
    Note that a number of queued Uris have been deleted.
    
    Parameters:
    numberOfDeletes -
  - queuedUriCount
```
public long queuedUriCount()
```
    (non-Javadoc)
    
    Specified by:
    
    queuedUriCount in interface Frontier
    
    Returns:
    Number of queued URIs.
    See Also:
    Frontier.queuedUriCount()
  - futureUriCount
```
public long futureUriCount()
```
    Specified by:
    
    futureUriCount in interface Frontier
    
    Returns:
    Number of URIs not currently queued/eligible but scheduled for future
  - finishedUriCount
```
public long finishedUriCount()
```
    (non-Javadoc)
    
    Specified by:
    
    finishedUriCount in interface Frontier
    
    Returns:
    Number of finished URIs.
    See Also:
    Frontier.finishedUriCount()
  - incrementSucceededFetchCount
```
protected void incrementSucceededFetchCount()
```
    Increment the running count of successfully fetched URIs.
  - succeededFetchCount
```
public long succeededFetchCount()
```
    (non-Javadoc)
    
    Specified by:
    
    succeededFetchCount in interface Frontier
    
    Returns:
    Number of successfully processed URIs.
    See Also:
    Frontier.succeededFetchCount()
  - incrementFailedFetchCount
```
protected void incrementFailedFetchCount()
```
    Increment the running count of failed URIs.
  - failedFetchCount
```
public long failedFetchCount()
```
    (non-Javadoc)
    
    Specified by:
    
    failedFetchCount in interface Frontier
    
    Returns:
    Number of URIs that failed to process.
    See Also:
    Frontier.failedFetchCount()
  - incrementDisregardedUriCount
```
protected void incrementDisregardedUriCount()
```
    Increment the running count of disregarded URIs.
  - disregardedUriCount
```
public long disregardedUriCount()
```
    Description copied from interface: Frontier
    
    Number of URIs that were scheduled at one point but have been disregarded.
    Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
    
    Specified by:
    
    disregardedUriCount in interface Frontier
    
    Returns:
    The number of URIs that have been disregarded.
  - addedSeed
```
public void addedSeed(CrawlURI puri)
```
    When notified of a seed via the SeedListener interface, schedule it.
    
    Specified by:
    
    addedSeed in interface SeedListener
    
    See Also:
    SeedListener.addedSeed(org.archive.modules.CrawlURI)
  - nonseedLine
```
public boolean nonseedLine(String line)
```
    Do nothing with non-seed lines
    
    Specified by:
    
    nonseedLine in interface SeedListener
    
    See Also:
    SeedListener.nonseedLine(java.lang.String)
  - concludedSeedBatch
```
public void concludedSeedBatch()
```
    Specified by:
    
    concludedSeedBatch in interface SeedListener
  - prepForFrontier
```
protected void prepForFrontier(CrawlURI curi)
```
  - noteAboutToEmit
```
protected void noteAboutToEmit(CrawlURI curi,
                   WorkQueue q)
```
    Perform fixups on a CrawlURI about to be returned via next().
    
    Parameters:
    curi - CrawlURI about to be returned by next()
    q - the queue from which the CrawlURI came
  - retryDelayFor
```
protected long retryDelayFor(CrawlURI curi)
```
    Return a suitable value to wait before retrying the given URI.
    
    Parameters:
    curi - CrawlURI to be retried
    
    Returns:
    millisecond delay before retry
  - logNonfatalErrors
```
protected void logNonfatalErrors(CrawlURI curi)
```
    Take note of any processor-local errors that have been entered into the CrawlURI.
    
    Parameters:
    curi -
  - overMaxRetries
```
protected boolean overMaxRetries(CrawlURI curi)
```
  - importRecoverFormat
```
public long importRecoverFormat(File source,
                       boolean applyScope,
                       boolean includeOnly,
                       boolean forceFetch,
                       String acceptTags)
                         throws IOException
```
    Import URIs from the given file (in recover-log-like format, with a 3-character 'type' tag preceding a URI with optional hops/via). If 'includeOnly' is true, the URIs will only be imported into the frontier's alreadyIncluded structure, without being queued. Only imports URIs if their first tag field matches the acceptTags pattern.
    
    Specified by:
    
    importRecoverFormat in interface Frontier
    
    Parameters:
    source - File recovery log file to use (may be .gz compressed)
    applyScope - whether to apply crawl scope to URIs
    includeOnly - whether to only add to included filter, not schedule
    forceFetch - whether to force fetching, even if already seen (ignored if includeOnly is set)
    acceptTags - String regex; only lines whose first field match will be included
    
    Returns:
    number of lines in recovery log (for reference)
    
    Throws:
    
    IOException
  - importURIs
```
public void importURIs(String jsonParams)
                throws IOException
```
    Description copied from interface: Frontier
    
    Load URIs from a file, for scheduling and/or considered-included status (if from a recovery log).
    The 'params' Map describes the source file to use and options in effect regarding its format and handling. Significant keys are:
    "path": full path to source file. If the path ends '.gz', it will be considered to be GZIP compressed.
    "format": one of "onePer", "crawlLog", or "recoveryLog"
    "forceRevisit": if non-null, URIs will be force-scheduled even if already considered included
    "scopeSchedules": if non-null, any URI imported be checked against the frontier's configured scope before scheduling
    If the "format" is "recoveryLog", 7 more keys are significant:
    "includeSuccesses": if non-null, success lines ("Fs") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.)
    "includeFailures": if non-null, failure lines ("Ff") in the log will be considered-included. (Sometimes, this is desired.)
    "includeScheduleds": If non-null, scheduled lines ("F+") in the log will be considered-included. (Atypical, but an option for completeness.)
    "scopeIncludes": if non-null, any of the above will be checked against the frontier's configured scope before consideration
    "scheduleSuccesses": if non-null, success lines ("Fs") in the log will be schedule-attempted. (Atypical, as all successes are preceded by "F+" lines.)
    "scheduleFailures": if non-null, failure lines ("Ff") in the log will be schedule-attempted. (Atypical, as all failures are preceded by "F+" lines.)
    "scheduleScheduleds": if non-null, scheduled lines ("F+") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.) TODO: add parameter for auto-unpause-at-good-time
    
    Specified by:
    
    importURIs in interface Frontier
    
    Parameters:
    jsonParams - Map describing source file and options as above
    
    Throws:
    
    IOException - If problems occur reading file.
  - importURIsSimple
```
protected void importURIsSimple(org.json.JSONObject params)
```
    Import URIs from either a simple (one URI per line) or crawl.log format.
    
    Parameters:
    params - JSONObject of options to control import
    See Also:
    org.archive.crawler.framework.Frontier#importURIs(java.util.Map)
  - log
```
protected void log(CrawlURI curi)
```
    Log to the main crawl.log
    
    Parameters:
    curi -
  - isDisregarded
```
protected boolean isDisregarded(CrawlURI curi)
```
  - needsReenqueuing
```
protected boolean needsReenqueuing(CrawlURI curi)
```
    Checks if a recently processed CrawlURI that did not finish successfully needs to be reenqueued (and thus possibly, processed again after some time elapses)
    
    Parameters:
    curi - The CrawlURI to check
    
    Returns:
    True if we need to retry.
  - getFrontierJournal
```
public FrontierJournal getFrontierJournal()
```
    Specified by:
    
    getFrontierJournal in interface Frontier
    
    Returns:
    RecoveryJournal instance. May be null.
  - crawlEnded
```
public void crawlEnded(String sExitMessage)
```
  - shortReportLine
```
public String shortReportLine()
```
  - onApplicationEvent
```
public void onApplicationEvent(org.springframework.context.ApplicationEvent event)
```
    Specified by:
    
    onApplicationEvent in interface org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>
  - beginDisposition
```
public void beginDisposition(CrawlURI curi)
```
    Description copied from interface: Frontier
    
    Inform frontier that a block of processing that should complete atomically with respect to checkpoints is about to begin. Callers should ensure an endDisposition() call soon follows; a mismatch risks freezing the frontier if a checkpoint is requested.
    
    Specified by:
    
    beginDisposition in interface Frontier
  - endDisposition
```
public void endDisposition()
```
    Description copied from interface: Frontier
    
    Inform frontier the processing signalled by an earlier pending beginDisposition() call has finished. Implementors should be resilient against extra endDisposition calls, as callers dealing with exceptional conditions need to be free to call this 'just in case'.
    
    Specified by:
    
    endDisposition in interface Frontier

Class AbstractFrontier

Nested Class Summary

Nested classes/interfaces inherited from interface org.archive.crawler.framework.Frontier

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.archive.crawler.framework.Frontier

Methods inherited from interface org.archive.util.Reporter

Field Detail

kp

controller

sheetOverlaysManager

loggerModule

seeds

serverCache

nextOrdinal

scope

preparer

queuedUriCount

futureUriCount

succeededFetchCount

failedFetchCount

disregardedUriCount

totalProcessedBytes

recover

outboundLock

managerThread

lastReachedState

targetState

dispositionInProgressLock

dispositionPending

Constructor Detail

AbstractFrontier

Method Detail

getKeyedProperties

getRetryDelaySeconds

setRetryDelaySeconds

getMaxRetries

setMaxRetries

getRecoveryLogEnabled

setRecoveryLogEnabled

getMaxOutlinks

setMaxOutlinks

getExtractIndependently

setExtractIndependently

getExtract404s

setExtract404s

isRunning

stop

getCrawlController

setCrawlController

getSheetOverlaysManager

setSheetOverlaysManager

getLoggerModule

setLoggerModule

getSeeds

setSeeds

getServerCache

setServerCache

getScope

setScope

getFrontierPreparer

setFrontierPreparer

getClassKey

startManagerThread

start

managementTasks

finalTasks

reachedState

next

findEligibleURI

processScheduleAlways

processScheduleIfUnique

processFinish

getInProcessCount

getMaxInWait

schedule

receive

finished