public abstract class AbstractFrontier extends Object implements Frontier, SeedListener, HasKeyedProperties, ExtractorParameters, UriUniqFilter.CrawlUriReceiver, org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>
Frontier.FrontierGroup, Frontier.State
Modifier and Type | Field and Description |
---|---|
protected CrawlController |
controller |
protected ReentrantReadWriteLock |
dispositionInProgressLock
lock allowing steps of outside processing that need to complete
all-or-nothing to signal their in-progress status
|
protected ThreadLocal<CrawlURI> |
dispositionPending
remembers a disposition-in-progress, so that extra endDisposition()
calls are harmless
|
protected AtomicLong |
disregardedUriCount
URIs that are disregarded (for example because of robot.txt rules
|
protected AtomicLong |
failedFetchCount |
protected AtomicLong |
futureUriCount |
protected KeyedProperties |
kp |
protected Frontier.State |
lastReachedState
last Frontier.State reached; used to suppress duplicate notifications
|
protected CrawlerLoggerModule |
loggerModule |
protected Thread |
managerThread
Distinguished frontier manager thread which handles all juggling
of URI queues and queues/maps of queues for proper ordering/delay of
URI processing.
|
protected AtomicLong |
nextOrdinal
ordinal numbers to assign to created CrawlURIs
|
protected ReentrantReadWriteLock |
outboundLock
lock to allow holding all worker ToeThreads from taking URIs already
on the outbound queue; they acquire read permission before take()ing;
frontier can acquire write permission to hold threads
|
protected FrontierPreparer |
preparer |
protected AtomicLong |
queuedUriCount
total URIs queued to be visited
|
protected FrontierJournal |
recover
Crawl replay logger.
|
protected DecideRule |
scope |
protected SeedModule |
seeds |
protected ServerCache |
serverCache |
protected SheetOverlaysManager |
sheetOverlaysManager |
protected AtomicLong |
succeededFetchCount |
protected Frontier.State |
targetState
Frontier.state that manager thread should seek to reach
|
protected AtomicLong |
totalProcessedBytes
Used when bandwidth constraint are used.
|
Constructor and Description |
---|
AbstractFrontier() |
Modifier and Type | Method and Description |
---|---|
void |
addedSeed(CrawlURI puri)
When notified of a seed via the SeedListener interface,
schedule it.
|
void |
beginDisposition(CrawlURI curi)
Inform frontier that a block of processing that should complete atomically
with respect to checkpoints is about to begin.
|
void |
concludedSeedBatch() |
void |
crawlEnded(String sExitMessage) |
protected void |
decrementQueuedCount(long numberOfDeletes)
Note that a number of queued Uris have been deleted.
|
long |
disregardedUriCount()
Number of URIs that were scheduled at one point but have been
disregarded.
|
protected void |
doJournalAdded(CrawlURI c) |
protected void |
doJournalDisregarded(CrawlURI c) |
protected void |
doJournalEmitted(CrawlURI c) |
protected void |
doJournalFinishedFailure(CrawlURI c) |
protected void |
doJournalFinishedSuccess(CrawlURI c) |
protected void |
doJournalReenqueued(CrawlURI c) |
protected void |
doJournalRelocated(CrawlURI c) |
void |
endDisposition()
Inform frontier the processing signalled by an earlier pending
beginDisposition() call has finished.
|
long |
failedFetchCount()
(non-Javadoc)
|
protected void |
finalTasks()
Perform any tasks necessary before entering
FINISH frontier state/FINISHED crawl state
|
protected abstract CrawlURI |
findEligibleURI()
Find a CrawlURI eligible to be put on the outbound queue for
processing.
|
void |
finished(CrawlURI curi)
Note that the previously emitted CrawlURI has completed
its processing (for now).
|
long |
finishedUriCount()
(non-Javadoc)
|
long |
futureUriCount() |
String |
getClassKey(CrawlURI curi) |
CrawlController |
getCrawlController() |
boolean |
getExtract404s()
Whether to extract links from responses with a 404 'not found' response
code.
|
boolean |
getExtractIndependently()
Whether each extractor should make an independent decision as to whether
it can extract links from a URI's content (when value is true), or
whether a previous extractor's success (marking the URI as
hasBeenLinkExtracted) should cancel later extractors (when value is
false).
|
FrontierJournal |
getFrontierJournal() |
FrontierPreparer |
getFrontierPreparer() |
protected abstract int |
getInProcessCount()
The number of CrawlURIs 'in process' (passed to the outbound
queue and not yet finished by returning through the inbound
queue.)
|
KeyedProperties |
getKeyedProperties() |
CrawlerLoggerModule |
getLoggerModule() |
protected abstract long |
getMaxInWait()
Maximum amount of time to wait for an inbound update event before
giving up and rechecking on the ability to further fill the outbound
queue.
|
int |
getMaxOutlinks()
The maximum number of outlinks to discover from any URI's content.
|
int |
getMaxRetries() |
boolean |
getRecoveryLogEnabled() |
int |
getRetryDelaySeconds() |
DecideRule |
getScope() |
SeedModule |
getSeeds() |
ServerCache |
getServerCache() |
SheetOverlaysManager |
getSheetOverlaysManager() |
long |
importRecoverFormat(File source,
boolean applyScope,
boolean includeOnly,
boolean forceFetch,
String acceptTags)
Import URIs from the given file (in recover-log-like format, with
a 3-character 'type' tag preceding a URI with optional hops/via).
|
void |
importURIs(String jsonParams)
Load URIs from a file, for scheduling and/or considered-included
status (if from a recovery log).
|
protected void |
importURIsSimple(org.json.JSONObject params)
Import URIs from either a simple (one URI per line) or crawl.log
format.
|
protected void |
incrementDisregardedUriCount()
Increment the running count of disregarded URIs.
|
protected void |
incrementFailedFetchCount()
Increment the running count of failed URIs.
|
protected void |
incrementQueuedUriCount()
Increment the running count of queued URIs.
|
protected void |
incrementQueuedUriCount(long increment)
Increment the running count of queued URIs.
|
protected void |
incrementSucceededFetchCount()
Increment the running count of successfully fetched URIs.
|
protected boolean |
isDisregarded(CrawlURI curi) |
boolean |
isEmpty()
Frontier is empty only if all queues are empty and no URIs are in-process
|
boolean |
isRunning() |
protected void |
log(CrawlURI curi)
Log to the main crawl.log
|
protected void |
logNonfatalErrors(CrawlURI curi)
Take note of any processor-local errors that have been entered into the
CrawlURI.
|
protected void |
managementTasks()
Main loop of frontier's managerThread.
|
protected boolean |
needsReenqueuing(CrawlURI curi)
Checks if a recently processed CrawlURI that did not finish successfully
needs to be reenqueued (and thus possibly, processed again after some
time elapses)
|
CrawlURI |
next()
Get the next URI that should be processed.
|
boolean |
nonseedLine(String line)
Do nothing with non-seed lines
|
protected void |
noteAboutToEmit(CrawlURI curi,
WorkQueue q)
Perform fixups on a CrawlURI about to be returned via next().
|
void |
onApplicationEvent(org.springframework.context.ApplicationEvent event) |
protected boolean |
overMaxRetries(CrawlURI curi) |
void |
pause()
Notify Frontier that it should not release any URIs, instead
holding all threads, until instructed otherwise.
|
protected void |
prepForFrontier(CrawlURI curi) |
protected abstract void |
processFinish(CrawlURI caUri)
Handle the given CrawlURI as having finished a worker ToeThread
processing attempt.
|
protected abstract void |
processScheduleAlways(CrawlURI caUri)
Schedule the given CrawlURI regardless of its already-seen status.
|
protected abstract void |
processScheduleIfUnique(CrawlURI caUri)
Schedule the given CrawlURI if not already-seen.
|
long |
queuedUriCount()
(non-Javadoc)
|
protected void |
reachedState(Frontier.State justReached)
The given state has been reached; if it is a new state, generate
a notification to the CrawlController.
|
void |
receive(CrawlURI curi)
Accept the given CrawlURI for scheduling, as it has
passed the alreadyIncluded filter.
|
void |
requestState(Frontier.State target)
Request the Frontier reach the given state as soon as possible.
|
protected long |
retryDelayFor(CrawlURI curi)
Return a suitable value to wait before retrying the given URI.
|
void |
run()
Request that Frontier allow crawling to begin.
|
void |
schedule(CrawlURI curi)
Arrange for the given CrawlURI to be visited, if it is not
already scheduled/completed.
|
void |
setCrawlController(CrawlController controller) |
void |
setExtract404s(boolean extract404s) |
void |
setExtractIndependently(boolean extractIndependently) |
void |
setFrontierPreparer(FrontierPreparer prep) |
void |
setLoggerModule(CrawlerLoggerModule loggerModule) |
void |
setMaxOutlinks(int max) |
void |
setMaxRetries(int maxRetries) |
void |
setRecoveryLogEnabled(boolean enabled) |
void |
setRetryDelaySeconds(int delay) |
void |
setScope(DecideRule scope) |
void |
setSeeds(SeedModule seeds) |
void |
setServerCache(ServerCache serverCache) |
void |
setSheetOverlaysManager(SheetOverlaysManager sheetOverlaysManager) |
String |
shortReportLine() |
void |
start() |
protected void |
startManagerThread()
Start the dedicated thread with an independent view of the frontier's
state.
|
void |
stop() |
long |
succeededFetchCount()
(non-Javadoc)
|
protected void |
tally(CrawlURI curi,
FetchStats.Stage stage)
Report CrawlURI to each of the three 'substats' accumulators
(group/queue, server, host) for a given stage.
|
void |
terminate()
Notify Frontier that it should end the crawl, giving
any worker ToeThread that askss for a next() an
EndedException.
|
void |
unpause()
Resumes the release of URIs to crawl, allowing worker
ToeThreads to proceed.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
averageDepth, congestionRatio, considerIncluded, deepestUri, deleted, deleteURIs, discoveredUriCount, getGroup, getURIsList
protected KeyedProperties kp
protected CrawlController controller
protected SheetOverlaysManager sheetOverlaysManager
protected CrawlerLoggerModule loggerModule
protected SeedModule seeds
protected ServerCache serverCache
protected AtomicLong nextOrdinal
protected DecideRule scope
protected FrontierPreparer preparer
protected AtomicLong queuedUriCount
protected AtomicLong futureUriCount
protected AtomicLong succeededFetchCount
protected AtomicLong failedFetchCount
protected AtomicLong disregardedUriCount
protected AtomicLong totalProcessedBytes
protected FrontierJournal recover
protected ReentrantReadWriteLock outboundLock
protected Thread managerThread
protected Frontier.State lastReachedState
protected volatile Frontier.State targetState
protected ReentrantReadWriteLock dispositionInProgressLock
protected ThreadLocal<CrawlURI> dispositionPending
public AbstractFrontier()
name
- Name of this frontier.description
- Description for this frontier.public KeyedProperties getKeyedProperties()
getKeyedProperties
in interface HasKeyedProperties
public int getRetryDelaySeconds()
public void setRetryDelaySeconds(int delay)
public int getMaxRetries()
public void setMaxRetries(int maxRetries)
public boolean getRecoveryLogEnabled()
public void setRecoveryLogEnabled(boolean enabled)
public int getMaxOutlinks()
ExtractorParameters
getMaxOutlinks
in interface ExtractorParameters
public void setMaxOutlinks(int max)
public boolean getExtractIndependently()
ExtractorParameters
getExtractIndependently
in interface ExtractorParameters
public void setExtractIndependently(boolean extractIndependently)
public boolean getExtract404s()
ExtractorParameters
getExtract404s
in interface ExtractorParameters
public void setExtract404s(boolean extract404s)
public boolean isRunning()
isRunning
in interface org.springframework.context.Lifecycle
public void stop()
stop
in interface org.springframework.context.Lifecycle
public CrawlController getCrawlController()
public void setCrawlController(CrawlController controller)
public SheetOverlaysManager getSheetOverlaysManager()
public void setSheetOverlaysManager(SheetOverlaysManager sheetOverlaysManager)
public CrawlerLoggerModule getLoggerModule()
public void setLoggerModule(CrawlerLoggerModule loggerModule)
public SeedModule getSeeds()
public void setSeeds(SeedModule seeds)
public ServerCache getServerCache()
public void setServerCache(ServerCache serverCache)
public DecideRule getScope()
public void setScope(DecideRule scope)
public FrontierPreparer getFrontierPreparer()
public void setFrontierPreparer(FrontierPreparer prep)
public String getClassKey(CrawlURI curi)
getClassKey
in interface Frontier
cauri
- CrawlURI we're to get a key for.protected void startManagerThread()
public void start()
start
in interface org.springframework.context.Lifecycle
protected void managementTasks()
protected void finalTasks()
protected void reachedState(Frontier.State justReached)
public CrawlURI next() throws InterruptedException
Frontier
next
in interface Frontier
InterruptedException
protected abstract CrawlURI findEligibleURI()
protected abstract void processScheduleAlways(CrawlURI caUri)
caUri
- CrawlURI to scheduleprotected abstract void processScheduleIfUnique(CrawlURI caUri)
caUri
- CrawlURI to scheduleprotected abstract void processFinish(CrawlURI caUri)
caUri
- CrawlURI to finishprotected abstract int getInProcessCount()
protected abstract long getMaxInWait()
public void schedule(CrawlURI curi)
schedule
in interface Frontier
curi
- The URI to schedule.Frontier.schedule(org.archive.modules.CrawlURI)
public void receive(CrawlURI curi)
receive
in interface UriUniqFilter.CrawlUriReceiver
caUri
- CrawlURI.public void finished(CrawlURI curi)
finished
in interface Frontier
curi
- The URI that has finished processing.Frontier.finished(org.archive.modules.CrawlURI)
public void run()
Frontier
public void requestState(Frontier.State target)
Frontier
requestState
in interface Frontier
target
- Frontier.State to pursuepublic void pause()
Frontier
public void unpause()
Frontier
public void terminate()
Frontier
protected void tally(CrawlURI curi, FetchStats.Stage stage)
curi
- stage
- protected void doJournalFinishedSuccess(CrawlURI c)
protected void doJournalAdded(CrawlURI c)
protected void doJournalRelocated(CrawlURI c)
protected void doJournalReenqueued(CrawlURI c)
protected void doJournalFinishedFailure(CrawlURI c)
protected void doJournalDisregarded(CrawlURI c)
protected void doJournalEmitted(CrawlURI c)
public boolean isEmpty()
protected void incrementQueuedUriCount()
protected void incrementQueuedUriCount(long increment)
increment
- amount to increment the queued countprotected void decrementQueuedCount(long numberOfDeletes)
numberOfDeletes
- public long queuedUriCount()
queuedUriCount
in interface Frontier
Frontier.queuedUriCount()
public long futureUriCount()
futureUriCount
in interface Frontier
public long finishedUriCount()
finishedUriCount
in interface Frontier
Frontier.finishedUriCount()
protected void incrementSucceededFetchCount()
public long succeededFetchCount()
succeededFetchCount
in interface Frontier
Frontier.succeededFetchCount()
protected void incrementFailedFetchCount()
public long failedFetchCount()
failedFetchCount
in interface Frontier
Frontier.failedFetchCount()
protected void incrementDisregardedUriCount()
public long disregardedUriCount()
Frontier
Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.
disregardedUriCount
in interface Frontier
public void addedSeed(CrawlURI puri)
addedSeed
in interface SeedListener
SeedListener.addedSeed(org.archive.modules.CrawlURI)
public boolean nonseedLine(String line)
nonseedLine
in interface SeedListener
SeedListener.nonseedLine(java.lang.String)
public void concludedSeedBatch()
concludedSeedBatch
in interface SeedListener
protected void prepForFrontier(CrawlURI curi)
protected void noteAboutToEmit(CrawlURI curi, WorkQueue q)
curi
- CrawlURI about to be returned by next()q
- the queue from which the CrawlURI cameprotected long retryDelayFor(CrawlURI curi)
curi
- CrawlURI to be retriedprotected void logNonfatalErrors(CrawlURI curi)
curi
- protected boolean overMaxRetries(CrawlURI curi)
public long importRecoverFormat(File source, boolean applyScope, boolean includeOnly, boolean forceFetch, String acceptTags) throws IOException
importRecoverFormat
in interface Frontier
source
- File recovery log file to use (may be .gz compressed)applyScope
- whether to apply crawl scope to URIsincludeOnly
- whether to only add to included filter, not scheduleforceFetch
- whether to force fetching, even if already seen
(ignored if includeOnly is set)acceptTags
- String regex; only lines whose first field
match will be includedIOException
public void importURIs(String jsonParams) throws IOException
Frontier
The 'params' Map describes the source file to use and options in effect regarding its format and handling. Significant keys are:
"path": full path to source file. If the path ends '.gz', it will be considered to be GZIP compressed.
"format": one of "onePer", "crawlLog", or "recoveryLog"
"forceRevisit": if non-null, URIs will be force-scheduled even if already considered included
"scopeSchedules": if non-null, any URI imported be checked against the frontier's configured scope before scheduling
If the "format" is "recoveryLog", 7 more keys are significant:
"includeSuccesses": if non-null, success lines ("Fs") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.)
"includeFailures": if non-null, failure lines ("Ff") in the log will be considered-included. (Sometimes, this is desired.)
"includeScheduleds": If non-null, scheduled lines ("F+") in the log will be considered-included. (Atypical, but an option for completeness.)
"scopeIncludes": if non-null, any of the above will be checked against the frontier's configured scope before consideration
"scheduleSuccesses": if non-null, success lines ("Fs") in the log will be schedule-attempted. (Atypical, as all successes are preceded by "F+" lines.)
"scheduleFailures": if non-null, failure lines ("Ff") in the log will be schedule-attempted. (Atypical, as all failures are preceded by "F+" lines.)
"scheduleScheduleds": if non-null, scheduled lines ("F+") in the log will be considered-included. (Usually, this is the aim of a recovery-log import.) TODO: add parameter for auto-unpause-at-good-time
importURIs
in interface Frontier
jsonParams
- Map describing source file and options as aboveIOException
- If problems occur reading file.protected void importURIsSimple(org.json.JSONObject params)
params
- JSONObject of options to control importorg.archive.crawler.framework.Frontier#importURIs(java.util.Map)
protected void log(CrawlURI curi)
curi
- protected boolean isDisregarded(CrawlURI curi)
protected boolean needsReenqueuing(CrawlURI curi)
curi
- The CrawlURI to checkpublic FrontierJournal getFrontierJournal()
getFrontierJournal
in interface Frontier
public void crawlEnded(String sExitMessage)
public String shortReportLine()
public void onApplicationEvent(org.springframework.context.ApplicationEvent event)
onApplicationEvent
in interface org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>
public void beginDisposition(CrawlURI curi)
Frontier
beginDisposition
in interface Frontier
public void endDisposition()
Frontier
endDisposition
in interface Frontier
Copyright © 2003-2014 Internet Archive. All Rights Reserved.