Package | Description |
---|---|
org.archive.crawler.datamodel | |
org.archive.crawler.deciderules |
Provides classes for a simple decision rules framework.
|
org.archive.crawler.event | |
org.archive.crawler.framework | |
org.archive.crawler.frontier | |
org.archive.crawler.frontier.precedence | |
org.archive.crawler.postprocessor | |
org.archive.crawler.prefetch | |
org.archive.crawler.processor | |
org.archive.crawler.reporting | |
org.archive.crawler.spring | |
org.archive.crawler.util | |
org.archive.modules |
The beginnings of a refactored settings framework.
|
org.archive.modules.credential |
Contains html form login and basic and digest credentials
used by Heritrix logging into sites.
|
org.archive.modules.deciderules | |
org.archive.modules.deciderules.recrawl | |
org.archive.modules.deciderules.surt | |
org.archive.modules.extractor | |
org.archive.modules.fetcher | |
org.archive.modules.forms | |
org.archive.modules.net | |
org.archive.modules.recrawl | |
org.archive.modules.seeds | |
org.archive.modules.writer |
Modifier and Type | Method and Description |
---|---|
void |
UriUniqFilter.add(String key,
CrawlURI value)
Add given uri, if not already present.
|
void |
UriUniqFilter.addForce(String key,
CrawlURI value)
Add given uri, all the way through to underlying destination, even
if already present.
|
void |
UriUniqFilter.addNow(String key,
CrawlURI value)
Immediately add uri.
|
void |
UriUniqFilter.forget(String key,
CrawlURI value)
Forget item was seen
|
void |
UriUniqFilter.CrawlUriReceiver.receive(CrawlURI item) |
Modifier and Type | Method and Description |
---|---|
protected String |
ClassKeyMatchesRegexDecideRule.getString(CrawlURI uri) |
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
CrawlURIDispositionEvent.curi |
Modifier and Type | Method and Description |
---|---|
CrawlURI |
CrawlURIDispositionEvent.getCrawlURI() |
Constructor and Description |
---|
CrawlURIDispositionEvent(Object source,
CrawlURI curi,
CrawlURIDispositionEvent.Disposition disposition) |
Modifier and Type | Method and Description |
---|---|
CrawlURI |
Frontier.next()
Get the next URI that should be processed.
|
Modifier and Type | Method and Description |
---|---|
void |
Frontier.beginDisposition(CrawlURI curi)
Inform frontier that a block of processing that should complete atomically
with respect to checkpoints is about to begin.
|
void |
Frontier.considerIncluded(CrawlURI curi)
Notify Frontier that it should consider the given UURI as if
already scheduled.
|
void |
Frontier.deleted(CrawlURI curi)
Notify Frontier that a CrawlURI has been deleted outside of the
normal next()/finished() lifecycle.
|
void |
Frontier.finished(CrawlURI cURI)
Report a URI being processed as having finished processing.
|
String |
Frontier.getClassKey(CrawlURI cauri) |
Frontier.FrontierGroup |
Frontier.getGroup(CrawlURI curi)
Get the 'frontier group' (usually queue) for the given
CrawlURI.
|
protected boolean |
Scoper.isInScope(CrawlURI caUri)
Schedule the given
CrawlURI with the Frontier. |
protected void |
Scoper.outOfScope(CrawlURI caUri)
Called when a CrawlURI is ruled out of scope.
|
void |
Frontier.schedule(CrawlURI caURI)
Schedules a CrawlURI.
|
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
WorkQueue.peekItem
The next item to be returned
|
Modifier and Type | Field and Description |
---|---|
protected ThreadLocal<CrawlURI> |
AbstractFrontier.dispositionPending
remembers a disposition-in-progress, so that extra endDisposition()
calls are harmless
|
protected com.sleepycat.collections.StoredSortedMap<Long,CrawlURI> |
WorkQueueFrontier.futureUris
URIs scheduled to be re-enqueued at future date
|
Modifier and Type | Method and Description |
---|---|
protected abstract CrawlURI |
AbstractFrontier.findEligibleURI()
Find a CrawlURI eligible to be put on the outbound queue for
processing.
|
protected CrawlURI |
WorkQueueFrontier.findEligibleURI()
Return the next CrawlURI eligible to be processed (and presumably
visited/fetched) by a a worker thread.
|
CrawlURI |
BdbMultipleWorkQueues.get(com.sleepycat.je.DatabaseEntry headKey)
Get the next nearest item after the given key.
|
CrawlURI |
AbstractFrontier.next() |
CrawlURI |
WorkQueue.peek(WorkQueueFrontier frontier)
Return the topmost queue item -- and remember it,
such that even later higher-priority inserts don't
change it.
|
protected abstract CrawlURI |
WorkQueue.peekItem(WorkQueueFrontier frontier)
Returns first item from queue (does not delete)
|
protected CrawlURI |
BdbWorkQueue.peekItem(WorkQueueFrontier frontier) |
Modifier and Type | Method and Description |
---|---|
void |
FrontierJournal.added(CrawlURI curi) |
void |
AbstractFrontier.addedSeed(CrawlURI puri)
When notified of a seed via the SeedListener interface,
schedule it.
|
void |
AbstractFrontier.beginDisposition(CrawlURI curi) |
protected static com.sleepycat.je.DatabaseEntry |
BdbMultipleWorkQueues.calculateInsertKey(CrawlURI curi)
Calculate the insertKey that places a CrawlURI in the
desired spot.
|
void |
WorkQueueFrontier.considerIncluded(CrawlURI curi) |
int |
ZeroCostAssignmentPolicy.costOf(CrawlURI curi) |
abstract int |
CostAssignmentPolicy.costOf(CrawlURI curi) |
int |
AntiCalendarCostAssignmentPolicy.costOf(CrawlURI curi) |
int |
UnitCostAssignmentPolicy.costOf(CrawlURI curi) |
int |
WagCostAssignmentPolicy.costOf(CrawlURI curi)
Add constant penalties for certain features of URI (and
its 'via') that make it more delayable/skippable.
|
void |
BdbMultipleWorkQueues.delete(CrawlURI item)
Delete the given CrawlURI from persistent store.
|
void |
WorkQueueFrontier.deleted(CrawlURI curi)
Force logging, etc.
|
protected abstract void |
WorkQueue.deleteItem(WorkQueueFrontier frontier,
CrawlURI item)
Removes the given item from the queue.
|
protected void |
BdbWorkQueue.deleteItem(WorkQueueFrontier frontier,
CrawlURI peekItem) |
protected void |
WorkQueue.dequeue(WorkQueueFrontier frontier,
CrawlURI expected)
Remove the peekItem from the queue and adjusts the count.
|
protected void |
AbstractFrontier.doJournalAdded(CrawlURI c) |
protected void |
AbstractFrontier.doJournalDisregarded(CrawlURI c) |
protected void |
AbstractFrontier.doJournalEmitted(CrawlURI c) |
protected void |
AbstractFrontier.doJournalFinishedFailure(CrawlURI c) |
protected void |
AbstractFrontier.doJournalFinishedSuccess(CrawlURI c) |
protected void |
AbstractFrontier.doJournalReenqueued(CrawlURI c) |
protected void |
AbstractFrontier.doJournalRelocated(CrawlURI c) |
void |
FrontierJournal.emitted(CrawlURI curi) |
protected long |
WorkQueue.enqueue(WorkQueueFrontier frontier,
CrawlURI curi)
Add the given CrawlURI, noting its addition in running count.
|
void |
AbstractFrontier.finished(CrawlURI curi)
Note that the previously emitted CrawlURI has completed
its processing (for now).
|
void |
FrontierJournal.finishedDisregard(CrawlURI curi) |
void |
FrontierJournal.finishedFailure(CrawlURI curi) |
void |
FrontierJournal.finishedSuccess(CrawlURI curi) |
protected void |
WorkQueueFrontier.forget(CrawlURI curi)
Forget the given CrawlURI.
|
String |
BucketQueueAssignmentPolicy.getClassKey(CrawlURI curi) |
String |
AbstractFrontier.getClassKey(CrawlURI curi) |
abstract String |
QueueAssignmentPolicy.getClassKey(CrawlURI cauri)
Get the String key (name) of the queue to which the
CrawlURI should be assigned.
|
String |
AssignmentLevelSurtQueueAssignmentPolicy.getClassKey(CrawlURI cauri) |
String |
IPQueueAssignmentPolicy.getClassKey(CrawlURI cauri) |
String |
URIAuthorityBasedQueueAssignmentPolicy.getClassKey(CrawlURI curi) |
Frontier.FrontierGroup |
BdbFrontier.getGroup(CrawlURI curi) |
void |
FrontierJournal.included(CrawlURI curi) |
protected abstract void |
WorkQueue.insertItem(WorkQueueFrontier frontier,
CrawlURI curi,
boolean overwriteIfPresent)
Insert the given curi, whether it is already present or not.
|
protected void |
BdbWorkQueue.insertItem(WorkQueueFrontier frontier,
CrawlURI curi,
boolean overwriteIfPresent) |
protected boolean |
AbstractFrontier.isDisregarded(CrawlURI curi) |
protected void |
AbstractFrontier.log(CrawlURI curi)
Log to the main crawl.log
|
protected void |
AbstractFrontier.logNonfatalErrors(CrawlURI curi)
Take note of any processor-local errors that have been entered into the
CrawlURI.
|
protected boolean |
AbstractFrontier.needsReenqueuing(CrawlURI curi)
Checks if a recently processed CrawlURI that did not finish successfully
needs to be reenqueued (and thus possibly, processed again after some
time elapses)
|
protected void |
AbstractFrontier.noteAboutToEmit(CrawlURI curi,
WorkQueue q)
Perform fixups on a CrawlURI about to be returned via next().
|
protected boolean |
AbstractFrontier.overMaxRetries(CrawlURI curi) |
protected void |
AbstractFrontier.prepForFrontier(CrawlURI curi) |
protected abstract void |
AbstractFrontier.processFinish(CrawlURI caUri)
Handle the given CrawlURI as having finished a worker ToeThread
processing attempt.
|
protected void |
WorkQueueFrontier.processFinish(CrawlURI curi)
Note that the previously emitted CrawlURI has completed
its processing (for now).
|
protected abstract void |
AbstractFrontier.processScheduleAlways(CrawlURI caUri)
Schedule the given CrawlURI regardless of its already-seen status.
|
protected void |
WorkQueueFrontier.processScheduleAlways(CrawlURI curi)
Accept the given CrawlURI for scheduling, as it has
passed the alreadyIncluded filter.
|
protected abstract void |
AbstractFrontier.processScheduleIfUnique(CrawlURI caUri)
Schedule the given CrawlURI if not already-seen.
|
protected void |
WorkQueueFrontier.processScheduleIfUnique(CrawlURI curi)
Arrange for the given CrawlURI to be visited, if it is not
already scheduled/completed.
|
void |
BdbMultipleWorkQueues.put(CrawlURI curi,
boolean overwriteIfPresent)
Put the given CrawlURI in at the appropriate place.
|
void |
AbstractFrontier.receive(CrawlURI curi)
Accept the given CrawlURI for scheduling, as it has
passed the alreadyIncluded filter.
|
void |
FrontierJournal.reenqueued(CrawlURI curi) |
protected long |
AbstractFrontier.retryDelayFor(CrawlURI curi)
Return a suitable value to wait before retrying the given URI.
|
void |
AbstractFrontier.schedule(CrawlURI curi)
Arrange for the given CrawlURI to be visited, if it is not
already scheduled/completed.
|
void |
WorkQueueFrontier.schedule(CrawlURI curi)
Arrange for the given CrawlURI to be visited, if it is not
already enqueued/completed.
|
protected void |
WorkQueueFrontier.sendToQueue(CrawlURI curi)
Send a CrawlURI to the appropriate subqueue.
|
void |
WorkQueue.tally(CrawlURI curi,
FetchStats.Stage stage) |
protected void |
AbstractFrontier.tally(CrawlURI curi,
FetchStats.Stage stage)
Report CrawlURI to each of the three 'substats' accumulators
(group/queue, server, host) for a given stage.
|
void |
WorkQueue.unpeek(CrawlURI expected)
Forgive the peek, allowing a subsequent peek to
return a different item.
|
protected void |
WorkQueue.update(WorkQueueFrontier frontier,
CrawlURI curi)
Update the given CrawlURI, which should already be present.
|
void |
FrontierJournal.writeLongUriLine(String tag,
CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
protected int |
PreloadedUriPrecedencePolicy.calculatePrecedence(CrawlURI curi) |
protected int |
BaseUriPrecedencePolicy.calculatePrecedence(CrawlURI curi)
Calculate the precedence value for the given URI.
|
protected int |
HopsUriPrecedencePolicy.calculatePrecedence(CrawlURI curi) |
protected void |
PreloadedUriPrecedencePolicy.mergePrior(CrawlURI curi)
Merge any data from the Map stored in the URI-history store into the
current instance.
|
void |
PrecedenceProvider.tally(CrawlURI curi,
FetchStats.Stage stage) |
void |
HighestUriQueuePrecedencePolicy.HighestUriPrecedenceProvider.tally(CrawlURI curi,
FetchStats.Stage stage) |
void |
CostUriPrecedencePolicy.uriScheduled(CrawlURI curi) |
abstract void |
UriPrecedencePolicy.uriScheduled(CrawlURI curi)
Add a precedence value to the supplied CrawlURI, which is being
scheduled onto a frontier queue for the first time.
|
void |
PreloadedUriPrecedencePolicy.uriScheduled(CrawlURI curi) |
void |
BaseUriPrecedencePolicy.uriScheduled(CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
CandidatesProcessor.checkForSeedPromotion(CrawlURI curi)
Check if the URI needs special 'discovered seed' treatment.
|
protected int |
LinksScoper.getSchedulingFor(CrawlURI curi,
Link wref,
int preferenceDepthHops)
Deprecated.
Determine scheduling for the
curi . |
protected void |
LinksScoper.handlePrerequisite(CrawlURI curi)
Deprecated.
The CrawlURI has a prerequisite; apply scoping and update
Link to CrawlURI in manner analogous to outlink handling.
|
protected void |
CandidatesProcessor.innerProcess(CrawlURI curi)
Run candidates chain on each of (1) any prerequisite, if present;
(2) any outCandidates, if present; (3) all outlinks, if appropriate
|
protected void |
ReschedulingProcessor.innerProcess(CrawlURI curi) |
protected void |
LinksScoper.innerProcess(CrawlURI puri)
Deprecated.
|
protected void |
SupplementaryLinksScoper.innerProcess(CrawlURI puri) |
protected void |
LowDiskPauseProcessor.innerProcess(CrawlURI uri)
Deprecated.
|
protected void |
DispositionProcessor.innerProcess(CrawlURI puri) |
protected ProcessResult |
LowDiskPauseProcessor.innerProcessResult(CrawlURI curi)
Deprecated.
Notes a CrawlURI's content size in its running tally.
|
protected boolean |
SupplementaryLinksScoper.isInScope(CrawlURI caUri) |
protected void |
LinksScoper.outOfScope(CrawlURI caUri)
Deprecated.
|
protected void |
SupplementaryLinksScoper.outOfScope(CrawlURI caUri)
Called when a CrawlURI is ruled out of scope.
|
protected long |
DispositionProcessor.politenessDelayFor(CrawlURI curi)
Update any scheduling structures with the new information in this
CrawlURI.
|
protected int |
CandidatesProcessor.runCandidateChain(CrawlURI candidate,
CrawlURI source)
Run candidatesChain on a single candidate CrawlURI; if its
reported status is nonnegative, schedule to frontier.
|
protected boolean |
CandidatesProcessor.shouldProcess(CrawlURI puri) |
protected boolean |
ReschedulingProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
LinksScoper.shouldProcess(CrawlURI puri)
Deprecated.
|
protected boolean |
SupplementaryLinksScoper.shouldProcess(CrawlURI puri) |
protected boolean |
LowDiskPauseProcessor.shouldProcess(CrawlURI curi)
Deprecated.
|
protected boolean |
DispositionProcessor.shouldProcess(CrawlURI puri) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
QuotaEnforcer.applyQuota(CrawlURI curi,
String key,
long actual)
Apply the quota specified by the given key against the actual
value provided.
|
protected boolean |
PreconditionEnforcer.authenticated(Credential credential,
CrawlURI curi)
Has passed credential already been authenticated.
|
protected String |
FrontierPreparer.canonicalize(CrawlURI cauri)
Canonicalize passed CrawlURI.
|
protected boolean |
QuotaEnforcer.checkQuotas(CrawlURI curi,
FetchStats.HasFetchStats hasStats,
int CAT)
Check all quotas for the given substats and category (server, host, or
group).
|
protected boolean |
PreconditionEnforcer.considerDnsPreconditions(CrawlURI curi) |
protected boolean |
PreconditionEnforcer.considerRobotsPreconditions(CrawlURI curi)
Consider the robots precondition.
|
protected boolean |
PreconditionEnforcer.credentialPrecondition(CrawlURI curi)
Consider credential preconditions.
|
String |
FrontierPreparer.getClassKey(CrawlURI curi) |
protected int |
FrontierPreparer.getCost(CrawlURI curi)
Return the 'cost' of a CrawlURI (how much of its associated
queue's budget it depletes upon attempted processing)
|
protected int |
FrontierPreparer.getSchedulingDirective(CrawlURI curi)
Calculate the coarse, original 'schedulingDirective' prioritization
for the given CrawlURI
|
protected void |
RuntimeLimitEnforcer.innerProcess(CrawlURI curi) |
protected void |
Preselector.innerProcess(CrawlURI puri) |
protected void |
QuotaEnforcer.innerProcess(CrawlURI puri) |
protected void |
FrontierPreparer.innerProcess(CrawlURI curi) |
protected void |
PreconditionEnforcer.innerProcess(CrawlURI puri) |
protected void |
CandidateScoper.innerProcess(CrawlURI uri) |
protected ProcessResult |
RuntimeLimitEnforcer.innerProcessResult(CrawlURI curi) |
protected ProcessResult |
Preselector.innerProcessResult(CrawlURI puri) |
protected ProcessResult |
QuotaEnforcer.innerProcessResult(CrawlURI puri) |
protected ProcessResult |
PreconditionEnforcer.innerProcessResult(CrawlURI puri) |
protected ProcessResult |
CandidateScoper.innerProcessResult(CrawlURI curi) |
boolean |
PreconditionEnforcer.isIpExpired(CrawlURI curi)
Return true if ip should be looked up.
|
void |
FrontierPreparer.prepare(CrawlURI curi)
Apply all configured policies to CrawlURI
|
protected boolean |
RuntimeLimitEnforcer.shouldProcess(CrawlURI puri) |
protected boolean |
Preselector.shouldProcess(CrawlURI puri) |
protected boolean |
QuotaEnforcer.shouldProcess(CrawlURI puri) |
protected boolean |
FrontierPreparer.shouldProcess(CrawlURI uri) |
protected boolean |
PreconditionEnforcer.shouldProcess(CrawlURI puri) |
protected boolean |
CandidateScoper.shouldProcess(CrawlURI uri) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
CrawlMapper.decideToMapOutlink(CrawlURI cauri) |
protected void |
CrawlMapper.divertLog(CrawlURI cauri,
String target)
Note the given CrawlURI in the appropriate diversion log.
|
protected String |
HashCrawlMapper.getReduceRegex(CrawlURI cauri) |
protected void |
CrawlMapper.innerProcess(CrawlURI puri) |
protected ProcessResult |
CrawlMapper.innerProcessResult(CrawlURI puri) |
protected String |
LexicalCrawlMapper.map(CrawlURI cauri)
Look up the crawler node name to which the given CrawlURI
should be mapped.
|
protected abstract String |
CrawlMapper.map(CrawlURI cauri)
Look up the crawler node name to which the given CrawlURI
should be mapped.
|
protected String |
HashCrawlMapper.map(CrawlURI cauri)
Look up the crawler node name to which the given CrawlURI
should be mapped.
|
protected boolean |
CrawlMapper.shouldProcess(CrawlURI puri) |
Modifier and Type | Method and Description |
---|---|
void |
StatisticsTracker.addedSeed(CrawlURI curi)
Create a seed record, even on initial notification (before
any real attempt/processing.
|
void |
StatisticsTracker.crawledURIDisregard(CrawlURI curi) |
void |
StatisticsTracker.crawledURIFailure(CrawlURI curi) |
void |
StatisticsTracker.crawledURINeedRetry(CrawlURI curi) |
void |
StatisticsTracker.crawledURISuccessful(CrawlURI curi) |
protected void |
SeedRecord.fillWith(CrawlURI curi,
String disposition)
Fill instance with given values; skips makeDirty so may be used
on initialization.
|
protected void |
StatisticsTracker.handleSeed(CrawlURI curi,
String disposition)
If the curi is a seed, we update the processedSeeds cache.
|
void |
SeedRecord.updateWith(CrawlURI curi,
String disposition)
A later/repeat report of the same seed has arrived; update with
latest.
|
Constructor and Description |
---|
SeedRecord(CrawlURI curi,
String disposition)
Create a record from the given CrawlURI and disposition string
|
Modifier and Type | Method and Description |
---|---|
void |
SheetOverlaysManager.applyOverlaysTo(CrawlURI curi)
Apply the proper overlays (by Sheet beanName) to the given CrawlURI,
according to configured associations.
|
Modifier and Type | Method and Description |
---|---|
void |
CrawledBytesHistotable.accumulate(CrawlURI curi) |
void |
SetBasedUriUniqFilter.add(String key,
CrawlURI value) |
void |
FPMergeUriUniqFilter.add(String key,
CrawlURI value) |
void |
SetBasedUriUniqFilter.addForce(String key,
CrawlURI value) |
void |
FPMergeUriUniqFilter.addForce(String key,
CrawlURI value) |
void |
SetBasedUriUniqFilter.addNow(String key,
CrawlURI value) |
void |
FPMergeUriUniqFilter.addNow(String key,
CrawlURI value) |
void |
SetBasedUriUniqFilter.forget(String key,
CrawlURI value) |
void |
FPMergeUriUniqFilter.forget(String key,
CrawlURI value) |
void |
BloomUriUniqFilter.forget(String canonical,
CrawlURI item) |
protected void |
FPMergeUriUniqFilter.pend(long fp,
CrawlURI value)
Place the given FP/CrawlURI pair into the pending set, awaiting
a merge to determine if it's actually accepted.
|
void |
BenchmarkUriUniqFilters.receive(CrawlURI item) |
Constructor and Description |
---|
FPMergeUriUniqFilter.PendingItem(long fp,
CrawlURI value) |
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
CrawlURI.fullVia |
Modifier and Type | Field and Description |
---|---|
protected Collection<CrawlURI> |
CrawlURI.outCandidates |
Modifier and Type | Method and Description |
---|---|
CrawlURI |
CrawlURI.clearPrerequisiteUri()
Clear prerequisite, if any.
|
CrawlURI |
CrawlURI.createCrawlURI(UURI baseUURI,
Link link)
Utility method for creation of CandidateURIs found extracting
links from this CrawlURI.
|
CrawlURI |
CrawlURI.createCrawlURI(UURI baseUURI,
Link link,
int scheduling,
boolean seed)
Utility method for creation of CandidateURIs found extracting
links from this CrawlURI.
|
static CrawlURI |
CrawlURI.fromHopsViaString(String uriHopsViaContext) |
CrawlURI |
CrawlURI.getFullVia() |
CrawlURI |
CrawlURI.getPrerequisiteUri()
Get the prerequisite for this URI.
|
CrawlURI |
CrawlURI.makeConsequentCandidate(String destination,
LinkContext lc,
Hop hop)
Create a consequent CrawlURI from this one, given the
additional parameters
|
CrawlURI |
CrawlURI.markPrerequisite(String preq)
Do all actions associated with setting a
CrawlURI as
requiring a prerequisite. |
Modifier and Type | Method and Description |
---|---|
Collection<CrawlURI> |
CrawlURI.getOutCandidates()
Returns discovered candidate URIs.
|
Modifier and Type | Method and Description |
---|---|
static String |
Processor.flattenVia(CrawlURI puri) |
static long |
Processor.getRecordedSize(CrawlURI puri) |
static boolean |
Processor.hasHttpAuthenticationCredential(CrawlURI puri) |
protected void |
CrawlURI.inheritFrom(CrawlURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.
|
protected void |
ScriptedProcessor.innerProcess(CrawlURI curi) |
protected abstract void |
Processor.innerProcess(CrawlURI uri)
Actually performs the process.
|
protected ProcessResult |
Processor.innerProcessResult(CrawlURI uri) |
protected void |
Processor.innerRejectProcess(CrawlURI uri)
Invoked after a URI has been rejected.
|
static boolean |
Processor.isSuccess(CrawlURI puri) |
ProcessResult |
Processor.process(CrawlURI uri)
Processes the given URI.
|
void |
ProcessorChain.process(CrawlURI curi,
ProcessorChain.ChainStatusReceiver thread) |
void |
CrawlURI.setFullVia(CrawlURI curi) |
void |
CrawlURI.setPrerequisiteUri(CrawlURI pre)
Set a prerequisite for this URI.
|
protected boolean |
ScriptedProcessor.shouldProcess(CrawlURI curi) |
protected abstract boolean |
Processor.shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
Modifier and Type | Method and Description |
---|---|
void |
Credential.attach(CrawlURI curi)
Attach this credentials avatar to the passed
curi . |
boolean |
Credential.detach(CrawlURI curi)
Detach this credential from passed curi.
|
boolean |
Credential.detachAll(CrawlURI curi)
Detach all credentials of this type from passed curi.
|
static HttpAuthenticationCredential |
HttpAuthenticationCredential.getByRealm(Set<Credential> rfc2617Credentials,
String realm,
CrawlURI context)
Convenience method that does look up on passed set using realm for key.
|
String |
HttpAuthenticationCredential.getPrerequisite(CrawlURI curi) |
String |
HtmlFormCredential.getPrerequisite(CrawlURI curi) |
abstract String |
Credential.getPrerequisite(CrawlURI curi)
Return the authentication URI, either absolute or relative, that serves
as prerequisite the passed
curi . |
boolean |
HttpAuthenticationCredential.hasPrerequisite(CrawlURI curi) |
boolean |
HtmlFormCredential.hasPrerequisite(CrawlURI curi) |
abstract boolean |
Credential.hasPrerequisite(CrawlURI curi) |
boolean |
HttpAuthenticationCredential.isPrerequisite(CrawlURI curi) |
boolean |
HtmlFormCredential.isPrerequisite(CrawlURI curi) |
abstract boolean |
Credential.isPrerequisite(CrawlURI curi) |
boolean |
HttpAuthenticationCredential.populate(CrawlURI curi,
org.apache.commons.httpclient.HttpClient http,
org.apache.commons.httpclient.HttpMethod method,
Map<String,String> httpAuthChallenges) |
boolean |
HtmlFormCredential.populate(CrawlURI curi,
org.apache.commons.httpclient.HttpClient http,
org.apache.commons.httpclient.HttpMethod method,
Map<String,String> httpAuthChallenges) |
abstract boolean |
Credential.populate(CrawlURI curi,
org.apache.commons.httpclient.HttpClient http,
org.apache.commons.httpclient.HttpMethod method,
Map<String,String> httpAuthChallenges) |
boolean |
Credential.rootUriMatch(ServerCache cache,
CrawlURI curi)
Test passed curi matches this credentials rootUri.
|
Set<Credential> |
CredentialStore.subset(CrawlURI context,
Class<?> type)
Return set made up of all credentials of the passed
type . |
Set<Credential> |
CredentialStore.subset(CrawlURI context,
Class<?> type,
String rootUri)
Return set made up of all credentials of the passed
type . |
Modifier and Type | Method and Description |
---|---|
boolean |
DecideRule.accepts(CrawlURI uri) |
DecideResult |
DecideRule.decisionFor(CrawlURI uri) |
protected boolean |
HasViaDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
NotMatchesListRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's string version does not match
configured regexs (by reversing the superclass's answer).
|
protected boolean |
ResponseContentLengthDecideRule.evaluate(CrawlURI uri) |
protected boolean |
TransclusionDecideRule.evaluate(CrawlURI curi)
Evaluate whether given object is within the acceptable thresholds of
transitive hops.
|
protected boolean |
NotMatchesStatusCodeDecideRule.evaluate(CrawlURI uri)
Returns "true" if the provided CrawlURI has a fetch status that does not
fall within this instance's specified range.
|
protected boolean |
ResourceNoLongerThanDecideRule.evaluate(CrawlURI curi) |
protected boolean |
AddRedirectFromRootServerToScope.evaluate(CrawlURI uri) |
protected boolean |
NotMatchesFilePatternDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
TooManyPathSegmentsDecideRule.evaluate(CrawlURI curi)
Evaluate whether given object is over the threshold number of
path-segments.
|
protected boolean |
NotMatchesRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
SchemeNotInSetDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
MatchesListRegexDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version
matches configured regexes
|
protected boolean |
TooManyHopsDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
IpAddressSetDecideRule.evaluate(CrawlURI curi) |
protected boolean |
FetchStatusNotMatchesRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's FetchStatus does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
HopCrossesAssignmentLevelDomainDecideRule.evaluate(CrawlURI uri) |
protected abstract boolean |
PredicatedDecideRule.evaluate(CrawlURI object) |
protected boolean |
MatchesStatusCodeDecideRule.evaluate(CrawlURI uri)
Returns "true" if the provided CrawlURI has a fetch status that falls
within this instance's specified range.
|
protected boolean |
ContentTypeNotMatchesRegexDecideRule.evaluate(CrawlURI o)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
MatchesRegexDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version
matches configured regex
|
protected boolean |
FetchStatusDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is equal to the configured status
|
protected boolean |
ExternalGeoLocationDecideRule.evaluate(CrawlURI uri) |
protected String |
IpAddressSetDecideRule.getHostAddress(CrawlURI curi)
from WriterPoolProcessor
|
protected String |
ContentTypeMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
HopsPathMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
FetchStatusMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
MatchesRegexDecideRule.getString(CrawlURI uri) |
DecideResult |
PrerequisiteAcceptDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
RejectDecideRule.innerDecide(CrawlURI uri) |
DecideResult |
DecideRuleSequence.innerDecide(CrawlURI uri) |
protected DecideResult |
PathologicalPathDecideRule.innerDecide(CrawlURI uri) |
DecideResult |
ScriptedDecideRule.innerDecide(CrawlURI uri) |
protected abstract DecideResult |
DecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
SeedAcceptDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
ContentLengthDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
AcceptDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
PredicatedDecideRule.innerDecide(CrawlURI uri) |
DecideResult |
RejectDecideRule.onlyDecision(CrawlURI uri) |
DecideResult |
DecideRule.onlyDecision(CrawlURI uri) |
DecideResult |
AcceptDecideRule.onlyDecision(CrawlURI uri) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
IdenticalDigestDecideRule.evaluate(CrawlURI curi)
Evaluate whether given CrawlURI's content-digest exactly
matches that of preceding fetch.
|
static boolean |
IdenticalDigestDecideRule.hasIdenticalDigest(CrawlURI curi)
Utility method for testing if a CrawlURI's last two history
entries (one being the most recent fetch) have identical
content-digest information.
|
Modifier and Type | Method and Description |
---|---|
void |
SurtPrefixedDecideRule.addedSeed(CrawlURI curi)
If appropriate, convert seed notification into prefix-addition.
|
protected boolean |
NotOnDomainsDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the set of
domains -- simply reverse superclass's determination
|
protected boolean |
NotOnHostsDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the set of
hosts -- simply reverse superclass's determination
|
protected boolean |
SurtPrefixedDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's URI is covered by the SURT prefix set
|
protected boolean |
NotSurtPrefixedDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the SURT
prefix set -- simply reverse superclass's determination
|
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
ExtractorSWF.CrawlUriSWFAction.curi |
CrawlURI |
StringExtractorTestBase.TestData.uri |
Modifier and Type | Method and Description |
---|---|
protected CrawlURI |
ContentExtractorTestBase.defaultURI()
Returns a CrawlURI for testing purposes.
|
Modifier and Type | Method and Description |
---|---|
static void |
Link.add(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
protected void |
ExtractorHTTP.addHeaderLink(CrawlURI curi,
org.apache.commons.httpclient.Header loc) |
protected void |
ExtractorHTTP.addHeaderLink(CrawlURI curi,
String headerName,
String url) |
protected void |
ExtractorHTML.addLinkFromString(CrawlURI curi,
CharSequence uri,
CharSequence context,
Hop hop) |
protected void |
Extractor.addOutlink(CrawlURI curi,
String uri,
LinkContext context,
Hop hop)
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
|
protected void |
ExtractorHTTP.addRefreshHeaderLink(CrawlURI curi,
org.apache.commons.httpclient.Header refreshHeader) |
static void |
Link.addRelativeToBase(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
static void |
Link.addRelativeToVia(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
protected static void |
ContentExtractorTestBase.assertNoSideEffects(CrawlURI uri)
Asserts that the given URI has no URI errors, no localized errors, and
no annotations.
|
protected void |
ExtractorMultipleRegex.buildAndAddOutlink(CrawlURI curi,
Map<String,Object> bindings) |
protected void |
ExtractorHTML.considerIfLikelyUri(CrawlURI curi,
CharSequence candidate,
CharSequence valueContext,
Hop hop)
Consider whether a given string is URI-like.
|
protected void |
ExtractorHTML.considerQueryStringValues(CrawlURI curi,
CharSequence queryString,
CharSequence valueContext,
Hop hop)
Consider a query-string-like collections of key=value[&key=value]
pairs for URI-like strings in the values.
|
protected boolean |
ExtractorJS.considerString(Extractor ext,
CrawlURI curi,
boolean handlingJSFile,
String candidate) |
protected long |
ExtractorJS.considerStrings(CrawlURI curi,
CharSequence cs) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs,
boolean handlingJSFile) |
void |
ExtractorImpliedURI.extract(CrawlURI curi)
Perform usual extraction on a CrawlURI
|
protected abstract void |
Extractor.extract(CrawlURI uri)
Extracts links from the given URI.
|
void |
ExtractorURI.extract(CrawlURI curi)
Perform usual extraction on a CrawlURI
|
protected void |
ExtractorHTTP.extract(CrawlURI curi) |
protected void |
ContentExtractor.extract(CrawlURI uri)
Extracts links
|
void |
ExtractorMultipleRegex.extract(CrawlURI curi) |
protected void |
JerichoExtractorHTML.extract(CrawlURI curi,
CharSequence cs)
Run extractor.
|
protected void |
ExtractorHTML.extract(CrawlURI curi,
CharSequence cs)
Run extractor.
|
protected void |
ExtractorURI.extractLink(CrawlURI curi,
Link wref)
Consider a single Link for internal URIs
|
protected Charset |
ExtractorHTML.getContentDeclaredCharset(CrawlURI curi,
String contentPrefix) |
protected Charset |
ExtractorXML.getContentDeclaredCharset(CrawlURI curi,
String contentPrefix) |
protected boolean |
TrapSuppressExtractor.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorUniversal.innerExtract(CrawlURI curi) |
boolean |
ExtractorCSS.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorPDF.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorJS.innerExtract(CrawlURI curi) |
boolean |
ExtractorHTML.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorDOC.innerExtract(CrawlURI curi)
Processes a word document and extracts any hyperlinks from it.
|
protected boolean |
ExtractorSWF.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorXML.innerExtract(CrawlURI curi) |
protected abstract boolean |
ContentExtractor.innerExtract(CrawlURI uri)
Actually extracts links.
|
protected void |
Extractor.innerProcess(CrawlURI uri)
Processes the given URI.
|
protected void |
HTTPContentDigest.innerProcess(CrawlURI curi) |
protected boolean |
ExtractorHTML.isHtmlExpectedHere(CrawlURI curi)
Test whether this HTML is so unexpected (eg in place of a GIF URI)
that it shouldn't be scanned for links.
|
protected void |
ExtractorHTML.processEmbed(CrawlURI curi,
CharSequence value,
CharSequence context) |
protected void |
ExtractorHTML.processEmbed(CrawlURI curi,
CharSequence value,
CharSequence context,
Hop hop) |
protected void |
JerichoExtractorHTML.processForm(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
ExtractorHTML.processGeneralTag(CrawlURI curi,
CharSequence element,
CharSequence cs) |
protected void |
JerichoExtractorHTML.processGeneralTag(CrawlURI curi,
au.id.jericho.lib.html.Element element,
au.id.jericho.lib.html.Attributes attributes) |
protected void |
ExtractorHTML.processLink(CrawlURI curi,
CharSequence value,
CharSequence context)
Handle generic HREF cases.
|
protected boolean |
ExtractorHTML.processMeta(CrawlURI curi,
CharSequence cs)
Process metadata tags.
|
protected boolean |
JerichoExtractorHTML.processMeta(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
AggressiveExtractorHTML.processScript(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag) |
protected void |
ExtractorHTML.processScript(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag) |
protected void |
JerichoExtractorHTML.processScript(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
ExtractorHTML.processScriptCode(CrawlURI curi,
CharSequence cs)
Extract the (java)script source in the given CharSequence.
|
protected void |
ExtractorHTML.processStyle(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag)
Process style text.
|
protected void |
JerichoExtractorHTML.processStyle(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
static long |
ExtractorCSS.processStyleCode(Extractor ext,
CrawlURI curi,
CharSequence cs) |
static long |
ExtractorXML.processXml(Extractor ext,
CrawlURI curi,
CharSequence cs) |
protected boolean |
TrapSuppressExtractor.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorUniversal.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorCSS.shouldExtract(CrawlURI curi) |
protected boolean |
ExtractorPDF.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorJS.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorHTML.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorDOC.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorSWF.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorXML.shouldExtract(CrawlURI curi) |
protected abstract boolean |
ContentExtractor.shouldExtract(CrawlURI uri)
Determines if otherwise valid URIs should have links extracted or not.
|
protected boolean |
ExtractorImpliedURI.shouldProcess(CrawlURI uri) |
protected boolean |
ExtractorURI.shouldProcess(CrawlURI uri) |
protected boolean |
ExtractorHTTP.shouldProcess(CrawlURI uri) |
protected boolean |
HTTPContentDigest.shouldProcess(CrawlURI uri) |
protected boolean |
ContentExtractor.shouldProcess(CrawlURI uri)
Determines if links should be extracted from the given URI.
|
protected boolean |
ExtractorMultipleRegex.shouldProcess(CrawlURI uri) |
Constructor and Description |
---|
ExtractorSWF.CrawlUriSWFAction(CrawlURI curi,
Extractor ext) |
StringExtractorTestBase.TestData(CrawlURI uri,
Link expectedResult) |
Modifier and Type | Method and Description |
---|---|
protected void |
FetchHTTP.addResponseContent(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
This method populates
curi with response status and
content type. |
protected void |
FetchWhois.addWhoisLink(CrawlURI curi,
String query) |
protected void |
FetchWhois.addWhoisLinks(CrawlURI curi)
Adds outlinks to whois:{domain} and whois:{ipAddress}
|
protected boolean |
FetchHTTP.checkMidfetchAbort(CrawlURI curi,
org.archive.httpclient.HttpRecorderMethod method,
HttpConnection conn) |
protected org.apache.commons.httpclient.HostConfiguration |
FetchHTTP.configureMethod(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method)
Configure the HttpMethod setting options and headers.
|
protected ProcessResult |
FetchWhois.deferOrFinishGeneric(CrawlURI curi,
String domainOrIp) |
protected void |
FetchHTTP.doAbort(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
String annotation) |
protected void |
FetchWhois.fetch(CrawlURI curi,
String whoisServer,
String whoisQuery) |
protected Object |
FetchHTTP.getAttributeEither(CrawlURI curi,
String key)
Get a value either from inside the CrawlURI instance, or from
settings (module attributes).
|
protected org.apache.commons.httpclient.auth.AuthScheme |
FetchHTTP.getAuthScheme(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi) |
protected String |
FetchWhois.getWhoisQuery(CrawlURI curi) |
protected String |
FetchWhois.getWhoisServer(CrawlURI curi) |
protected void |
FetchHTTP.handle401(org.apache.commons.httpclient.HttpMethod method,
CrawlURI curi)
Server is looking for basic/digest auth credentials (RFC2617).
|
protected void |
FetchDNS.innerProcess(CrawlURI curi) |
protected void |
FetchHTTP.innerProcess(CrawlURI curi) |
protected void |
FetchFTP.innerProcess(CrawlURI curi)
Processes the given URI.
|
protected void |
FetchWhois.innerProcess(CrawlURI uri) |
protected ProcessResult |
FetchWhois.innerProcessResult(CrawlURI curi) |
protected boolean |
FetchDNS.isQuadAddress(CrawlURI curi,
String dnsName,
CrawlHost targetHost) |
ProcessResult |
FetchHTTP.process(CrawlURI uri) |
protected void |
FetchDNS.recordDNS(CrawlURI curi,
org.xbill.DNS.Record[] rrecordSet) |
protected void |
FetchHTTP.setConditionalGetHeader(CrawlURI curi,
org.apache.commons.httpclient.HttpMethod method,
boolean conditional,
String sourceHeader,
String targetHeader)
Set the given conditional-GET header, if the setting is enabled and
a suitable value is available in the URI history.
|
protected void |
FetchHTTP.setSizes(CrawlURI curi,
org.archive.util.Recorder rec)
Update CrawlURI internal sizes based on current transaction (and
in the case of 304s, history)
|
protected void |
FetchDNS.setUnresolvable(CrawlURI curi,
CrawlHost host) |
protected boolean |
FetchDNS.shouldProcess(CrawlURI curi) |
protected boolean |
FetchHTTP.shouldProcess(CrawlURI curi)
Can this processor fetch the given CrawlURI.
|
protected boolean |
FetchFTP.shouldProcess(CrawlURI curi) |
protected boolean |
FetchWhois.shouldProcess(CrawlURI uri) |
protected void |
FetchDNS.storeDNSRecord(CrawlURI curi,
String dnsName,
CrawlHost targetHost,
org.xbill.DNS.Record[] rrecordSet) |
void |
FetchStats.tally(CrawlURI curi,
FetchStats.Stage stage) |
void |
FetchStats.CollectsFetchStats.tally(CrawlURI curi,
FetchStats.Stage stage) |
Modifier and Type | Method and Description |
---|---|
protected void |
ExtractorHTMLForms.analyze(CrawlURI curi,
CharSequence cs)
Run analysis: find form METHOD, ACTION, and all INPUT names/values
Log as configured.
|
protected void |
FormLoginProcessor.createFormSubmissionAttempt(CrawlURI curi,
HTMLForm templateForm,
String formProvince) |
void |
ExtractorHTMLForms.extract(CrawlURI curi) |
protected String |
FormLoginProcessor.getFormProvince(CrawlURI curi)
Get the 'form province' - either the configured (applicableSurtPrefix)
or inferred (full current server) range of URIs that is considered
covered by one form login
|
protected void |
FormLoginProcessor.innerProcess(CrawlURI curi) |
protected boolean |
FormLoginProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
ExtractorHTMLForms.shouldProcess(CrawlURI uri) |
Modifier and Type | Method and Description |
---|---|
boolean |
IgnoreRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
FirstNamedRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
MostFavoredRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
CustomRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
ObeyRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
abstract boolean |
RobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
String |
RobotsPolicy.getPathQuery(CrawlURI curi) |
void |
CrawlServer.updateRobots(CrawlURI curi)
Update the robotstxt
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
AbstractPersistProcessor.hasWriteTag(CrawlURI uri) |
protected void |
PersistLoadProcessor.innerProcess(CrawlURI curi) |
protected void |
ContentDigestHistoryStorer.innerProcess(CrawlURI curi) |
protected void |
PersistLogProcessor.innerProcess(CrawlURI curi) |
protected void |
ContentDigestHistoryLoader.innerProcess(CrawlURI curi) |
protected void |
PersistStoreProcessor.innerProcess(CrawlURI curi) |
protected void |
FetchHistoryProcessor.innerProcess(CrawlURI puri) |
abstract void |
AbstractContentDigestHistory.load(CrawlURI curi)
Looks up the history by key
persistKeyFor(curi) and loads it into
curi.getContentDigestHistory() . |
void |
BdbContentDigestHistory.load(CrawlURI curi) |
protected String |
AbstractContentDigestHistory.persistKeyFor(CrawlURI curi) |
static String |
PersistProcessor.persistKeyFor(CrawlURI curi)
Return a preferred String key for persisting the given CrawlURI's
AList state.
|
protected boolean |
AbstractPersistProcessor.shouldLoad(CrawlURI curi)
Whether the current CrawlURI's state should be loaded
|
protected boolean |
PersistLoadProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
ContentDigestHistoryStorer.shouldProcess(CrawlURI uri) |
protected boolean |
PersistLogProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
ContentDigestHistoryLoader.shouldProcess(CrawlURI uri) |
protected boolean |
PersistStoreProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
FetchHistoryProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
AbstractPersistProcessor.shouldStore(CrawlURI curi)
Whether the current CrawlURI's state should be persisted (to log or
direct to database)
|
abstract void |
AbstractContentDigestHistory.store(CrawlURI curi)
Stores
curi.getContentDigestHistory() for the key
persistKeyFor(curi) . |
void |
BdbContentDigestHistory.store(CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
void |
SeedListener.addedSeed(CrawlURI uuri) |
abstract void |
SeedModule.addSeed(CrawlURI curi) |
void |
TextSeedModule.addSeed(CrawlURI curi)
Add a new seed to scope.
|
protected void |
SeedModule.publishAddedSeed(CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
protected void |
WriterPoolProcessor.copyForwardWriteTagIfDupe(CrawlURI curi)
If this fetch is identical to the last written (archived) fetch, then
copy forward the writeTag.
|
protected String |
WriterPoolProcessor.getHostAddress(CrawlURI curi)
Return IP address of given URI suitable for recording (as in a
classic ARC 5-field header line).
|
protected OutputStream |
Kw3WriterProcessor.initOutputStream(CrawlURI curi)
Get the OutputStream for the file to write to.
|
protected void |
Kw3WriterProcessor.innerProcess(CrawlURI curi) |
protected void |
MirrorWriterProcessor.innerProcess(CrawlURI curi) |
protected void |
WriterPoolProcessor.innerProcess(CrawlURI puri) |
protected abstract ProcessResult |
WriterPoolProcessor.innerProcessResult(CrawlURI uri) |
protected ProcessResult |
WARCWriterProcessor.innerProcessResult(CrawlURI puri)
Writes a CrawlURI and its associated data to store file.
|
protected ProcessResult |
ARCWriterProcessor.innerProcessResult(CrawlURI puri)
Writes a CrawlURI and its associated data to store file.
|
protected void |
WriterPoolProcessor.innerRejectProcess(CrawlURI curi) |
protected boolean |
Kw3WriterProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
MirrorWriterProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
WriterPoolProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
WriterPoolProcessor.shouldWrite(CrawlURI curi)
Whether the given CrawlURI should be written to archive files.
|
protected void |
WARCWriterProcessor.updateMetadataAfterWrite(CrawlURI curi,
org.archive.io.warc.WARCWriter writer,
long startPosition) |
protected ProcessResult |
ARCWriterProcessor.write(CrawlURI curi,
long recordLength,
InputStream in,
String ip) |
protected ProcessResult |
WARCWriterProcessor.write(String lowerCaseScheme,
CrawlURI curi) |
protected void |
Kw3WriterProcessor.writeArchiveInfoPart(String boundary,
CrawlURI curi,
org.archive.io.ReplayInputStream ris,
OutputStream out) |
protected void |
Kw3WriterProcessor.writeContentPart(String boundary,
CrawlURI curi,
org.archive.io.ReplayInputStream ris,
OutputStream out) |
protected void |
WARCWriterProcessor.writeDnsRecords(CrawlURI curi,
org.archive.io.warc.WARCWriter w,
URI baseid,
String timestamp) |
protected URI |
WARCWriterProcessor.writeFtpControlConversation(org.archive.io.warc.WARCWriter w,
String timestamp,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord headers,
String controlConversation) |
protected void |
WARCWriterProcessor.writeFtpRecords(org.archive.io.warc.WARCWriter w,
CrawlURI curi,
URI baseid,
String timestamp) |
protected void |
WARCWriterProcessor.writeHttpRecords(CrawlURI curi,
org.archive.io.warc.WARCWriter w,
URI baseid,
String timestamp) |
protected URI |
WARCWriterProcessor.writeMetadata(org.archive.io.warc.WARCWriter w,
String timestamp,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields) |
protected void |
Kw3WriterProcessor.writeMimeFile(CrawlURI curi)
The actual writing of the Kulturarw3 MIME-file.
|
protected URI |
WARCWriterProcessor.writeRequest(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields) |
protected URI |
WARCWriterProcessor.writeResource(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields) |
protected URI |
WARCWriterProcessor.writeResponse(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord suppliedFields) |
protected URI |
WARCWriterProcessor.writeRevisitDigest(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields) |
protected URI |
WARCWriterProcessor.writeRevisitDigest(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields,
long contentLength) |
protected URI |
WARCWriterProcessor.writeRevisitNotModified(org.archive.io.warc.WARCWriter w,
String timestamp,
URI baseid,
CrawlURI puri,
org.archive.util.anvl.ANVLRecord namedFields) |
protected URI |
WARCWriterProcessor.writeRevisitUriAgnosticDigest(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord headers) |
protected void |
WARCWriterProcessor.writeWhoisRecords(org.archive.io.warc.WARCWriter w,
CrawlURI curi,
URI baseid,
String timestamp) |
Copyright © 2003-2014 Internet Archive. All Rights Reserved.