public class FrontierPreparer extends Scoper
fileLogger, isRunning, loggerModule, scope
beanName, kp, recoveryCheckpoint, uriCount
Constructor and Description |
---|
FrontierPreparer() |
Modifier and Type | Method and Description |
---|---|
protected String |
canonicalize(CrawlURI cauri)
Canonicalize passed CrawlURI.
|
UriCanonicalizationPolicy |
getCanonicalizationPolicy() |
String |
getClassKey(CrawlURI curi) |
protected int |
getCost(CrawlURI curi)
Return the 'cost' of a CrawlURI (how much of its associated
queue's budget it depletes upon attempted processing)
|
CostAssignmentPolicy |
getCostAssignmentPolicy() |
int |
getPreferenceDepthHops() |
int |
getPreferenceEmbedHops() |
QueueAssignmentPolicy |
getQueueAssignmentPolicy() |
protected int |
getSchedulingDirective(CrawlURI curi)
Calculate the coarse, original 'schedulingDirective' prioritization
for the given CrawlURI
|
UriPrecedencePolicy |
getUriPrecedencePolicy() |
protected void |
innerProcess(CrawlURI curi)
Actually performs the process.
|
void |
prepare(CrawlURI curi)
Apply all configured policies to CrawlURI
|
void |
setCanonicalizationPolicy(UriCanonicalizationPolicy policy) |
void |
setCostAssignmentPolicy(CostAssignmentPolicy policy) |
void |
setPreferenceDepthHops(int depth) |
void |
setPreferenceEmbedHops(int pref) |
void |
setQueueAssignmentPolicy(QueueAssignmentPolicy policy) |
void |
setUriPrecedencePolicy(UriPrecedencePolicy policy) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
getLoggerModule, getLogToFile, getScope, isInScope, isRunning, outOfScope, setLoggerModule, setLogToFile, setScope, start, stop
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint, toCheckpointJson
public int getPreferenceDepthHops()
public void setPreferenceDepthHops(int depth)
public int getPreferenceEmbedHops()
public void setPreferenceEmbedHops(int pref)
public UriCanonicalizationPolicy getCanonicalizationPolicy()
public void setCanonicalizationPolicy(UriCanonicalizationPolicy policy)
public QueueAssignmentPolicy getQueueAssignmentPolicy()
public void setQueueAssignmentPolicy(QueueAssignmentPolicy policy)
public UriPrecedencePolicy getUriPrecedencePolicy()
public void setUriPrecedencePolicy(UriPrecedencePolicy policy)
public CostAssignmentPolicy getCostAssignmentPolicy()
public void setCostAssignmentPolicy(CostAssignmentPolicy policy)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testprotected void innerProcess(CrawlURI curi)
Processor
#ENABLED
, the
#DECIDE_RULES
and the #shouldProcess(ProcessorURI)
tests.innerProcess
in class Processor
curi
- the URI to processpublic void prepare(CrawlURI curi)
curi
- CrawlURIprotected int getSchedulingDirective(CrawlURI curi)
curi
- protected String canonicalize(CrawlURI cauri)
#canonicalize(UURI)
in that it takes a look at
the CrawlURI context possibly overriding any canonicalization effect if
it could make us miss content. If canonicalization produces an URL that
was 'alreadyseen', but the entry in the 'alreadyseen' database did
nothing but redirect to the current URL, we won't get the current URL;
we'll think we've already see it. Examples would be archive.org
redirecting to www.archive.org or the inverse, www.netarkivet.net
redirecting to netarkivet.net (assuming stripWWW rule enabled).
Note, this method under circumstance sets the forceFetch flag.
cauri
- CrawlURI to examine.cacuri
.public String getClassKey(CrawlURI curi)
cauri
- CrawlURI we're to get a key for.protected int getCost(CrawlURI curi)
curi
- Copyright © 2003-2014 Internet Archive. All Rights Reserved.