public abstract class CrawlMapper extends Processor implements org.springframework.context.Lifecycle
Applies a map() method, supplied by a concrete subclass, to classKeys to map URIs to crawlers by name.
One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.
If using the JMX importUris operation importing URLs dropped by
a CrawlMapper
instance, use recoveryLog
style.
Modifier and Type | Field and Description |
---|---|
protected ArrayLongFPCache |
cache |
protected boolean |
checkOutlinks
Whether to apply the mapping to discovered outlinks, for example after
extraction has occurred.
|
protected boolean |
checkUri
Whether to apply the mapping to a URI being processed itself, for example
early in processing (while its status is still 'unattempted').
|
protected ConfigPath |
diversionDir
Directory to write diversion logs.
|
protected HashMap<String,PrintWriter> |
diversionLogs
Mapping of target crawlers to logs (PrintWriters)
|
protected String |
localName
Name of local crawler node; mappings to this name result in normal
processing (no diversion).
|
protected String |
logGeneration
Truncated timestamp prefix for diversion logs; when
current time doesn't match, it's time to close all
current logs.
|
protected DecideRule |
outlinkRule
Decide rules to determine if an outlink is subject to mapping.
|
protected int |
rotationDigits
Number of timestamp digits to use as prefix of log names (grouping all
diversions from that period in a single log).
|
Constructor and Description |
---|
CrawlMapper()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
decideToMapOutlink(CrawlURI cauri) |
protected void |
divertLog(CrawlURI cauri,
String target)
Note the given CrawlURI in the appropriate diversion log.
|
boolean |
getCheckOutlinks() |
boolean |
getCheckUri() |
ConfigPath |
getDiversionDir() |
protected PrintWriter |
getDiversionLog(String target)
Get the diversion log for a given target crawler node node.
|
String |
getLocalName() |
DecideRule |
getOutlinkRule() |
int |
getRotationDigits() |
protected void |
innerProcess(CrawlURI puri)
Actually performs the process.
|
protected ProcessResult |
innerProcessResult(CrawlURI puri) |
boolean |
isRunning() |
protected abstract String |
map(CrawlURI cauri)
Look up the crawler node name to which the given CrawlURI
should be mapped.
|
void |
setCheckOutlinks(boolean check) |
void |
setCheckUri(boolean check) |
void |
setDiversionDir(ConfigPath path) |
void |
setLocalName(String name) |
void |
setOutlinkRule(DecideRule rule) |
void |
setRotationDigits(int digits) |
protected boolean |
shouldProcess(CrawlURI puri)
Determines whether the given uri should be processed by this
processor.
|
void |
start() |
void |
stop() |
protected void |
updateGeneration(String nowGeneration)
Close and mark as finished all existing diversion logs, and
arrange for new logs to use the new generation prefix.
|
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint, toCheckpointJson
protected boolean checkUri
protected boolean checkOutlinks
protected DecideRule outlinkRule
protected String localName
protected ConfigPath diversionDir
protected int rotationDigits
protected HashMap<String,PrintWriter> diversionLogs
protected String logGeneration
protected ArrayLongFPCache cache
public CrawlMapper()
name
- Name of this processor.public boolean getCheckUri()
public void setCheckUri(boolean check)
public boolean getCheckOutlinks()
public void setCheckOutlinks(boolean check)
public DecideRule getOutlinkRule()
public void setOutlinkRule(DecideRule rule)
public String getLocalName()
public void setLocalName(String name)
public ConfigPath getDiversionDir()
public void setDiversionDir(ConfigPath path)
public int getRotationDigits()
public void setRotationDigits(int digits)
protected boolean shouldProcess(CrawlURI puri)
Processor
shouldProcess
in class Processor
puri
- the URI to testprotected void innerProcess(CrawlURI puri)
Processor
#ENABLED
, the
#DECIDE_RULES
and the #shouldProcess(ProcessorURI)
tests.innerProcess
in class Processor
puri
- the URI to processprotected ProcessResult innerProcessResult(CrawlURI puri)
innerProcessResult
in class Processor
protected boolean decideToMapOutlink(CrawlURI cauri)
protected void updateGeneration(String nowGeneration)
nowGeneration
- new generation (timestamp prefix) to useprotected abstract String map(CrawlURI cauri)
cauri
- CrawlURI to considerprotected void divertLog(CrawlURI cauri, String target)
cauri
- CrawlURI to append to a diversion logtarget
- String node name (log name) to receive URIprotected PrintWriter getDiversionLog(String target)
target
- crawler node name of requested logpublic void start()
public boolean isRunning()
Copyright © 2003-2014 Internet Archive. All Rights Reserved.