public class LexicalCrawlMapper extends CrawlMapper
Uses lexical comparisons of classKeys to map URIs to crawlers. The 'map' is specified via either a local or HTTP-fetchable file. Each line of this file should contain two space-separated tokens, the first a key and the second a crawler node name (which should be legal as part of a filename). All URIs will be mapped to the crawler node name associated with the nearest mapping key equal or subsequent to the URI's own classKey. If there are no mapping keys equal or after the classKey, the mapping 'wraps around' to the first mapping key.
One crawler name is distinguished as the 'local name'; URIs mapped to this name are not diverted, but continue to be processed normally.
For example, assume a SurtAuthorityQueueAssignmentPolicy and a simple mapping file:
d crawlerA ~ crawlerB
All URIs with "com," classKeys will find the 'd' key as the nearest subsequent mapping key, and thus be mapped to 'crawlerA'. If that's the 'local name', the URIs will be processed normally; otherwise, the URI will be written to a diversion log aimed for 'crawlerA'.
If using the JMX importUris operation importing URLs dropped by
a LexicalCrawlMapper
instance, use recoveryLog
style.
Modifier and Type | Field and Description |
---|---|
protected Frontier |
frontier |
protected TreeMap<String,String> |
map
Mapping of classKey ranges (as represented by their start) to
crawlers (by abstract name/filename)
|
protected ConfigPath |
mapPath
Path to map specification file.
|
protected String |
mapUri
URI to map specification file.
|
cache, checkOutlinks, checkUri, diversionDir, diversionLogs, localName, logGeneration, outlinkRule, rotationDigits
Constructor and Description |
---|
LexicalCrawlMapper()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
Frontier |
getFrontier() |
ConfigPath |
getMapPath() |
String |
getMapUri() |
protected void |
loadMap()
Retrieve and parse the mapping specification from a local path or
HTTP URL.
|
protected String |
map(CrawlURI cauri)
Look up the crawler node name to which the given CrawlURI
should be mapped.
|
void |
setFrontier(Frontier frontier) |
void |
setMapPath(ConfigPath path) |
void |
setMapUri(String uri) |
void |
start() |
decideToMapOutlink, divertLog, getCheckOutlinks, getCheckUri, getDiversionDir, getDiversionLog, getLocalName, getOutlinkRule, getRotationDigits, innerProcess, innerProcessResult, isRunning, setCheckOutlinks, setCheckUri, setDiversionDir, setLocalName, setOutlinkRule, setRotationDigits, shouldProcess, stop, updateGeneration
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint, toCheckpointJson
protected ConfigPath mapPath
protected String mapUri
protected Frontier frontier
public ConfigPath getMapPath()
public void setMapPath(ConfigPath path)
public String getMapUri()
public void setMapUri(String uri)
public Frontier getFrontier()
public void setFrontier(Frontier frontier)
protected String map(CrawlURI cauri)
map
in class CrawlMapper
cauri
- CrawlURI to considerpublic void start()
start
in interface org.springframework.context.Lifecycle
start
in class CrawlMapper
protected void loadMap() throws IOException
IOException
Copyright © 2003-2014 Internet Archive. All Rights Reserved.