public abstract class Extractor extends Processor
ContentExtractor
instead of
this class.Modifier and Type | Field and Description |
---|---|
static ExtractorParameters |
DEFAULT_PARAMETERS |
protected ExtractorParameters |
extractorParameters |
protected UriErrorLoggerModule |
loggerModule |
protected AtomicLong |
numberOfLinksExtracted |
Constructor and Description |
---|
Extractor() |
Modifier and Type | Method and Description |
---|---|
protected void |
addOutlink(CrawlURI curi,
String uri,
LinkContext context,
Hop hop)
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
|
protected abstract void |
extract(CrawlURI uri)
Extracts links from the given URI.
|
protected void |
fromCheckpointJson(org.json.JSONObject json)
Restore internal state from JSONObject stored at earlier
checkpoint-time.
|
ExtractorParameters |
getExtractorParameters() |
UriErrorLoggerModule |
getLoggerModule() |
protected void |
innerProcess(CrawlURI uri)
Processes the given URI.
|
void |
logUriError(org.apache.commons.httpclient.URIException e,
UURI uuri,
CharSequence l) |
String |
report() |
void |
setExtractorParameters(ExtractorParameters helper) |
void |
setLoggerModule(UriErrorLoggerModule loggerModule) |
protected org.json.JSONObject |
toCheckpointJson()
Return a JSONObject of current stat that can be consulted
on recovery to restore necessary values.
|
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, shouldProcess, start, startCheckpoint, stop
protected AtomicLong numberOfLinksExtracted
public static final ExtractorParameters DEFAULT_PARAMETERS
protected transient UriErrorLoggerModule loggerModule
protected transient ExtractorParameters extractorParameters
public UriErrorLoggerModule getLoggerModule()
public void setLoggerModule(UriErrorLoggerModule loggerModule)
public ExtractorParameters getExtractorParameters()
public void setExtractorParameters(ExtractorParameters helper)
protected final void innerProcess(CrawlURI uri) throws InterruptedException
#extract(ExtractorURI)
, catching runtime exceptions and
errors that are usually non-fatal, to highlight them in the
relevant log(s).
Notably, StackOverflowError is caught here, as that seems to happen a lot when dealing with document parsing APIs.
innerProcess
in class Processor
uri
- the URI to extract links fromInterruptedException
- if the thread is interruptedprotected abstract void extract(CrawlURI uri)
ExtractorURI#getInputStream()
or
ExtractorURI#getCharSequence()
to process the content of the
URI. Any links that are discovered should be added to the
ExtractorURI#getOutLinks()
set.uri
- the uri to extract links fromprotected void addOutlink(CrawlURI curi, String uri, LinkContext context, Hop hop)
curi
- uri
- context
- hop
- public void logUriError(org.apache.commons.httpclient.URIException e, UURI uuri, CharSequence l)
protected org.json.JSONObject toCheckpointJson() throws org.json.JSONException
Processor
toCheckpointJson
in class Processor
org.json.JSONException
protected void fromCheckpointJson(org.json.JSONObject json) throws org.json.JSONException
Processor
fromCheckpointJson
in class Processor
json
- JSONObjectorg.json.JSONException
Copyright © 2003-2014 Internet Archive. All Rights Reserved.