public class ExtractorDOC extends ContentExtractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorDOC() |
Modifier and Type | Method and Description |
---|---|
protected boolean |
innerExtract(CrawlURI curi)
Processes a word document and extracts any hyperlinks from it.
|
protected boolean |
shouldExtract(CrawlURI uri)
Determines if otherwise valid URIs should have links extracted or not.
|
extract, shouldProcess
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected boolean shouldExtract(CrawlURI uri)
ContentExtractor
ExtractorHTML
implementation checks that the content-type of
the given URI is text/html.shouldExtract
in class ContentExtractor
uri
- the URI to checkprotected boolean innerExtract(CrawlURI curi)
innerExtract
in class ContentExtractor
curi
- CrawlURI to process.Copyright © 2003-2014 Internet Archive. All Rights Reserved.