public class ExtractorCSS extends ContentExtractor
Modifier and Type | Field and Description |
---|---|
protected static String |
CSS_BACKSLASH_ESCAPE |
protected static String |
CSS_URI_EXTRACTOR
CSS URL extractor pattern.
|
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorCSS() |
Modifier and Type | Method and Description |
---|---|
boolean |
innerExtract(CrawlURI curi)
Actually extracts links.
|
static long |
processStyleCode(Extractor ext,
CrawlURI curi,
CharSequence cs) |
protected boolean |
shouldExtract(CrawlURI curi)
Determines if otherwise valid URIs should have links extracted or not.
|
extract, shouldProcess
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected static final String CSS_BACKSLASH_ESCAPE
protected static final String CSS_URI_EXTRACTOR
protected boolean shouldExtract(CrawlURI curi)
ContentExtractor
ExtractorHTML
implementation checks that the content-type of
the given URI is text/html.shouldExtract
in class ContentExtractor
curi
- the URI to checkpublic boolean innerExtract(CrawlURI curi)
ContentExtractor
#shouldProcess(ExtractorURI)
. Subclasses
should implement this method to discover outlinks in the URI's
content stream. For instance, ExtractorHTML
extracts links
from Anchor tags and so on.
This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
innerExtract
in class ContentExtractor
curi
- Crawl URI to process.public static long processStyleCode(Extractor ext, CrawlURI curi, CharSequence cs)
Copyright © 2003-2014 Internet Archive. All Rights Reserved.