public class ExtractorHTTP extends Extractor
Modifier and Type | Field and Description |
---|---|
protected boolean |
inferRootPage
should all HTTP URIs be used to infer a link to the site's root?
|
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorHTTP() |
Modifier and Type | Method and Description |
---|---|
protected void |
addHeaderLink(CrawlURI curi,
org.apache.commons.httpclient.Header loc) |
protected void |
addHeaderLink(CrawlURI curi,
String headerName,
String url) |
protected void |
addRefreshHeaderLink(CrawlURI curi,
org.apache.commons.httpclient.Header refreshHeader) |
protected void |
extract(CrawlURI curi)
Extracts links from the given URI.
|
boolean |
getInferRootPage() |
void |
setInferRootPage(boolean inferRootPage) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected boolean inferRootPage
public boolean getInferRootPage()
public void setInferRootPage(boolean inferRootPage)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testprotected void extract(CrawlURI curi)
Extractor
ExtractorURI#getInputStream()
or
ExtractorURI#getCharSequence()
to process the content of the
URI. Any links that are discovered should be added to the
ExtractorURI#getOutLinks()
set.protected void addRefreshHeaderLink(CrawlURI curi, org.apache.commons.httpclient.Header refreshHeader)
protected void addHeaderLink(CrawlURI curi, org.apache.commons.httpclient.Header loc)
Copyright © 2003-2014 Internet Archive. All Rights Reserved.