public class FetchWhois extends Processor implements CoreAttributeConstants, FetchStatusCodes, org.springframework.context.Lifecycle
There is no pre-existing, canonical specification for WHOIS URIs. What follows is the the format that Heritrix uses, which we propose for general use.
Syntax in ABNF as used in RFC 3986 Uniform Resource Identifier (URI): Generic Syntax:
whoisurl = "whois:" [ "//" host [ ":" port ] "/" ] whoisquery
whoisquery is a url-encoded string. In ABNF,
whoisquery = 1*pchar
where pchar is defined in RFC 3986.
host and port also as defined in RFC 3986.
To resolve a WHOIS URI which specifies host[:port], open a TCP connection to the host at the specified port (default 43), send the query (whoisquery, url-decoded) followed by CRLF, and read the response until the server closes the connection. For more details see RFC 3912.
Resolution of a "serverless" WHOIS URI, which does not specify host[:port], is implementation-dependent.
For each non-WHOIS URI processed which has an authority, FetchWhois adds 1 or
2 serverless WHOIS URIs to the CrawlURI's outlinks. These are
"whois:{ipAddress}" and, if the authority includes a hostname,
"whois:{topLevelDomain}". See addWhoisLinks(CrawlURI)
.
Heritrix resolves serverless WHOIS URIs by first querying an initial server, then following referrals to other servers. In pseudocode:
if query is an IPv4 address
resolve whois://DEFAULT_IP_WHOIS_SERVER
/whoisquery
ULTRA_SUFFIX_WHOIS_SERVER
/domainSuffix
while last response refers to another server, i.e. matches regex WHOIS_SERVER_REGEX
if we have a special query formatting rule for this whois server, apply it - see specialQueryTemplates
resolve whois://referralServer/whoisquery
Modifier and Type | Class and Description |
---|---|
protected static class |
FetchWhois.UrlStatus |
Modifier and Type | Field and Description |
---|---|
protected BdbModule |
bdb |
protected static String |
DEFAULT_IP_WHOIS_SERVER |
static String |
IP_ADDRESS_REGEX |
protected ServerCache |
serverCache |
protected Map<String,String> |
specialQueryTemplates |
protected static String |
ULTRA_SUFFIX_WHOIS_SERVER |
protected static String |
WHOIS_SERVER_REGEX |
beanName, kp, recoveryCheckpoint, uriCount
A_ANNOTATIONS, A_CONTENT_TYPE, A_CREDENTIALS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_AUTH_CHALLENGES, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_MINIMUM_DELAY, A_MIRROR_PATH, A_NONFATAL_ERRORS, A_PRECALC_PRECEDENCE, A_PREREQUISITE_URI, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_SUBMIT_DATA, A_WARC_RESPONSE_HEADERS, A_WHOIS_SERVER_IP, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_NOT_FOUND, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE, S_WHOIS_GENERIC_FINISHED, S_WHOIS_SUCCESS
Constructor and Description |
---|
FetchWhois() |
Modifier and Type | Method and Description |
---|---|
protected void |
addWhoisLink(CrawlURI curi,
String query) |
protected void |
addWhoisLinks(CrawlURI curi)
Adds outlinks to whois:{domain} and whois:{ipAddress}
|
protected ProcessResult |
deferOrFinishGeneric(CrawlURI curi,
String domainOrIp) |
protected void |
fetch(CrawlURI curi,
String whoisServer,
String whoisQuery) |
ServerCache |
getServerCache() |
int |
getSoTimeoutMs() |
protected String |
getWhoisQuery(CrawlURI curi) |
protected String |
getWhoisServer(CrawlURI curi) |
protected void |
innerProcess(CrawlURI uri)
Actually performs the process.
|
protected ProcessResult |
innerProcessResult(CrawlURI curi) |
boolean |
isRunning() |
protected String |
makeWhoisUrl(String server,
String principal) |
void |
setBdbModule(BdbModule bdb) |
void |
setServerCache(ServerCache serverCache) |
void |
setSoTimeoutMs(int timeout) |
void |
setSpecialQueryTemplates(Map<String,String> m) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
void |
start() |
void |
stop() |
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint, toCheckpointJson
public static final String IP_ADDRESS_REGEX
protected static final String DEFAULT_IP_WHOIS_SERVER
protected static final String ULTRA_SUFFIX_WHOIS_SERVER
protected static String WHOIS_SERVER_REGEX
protected BdbModule bdb
protected ServerCache serverCache
public void setBdbModule(BdbModule bdb)
public int getSoTimeoutMs()
public void setSoTimeoutMs(int timeout)
public void start()
public boolean isRunning()
public void stop()
protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException
innerProcessResult
in class Processor
InterruptedException
protected ProcessResult deferOrFinishGeneric(CrawlURI curi, String domainOrIp)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testpublic ServerCache getServerCache()
public void setServerCache(ServerCache serverCache)
protected void addWhoisLinks(CrawlURI curi) throws InterruptedException
InterruptedException
protected void innerProcess(CrawlURI uri) throws InterruptedException
Processor
#ENABLED
, the
#DECIDE_RULES
and the #shouldProcess(ProcessorURI)
tests.innerProcess
in class Processor
uri
- the URI to processInterruptedException
- if the thread is interruptedCopyright © 2003-2014 Internet Archive. All Rights Reserved.