Package | Description |
---|---|
org.archive.crawler.deciderules |
Provides classes for a simple decision rules framework.
|
org.archive.crawler.framework | |
org.archive.crawler.frontier | |
org.archive.crawler.frontier.precedence | |
org.archive.crawler.postprocessor | |
org.archive.crawler.prefetch | |
org.archive.crawler.processor | |
org.archive.modules |
The beginnings of a refactored settings framework.
|
org.archive.modules.canonicalize | |
org.archive.modules.credential |
Contains html form login and basic and digest credentials
used by Heritrix logging into sites.
|
org.archive.modules.deciderules | |
org.archive.modules.deciderules.recrawl | |
org.archive.modules.deciderules.surt | |
org.archive.modules.extractor | |
org.archive.modules.fetcher | |
org.archive.modules.forms | |
org.archive.modules.recrawl | |
org.archive.modules.writer |
Modifier and Type | Class and Description |
---|---|
class |
ClassKeyMatchesRegexDecideRule
Rule applies configured decision to any CrawlURI class key -- i.e.
|
Modifier and Type | Class and Description |
---|---|
class |
Scoper
Base class for Scopers.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractFrontier
Shared facilities for Frontier implementations.
|
class |
AssignmentLevelSurtQueueAssignmentPolicy
Create a queueKey based on the SURT authority, reduced to the
public-suffix-plus-one domain (topmost assignable domain).
|
class |
BdbFrontier
A Frontier using several BerkeleyDB JE Databases to hold its record of
known hosts (queues), and pending URIs.
|
class |
BucketQueueAssignmentPolicy
Uses the target IPs as basis for queue-assignment,
distributing them over a fixed number of sub-queues.
|
class |
HostnameQueueAssignmentPolicy
QueueAssignmentPolicy based on the hostname:port evident in the given
CrawlURI.
|
class |
IPQueueAssignmentPolicy
Uses target IP as basis for queue-assignment, unless it is unavailable,
in which case it behaves as HostnameQueueAssignmentPolicy.
|
class |
QueueAssignmentPolicy
Establishes a mapping from CrawlURIs to String keys (queue names).
|
class |
SurtAuthorityQueueAssignmentPolicy
SurtAuthorityQueueAssignmentPolicy based on the surt form of hostname.
|
class |
URIAuthorityBasedQueueAssignmentPolicy
SurtAuthorityQueueAssignmentPolicy based on the surt form of hostname.
|
class |
WorkQueueFrontier
A common Frontier base using several queues to hold pending URIs.
|
Modifier and Type | Class and Description |
---|---|
class |
BaseQueuePrecedencePolicy
QueuePrecedencePolicy that sets a uri-queue's precedence to a configured
single value.
|
class |
BaseUriPrecedencePolicy
UriPrecedencePolicy which assigns URIs a set value (perhaps a overridden
for different URIs).
|
class |
HighestUriQueuePrecedencePolicy
QueuePrecedencePolicy that sets a uri-queue's precedence to that of the
highest URI currently enqueued within itself, added to the configured
base-precedence.
|
class |
HopsUriPrecedencePolicy
UriPrecedencePolicy which assigns URIs a precedence equal to the number
of hops in its hops-path-from-seed (either all hops or just navlink ('L')
hops.
|
class |
PreloadedUriPrecedencePolicy
UriPrecedencePolicy which assigns URIs a precedence from a value that
was preloaded for them into the uri-history database.
|
class |
SuccessCountsQueuePrecedencePolicy
QueuePrecedencePolicy that sets a uri-queue's precedence to a configured
base value, then lowers its precedence with each tier of successful URIs
completed.
|
Modifier and Type | Class and Description |
---|---|
class |
CandidatesProcessor
Processor which sends all candidate outlinks through the
CandidateChain, scheduling those with non-negative status
codes to the frontier.
|
class |
DispositionProcessor
A step, late in the processing of a CrawlURI, for marking-up the
CrawlURI with values to affect frontier disposition, and updating
information that may have been affected by the fetch.
|
class |
LinksScoper
Deprecated.
Use CandidatesProcessor and CandidateChain/CandidateScoper instead
|
class |
LowDiskPauseProcessor
Deprecated.
Is highly system dependant.
Use
DiskSpaceMonitor instead. |
class |
ReschedulingProcessor
The most simple forced-rescheduling step possible: use a local
setting (perhaps overlaid to vary based on the URI) to set an exact
future reschedule time, as a delay from now.
|
class |
SupplementaryLinksScoper
Run CrawlURI links carried in the passed CrawlURI through a filter
and 'handle' rejections.
|
Modifier and Type | Class and Description |
---|---|
class |
CandidateScoper
Simple single-URI scoper, considers passed-in URI as candidate; sets
fetchstatus negative and skips to end of processing if out-of-scope.
|
class |
FrontierPreparer
Processor to preload URI with as much precalculated policy-based
info as possible before it reaches frontier criticial sections.
|
class |
PreconditionEnforcer
Ensures the preconditions for a fetch -- such as DNS lookup
or acquiring and respecting a robots.txt policy -- are
satisfied before a URI is passed to subsequent stages.
|
class |
Preselector
If set to recheck the crawl's scope, gives a yes/no on whether
a CrawlURI should be processed at all.
|
class |
QuotaEnforcer
A simple quota enforcer.
|
class |
RuntimeLimitEnforcer
A processor to enforce runtime limits on crawls.
|
Modifier and Type | Class and Description |
---|---|
class |
CrawlMapper
A simple crawl splitter/mapper, dividing up CrawlURIs/CrawlURIs
between crawlers by diverting some range of URIs to local log files
(which can then be imported to other crawlers).
|
class |
HashCrawlMapper
Maps URIs to one of N crawler names by applying a hash to the
URI's (possibly-transformed) classKey.
|
class |
LexicalCrawlMapper
A simple crawl splitter/mapper, dividing up CrawlURIs/CrawlURIs
between crawlers by diverting some range of URIs to local log files
(which can then be imported to other crawlers).
|
Modifier and Type | Class and Description |
---|---|
class |
CandidateChain |
class |
CrawlMetadata
Basic crawl metadata, as consulted by functional modules and
recorded in ARCs/WARCs.
|
class |
DispositionChain |
class |
FetchChain |
class |
Processor
A processor of URIs.
|
class |
ProcessorChain
Collection of Processors to run.
|
class |
ScriptedProcessor
A processor which runs a JSR-223 script on the CrawlURI.
|
Modifier and Type | Class and Description |
---|---|
class |
BaseRule
Base of all rules applied canonicalizing a URL that are configurable
via the Heritrix settings system.
|
class |
FixupQueryString
Strip any trailing question mark.
|
class |
LowercaseRule
Lowercases the URL.
|
class |
RegexRule
General conversion rule.
|
class |
RulesCanonicalizationPolicy
URI Canonicalizatioon Policy
|
class |
StripExtraSlashes
Strip any extra slashes, '/', found in the path.
|
class |
StripSessionCFIDs
Strip cold fusion session ids.
|
class |
StripSessionIDs
Strip known session ids.
|
class |
StripUserinfoRule
Strip any 'userinfo' found on http/https URLs.
|
class |
StripWWWNRule
Strip any 'www[0-9]*' found on http/https URLs IF they have some
path/query component (content after third slash).
|
class |
StripWWWRule
Strip any 'www' found on http/https URLs, IF they have some
path/query component (content after third slash).
|
Modifier and Type | Class and Description |
---|---|
class |
CredentialStore
Front door to the credential store.
|
Modifier and Type | Class and Description |
---|---|
class |
AcceptDecideRule |
class |
AddRedirectFromRootServerToScope |
class |
ContentLengthDecideRule |
class |
ContentTypeMatchesRegexDecideRule
DecideRule whose decision is applied if the URI's content-type
is present and matches the supplied regular expression.
|
class |
ContentTypeNotMatchesRegexDecideRule
DecideRule whose decision is applied if the URI's content-type
is present and does not match the supplied regular expression.
|
class |
DecideRule |
class |
DecideRuleSequence |
class |
ExternalGeoLocationDecideRule
A rule that can be configured to take alternate implementations
of the ExternalGeoLocationInterface.
|
class |
FetchStatusDecideRule
Rule applies the configured decision for any URI which has a
fetch status equal to the 'target-status' setting.
|
class |
FetchStatusMatchesRegexDecideRule |
class |
FetchStatusNotMatchesRegexDecideRule |
class |
HasViaDecideRule
Rule applies the configured decision for any URI which has a 'via'
(essentially, any URI that was a seed or some kinds of mid-crawl adds).
|
class |
HopCrossesAssignmentLevelDomainDecideRule
Applies its decision if the current URI differs in that portion of
its hostname/domain that is assigned/sold by registrars, its
'assignment-level-domain' (ALD) (AKA 'public suffix' or in previous
Heritrix versions, 'topmost assigned SURT')
|
class |
HopsPathMatchesRegexDecideRule
Rule applies configured decision to any CrawlURIs whose 'hops-path'
(string like "LLXE" etc.) matches the supplied regex.
|
class |
IpAddressSetDecideRule
IpAddressSetDecideRule must be used with
Preselector.setRecheckScope(boolean) set
to true because it relies on Heritrix' dns lookup to establish the ip address
for a URI before it can run. |
class |
MatchesFilePatternDecideRule
Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern, applying its configured decision to all matches.
|
class |
MatchesListRegexDecideRule
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexs.
|
class |
MatchesRegexDecideRule
Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regex.
|
class |
MatchesStatusCodeDecideRule
Provides a rule that returns "true" for any CrawlURIs which have a fetch
status code that falls within the provided inclusive range.
|
class |
NotMatchesFilePatternDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied (file-pattern) regex.
|
class |
NotMatchesListRegexDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied regex.
|
class |
NotMatchesRegexDecideRule
Rule applies configured decision to any URIs which do *not*
match the supplied regex.
|
class |
NotMatchesStatusCodeDecideRule
Provides a rule that returns "true" for any CrawlURIs which has a fetch
status code that does not fall within the provided inclusive range.
|
class |
PathologicalPathDecideRule
Rule REJECTs any URI which contains an excessive number of identical,
consecutive path-segments (eg http://example.com/a/a/a/boo.html == 3 '/a'
segments)
|
class |
PredicatedDecideRule
Rule which applies the configured decision only if a
test evaluates to true.
|
class |
PrerequisiteAcceptDecideRule
Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in
the last hopsPath position).
|
class |
RejectDecideRule |
class |
ResourceLongerThanDecideRule
Applies configured decision for URIs with content length greater than
a given threshold length value.
|
class |
ResourceNoLongerThanDecideRule
Applies configured decision for URIs with content length less than or equal
to a given threshold length value.
|
class |
ResponseContentLengthDecideRule
Decide rule that will ACCEPT or REJECT a uri, depending on the
"decision" property, after it's fetched, if the content body is within a
specified size range, specified in bytes.
|
class |
SchemeNotInSetDecideRule
Rule applies the configured decision (default REJECT) for any URI which
has a URI-scheme NOT contained in the configured Set.
|
class |
ScriptedDecideRule
Rule which runs a JSR-223 script to make its decision.
|
class |
SeedAcceptDecideRule
Rule which ACCEPTs all 'seed' URIs (those for which
isSeed is true).
|
class |
TooManyHopsDecideRule
Rule REJECTs any CrawlURIs whose total number of hops (length of the
hopsPath string, traversed links of any type) is over a threshold.
|
class |
TooManyPathSegmentsDecideRule
Rule REJECTs any CrawlURIs whose total number of path-segments (as
indicated by the count of '/' characters not including the first '//')
is over a given threshold.
|
class |
TransclusionDecideRule
Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see
CandidateURI#getPathFromSeed() ) ends
with at least one, but not more than, the given number of
non-navlink ('L') hops. |
Modifier and Type | Class and Description |
---|---|
class |
IdenticalDigestDecideRule
Rule applies configured decision to any CrawlURIs whose prior-history
content-digest matches the latest fetch.
|
Modifier and Type | Class and Description |
---|---|
class |
NotOnDomainsDecideRule
Rule applies configured decision to any URIs that are
*not* in one of the domains in the configured set of
domains, filled from the seed set.
|
class |
NotOnHostsDecideRule
Rule applies configured decision to any URIs that
are *not* on one of the hosts in the configured set of
hosts, filled from the seed set.
|
class |
NotSurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when
expressed in SURT form, do *not* begin with one of the prefixes
in the configured set.
|
class |
OnDomainsDecideRule
Rule applies configured decision to any URIs that
are on one of the domains in the configured set of
domains, filled from the seed set.
|
class |
OnHostsDecideRule
Rule applies configured decision to any URIs that
are on one of the hosts in the configured set of
hosts, filled from the seed set.
|
class |
SurtPrefixedDecideRule
Rule applies configured decision to any URIs that, when
expressed in SURT form, begin with one of the prefixes
in the configured set.
|
Modifier and Type | Class and Description |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
|
class |
ContentExtractor
Extracts link from the fetched content of a URI, as opposed to its headers.
|
class |
Extractor
Extracts links from fetched URIs.
|
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files.
|
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents.
|
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body,
using regular expressions.
|
class |
ExtractorHTTP
Extracts URIs from HTTP response headers.
|
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs.
|
class |
ExtractorJS
Processes Javascript files for strings that are likely to be
crawlable URIs.
|
class |
ExtractorMultipleRegex
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
|
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
|
class |
ExtractorSWF
Extracts URIs from SWF (flash/shockwave) files.
|
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
|
class |
ExtractorURI
An extractor for finding URIs inside other URIs.
|
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
|
class |
HTTPContentDigest
A processor for calculating custom HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
|
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser.
|
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'.
|
Modifier and Type | Class and Description |
---|---|
class |
FetchDNS
Processor to resolve 'dns:' URIs.
|
class |
FetchFTP
Fetches documents and directory listings using FTP.
|
class |
FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons
HttpClient library.
|
class |
FetchWhois
WHOIS Fetcher (RFC 3912).
|
Modifier and Type | Class and Description |
---|---|
class |
ExtractorHTMLForms
Extracts extra information about FORMs in HTML, loading this
into the CrawlURI (for potential later use by FormLoginProcessor)
and adding a small annotation to the crawl.log.
|
class |
FormLoginProcessor
A step, post-ExtractorHTMLForms, where a followup CrawlURI to
attempt a form submission may be synthesized.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractPersistProcessor |
class |
ContentDigestHistoryLoader |
class |
ContentDigestHistoryStorer |
class |
FetchHistoryProcessor
Maintain a history of fetch information inside the CrawlURI's attributes.
|
class |
PersistLoadProcessor
Loads CrawlURI attributes from previous fetch from persistent storage for
consultation by a later recrawl.
|
class |
PersistLogProcessor
Log CrawlURI attributes from latest fetch for consultation by a later
recrawl.
|
class |
PersistOnlineProcessor
Common superclass for persisting Processors which directly store/load
to persistence (as opposed to logging for batch load later).
|
class |
PersistProcessor
Superclass for Processors which utilize BDB-JE for URI state
(including most notably history) persistence.
|
class |
PersistStoreProcessor
Store CrawlURI attributes from latest fetch to persistent storage for
consultation by a later recrawl.
|
Modifier and Type | Class and Description |
---|---|
class |
ARCWriterProcessor
Processor module for writing the results of successful fetches (and
perhaps someday, certain kinds of network failures) to the Internet Archive
ARC file format.
|
class |
Kw3WriterProcessor
Processor module that writes the results of successful fetches to
files on disk.
|
class |
MirrorWriterProcessor
Processor module that writes the results of successful fetches to
files on disk.
|
class |
WARCWriterProcessor
WARCWriterProcessor.
|
class |
WriterPoolProcessor
Abstract implementation of a file pool processor.
|
Copyright © 2003-2014 Internet Archive. All Rights Reserved.