Package | Description |
---|---|
org.archive.crawler.framework | |
org.archive.crawler.postprocessor | |
org.archive.crawler.prefetch | |
org.archive.crawler.processor | |
org.archive.modules |
The beginnings of a refactored settings framework.
|
org.archive.modules.extractor | |
org.archive.modules.fetcher | |
org.archive.modules.forms | |
org.archive.modules.recrawl | |
org.archive.modules.writer |
Modifier and Type | Class and Description |
---|---|
class |
Scoper
Base class for Scopers.
|
Modifier and Type | Method and Description |
---|---|
void |
ToeThread.atProcessor(Processor proc) |
Modifier and Type | Class and Description |
---|---|
class |
CandidatesProcessor
Processor which sends all candidate outlinks through the
CandidateChain, scheduling those with non-negative status
codes to the frontier.
|
class |
DispositionProcessor
A step, late in the processing of a CrawlURI, for marking-up the
CrawlURI with values to affect frontier disposition, and updating
information that may have been affected by the fetch.
|
class |
LinksScoper
Deprecated.
Use CandidatesProcessor and CandidateChain/CandidateScoper instead
|
class |
LowDiskPauseProcessor
Deprecated.
Is highly system dependant.
Use
DiskSpaceMonitor instead. |
class |
ReschedulingProcessor
The most simple forced-rescheduling step possible: use a local
setting (perhaps overlaid to vary based on the URI) to set an exact
future reschedule time, as a delay from now.
|
class |
SupplementaryLinksScoper
Run CrawlURI links carried in the passed CrawlURI through a filter
and 'handle' rejections.
|
Modifier and Type | Class and Description |
---|---|
class |
CandidateScoper
Simple single-URI scoper, considers passed-in URI as candidate; sets
fetchstatus negative and skips to end of processing if out-of-scope.
|
class |
FrontierPreparer
Processor to preload URI with as much precalculated policy-based
info as possible before it reaches frontier criticial sections.
|
class |
PreconditionEnforcer
Ensures the preconditions for a fetch -- such as DNS lookup
or acquiring and respecting a robots.txt policy -- are
satisfied before a URI is passed to subsequent stages.
|
class |
Preselector
If set to recheck the crawl's scope, gives a yes/no on whether
a CrawlURI should be processed at all.
|
class |
QuotaEnforcer
A simple quota enforcer.
|
class |
RuntimeLimitEnforcer
A processor to enforce runtime limits on crawls.
|
Modifier and Type | Class and Description |
---|---|
class |
CrawlMapper
A simple crawl splitter/mapper, dividing up CrawlURIs/CrawlURIs
between crawlers by diverting some range of URIs to local log files
(which can then be imported to other crawlers).
|
class |
HashCrawlMapper
Maps URIs to one of N crawler names by applying a hash to the
URI's (possibly-transformed) classKey.
|
class |
LexicalCrawlMapper
A simple crawl splitter/mapper, dividing up CrawlURIs/CrawlURIs
between crawlers by diverting some range of URIs to local log files
(which can then be imported to other crawlers).
|
Modifier and Type | Class and Description |
---|---|
class |
ScriptedProcessor
A processor which runs a JSR-223 script on the CrawlURI.
|
Modifier and Type | Method and Description |
---|---|
List<Processor> |
ProcessorChain.getProcessors() |
Iterator<Processor> |
ProcessorChain.iterator() |
Modifier and Type | Method and Description |
---|---|
void |
ProcessorChain.ChainStatusReceiver.atProcessor(Processor proc) |
Modifier and Type | Method and Description |
---|---|
void |
ProcessorChain.setProcessors(List<Processor> processors) |
Modifier and Type | Class and Description |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
|
class |
ContentExtractor
Extracts link from the fetched content of a URI, as opposed to its headers.
|
class |
Extractor
Extracts links from fetched URIs.
|
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files.
|
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents.
|
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body,
using regular expressions.
|
class |
ExtractorHTTP
Extracts URIs from HTTP response headers.
|
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs.
|
class |
ExtractorJS
Processes Javascript files for strings that are likely to be
crawlable URIs.
|
class |
ExtractorMultipleRegex
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
|
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
|
class |
ExtractorSWF
Extracts URIs from SWF (flash/shockwave) files.
|
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
|
class |
ExtractorURI
An extractor for finding URIs inside other URIs.
|
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
|
class |
HTTPContentDigest
A processor for calculating custom HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
|
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser.
|
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'.
|
Modifier and Type | Class and Description |
---|---|
class |
FetchDNS
Processor to resolve 'dns:' URIs.
|
class |
FetchFTP
Fetches documents and directory listings using FTP.
|
class |
FetchHTTP
HTTP fetcher that uses Apache Jakarta Commons
HttpClient library.
|
class |
FetchWhois
WHOIS Fetcher (RFC 3912).
|
Modifier and Type | Class and Description |
---|---|
class |
ExtractorHTMLForms
Extracts extra information about FORMs in HTML, loading this
into the CrawlURI (for potential later use by FormLoginProcessor)
and adding a small annotation to the crawl.log.
|
class |
FormLoginProcessor
A step, post-ExtractorHTMLForms, where a followup CrawlURI to
attempt a form submission may be synthesized.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractPersistProcessor |
class |
ContentDigestHistoryLoader |
class |
ContentDigestHistoryStorer |
class |
FetchHistoryProcessor
Maintain a history of fetch information inside the CrawlURI's attributes.
|
class |
PersistLoadProcessor
Loads CrawlURI attributes from previous fetch from persistent storage for
consultation by a later recrawl.
|
class |
PersistLogProcessor
Log CrawlURI attributes from latest fetch for consultation by a later
recrawl.
|
class |
PersistOnlineProcessor
Common superclass for persisting Processors which directly store/load
to persistence (as opposed to logging for batch load later).
|
class |
PersistProcessor
Superclass for Processors which utilize BDB-JE for URI state
(including most notably history) persistence.
|
class |
PersistStoreProcessor
Store CrawlURI attributes from latest fetch to persistent storage for
consultation by a later recrawl.
|
Modifier and Type | Class and Description |
---|---|
class |
ARCWriterProcessor
Processor module for writing the results of successful fetches (and
perhaps someday, certain kinds of network failures) to the Internet Archive
ARC file format.
|
class |
Kw3WriterProcessor
Processor module that writes the results of successful fetches to
files on disk.
|
class |
MirrorWriterProcessor
Processor module that writes the results of successful fetches to
files on disk.
|
class |
WARCWriterProcessor
WARCWriterProcessor.
|
class |
WriterPoolProcessor
Abstract implementation of a file pool processor.
|
Copyright © 2003-2014 Internet Archive. All Rights Reserved.