org.archive.modules.extractor (Heritrix 3 3.2.0 API)

Interface Summary
Interface	Description
ExtractorParameters	Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.
TempDirProvider
UriErrorLoggerModule

Class Summary
Class	Description
AggressiveExtractorHTML	Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
ContentExtractor	Extracts link from the fetched content of a URI, as opposed to its headers.
ContentExtractorTestBase	Abstract base class for unit testing ContentExtractor implementations.
CustomSWFTags	Overwrite action tags, that may hold URI, to use `CrawlUriSWFAction action.`
Extractor	Extracts links from fetched URIs.
ExtractorCSS	This extractor is parsing URIs from CSS type files.
ExtractorDOC	This class allows the caller to extract href style links from word97-format word documents.
ExtractorHTML	Basic link-extraction, from an HTML content-body, using regular expressions.
ExtractorHTTP	Extracts URIs from HTTP response headers.
ExtractorImpliedURI	An extractor for finding 'implied' URIs inside other URIs.
ExtractorJS	Processes Javascript files for strings that are likely to be crawlable URIs.
ExtractorMultipleRegex	An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
ExtractorPDF	Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
ExtractorSWF	Extracts URIs from SWF (flash/shockwave) files.
ExtractorUniversal	A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
ExtractorURI	An extractor for finding URIs inside other URIs.
ExtractorXML	A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
HTMLLinkContext	XPath-like context for HTML discovered URIs.
HTTPContentDigest	A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
JerichoExtractorHTML	Improved link-extraction from an HTML content-body using jericho-html parser.
Link	Link represents one discovered "edge" of the web graph: the source URI, the destination URI, and the type of reference (represented by the context in which it was found).
LinkContext	The context of link discovery.
LinkContext.SimpleLinkContext	Class for representing handy default LinkContext values.
PDFParser	Supports PDF parsing operations.
StringExtractorTestBase
StringExtractorTestBase.TestData
TrapSuppressExtractor	Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Enum Summary
Enum Description

Hop
The kind of "hop" from one URI to another.

Package org.archive.modules.extractor