Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.
Extracts link from the fetched content of a URI, as opposed to its headers.
Abstract base class for unit testing ContentExtractor implementations.
Overwrite action tags, that may hold URI, to use
Extracts links from fetched URIs.
This extractor is parsing URIs from CSS type files.
This class allows the caller to extract href style links from word97-format word documents.
Basic link-extraction, from an HTML content-body, using regular expressions.
Extracts URIs from HTTP response headers.
An extractor for finding 'implied' URIs inside other URIs.
An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
Extracts URIs from SWF (flash/shockwave) files.
A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
An extractor for finding URIs inside other URIs.
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
XPath-like context for HTML discovered URIs.
A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.
Improved link-extraction from an HTML content-body using jericho-html parser.
Link represents one discovered "edge" of the web graph: the source URI, the destination URI, and the type of reference (represented by the context in which it was found).
The context of link discovery.
Class for representing handy default LinkContext values.
Supports PDF parsing operations.
Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.
The kind of "hop" from one URI to another.
Copyright © 2003-2014 Internet Archive. All Rights Reserved.