Interface | Description |
---|---|
ExtractorParameters |
Bean interface for parameters consulted by multiple Extractors, and
thus provided by some shared object.
|
TempDirProvider | |
UriErrorLoggerModule |
Class | Description |
---|---|
AggressiveExtractorHTML |
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
|
ContentExtractor |
Extracts link from the fetched content of a URI, as opposed to its headers.
|
ContentExtractorTestBase |
Abstract base class for unit testing ContentExtractor implementations.
|
CustomSWFTags |
Overwrite action tags, that may hold URI, to use
CrawlUriSWFAction
|
Extractor |
Extracts links from fetched URIs.
|
ExtractorCSS |
This extractor is parsing URIs from CSS type files.
|
ExtractorDOC |
This class allows the caller to extract href style links from word97-format word documents.
|
ExtractorHTML |
Basic link-extraction, from an HTML content-body,
using regular expressions.
|
ExtractorHTTP |
Extracts URIs from HTTP response headers.
|
ExtractorImpliedURI |
An extractor for finding 'implied' URIs inside other URIs.
|
ExtractorJS |
Processes Javascript files for strings that are likely to be
crawlable URIs.
|
ExtractorMultipleRegex |
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
|
ExtractorPDF |
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
|
ExtractorSWF |
Extracts URIs from SWF (flash/shockwave) files.
|
ExtractorUniversal |
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
|
ExtractorURI |
An extractor for finding URIs inside other URIs.
|
ExtractorXML |
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
|
HTMLLinkContext |
XPath-like context for HTML discovered URIs.
|
HTTPContentDigest |
A processor for calculating custom HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
|
JerichoExtractorHTML |
Improved link-extraction from an HTML content-body using jericho-html parser.
|
Link |
Link represents one discovered "edge" of the web graph: the source
URI, the destination URI, and the type of reference (represented by the
context in which it was found).
|
LinkContext |
The context of link discovery.
|
LinkContext.SimpleLinkContext |
Class for representing handy default LinkContext values.
|
PDFParser |
Supports PDF parsing operations.
|
StringExtractorTestBase | |
StringExtractorTestBase.TestData | |
TrapSuppressExtractor |
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'.
|
Enum | Description |
---|---|
Hop |
The kind of "hop" from one URI to another.
|
Copyright © 2003-2014 Internet Archive. All Rights Reserved.