Uses of Class org.archive.modules.extractor.Extractor (Heritrix 3 3.2.0 API)

Prev
Next

All Classes

Packages that use Extractor
Package Description

org.archive.modules.extractor

org.archive.modules.forms

Uses of Extractor in org.archive.modules.extractor

Subclasses of Extractor in org.archive.modules.extractor
Modifier and Type	Class and Description
`class`	`AggressiveExtractorHTML` Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
`class`	`ContentExtractor` Extracts link from the fetched content of a URI, as opposed to its headers.
`class`	`ExtractorCSS` This extractor is parsing URIs from CSS type files.
`class`	`ExtractorDOC` This class allows the caller to extract href style links from word97-format word documents.
`class`	`ExtractorHTML` Basic link-extraction, from an HTML content-body, using regular expressions.
`class`	`ExtractorHTTP` Extracts URIs from HTTP response headers.
`class`	`ExtractorImpliedURI` An extractor for finding 'implied' URIs inside other URIs.
`class`	`ExtractorJS` Processes Javascript files for strings that are likely to be crawlable URIs.
`class`	`ExtractorMultipleRegex` An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
`class`	`ExtractorPDF` Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
`class`	`ExtractorSWF` Extracts URIs from SWF (flash/shockwave) files.
`class`	`ExtractorUniversal` A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
`class`	`ExtractorURI` An extractor for finding URIs inside other URIs.
`class`	`ExtractorXML` A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
`class`	`JerichoExtractorHTML` Improved link-extraction from an HTML content-body using jericho-html parser.
`class`	`TrapSuppressExtractor` Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Fields in org.archive.modules.extractor declared as Extractor
Modifier and Type	Field and Description
`protected Extractor`	ContentExtractorTestBase.`extractor` An extractor created during the setUp.

Methods in org.archive.modules.extractor that return Extractor
Modifier and Type	Method and Description
`protected abstract Extractor`	ContentExtractorTestBase.`makeExtractor()` Subclasses should return an Extractor instance to test.

Methods in org.archive.modules.extractor with parameters of type Extractor
Modifier and Type	Method and Description
`protected boolean`	ExtractorJS.`considerString(Extractor ext, CrawlURI curi, boolean handlingJSFile, String candidate)`
`long`	ExtractorJS.`considerStrings(Extractor ext, CrawlURI curi, CharSequence cs)`
`long`	ExtractorJS.`considerStrings(Extractor ext, CrawlURI curi, CharSequence cs, boolean handlingJSFile)`
`static long`	ExtractorCSS.`processStyleCode(Extractor ext, CrawlURI curi, CharSequence cs)`
`static long`	ExtractorXML.`processXml(Extractor ext, CrawlURI curi, CharSequence cs)`

Constructors in org.archive.modules.extractor with parameters of type Extractor
Constructor and Description
`ExtractorSWF.CrawlUriSWFAction(CrawlURI curi, Extractor ext)`

Uses of Extractor in org.archive.modules.forms

Subclasses of Extractor in org.archive.modules.forms
Modifier and Type	Class and Description
`class`	`ExtractorHTMLForms` Extracts extra information about FORMs in HTML, loading this into the CrawlURI (for potential later use by FormLoginProcessor) and adding a small annotation to the crawl.log.

Prev
Next

All Classes

Copyright © 2003-2014 Internet Archive. All Rights Reserved.