Uses of Class org.archive.modules.extractor.ContentExtractor (Heritrix 3 3.2.0 API)

Packages that use ContentExtractor
Package Description

org.archive.modules.extractor

Packages that use ContentExtractor
Package	Description
org.archive.modules.extractor

Subclasses of ContentExtractor in org.archive.modules.extractor
Modifier and Type	Class and Description
`class`	`AggressiveExtractorHTML` Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.
`class`	`ExtractorCSS` This extractor is parsing URIs from CSS type files.
`class`	`ExtractorDOC` This class allows the caller to extract href style links from word97-format word documents.
`class`	`ExtractorHTML` Basic link-extraction, from an HTML content-body, using regular expressions.
`class`	`ExtractorJS` Processes Javascript files for strings that are likely to be crawlable URIs.
`class`	`ExtractorPDF` Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs
`class`	`ExtractorSWF` Extracts URIs from SWF (flash/shockwave) files.
`class`	`ExtractorUniversal` A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.
`class`	`ExtractorXML` A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).
`class`	`JerichoExtractorHTML` Improved link-extraction from an HTML content-body using jericho-html parser.
`class`	`TrapSuppressExtractor` Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

Uses of Classorg.archive.modules.extractor.ContentExtractor