Package | Description |
---|---|
org.archive.modules.extractor |
Modifier and Type | Class and Description |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
|
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files.
|
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents.
|
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body,
using regular expressions.
|
class |
ExtractorJS
Processes Javascript files for strings that are likely to be
crawlable URIs.
|
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
|
class |
ExtractorSWF
Extracts URIs from SWF (flash/shockwave) files.
|
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
|
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
|
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser.
|
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'.
|
Copyright © 2003-2014 Internet Archive. All Rights Reserved.