Package | Description |
---|---|
org.archive.modules.extractor | |
org.archive.modules.forms |
Modifier and Type | Class and Description |
---|---|
class |
AggressiveExtractorHTML
Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regex, and than by javascript speculative link regex.
|
class |
ContentExtractor
Extracts link from the fetched content of a URI, as opposed to its headers.
|
class |
ExtractorCSS
This extractor is parsing URIs from CSS type files.
|
class |
ExtractorDOC
This class allows the caller to extract href style links from word97-format word documents.
|
class |
ExtractorHTML
Basic link-extraction, from an HTML content-body,
using regular expressions.
|
class |
ExtractorHTTP
Extracts URIs from HTTP response headers.
|
class |
ExtractorImpliedURI
An extractor for finding 'implied' URIs inside other URIs.
|
class |
ExtractorJS
Processes Javascript files for strings that are likely to be
crawlable URIs.
|
class |
ExtractorMultipleRegex
An extractor that uses regular expressions to find strings in the fetched
content of a URI, and constructs outlink URIs from those strings.
|
class |
ExtractorPDF
Allows the caller to process a CrawlURI representing a PDF
for the purpose of extracting URIs
|
class |
ExtractorSWF
Extracts URIs from SWF (flash/shockwave) files.
|
class |
ExtractorUniversal
A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
|
class |
ExtractorURI
An extractor for finding URIs inside other URIs.
|
class |
ExtractorXML
A simple extractor which finds HTTP URIs inside XML/RSS files,
inside attribute values and simple elements (those with only
whitespace + HTTP URI + whitespace as contents).
|
class |
JerichoExtractorHTML
Improved link-extraction from an HTML content-body using jericho-html parser.
|
class |
TrapSuppressExtractor
Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'.
|
Modifier and Type | Field and Description |
---|---|
protected Extractor |
ContentExtractorTestBase.extractor
An extractor created during the setUp.
|
Modifier and Type | Method and Description |
---|---|
protected abstract Extractor |
ContentExtractorTestBase.makeExtractor()
Subclasses should return an Extractor instance to test.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
ExtractorJS.considerString(Extractor ext,
CrawlURI curi,
boolean handlingJSFile,
String candidate) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs,
boolean handlingJSFile) |
static long |
ExtractorCSS.processStyleCode(Extractor ext,
CrawlURI curi,
CharSequence cs) |
static long |
ExtractorXML.processXml(Extractor ext,
CrawlURI curi,
CharSequence cs) |
Constructor and Description |
---|
ExtractorSWF.CrawlUriSWFAction(CrawlURI curi,
Extractor ext) |
Modifier and Type | Class and Description |
---|---|
class |
ExtractorHTMLForms
Extracts extra information about FORMs in HTML, loading this
into the CrawlURI (for potential later use by FormLoginProcessor)
and adding a small annotation to the crawl.log.
|
Copyright © 2003-2014 Internet Archive. All Rights Reserved.