ExtractorJS (Heritrix 3 3.2.0 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorJS

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorJS
extends ContentExtractor
```
Processes Javascript files for strings that are likely to be crawlable URIs. NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs. TODO: Replace with a system for actually executing Javascript in a browser-workalike DOM, such as via HtmlUnit or remote-controlled browser engines.

Contributor:

gojomo, nlevitt

Field Summary

Fields
Modifier and Type Field and Description

protected static String JAVASCRIPT_STRING_EXTRACTOR

protected long numberOfCURIsHandled
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Fields
Modifier and Type	Field and Description
`protected static String`	`JAVASCRIPT_STRING_EXTRACTOR`
`protected long`	`numberOfCURIsHandled`

Constructor Summary

Constructors
Constructor and Description

ExtractorJS()

Constructors
Constructor and Description
`ExtractorJS()`

Method Summary

Methods
Modifier and Type	Method and Description
`protected boolean`	`considerString(Extractor ext, CrawlURI curi, boolean handlingJSFile, String candidate)`
`protected long`	`considerStrings(CrawlURI curi, CharSequence cs)`
`long`	`considerStrings(Extractor ext, CrawlURI curi, CharSequence cs)`
`long`	`considerStrings(Extractor ext, CrawlURI curi, CharSequence cs, boolean handlingJSFile)`
`protected boolean`	`innerExtract(CrawlURI curi)` Actually extracts links.
`protected boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - JAVASCRIPT_STRING_EXTRACTOR
```
protected static final String JAVASCRIPT_STRING_EXTRACTOR
```
    See Also:
    Constant Field Values
  - numberOfCURIsHandled
```
protected long numberOfCURIsHandled
```
- Constructor Detail
  - ExtractorJS
```
public ExtractorJS()
```
- Method Detail
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI uri)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    uri - the URI to check
    
    Returns:
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Actually extracts links. The given URI will have passed the three checks described in #shouldProcess(ExtractorURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    curi - the URI whose links to extract
    
    Returns:
    true if link extraction finished; false if downstream extractors should attempt to extract links
  - considerStrings
```
protected long considerStrings(CrawlURI curi,
                   CharSequence cs)
```
  - considerStrings
```
public long considerStrings(Extractor ext,
                   CrawlURI curi,
                   CharSequence cs)
```
  - considerStrings
```
public long considerStrings(Extractor ext,
                   CrawlURI curi,
                   CharSequence cs,
                   boolean handlingJSFile)
```
  - considerString
```
protected boolean considerString(Extractor ext,
                     CrawlURI curi,
                     boolean handlingJSFile,
                     String candidate)
```

Class ExtractorJS

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

JAVASCRIPT_STRING_EXTRACTOR

numberOfCURIsHandled

Constructor Detail

ExtractorJS

Method Detail

shouldExtract

innerExtract

considerStrings

considerStrings

considerStrings

considerString