ExtractorSWF (Heritrix 3 3.2.0 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorSWF

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorSWF
extends ContentExtractor
```
Extracts URIs from SWF (flash/shockwave) files. To test, here is a link to an swf that has links embedded inside of it: http://www.hitspring.com/index.swf.

Author:

Igor Ranitovic

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`class`	`ExtractorSWF.CrawlUriSWFAction` SWF action that handles discovered URIs.
`protected class`	`ExtractorSWF.ExtractorTagParser` TagParser customized to ignore SWFTags that will never contain extractable URIs.

Field Summary

Fields
Modifier and Type Field and Description

protected ExtractorJS extractorJS
Javascript extractor to use to process inline javascript.

protected static String JSSTRING
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Fields
Modifier and Type	Field and Description
`protected ExtractorJS`	`extractorJS` Javascript extractor to use to process inline javascript.
`protected static String`	`JSSTRING`

Constructor Summary

Constructors
Constructor and Description

ExtractorSWF()

Constructors
Constructor and Description
`ExtractorSWF()`

Method Summary

Methods
Modifier and Type	Method and Description
`ExtractorJS`	`getExtractorJS()`
`protected boolean`	`innerExtract(CrawlURI curi)` Actually extracts links.
`void`	`setExtractorJS(ExtractorJS extractorJS)`
`protected boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - JSSTRING
```
protected static final String JSSTRING
```
    See Also:
    Constant Field Values
  - extractorJS
```
protected transient ExtractorJS extractorJS
```
    Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
- Constructor Detail
  - ExtractorSWF
```
public ExtractorSWF()
```
    Parameters:
    name -
- Method Detail
  - getExtractorJS
```
public ExtractorJS getExtractorJS()
```
  - setExtractorJS
```
public void setExtractorJS(ExtractorJS extractorJS)
```
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI uri)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    uri - the URI to check
    
    Returns:
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Actually extracts links. The given URI will have passed the three checks described in #shouldProcess(ExtractorURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    curi - the URI whose links to extract
    
    Returns:
    true if link extraction finished; false if downstream extractors should attempt to extract links

Class ExtractorSWF

Nested Class Summary

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

JSSTRING

extractorJS

Constructor Detail

ExtractorSWF

Method Detail

getExtractorJS

setExtractorJS

shouldExtract

innerExtract