ExtractorXML (Heritrix 3 3.2.0 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorXML

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorXML
extends ContentExtractor
```
A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents). NOTE: This processor may open a ReplayCharSequence from the CrawlURI's Recorder, without closing that ReplayCharSequence, to allow reuse by later processors in sequence. In the usual (Heritrix) case, a call after all processing to the Recorder's endReplays() method ensures timely close of any reused ReplayCharSequences. Reuse of this processor elsewhere should ensure a similar cleanup call to Recorder.endReplays() occurs.

Contributor:

gojomo

Field Summary
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ExtractorXML()

Constructors
Constructor and Description
`ExtractorXML()`

Method Summary

Methods
Modifier and Type	Method and Description
`protected Charset`	`getContentDeclaredCharset(CrawlURI curi, String contentPrefix)`
`protected boolean`	`innerExtract(CrawlURI curi)` Actually extracts links.
`static long`	`processXml(Extractor ext, CrawlURI curi, CharSequence cs)`
`protected boolean`	`shouldExtract(CrawlURI curi)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ExtractorXML
```
public ExtractorXML()
```
    Parameters:
    name -
- Method Detail
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    curi - the URI to check
    
    Returns:
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Description copied from class: ContentExtractor
    
    Actually extracts links. The given URI will have passed the three checks described in #shouldProcess(ExtractorURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
    This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    curi - Crawl URI to process.
    
    Returns:
    true if link extraction finished; false if downstream extractors should attempt to extract links
  - getContentDeclaredCharset
```
protected Charset getContentDeclaredCharset(CrawlURI curi,
                                String contentPrefix)
```
  - processXml
```
public static long processXml(Extractor ext,
              CrawlURI curi,
              CharSequence cs)
```

Class ExtractorXML

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Constructor Detail

ExtractorXML

Method Detail

shouldExtract

innerExtract

getContentDeclaredCharset

processXml