ExtractorDOC (Heritrix 3 3.2.0 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ContentExtractor
    - - org.archive.modules.extractor.ExtractorDOC

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorDOC
extends ContentExtractor
```
This class allows the caller to extract href style links from word97-format word documents.

Author:

Parker Thompson

Field Summary
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Constructor Summary

Constructors
Constructor and Description

ExtractorDOC()

Method Summary

Methods
Modifier and Type	Method and Description
`protected boolean`	`innerExtract(CrawlURI curi)` Processes a word document and extracts any hyperlinks from it.
`protected boolean`	`shouldExtract(CrawlURI uri)` Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ExtractorDOC
```
public ExtractorDOC()
```
    Parameters:
    name -
- Method Detail
  - shouldExtract
```
protected boolean shouldExtract(CrawlURI uri)
```
    Description copied from class: ContentExtractor
    
    Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
    
    Specified by:
    
    shouldExtract in class ContentExtractor
    
    Parameters:
    uri - the URI to check
    
    Returns:
    true if links should be extracted from that URI, false otherwise
  - innerExtract
```
protected boolean innerExtract(CrawlURI curi)
```
    Processes a word document and extracts any hyperlinks from it. This only extracts href style links, and does not examine the actual text for valid URIs.
    
    Specified by:
    
    innerExtract in class ContentExtractor
    
    Parameters:
    curi - CrawlURI to process.
    
    Returns:
    true if link extraction finished; false if downstream extractors should attempt to extract links

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2003-2014 Internet Archive. All Rights Reserved.