ExtractorMultipleRegex (Heritrix 3 3.2.0 API)

java.lang.Object
- org.archive.modules.Processor
- - org.archive.modules.extractor.Extractor
  - - org.archive.modules.extractor.ExtractorMultipleRegex

All Implemented Interfaces:

Checkpointable, HasKeyedProperties, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
```
public class ExtractorMultipleRegex
extends Extractor
```
An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.
The crawl operator configures these parameters:
- uriRegex: a regular expression to match against the url
- contentRegexes a map of named regular expressions { name => regex } to run against the content
- template: the template for constructing the outlinks
The URI is checked against uriRegex. The match is done using Matcher.matches(), so the full URI string must match, not just a substring. If it does match, then the matching groups are available to the URI-building template as ${uriRegex[n]}. If it does not match, processing of the URI is finished and no outlinks are extracted.
Then the extractor looks for matches for each of the contentRegexes in the fetched content. If any of the regular expressions produce no matches, processing of the URI is finished and no outlinks are extracted. If at least one match is found for each regular expression, then an outlink is constructed, using the URI-building template, for every combination of matches. The matching groups are available to the template as ${name[n]}.
Outlinks are constructed using the URI-building template. Variable interpolation using the familiar ${...} syntax is supported. The template is evaluated for each combination of regular expression matches found, and the matching groups are available to the template as ${regexName[n]}. An example template might look like: http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}.
The template is evaluated as a Groovy Template, so further capabilities beyond simple variable interpolation are available.
See Also:
http://groovy.codehaus.org/Groovy+Templates
Contributor:

nlevitt, travis

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

protected class ExtractorMultipleRegex.GroupList

protected class ExtractorMultipleRegex.MatchList

Nested Classes
Modifier and Type	Class and Description
`protected class`	`ExtractorMultipleRegex.GroupList`
`protected class`	`ExtractorMultipleRegex.MatchList`

Field Summary

Fields
Modifier and Type Field and Description

protected ConcurrentHashMap<String,groovy.text.Template> groovyTemplates
- Fields inherited from class org.archive.modules.extractor.Extractor
  DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
- Fields inherited from class org.archive.modules.Processor
  beanName, isRunning, kp, recoveryCheckpoint, uriCount

Fields
Modifier and Type	Field and Description
`protected ConcurrentHashMap<String,groovy.text.Template>`	`groovyTemplates`

Constructor Summary

Constructors
Constructor and Description

ExtractorMultipleRegex()

Constructors
Constructor and Description
`ExtractorMultipleRegex()`

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`buildAndAddOutlink(CrawlURI curi, Map<String,Object> bindings)`
`void`	`extract(CrawlURI curi)` Extracts links from the given URI.
`Map<String,String>`	`getContentRegexes()`
`String`	`getTemplate()`
`String`	`getUriRegex()`
`protected groovy.text.Template`	`groovyTemplate()`
`protected Map<String,Object>`	`makeBindings(Map<String,ExtractorMultipleRegex.MatchList> matchLists, String[] regexNames, int outlinkIndex)`
`void`	`setContentRegexes(Map<String,String> contentRegexes)` A map of { name => regex }.
`void`	`setTemplate(String template)` URI-building template.
`void`	`setUriRegex(String uriRegex)` Regular expression against which to match the URI.
`protected boolean`	`shouldProcess(CrawlURI uri)` Determines whether the given uri should be processed by this processor.

Methods inherited from class org.archive.modules.extractor.Extractor
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - groovyTemplates
```
protected ConcurrentHashMap<String,groovy.text.Template> groovyTemplates
```
- Constructor Detail
  - ExtractorMultipleRegex
```
public ExtractorMultipleRegex()
```
- Method Detail
  - setUriRegex
```
public void setUriRegex(String uriRegex)
```
    Regular expression against which to match the URI. If the URI matches, then the matching groups are available to the URI-building template as ${uriRegex[n]}. If it does not match, processing of this URI is finished and no outlinks are extracted.
  - getUriRegex
```
public String getUriRegex()
```
  - setContentRegexes
```
public void setContentRegexes(Map<String,String> contentRegexes)
```
    A map of { name => regex }. The extractor looks for matches for each regular expression in the content of the URI being processed. If any of the regular expressions produce no matches, processing of the URI is finished and no outlinks are extracted. If at least one match is found for each regular expression, then an outlink is constructed for every combination of matches. The matching groups are available to the URI-building template as ${name[n]}.
  - getContentRegexes
```
public Map<String,String> getContentRegexes()
```
  - setTemplate
```
public void setTemplate(String template)
```
    URI-building template. Provides variable interpolation using the familiar ${...} syntax. The template is evaluated for each combination of regular expression matches found, and the matching groups are available to the template as ${regexName[n]}. An example template might look like: http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}.
    The template is evaluated as a Groovy Template, so further capabilities beyond simple variable interpolation are available.
    
    See Also:
    http://groovy.codehaus.org/Groovy+Templates
  - getTemplate
```
public String getTemplate()
```
  - groovyTemplate
```
protected groovy.text.Template groovyTemplate()
```
  - shouldProcess
```
protected boolean shouldProcess(CrawlURI uri)
```
    Description copied from class: Processor
    
    Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
    
    Specified by:
    
    shouldProcess in class Processor
    
    Parameters:
    uri - the URI to test
    
    Returns:
    true if this processor should process that uri; false if not
  - extract
```
public void extract(CrawlURI curi)
```
    Description copied from class: Extractor
    
    Extracts links from the given URI. Subclasses should use ExtractorURI#getInputStream() or ExtractorURI#getCharSequence() to process the content of the URI. Any links that are discovered should be added to the ExtractorURI#getOutLinks() set.
    
    Specified by:
    
    extract in class Extractor
    
    Parameters:
    curi - the uri to extract links from
  - makeBindings
```
protected Map<String,Object> makeBindings(Map<String,ExtractorMultipleRegex.MatchList> matchLists,
                              String[] regexNames,
                              int outlinkIndex)
```
  - buildAndAddOutlink
```
protected void buildAndAddOutlink(CrawlURI curi,
                      Map<String,Object> bindings)
```

Class ExtractorMultipleRegex

Nested Class Summary

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Detail

groovyTemplates

Constructor Detail

ExtractorMultipleRegex

Method Detail

setUriRegex

getUriRegex

setContentRegexes

getContentRegexes

setTemplate

getTemplate

groovyTemplate

shouldProcess

extract

makeBindings

buildAndAddOutlink