public class ExtractorMultipleRegex extends Extractor
The crawl operator configures these parameters:
uriRegex
: a regular expression to match against the urlcontentRegexes
a map of named regular expressions { name =>
regex } to run against the contenttemplate
: the template for constructing the outlinks
The URI is checked against uriRegex
. The match is done using
Matcher.matches()
, so the full URI string must match, not just a
substring. If it does match, then the matching groups are available to the
URI-building template as ${uriRegex[n]}
. If it does not match,
processing of the URI is finished and no outlinks are extracted.
Then the extractor looks for matches for each of the
contentRegexes
in the fetched content. If any of the regular
expressions produce no matches, processing of the URI is finished and no
outlinks are extracted. If at least one match is found for each regular
expression, then an outlink is constructed, using the URI-building template,
for every combination of matches. The matching groups are available to the
template as ${name[n]}
.
Outlinks are constructed using the URI-building template
.
Variable interpolation using the familiar ${...} syntax is supported. The
template is evaluated for each combination of regular expression matches
found, and the matching groups are available to the template as
${regexName[n]}
. An example template might look like:
http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}
.
The template is evaluated as a Groovy Template, so further capabilities beyond simple variable interpolation are available.
Modifier and Type | Class and Description |
---|---|
protected class |
ExtractorMultipleRegex.GroupList |
protected class |
ExtractorMultipleRegex.MatchList |
Modifier and Type | Field and Description |
---|---|
protected ConcurrentHashMap<String,groovy.text.Template> |
groovyTemplates |
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorMultipleRegex() |
Modifier and Type | Method and Description |
---|---|
protected void |
buildAndAddOutlink(CrawlURI curi,
Map<String,Object> bindings) |
void |
extract(CrawlURI curi)
Extracts links from the given URI.
|
Map<String,String> |
getContentRegexes() |
String |
getTemplate() |
String |
getUriRegex() |
protected groovy.text.Template |
groovyTemplate() |
protected Map<String,Object> |
makeBindings(Map<String,ExtractorMultipleRegex.MatchList> matchLists,
String[] regexNames,
int outlinkIndex) |
void |
setContentRegexes(Map<String,String> contentRegexes)
A map of { name => regex }.
|
void |
setTemplate(String template)
URI-building template.
|
void |
setUriRegex(String uriRegex)
Regular expression against which to match the URI.
|
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
protected ConcurrentHashMap<String,groovy.text.Template> groovyTemplates
public void setUriRegex(String uriRegex)
${uriRegex[n]}
. If it does not match, processing of this URI
is finished and no outlinks are extracted.public String getUriRegex()
public void setContentRegexes(Map<String,String> contentRegexes)
${name[n]}
.public void setTemplate(String template)
${regexName[n]}
. An example template might look
like:
http://example.org/${uriRegex[1]}/foo?bar=${myContentRegex[0]}
.
The template is evaluated as a Groovy Template, so further capabilities beyond simple variable interpolation are available.
public String getTemplate()
protected groovy.text.Template groovyTemplate()
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testpublic void extract(CrawlURI curi)
Extractor
ExtractorURI#getInputStream()
or
ExtractorURI#getCharSequence()
to process the content of the
URI. Any links that are discovered should be added to the
ExtractorURI#getOutLinks()
set.protected Map<String,Object> makeBindings(Map<String,ExtractorMultipleRegex.MatchList> matchLists, String[] regexNames, int outlinkIndex)
Copyright © 2003-2014 Internet Archive. All Rights Reserved.