public class ExtractorHTMLForms extends Extractor
<bean id="extractorForms" class="org.archive.modules.forms.ExtractorHTMLForms">
<!-- <property name="extractAllForms" value="false" /> -->
</bean>
<bean id="formFiller" class="org.archive.modules.forms.FormLoginProcessor">
<!-- generally these are overlaid with sheets rather than set directly -->
<!-- <property name="applicableSurtPrefix" value="" /> -->
<!-- <property name="loginUsername" value="" /> -->
<!-- <property name="loginPassword" value="" /> -->
</bean>
Then, inside the fetch chain, after all other extractors:
<bean id="fetchProcessors" class="org.archive.modules.FetchChain">
<property name="processors">
<list>
...ALL USUAL PREPROCESSORS/FETCHERS/EXTRACTORS HERE, THEN...
<ref bean="extractorForms"/>
<ref bean="formFiller"/>
</list>
</property>
</bean>
NOTE: This processor may open a ReplayCharSequence from the
CrawlURI's Recorder, without closing that ReplayCharSequence, to allow
reuse by later processors in sequence. In the usual (Heritrix) case, a
call after all processing to the Recorder's endReplays() method ensures
timely close of any reused ReplayCharSequences. Reuse of this processor
elsewhere should ensure a similar cleanup call to Recorder.endReplays()
occurs.Modifier and Type | Field and Description |
---|---|
static String |
A_HTML_FORM_OBJECTS |
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
Constructor and Description |
---|
ExtractorHTMLForms() |
Modifier and Type | Method and Description |
---|---|
protected void |
analyze(CrawlURI curi,
CharSequence cs)
Run analysis: find form METHOD, ACTION, and all INPUT names/values
Log as configured.
|
void |
extract(CrawlURI curi)
Extracts links from the given URI.
|
protected String |
findAttributeValueGroup(String pattern,
int groupNumber,
CharSequence cs) |
protected List<CharSequence> |
findGroups(String pattern,
int groupNumber,
CharSequence cs) |
boolean |
getExtractAllForms() |
void |
setExtractAllForms(boolean extractAllForms) |
protected boolean |
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
addOutlink, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
public static final String A_HTML_FORM_OBJECTS
public boolean getExtractAllForms()
public void setExtractAllForms(boolean extractAllForms)
protected boolean shouldProcess(CrawlURI uri)
Processor
shouldProcess
in class Processor
uri
- the URI to testpublic void extract(CrawlURI curi)
Extractor
ExtractorURI#getInputStream()
or
ExtractorURI#getCharSequence()
to process the content of the
URI. Any links that are discovered should be added to the
ExtractorURI#getOutLinks()
set.protected void analyze(CrawlURI curi, CharSequence cs)
curi
- CrawlURI we're processing.cs
- Sequence from underlying ReplayCharSequence. This
is TRANSIENT data. Make a copy if you want the data to live outside
of this extractors' lifetime.protected List<CharSequence> findGroups(String pattern, int groupNumber, CharSequence cs)
protected String findAttributeValueGroup(String pattern, int groupNumber, CharSequence cs)
Copyright © 2003-2014 Internet Archive. All Rights Reserved.