public class FormLoginProcessor extends Processor implements Checkpointable
If an HTMLForm was previously discovered (by ExtractorHTMLForms), and that form appears to be a login form, and at the very least the loginUsername setting is non-empty, and the current (NOT 'action') URI fits under a configured SURT prefix, then a submission CrawlURI will be composed.
This submission CrawlURI will be added to the current URI's outCandidates, and prefilled with settings for a POST and input values that are a merging of: (a) original discovered in-page values; (b) the 'loginUsername' into the first plausible text/email-type input field; (c) the 'loginPassword' into the first password-type input field.
Typically the settings 'applicableSurtPrefix', 'loginUsername', and 'loginPassword' would be set in an overlay sheet and only applied to one or more sites (by SURT prefix), rather than set globally. An example minimal set of beans to add to CXML could look like:
<bean id='formLoginFields' class='org.archive.spring.Sheet'>
<property name='map'>
<map>
<entry key='formFiller.loginUsername' value='EXAMPLE_USERNAME'/>
<entry key='formFiller.loginPassword' value='EXAMPLE_PASSWORD'/>
</map>
</property>
</bean>
<bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
<property name='surtPrefixes'>
<list>
<value>http://(net,example,www,)/</value>
<value>http://(com,example,</value>
</list>
</property>
<property name='targetSheetNames'>
<list>
<value>formLoginFields</value>
</list>
</property>
</bean>
(Remember: https URIs are always collapsed to http form before overlay-surt-prefix comparisons, so surtPrefixes in the above association should always be in http form, even if the actual target URIs are https.)
Finally, while there is not yet support for testing if the submitted CrawlURI succeeded, this processor keeps track of a count of FORMS seen that are eligible for attempts, and attempts made (for now, just once), per 'formTrackingDomain' (which is either the applicableSurtPrefix or the form-origin-URI trimmed to its pathless root in SURT form). This is also added to the submission URI for logging to the resulting WARC 'response' record.
Modifier and Type | Field and Description |
---|---|
protected UriErrorLoggerModule |
loggerModule |
Constructor and Description |
---|
FormLoginProcessor() |
Modifier and Type | Method and Description |
---|---|
protected void |
createFormSubmissionAttempt(CrawlURI curi,
HTMLForm templateForm,
String formProvince) |
protected void |
fromCheckpointJson(org.json.JSONObject json)
Restore internal state from JSONObject stored at earlier
checkpoint-time.
|
String |
getApplicableSurtPrefix() |
protected String |
getFormProvince(CrawlURI curi)
Get the 'form province' - either the configured (applicableSurtPrefix)
or inferred (full current server) range of URIs that is considered
covered by one form login
|
UriErrorLoggerModule |
getLoggerModule() |
String |
getLoginPassword() |
String |
getLoginUsername() |
protected void |
innerProcess(CrawlURI curi)
Actually performs the process.
|
void |
setApplicableSurtPrefix(String applicableSurtPrefix) |
void |
setLoggerModule(UriErrorLoggerModule loggerModule) |
void |
setLoginPassword(String loginPassword) |
void |
setLoginUsername(String loginUsername) |
protected boolean |
shouldProcess(CrawlURI curi)
Determines whether the given uri should be processed by this
processor.
|
protected String |
submitStatusFor(String formProvince) |
protected org.json.JSONObject |
toCheckpointJson()
Return a JSONObject of current stat that can be consulted
on recovery to restore necessary values.
|
protected String |
warcHeaderFor(String formProvince) |
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
doCheckpoint, finishCheckpoint, setRecoveryCheckpoint, startCheckpoint
protected transient UriErrorLoggerModule loggerModule
public String getApplicableSurtPrefix()
public void setApplicableSurtPrefix(String applicableSurtPrefix)
public String getLoginUsername()
public void setLoginUsername(String loginUsername)
public String getLoginPassword()
public void setLoginPassword(String loginPassword)
public UriErrorLoggerModule getLoggerModule()
public void setLoggerModule(UriErrorLoggerModule loggerModule)
protected boolean shouldProcess(CrawlURI curi)
Processor
shouldProcess
in class Processor
curi
- the URI to testprotected void innerProcess(CrawlURI curi)
Processor
#ENABLED
, the
#DECIDE_RULES
and the #shouldProcess(ProcessorURI)
tests.innerProcess
in class Processor
curi
- the URI to processprotected String getFormProvince(CrawlURI curi)
curi
- protected void createFormSubmissionAttempt(CrawlURI curi, HTMLForm templateForm, String formProvince)
protected org.json.JSONObject toCheckpointJson() throws org.json.JSONException
Processor
toCheckpointJson
in class Processor
org.json.JSONException
protected void fromCheckpointJson(org.json.JSONObject json) throws org.json.JSONException
Processor
fromCheckpointJson
in class Processor
json
- JSONObjectorg.json.JSONException
Copyright © 2003-2014 Internet Archive. All Rights Reserved.