public class CrawlURI extends Object implements org.archive.util.Reporter, Serializable, OverlayContext
Core state is in instance variables but a flexible
attribute list is also available. Use this 'bucket' to carry
custom processing extracted data and state across CrawlURI
processing. See the #putString(String, String)
,
#getString(String)
, etc.
Modifier and Type | Class and Description |
---|---|
static class |
CrawlURI.FetchType |
Modifier and Type | Field and Description |
---|---|
protected String |
canonicalString |
protected Map<String,Object> |
data
Flexible dynamic attributes list.
|
protected org.json.JSONObject |
extraInfo |
protected CrawlURI |
fullVia |
protected Object |
holder |
protected int |
holderCost
spot for an integer cost to be placed by external facility (frontier).
|
protected Object |
holderKey |
protected long |
ordinal
Monotonically increasing number within a crawl;
useful for tending towards breadth-first ordering.
|
protected Collection<CrawlURI> |
outCandidates |
protected Collection<Link> |
outLinks
All discovered outbound Links (navlinks, embeds, etc.)
Can either contain Link instances or CrawlURI instances, or both.
|
protected OverlayMapsSource |
overlayMapsSource |
protected ArrayList<String> |
overlayNames |
protected long |
politenessDelay |
protected long |
rescheduleTime
A future time at which this CrawlURI should be reenqueued.
|
static int |
UNCALCULATED |
Constructor and Description |
---|
CrawlURI(UURI uuri)
Create a new instance of CrawlURI from a
UURI . |
CrawlURI(UURI u,
String pathFromSeed,
UURI via,
LinkContext viaContext) |
Modifier and Type | Method and Description |
---|---|
void |
aboutToLog()
Notify CrawlURI it is about to be logged; opportunity
for self-annotation
|
static void |
addDataPersistentMember(String key)
Add the key of data map items you want to persist across
processings.
|
void |
addExtraInfo(String key,
Object value) |
void |
addPersistentDataMapKey(String s) |
static void |
autoregisterTo(AutoKryo kryo) |
CrawlURI |
clearPrerequisiteUri()
Clear prerequisite, if any.
|
boolean |
containsContentTypeCharsetDeclaration() |
boolean |
containsDataKey(String key) |
CrawlURI |
createCrawlURI(UURI baseUURI,
Link link)
Utility method for creation of CandidateURIs found extracting
links from this CrawlURI.
|
CrawlURI |
createCrawlURI(UURI baseUURI,
Link link,
int scheduling,
boolean seed)
Utility method for creation of CandidateURIs found extracting
links from this CrawlURI.
|
static String |
extendHopsPath(String pathFromSeed,
char hopChar)
Extend a 'hopsPath' (pathFromSeed string of single-character hop-type symbols),
keeping the number of displayed hop-types under MAX_HOPS_DISPLAYED.
|
static String |
fetchStatusCodesToString(int code)
Takes a status code and converts it into a human readable string.
|
String |
flattenVia()
Method returns string version of this URI's referral URI.
|
boolean |
forceFetch()
If this method returns true, this URI should be fetched even though
it already has been crawled.
|
static CrawlURI |
fromHopsViaString(String uriHopsViaContext) |
Collection<String> |
getAnnotations()
Get the annotations set for this uri.
|
UURI |
getBaseURI()
Get the (HTML) Base URI used for derelativizing internal URIs.
|
String |
getCanonicalString() |
String |
getClassKey()
Get the token (usually the hostname + port) which indicates
what "class" this CrawlURI should be grouped with,
for the purposes of ensuring only one item of the
class is processed at once, all items of the class
are held for a politeness period, etc.
|
byte[] |
getContentDigest()
Return the retained content-digest value, if any.
|
HashMap<String,Object> |
getContentDigestHistory() |
String |
getContentDigestSchemeString() |
String |
getContentDigestString() |
long |
getContentLength()
For completed HTTP transactions, the length of the content-body.
|
long |
getContentSize()
Get the size in bytes of this URI's recorded content, inclusive
of things like protocol headers.
|
String |
getContentType()
Get the content type of this URI.
|
Set<Credential> |
getCredentials() |
Map<String,Object> |
getData() |
List<Object> |
getDataList(String key)
Convenience method: return (creating if necessary) list at
given data key
|
int |
getDeferrals()
Get the deferral count.
|
String |
getDNSServerIPLabel() |
int |
getEmbedHopCount()
Get the embed hop count.
|
org.json.JSONObject |
getExtraInfo() |
int |
getFetchAttempts()
Get the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
|
long |
getFetchBeginTime() |
long |
getFetchCompletedTime() |
long |
getFetchDuration() |
int |
getFetchStatus()
Return the overall/fetch status of this CrawlURI for its
current trip through the processing loop.
|
CrawlURI.FetchType |
getFetchType() |
CrawlURI |
getFullVia() |
Object |
getHolder()
Return the 'holder' for the convenience of
an external facility.
|
int |
getHolderCost()
Return the 'holderCost' for convenience of external facility (frontier)
|
Object |
getHolderKey()
Return the 'holderKey' for convenience of
an external facility (Frontier).
|
int |
getHopCount()
Get total hops from seed.
|
Map<String,String> |
getHttpAuthChallenges() |
org.apache.commons.httpclient.HttpMethod |
getHttpMethod() |
String |
getLastHop()
convenience access to last hop character, as string
|
int |
getLinkHopCount()
Get the link hop count.
|
Collection<Throwable> |
getNonFatalFailures() |
long |
getOrdinal()
Get the ordinal (serial number) assigned at creation.
|
Collection<CrawlURI> |
getOutCandidates()
Returns discovered candidate URIs.
|
Collection<Link> |
getOutLinks()
Returns discovered links.
|
Map<String,Object> |
getOverlayMap(String name)
get the map corresponding to the overlay name
|
ArrayList<String> |
getOverlayNames()
return a list of the names of overlay maps to consider
|
String |
getPathFromSeed() |
static Collection<String> |
getPersistentDataKeys()
Add the key of items you want to persist across
processings.
|
Map<String,Object> |
getPersistentDataMap() |
UURI |
getPolicyBasisUURI()
Get the UURI that should be used as the basis of policy/overlay
decisions.
|
long |
getPolitenessDelay() |
int |
getPrecedence() |
CrawlURI |
getPrerequisiteUri()
Get the prerequisite for this URI.
|
long |
getRecordedSize()
Get size of data recorded (transferred)
|
org.archive.util.Recorder |
getRecorder()
Get the http recorder associated with this uri.
|
long |
getRescheduleTime() |
int |
getSchedulingDirective() |
String |
getSourceTag() |
int |
getThreadNumber()
Get the number of the ToeThread responsible for processing this uri.
|
int |
getTransHops()
Tally up the number of transitive (non-simple-link) hops at
the end of this CrawlURI's pathFromSeed.
|
String |
getURI() |
String |
getUserAgent()
Get the user agent to use for crawling this URI.
|
UURI |
getUURI() |
UURI |
getVia() |
LinkContext |
getViaContext() |
boolean |
hasBeenLinkExtracted()
If true then a link extractor has already claimed this CrawlURI and
performed link extraction on the document content.
|
boolean |
hasContentDigestHistory() |
boolean |
hasCredentials() |
boolean |
hasPrerequisiteUri() |
boolean |
hasRfc2617Credential() |
boolean |
haveOverlayNamesBeenSet()
test if this context has actually been configured with overlays
(even if in fact no overlays were added)
|
boolean |
includesRetireDirective() |
void |
incrementDeferrals()
Increment the deferral count.
|
void |
incrementDiscardedOutLinks() |
void |
incrementFetchAttempts()
Increment the count of attempts (trips through the processing
loop) at getting the document referenced by this URI.
|
protected void |
inheritFrom(CrawlURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.
|
boolean |
is2XXSuccess() |
boolean |
isHttpTransaction()
Return true if this is a http transaction.
|
boolean |
isLocation() |
boolean |
isPrerequisite()
Returns true if this CrawlURI is a prerequisite.
|
boolean |
isSeed() |
boolean |
isSuccess()
Ask this URI if it was a success or not.
|
void |
linkExtractorFinished()
Note that link extraction has been performed on this CrawlURI.
|
CrawlURI |
makeConsequentCandidate(String destination,
LinkContext lc,
Hop hop)
Create a consequent CrawlURI from this one, given the
additional parameters
|
void |
makeHeritable(String key)
Make the given key 'heritable', meaning its value will be
added to descendant CrawlURIs.
|
void |
makeNonHeritable(String key)
Make the given key non-'heritable', meaning its value will
not be added to descendant CrawlURIs.
|
CrawlURI |
markPrerequisite(String preq)
Do all actions associated with setting a
CrawlURI as
requiring a prerequisite. |
void |
processingCleanup()
Clean up after a run through the processing chain.
|
protected UURI |
readUuri(String u)
Read a UURI from a String, handling a null or URIException
|
static boolean |
removeDataPersistentMember(String key)
Remove the key from those data map members persisted.
|
void |
reportTo(PrintWriter writer) |
void |
resetDeferrals()
Reset deferrals counter.
|
void |
resetFetchAttempts()
Reset fetchAttempts counter.
|
void |
resetForRescheduling()
Reset state that that should not persist when a URI is
rescheduled for a specific future time.
|
void |
setBaseURI(String baseHref)
Set the (HTML) Base URI used for derelativizing internal URIs.
|
void |
setBaseURI(UURI base) |
void |
setCanonicalString(String canonical) |
void |
setClassKey(String key) |
void |
setContentDigest(byte[] digestValue)
Deprecated.
|
void |
setContentDigest(String scheme,
byte[] digestValue) |
void |
setContentSize(long l)
Sets the 'content size' for the URI, which is considered inclusive of all
of all recorded material (such as protocol headers) or even material
'virtually' considered (as in material from a previous fetch
confirmed unchanged with a server).
|
void |
setContentType(String ct)
Set a fetched uri's content type.
|
void |
setDNSServerIPLabel(String label) |
void |
setError(String msg) |
void |
setFetchBeginTime(long time) |
void |
setFetchCompletedTime(long time) |
void |
setFetchStatus(int newstatus)
Set the overall/fetch status of this CrawlURI for
its current trip through the processing loop.
|
void |
setFetchType(CrawlURI.FetchType type) |
void |
setForceFetch(boolean b)
Method to signal that this URI should be fetched even though
it already has been crawled.
|
void |
setForceRetire(boolean b) |
void |
setFullVia(CrawlURI curi) |
void |
setHolder(Object obj)
Remember a 'holder' to which some enclosing/queueing
facility has assigned this CrawlURI
.
|
void |
setHolderCost(int cost)
Remember a 'holderCost' which some enclosing/queueing
facility has assigned this CrawlURI
|
void |
setHolderKey(Object obj)
Remember a 'holderKey' which some enclosing/queueing
facility has assigned this CrawlURI
.
|
void |
setHttpAuthChallenges(Map<String,String> httpAuthChallenges) |
void |
setHttpMethod(org.apache.commons.httpclient.HttpMethod method) |
void |
setOrdinal(long o) |
void |
setOverlayMapsSource(OverlayMapsSource overrideMapsSource) |
void |
setPolitenessDelay(long polite) |
void |
setPrecedence(int precedence) |
void |
setPrerequisite(boolean prerequisite)
Set if this CrawlURI is itself a prerequisite URI.
|
void |
setPrerequisiteUri(CrawlURI pre)
Set a prerequisite for this URI.
|
void |
setRecorder(org.archive.util.Recorder httpRecorder)
Set the http recorder to be associated with this uri.
|
void |
setRescheduleTime(long time) |
void |
setSchedulingDirective(int priority) |
void |
setSeed(boolean b)
Set the isSeed attribute of this URI.
|
void |
setSourceTag(String sourceTag) |
void |
setThreadNumber(int i)
Set the number of the ToeThread responsible for processing this uri.
|
void |
setUserAgent(String string)
Set the user agent to use when crawling this URI.
|
void |
setVia(UURI via) |
String |
shortReportLegend() |
String |
shortReportLine() |
void |
shortReportLineTo(PrintWriter w) |
Map<String,Object> |
shortReportMap() |
void |
stripToMinimal()
Remove all attributes set on this uri.
|
String |
toString() |
public static final int UNCALCULATED
protected Map<String,Object> data
The attribute list is a flexible map of key/value pairs for storing
status of this URI for use by other processors. By convention the
attribute list is keyed by constants found in the
CoreAttributeConstants
interface. Use this list to carry
data or state produced by custom processors rather change the
classes CrawlURI
or this class, CrawlURI.
protected long ordinal
protected transient Object holder
protected transient Object holderKey
protected int holderCost
protected transient Collection<Link> outLinks
protected transient Collection<CrawlURI> outCandidates
protected transient OverlayMapsSource overlayMapsSource
protected String canonicalString
protected long politenessDelay
protected transient CrawlURI fullVia
protected long rescheduleTime
protected org.json.JSONObject extraInfo
public CrawlURI(UURI uuri)
UURI
.uuri
- the UURI to base this CrawlURI on.public CrawlURI(UURI u, String pathFromSeed, UURI via, LinkContext viaContext)
u
- uuri instance this CrawlURI wraps.pathFromSeed
- via
- viaContext
- public static CrawlURI fromHopsViaString(String uriHopsViaContext) throws org.apache.commons.httpclient.URIException
org.apache.commons.httpclient.URIException
public int getSchedulingDirective()
public void setSchedulingDirective(int priority)
priority
- The schedulingDirective to set.public boolean containsDataKey(String key)
public static String fetchStatusCodesToString(int code)
code
- the status codepublic int getFetchStatus()
public void setFetchStatus(int newstatus)
newstatus
- a value from FetchStatusCodespublic int getFetchAttempts()
public void incrementFetchAttempts()
public void resetFetchAttempts()
public void resetDeferrals()
public void setPrerequisiteUri(CrawlURI pre)
A prerequisite is a URI that must be crawled before this URI can be crawled.
link
- Link to set as prereq.public CrawlURI getPrerequisiteUri()
A prerequisite is a URI that must be crawled before this URI can be crawled.
public CrawlURI clearPrerequisiteUri()
public boolean hasPrerequisiteUri()
public boolean isPrerequisite()
public void setPrerequisite(boolean prerequisite)
prerequisite
- True if this CrawlURI is itself a prerequiste uri.public String getContentType()
public void setContentType(String ct)
ct
- Contenttype.public void setThreadNumber(int i)
i
- the ToeThread number.public int getThreadNumber()
public void incrementDeferrals()
public int getDeferrals()
public void stripToMinimal()
This methods removes the attribute list.
public long getContentSize()
#setContentSize()
public Collection<String> getAnnotations()
public int getHopCount()
public int getEmbedHopCount()
public int getLinkHopCount()
public String getUserAgent()
public void setUserAgent(String string)
string
- user agent to usepublic long getContentLength()
public long getRecordedSize()
public void setContentSize(long l)
public boolean hasBeenLinkExtracted()
There is an onus on link extractors to set this flag if they have run.
linkExtractorFinished()
public void linkExtractorFinished()
hasBeenLinkExtracted()
public void aboutToLog()
public org.archive.util.Recorder getRecorder()
public void setRecorder(org.archive.util.Recorder httpRecorder)
httpRecorder
- The httpRecorder to set.public boolean isHttpTransaction()
#isPost()
method so that there is one
place to go to find out if get http, post http, ftp, dns.public void processingCleanup()
public Set<Credential> getCredentials()
public boolean hasCredentials()
public boolean isSuccess()
is2XXSuccess()
if
looking for a status code in the 200 range.
401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.
is2XXSuccess()
public boolean is2XXSuccess()
isSuccess()
public boolean hasRfc2617Credential()
public void setContentDigest(byte[] digestValue)
setContentDigest(String scheme, byte[])
digestValue
- public void setContentDigest(String scheme, byte[] digestValue)
public String getContentDigestSchemeString()
public byte[] getContentDigest()
public String getContentDigestString()
public void setHolder(Object obj)
obj
- public Object getHolder()
public void setHolderKey(Object obj)
obj
- public Object getHolderKey()
public long getOrdinal()
public void setOrdinal(long o)
public int getHolderCost()
public void setHolderCost(int cost)
cost
- value to rememberpublic Collection<Link> getOutLinks()
public Collection<CrawlURI> getOutCandidates()
public void setBaseURI(String baseHref) throws org.apache.commons.httpclient.URIException
baseHref
- String base href to useorg.apache.commons.httpclient.URIException
- if supplied string cannot be interpreted as URIpublic UURI getBaseURI()
public static Collection<String> getPersistentDataKeys()
key
- Key to add.public void addPersistentDataMapKey(String s)
public static void addDataPersistentMember(String key)
key
- Key to add.public static boolean removeDataPersistentMember(String key)
key
- Key to remove.protected UURI readUuri(String u)
u
- String or null from which to create UURIpublic String getDNSServerIPLabel()
public long getFetchBeginTime()
public long getFetchCompletedTime()
public long getFetchDuration()
public CrawlURI.FetchType getFetchType()
public Collection<Throwable> getNonFatalFailures()
public void setDNSServerIPLabel(String label)
public void setError(String msg)
public void setFetchBeginTime(long time)
public void setFetchCompletedTime(long time)
public void setFetchType(CrawlURI.FetchType type)
public void setHttpMethod(org.apache.commons.httpclient.HttpMethod method)
public void setForceRetire(boolean b)
public org.apache.commons.httpclient.HttpMethod getHttpMethod()
public void setBaseURI(UURI base)
public List<Object> getDataList(String key)
key
- public void setSeed(boolean b)
b
- Is this URI a seed, true or false.public boolean isSeed()
public UURI getUURI()
public String getURI()
public String getPathFromSeed()
public String getLastHop()
public UURI getVia()
public void setVia(UURI via)
public LinkContext getViaContext()
public boolean isLocation()
public String shortReportLine()
public Map<String,Object> shortReportMap()
shortReportMap
in interface org.archive.util.Reporter
public void shortReportLineTo(PrintWriter w)
shortReportLineTo
in interface org.archive.util.Reporter
public String shortReportLegend()
shortReportLegend
in interface org.archive.util.Reporter
public void reportTo(PrintWriter writer) throws IOException
reportTo
in interface org.archive.util.Reporter
IOException
public String flattenVia()
public String getSourceTag()
public void setSourceTag(String sourceTag)
public void makeHeritable(String key)
key
- to make heritablepublic void makeNonHeritable(String key)
key
- to make non-heritablepublic String getClassKey()
public void setClassKey(String key)
public boolean forceFetch()
public void setForceFetch(boolean b)
b
- set to true to enforce the crawling of this URIpublic int getTransHops()
TODO: consider moving link-count in here as well, caching calculation, and refactoring CrawlScope.exceedsMaxHops() to use this.
protected void inheritFrom(CrawlURI ancestor)
ancestor
- public CrawlURI createCrawlURI(UURI baseUURI, Link link) throws org.apache.commons.httpclient.URIException
baseUURI
- BaseUURI for link
.link
- Link to wrap CandidateURI in.link
.org.apache.commons.httpclient.URIException
public static String extendHopsPath(String pathFromSeed, char hopChar)
pathFromSeed
- hopChar
- public CrawlURI createCrawlURI(UURI baseUURI, Link link, int scheduling, boolean seed) throws org.apache.commons.httpclient.URIException
baseUURI
- BaseUURI for link
.link
- Link to wrap CandidateURI in.scheduling
- How new CandidateURI should be scheduled.seed
- True if this CandidateURI is a seed.link
.org.apache.commons.httpclient.URIException
public String toString()
public void incrementDiscardedOutLinks()
public int getPrecedence()
public void setPrecedence(int precedence)
precedence
- the precedence to setpublic UURI getPolicyBasisUURI()
public boolean haveOverlayNamesBeenSet()
OverlayContext
haveOverlayNamesBeenSet
in interface OverlayContext
public ArrayList<String> getOverlayNames()
OverlayContext
getOverlayNames
in interface OverlayContext
public Map<String,Object> getOverlayMap(String name)
OverlayContext
getOverlayMap
in interface OverlayContext
public void setOverlayMapsSource(OverlayMapsSource overrideMapsSource)
public void setCanonicalString(String canonical)
public String getCanonicalString()
public void setPolitenessDelay(long polite)
public long getPolitenessDelay()
public void setFullVia(CrawlURI curi)
public CrawlURI getFullVia()
public void setRescheduleTime(long time)
public long getRescheduleTime()
public void resetForRescheduling()
public boolean includesRetireDirective()
public org.json.JSONObject getExtraInfo()
public static void autoregisterTo(AutoKryo kryo)
public CrawlURI markPrerequisite(String preq) throws org.apache.commons.httpclient.URIException
CrawlURI
as
requiring a prerequisite.lastProcessorChain
- Last processor chain reference. This chain is
where this CrawlURI
goes next.preq
- Object to set a prerequisite.org.apache.commons.httpclient.URIException
public CrawlURI makeConsequentCandidate(String destination, LinkContext lc, Hop hop) throws org.apache.commons.httpclient.URIException
destination
- URI stringlc
- LinkContexthop
- Hoporg.apache.commons.httpclient.URIException
public boolean containsContentTypeCharsetDeclaration()
public boolean hasContentDigestHistory()
Copyright © 2003-2014 Internet Archive. All Rights Reserved.