public class StatisticsTracker extends Object implements org.springframework.context.ApplicationContextAware, org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>, SeedListener, org.springframework.context.Lifecycle, Runnable, Checkpointable, org.springframework.beans.factory.BeanNameAware
At the end of each snapshot a line is written to the 'progress-statistics.log' file.
The header of that file is as follows:
[timestamp] [discovered] [queued] [downloaded] [doc/s(avg)] [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]First there is a timestamp, accurate down to 1 second.
discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.
KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.
doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.
busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.
Finally mem-use-KB is extracted from the run time environment
(Runtime.getRuntime().totalMemory()
).
In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.
Modifier and Type | Field and Description |
---|---|
protected org.springframework.context.ApplicationContext |
appCtx |
protected BdbModule |
bdb |
protected String |
beanName |
protected CrawlController |
controller |
protected CrawledBytesHistotable |
crawledBytes
tally sizes novel, verified (same hash), vouched (not-modified)
|
protected long |
crawlEndTime
wall-clock time the crawl ended
|
protected long |
crawlPauseStarted
wall-clock time of last pause, while pause in progres
|
protected long |
crawlStartTime
wall-clock time the crawl started
|
protected long |
crawlTotalPausedTime
duration tally of all time spent in paused state
|
protected ScheduledExecutorService |
executor |
protected TopNSet |
hostsBytesTop |
protected TopNSet |
hostsDistributionTop |
protected TopNSet |
hostsLastFinishedTop |
protected int |
intervalSeconds
The interval between writing progress information to log.
|
protected boolean |
isRunning |
protected int |
keepSnapshotsCount
Number of crawl-stat sample snapshots to keep for calculation
purposes.
|
protected int |
liveHostReportSize |
protected ConcurrentMap<String,AtomicLong> |
mimeTypeBytes |
protected ConcurrentMap<String,AtomicLong> |
mimeTypeDistribution
Keep track of the file types we see (mime type -> count)
|
protected ObjectIdentityCache<SeedRecord> |
processedSeedsRecords
Record of seeds and latest results
|
protected Checkpoint |
recoveryCheckpoint |
protected List<Report> |
reports |
protected ConfigPath |
reportsDir |
protected SeedModule |
seeds |
protected long |
seedsCrawled |
protected long |
seedsTotal |
protected ServerCache |
serverCache |
protected LinkedList<CrawlStatSnapshot> |
snapshots
snapshots of crawl tallies and rates
|
protected ConcurrentHashMap<String,ConcurrentMap<String,AtomicLong>> |
sourceHostDistribution
Keep track of URL counts per host per seed
|
protected ConcurrentMap<String,AtomicLong> |
statusCodeDistribution
Keep track of fetch status codes
|
protected boolean |
trackSeeds
Whether to maintain seed disposition records (expensive in
crawls with millions of seeds)
|
protected boolean |
trackSources
Whether to maintain hosts-per-source-tag records for; very expensive in
crawls with large numbers of source-tags (seeds) or large crawls
over many hosts
|
Constructor and Description |
---|
StatisticsTracker() |
Modifier and Type | Method and Description |
---|---|
void |
addedSeed(CrawlURI curi)
Create a seed record, even on initial notification (before
any real attempt/processing.
|
protected void |
addToManifest(String absolutePath,
char manifest_report_file,
boolean b) |
DisposableStoredSortedMap<Long,String> |
calcReverseSortedHostsDistribution()
Return a copy of the hosts distribution in reverse-sorted
(largest first) order.
|
DisposableStoredSortedMap<Integer,SeedRecord> |
calcSeedRecordsSortedByStatusCode() |
void |
concludedSeedBatch() |
void |
crawlCheckpoint(Object def,
File cpDir) |
String |
crawledBytesSummary() |
void |
crawledURIDisregard(CrawlURI curi) |
void |
crawledURIFailure(CrawlURI curi) |
void |
crawledURINeedRetry(CrawlURI curi) |
void |
crawledURISuccessful(CrawlURI curi) |
void |
crawlEmpty(String statusMessage) |
void |
crawlEnded(String sExitMessage) |
void |
crawlEnding(String sExitMessage) |
void |
crawlPaused(String statusMessage) |
void |
crawlPausing(String statusMessage) |
void |
crawlResuming(String statusMessage) |
void |
doCheckpoint(Checkpoint checkpointInProgress)
Do the actual checkpoint.
|
void |
dumpReports()
Run the reports.
|
void |
finishCheckpoint(Checkpoint checkpointInProgress)
Cleanup/unlock; need not complete for a checkpoint to be valid.
|
long |
getBytesPerFileType(String filetype)
Returns the accumulated number of bytes from files of a given file type.
|
long |
getBytesPerHost(String host)
Returns the accumulated number of bytes downloaded from a given host.
|
CrawlController |
getCrawlController() |
long |
getCrawlDuration()
Returns how long the current crawl has been running *including*
time paused (contrast with getCrawlElapsedTime()).
|
CrawledBytesHistotable |
getCrawledBytes() |
long |
getCrawlElapsedTime() |
Map<String,AtomicLong> |
getFileDistribution()
Returns a HashMap that contains information about distributions of
encountered mime types.
|
long |
getHostLastFinished(String host)
Returns the time (in millisec) when a URI belonging to a given host was
last finished processing.
|
int |
getIntervalSeconds() |
int |
getKeepSnapshotsCount() |
CrawlStatSnapshot |
getLastSnapshot() |
int |
getLiveHostReportSize() |
String |
getProgressStamp() |
List<Report> |
getReports() |
ConfigPath |
getReportsDir() |
DisposableStoredSortedMap<Long,String> |
getReverseSortedCopy(Map<String,AtomicLong> mapOfAtomicLongValues)
Sort the entries of the given Map in descending order by their
values, which must be longs wrapped with
AtomicLong . |
DisposableStoredSortedMap<Long,String> |
getReverseSortedHostCounts(Map<String,AtomicLong> hostCounts)
Return a copy of the hosts distribution in reverse-sorted (largest first)
order.
|
SeedModule |
getSeeds() |
Iterator<String> |
getSeedsIterator()
Get a seed iterator for the job being monitored.
|
ServerCache |
getServerCache() |
CrawlStatSnapshot |
getSnapshot() |
Map<String,AtomicLong> |
getStatusCodeDistribution()
Return a objectCache representing the distribution of status codes for
successfully fetched curis, as represented by a cache where key ->
val represents (string)code -> (integer)count.
|
boolean |
getTrackSeeds() |
boolean |
getTrackSources() |
protected void |
handleSeed(CrawlURI curi,
String disposition)
If the curi is a seed, we update the processedSeeds cache.
|
protected static void |
incrementMapCount(ConcurrentMap<String,AtomicLong> map,
String key)
Increment a counter for a key in a given HashMap.
|
protected static void |
incrementMapCount(ConcurrentMap<String,AtomicLong> map,
String key,
long increment)
Increment a counter for a key in a given HashMap by an arbitrary amount.
|
boolean |
isRunning() |
LinkedList<CrawlStatSnapshot> |
listSnapshots() |
protected void |
logNote(String note) |
boolean |
nonseedLine(String line)
Do nothing with nonseed lines.
|
void |
noteStart()
Notify tracker that crawl has begun.
|
void |
onApplicationEvent(org.springframework.context.ApplicationEvent event) |
protected void |
progressStatisticsEvent()
A method for logging current crawler state.
|
String |
progressStatisticsLegend() |
void |
run()
Do activity.
|
protected void |
saveHostStats(String hostname,
long size)
Update some running-stats based on a URI success
|
protected void |
saveSourceStats(String source,
String hostname) |
void |
setApplicationContext(org.springframework.context.ApplicationContext appCtx) |
void |
setBdbModule(BdbModule bdb) |
void |
setBeanName(String name) |
void |
setCrawlController(CrawlController controller) |
void |
setIntervalSeconds(int interval) |
void |
setKeepSnapshotsCount(int count) |
void |
setLiveHostReportSize(int liveHostReportSize) |
void |
setRecoveryCheckpoint(Checkpoint recoveryCheckpoint)
Used to inform a bean that it should restore its state from
the given Checkpoint when launched (Lifecycle start()).
|
void |
setReports(List<Report> reports) |
void |
setReportsDir(ConfigPath reportsDir) |
void |
setSeeds(SeedModule seeds) |
void |
setServerCache(ServerCache serverCache) |
void |
setTrackSeeds(boolean trackSeeds) |
void |
setTrackSources(boolean trackSources) |
void |
start() |
void |
startCheckpoint(Checkpoint checkpointInProgress)
Note a checkpoint is about to begin.
|
void |
stop() |
protected void |
tallyCurrentPause()
For a current pause (if any), add paused time to total and reset
|
void |
tallySeeds() |
int |
threadCount()
Get the total number of ToeThreads (sleeping and active)
|
protected File |
writeReportFile(Report report,
boolean force) |
File |
writeReportFile(String reportName) |
protected SeedModule seeds
protected BdbModule bdb
protected ConfigPath reportsDir
protected ServerCache serverCache
protected int liveHostReportSize
protected org.springframework.context.ApplicationContext appCtx
protected boolean trackSeeds
protected boolean trackSources
protected int intervalSeconds
protected int keepSnapshotsCount
protected CrawlController controller
protected long crawlStartTime
protected long crawlEndTime
protected long crawlPauseStarted
protected long crawlTotalPausedTime
protected LinkedList<CrawlStatSnapshot> snapshots
protected ScheduledExecutorService executor
protected CrawledBytesHistotable crawledBytes
protected ConcurrentMap<String,AtomicLong> mimeTypeDistribution
protected ConcurrentMap<String,AtomicLong> mimeTypeBytes
protected ConcurrentMap<String,AtomicLong> statusCodeDistribution
protected ConcurrentHashMap<String,ConcurrentMap<String,AtomicLong>> sourceHostDistribution
protected TopNSet hostsDistributionTop
protected TopNSet hostsBytesTop
protected TopNSet hostsLastFinishedTop
protected ObjectIdentityCache<SeedRecord> processedSeedsRecords
protected long seedsTotal
protected long seedsCrawled
protected boolean isRunning
protected String beanName
protected Checkpoint recoveryCheckpoint
public SeedModule getSeeds()
public void setSeeds(SeedModule seeds)
public void setBdbModule(BdbModule bdb)
public ConfigPath getReportsDir()
public void setReportsDir(ConfigPath reportsDir)
public ServerCache getServerCache()
public void setServerCache(ServerCache serverCache)
public int getLiveHostReportSize()
public void setLiveHostReportSize(int liveHostReportSize)
public void setApplicationContext(org.springframework.context.ApplicationContext appCtx) throws org.springframework.beans.BeansException
setApplicationContext
in interface org.springframework.context.ApplicationContextAware
org.springframework.beans.BeansException
public boolean getTrackSeeds()
public void setTrackSeeds(boolean trackSeeds)
public boolean getTrackSources()
public void setTrackSources(boolean trackSources)
public int getIntervalSeconds()
public void setIntervalSeconds(int interval)
public int getKeepSnapshotsCount()
public void setKeepSnapshotsCount(int count)
public CrawlController getCrawlController()
public void setCrawlController(CrawlController controller)
public CrawledBytesHistotable getCrawledBytes()
public boolean isRunning()
isRunning
in interface org.springframework.context.Lifecycle
public void stop()
stop
in interface org.springframework.context.Lifecycle
public void start()
start
in interface org.springframework.context.Lifecycle
public void run()
public String progressStatisticsLegend()
public String getProgressStamp()
public void noteStart()
protected void progressStatisticsEvent()
CrawlController.logProgressStatistics(java.lang.String)
so CrawlController
can act on progress statistics event.
It is recommended that for implementations of this method it be carefully considered if it should be synchronized in whole or in part
e
- Progress statistics event.public CrawlStatSnapshot getSnapshot()
public LinkedList<CrawlStatSnapshot> listSnapshots()
public CrawlStatSnapshot getLastSnapshot()
public long getCrawlElapsedTime()
public void crawlPausing(String statusMessage)
protected void logNote(String note)
public void crawlPaused(String statusMessage)
public void crawlResuming(String statusMessage)
public void crawlEmpty(String statusMessage)
protected void tallyCurrentPause()
public void crawlEnding(String sExitMessage)
public void crawlEnded(String sExitMessage)
public long getCrawlDuration()
public Map<String,AtomicLong> getFileDistribution()
Note: All the values are wrapped with a AtomicLong
protected static void incrementMapCount(ConcurrentMap<String,AtomicLong> map, String key)
map
- The Map or ConcurrentMapkey
- The key for the counter to be incremented, if it does not
exist it will be added (set to 1). If null it will
increment the counter "unknown".protected static void incrementMapCount(ConcurrentMap<String,AtomicLong> map, String key, long increment)
map
- The HashMapkey
- The key for the counter to be incremented, if it does not exist
it will be added (set to equal to increment
).
If null it will increment the counter "unknown".increment
- The amount to increment counter related to the key
.public DisposableStoredSortedMap<Long,String> getReverseSortedCopy(Map<String,AtomicLong> mapOfAtomicLongValues)
AtomicLong
.
Elements are sorted by value from largest to smallest. Equal values are sorted by their keys. The returned map is a StoredSortedMap, and thus may include duplicate keys. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.
mapOfAtomicLongValues
- Assumes values are wrapped with AtomicLong.public Map<String,AtomicLong> getStatusCodeDistribution()
AtomicLong
public long getHostLastFinished(String host)
host
- The host to look up time of last completed URI.public long getBytesPerHost(String host)
host
- name of the hostpublic long getBytesPerFileType(String filetype)
filetype
- Filetype to check.public int threadCount()
public String crawledBytesSummary()
protected void handleSeed(CrawlURI curi, String disposition)
curi
- The CrawlURI that may be a seed.disposition
- The disposition of the CrawlURI.public void crawledURISuccessful(CrawlURI curi)
protected void saveHostStats(String hostname, long size)
hostname
- size
- public void crawledURINeedRetry(CrawlURI curi)
public void crawledURIDisregard(CrawlURI curi)
public void crawledURIFailure(CrawlURI curi)
public Iterator<String> getSeedsIterator()
public DisposableStoredSortedMap<Integer,SeedRecord> calcSeedRecordsSortedByStatusCode()
public DisposableStoredSortedMap<Long,String> getReverseSortedHostCounts(Map<String,AtomicLong> hostCounts)
public DisposableStoredSortedMap<Long,String> calcReverseSortedHostsDistribution()
protected void addToManifest(String absolutePath, char manifest_report_file, boolean b)
public void dumpReports()
public void crawlCheckpoint(Object def, File cpDir) throws Exception
Exception
public void onApplicationEvent(org.springframework.context.ApplicationEvent event)
onApplicationEvent
in interface org.springframework.context.ApplicationListener<org.springframework.context.ApplicationEvent>
public void tallySeeds()
public void addedSeed(CrawlURI curi)
addedSeed
in interface SeedListener
SeedListener.addedSeed(org.archive.modules.CrawlURI)
public boolean nonseedLine(String line)
nonseedLine
in interface SeedListener
SeedListener.nonseedLine(java.lang.String)
public void concludedSeedBatch()
concludedSeedBatch
in interface SeedListener
public void setBeanName(String name)
setBeanName
in interface org.springframework.beans.factory.BeanNameAware
public void startCheckpoint(Checkpoint checkpointInProgress)
Checkpointable
startCheckpoint
in interface Checkpointable
checkpointInProgress
- Checkpointpublic void doCheckpoint(Checkpoint checkpointInProgress) throws IOException
Checkpointable
doCheckpoint
in interface Checkpointable
checkpointInProgress
- CheckpointIOException
public void finishCheckpoint(Checkpoint checkpointInProgress)
Checkpointable
finishCheckpoint
in interface Checkpointable
checkpointInProgress
- Checkpointpublic void setRecoveryCheckpoint(Checkpoint recoveryCheckpoint)
Checkpointable
setRecoveryCheckpoint
in interface Checkpointable
recoveryCheckpoint
- CheckpointCopyright © 2003-2014 Internet Archive. All Rights Reserved.