public class BdbUriUniqFilter extends SetBasedUriUniqFilter implements org.springframework.context.Lifecycle, Checkpointable, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.DisposableBean
Makes keys that have URIs from same server close to each other. Mercator and 2.3.5 'Elminating Already-Visited URLs' in 'Mining the Web' by Soumen Chakrabarti talk of a two-level key with the first 24 bits a hash of the host plus port and with the last 40 as a hash of the path. Testing showed adoption of such a scheme halving lookup times (Tutilhis implementation actually concatenates scheme + host in first 24 bits and path + query in trailing 40 bits).
UriUniqFilter.CrawlUriReceiver
Modifier and Type | Field and Description |
---|---|
protected com.sleepycat.je.Database |
alreadySeen |
protected BdbModule |
bdb |
protected String |
beanName |
protected AtomicLong |
count |
protected boolean |
createdEnvironment |
protected boolean |
isRunning |
protected long |
lastCacheMiss |
protected long |
lastCacheMissDiff |
protected Checkpoint |
recoveryCheckpoint |
protected com.sleepycat.je.DatabaseEntry |
value |
protected static com.sleepycat.je.DatabaseEntry |
ZERO_LENGTH_ENTRY |
duplicateCount, duplicatesAtLastSample, profileLog, receiver
Constructor and Description |
---|
BdbUriUniqFilter() |
BdbUriUniqFilter(File bdbEnv)
Constructor.
|
BdbUriUniqFilter(File bdbEnv,
int cacheSizePercentage)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected static long |
calcSchemeAuthorityKeyBytes(String url) |
void |
close()
Close down any allocated resources.
|
static long |
createKey(CharSequence uri)
Create fingerprint.
|
void |
destroy() |
void |
doCheckpoint(Checkpoint checkpointInProgress)
Do the actual checkpoint.
|
void |
finishCheckpoint(Checkpoint checkpointInProgress)
Cleanup/unlock; need not complete for a checkpoint to be valid.
|
long |
flush() |
void |
forgetAllSchemeAuthorityMatching(String url)
Forget all entries that match the scheme+host+port of the given url, so
that they can be crawled again if discovered again.
|
long |
getCacheMisses() |
protected BdbModule.BdbConfig |
getDatabaseConfig() |
long |
getLastCacheMissDiff() |
protected void |
initialize(com.sleepycat.je.Database db)
Method shared by constructors.
|
boolean |
isRunning() |
protected void |
open(com.sleepycat.je.Database db) |
void |
reopen(com.sleepycat.je.Database db)
Call after deserializing an instance of this class.
|
protected boolean |
setAdd(CharSequence uri) |
void |
setBdbModule(BdbModule bdb) |
void |
setBeanName(String name) |
protected long |
setCount() |
void |
setRecoveryCheckpoint(Checkpoint recoveryCheckpoint)
Used to inform a bean that it should restore its state from
the given Checkpoint when launched (Lifecycle start()).
|
protected boolean |
setRemove(CharSequence uri) |
void |
start() |
void |
startCheckpoint(Checkpoint checkpointInProgress)
Note a checkpoint is about to begin.
|
void |
stop() |
add, addForce, addNow, count, forget, note, pending, profileLog, requestFlush, setDestination, setProfileLog
protected boolean createdEnvironment
protected long lastCacheMiss
protected long lastCacheMissDiff
protected transient com.sleepycat.je.Database alreadySeen
protected transient com.sleepycat.je.DatabaseEntry value
protected static com.sleepycat.je.DatabaseEntry ZERO_LENGTH_ENTRY
protected AtomicLong count
protected BdbModule bdb
protected String beanName
protected boolean isRunning
protected Checkpoint recoveryCheckpoint
public BdbUriUniqFilter()
public BdbUriUniqFilter(File bdbEnv) throws IOException
bdbEnv
- The directory that holds the bdb environment. Will
make a database under here if doesn't already exit. Otherwise
reopens any existing dbs.IOException
public BdbUriUniqFilter(File bdbEnv, int cacheSizePercentage) throws IOException
bdbEnv
- The directory that holds the bdb environment. Will
make a database under here if doesn't already exit. Otherwise
reopens any existing dbs.cacheSizePercentage
- Percentage of JVM bdb allocates as
its cache. Pass -1 to get default cache size.IOException
public void setBdbModule(BdbModule bdb)
public void setBeanName(String name)
setBeanName
in interface org.springframework.beans.factory.BeanNameAware
public void start()
start
in interface org.springframework.context.Lifecycle
public boolean isRunning()
isRunning
in interface org.springframework.context.Lifecycle
public void stop()
stop
in interface org.springframework.context.Lifecycle
public void destroy()
destroy
in interface org.springframework.beans.factory.DisposableBean
protected void initialize(com.sleepycat.je.Database db) throws com.sleepycat.je.DatabaseException
env
- Environment to use.com.sleepycat.je.DatabaseException
protected BdbModule.BdbConfig getDatabaseConfig()
public void reopen(com.sleepycat.je.Database db) throws com.sleepycat.je.DatabaseException
env
- DB Environment to use.com.sleepycat.je.DatabaseException
protected void open(com.sleepycat.je.Database db) throws com.sleepycat.je.DatabaseException
com.sleepycat.je.DatabaseException
public void close()
UriUniqFilter
close
in interface UriUniqFilter
close
in class SetBasedUriUniqFilter
public long getCacheMisses()
public long getLastCacheMissDiff()
public static long createKey(CharSequence uri)
uri
- URI to fingerprint.url
.protected static long calcSchemeAuthorityKeyBytes(String url)
protected boolean setAdd(CharSequence uri)
setAdd
in class SetBasedUriUniqFilter
protected long setCount()
setCount
in class SetBasedUriUniqFilter
protected boolean setRemove(CharSequence uri)
setRemove
in class SetBasedUriUniqFilter
public long flush()
public void startCheckpoint(Checkpoint checkpointInProgress)
Checkpointable
startCheckpoint
in interface Checkpointable
checkpointInProgress
- Checkpointpublic void doCheckpoint(Checkpoint checkpointInProgress) throws IOException
Checkpointable
doCheckpoint
in interface Checkpointable
checkpointInProgress
- CheckpointIOException
public void finishCheckpoint(Checkpoint checkpointInProgress)
Checkpointable
finishCheckpoint
in interface Checkpointable
checkpointInProgress
- Checkpointpublic void setRecoveryCheckpoint(Checkpoint recoveryCheckpoint)
Checkpointable
setRecoveryCheckpoint
in interface Checkpointable
recoveryCheckpoint
- Checkpointpublic void forgetAllSchemeAuthorityMatching(String url)
Because of the way keys are calculated, scheme+host+port is the only
grouping of urls that is feasible to forget in bulk. See
createKey(CharSequence)
WARNING: Value collisions in this 24-bit schemeAuthority part are going to be fairly common, by 'birthday problem' over 50% likely to show up with as few as 2^12 unique schemeAuthority strings. So the forgetting may forget other hosts.
url
- whose scheme+host+port should be forgotten (remainder of url
is ignored)Copyright © 2003-2014 Internet Archive. All Rights Reserved.