org.archive.modules.extractor

Class HTTPContentDigest

    • Constructor Detail

      • HTTPContentDigest

        public HTTPContentDigest()
        Constructor.
    • Method Detail

      • getStripRegex

        public String getStripRegex()
      • setStripRegex

        public void setStripRegex(String regex)
      • getMaxSizeToDigest

        public long getMaxSizeToDigest()
      • setMaxSizeToDigest

        public void setMaxSizeToDigest(long threshold)
      • shouldProcess

        protected boolean shouldProcess(CrawlURI uri)
        Description copied from class: Processor
        Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
        Specified by:
        shouldProcess in class Processor
        Parameters:
        uri - the URI to test
        Returns:
        true if this processor should process that uri; false if not
      • innerProcess

        protected void innerProcess(CrawlURI curi)
                             throws InterruptedException
        Description copied from class: Processor
        Actually performs the process. By the time this method is invoked, it is known that the given URI passes the #ENABLED, the #DECIDE_RULES and the #shouldProcess(ProcessorURI) tests.
        Specified by:
        innerProcess in class Processor
        Parameters:
        curi - the URI to process
        Throws:
        InterruptedException - if the thread is interrupted

Copyright © 2003-2014 Internet Archive. All Rights Reserved.