上文中的抽象类Scoper关联到另外一个成员变量DecideRule scope,我不得不先中断处理器类的分析(后面再继续处理器分析),来插叙一下DecideRule scope对象,我说了,DecideRule scope成员是用来控制CrawlURI caUri对象的范围
照例先来浏览一下DecideRule相关类图

DecideRule类是一个抽象类,用来判断一个CrawlURI caUri对象是接受还是拒绝
public DecideResult decisionFor(CrawlURI uri) { if (!getEnabled()) { return DecideResult.NONE; } DecideResult result = innerDecide(uri); if (result == DecideResult.NONE) { return result; } return result; } protected abstract DecideResult innerDecide(CrawlURI uri); public DecideResult onlyDecision(CrawlURI uri) { return null; } public boolean accepts(CrawlURI uri) { return DecideResult.ACCEPT == decisionFor(uri); }
上面抽象方法由子类DecideResult innerDecide(CrawlURI uri)实现
DecideResult为枚举类,其值有三
/** * The decision of a DecideRule. * * @author pjack */ public enum DecideResult { /** Indicates the URI was accepted. */ ACCEPT, /** Indicates the URI was neither accepted nor rejected. */ NONE, /** Indicates the URI was rejected. */ REJECT; public static DecideResult invert(DecideResult result) { switch (result) { case ACCEPT: return REJECT; case REJECT: return ACCEPT; default: return result; } } }
我们再来看它的重要子类DecideRuleSequence,该类拥有DecideRule聚集,DecideResult innerDecide(CrawlURI uri)方法里面迭代调用聚集元素的DecideResult decisionFor(CrawlURI uri)方法(composite模式与Iterator模式结合)
@SuppressWarnings("unchecked")
public List<DecideRule> getRules() {
return (List<DecideRule>) kp.get("rules");
}
public void setRules(List<DecideRule> rules) {
kp.put("rules", rules);
}
public DecideResult innerDecide(CrawlURI uri) {
DecideRule decisiveRule = null;
int decisiveRuleNumber = -1;
DecideResult result = DecideResult.NONE;
List<DecideRule> rules = getRules();
int max = rules.size();
for (int i = 0; i < max; i++) {
DecideRule rule = rules.get(i);
if (rule.onlyDecision(uri) != result) {
DecideResult r = rule.decisionFor(uri);
if (LOGGER.isLoggable(Level.FINEST)) {
LOGGER.finest("DecideRule #" + i + " " +
rule.getClass().getName() + " returned " + r + " for url: " + uri);
}
if (r != DecideResult.NONE) {
result = r;
decisiveRule = rule;
decisiveRuleNumber = i;
}
}
}
if (fileLogger != null) {
fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri);
}
return result;
}
运行环境中该聚集元素我们可以通过crawler-beans.cxml配置文件看到
<!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. --> <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence"> <!-- <property name="logToFile" value="false" /> --> <property name="rules"> <list> <!-- Begin by REJECTing all... --> <bean class="org.archive.modules.deciderules.RejectDecideRule"> </bean> <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <!-- <property name="seedsAsSurtPrefixes" value="true" /> --> <!-- <property name="alsoCheckVia" value="false" /> --> <!-- <property name="surtsSourceFile" value="" /> --> <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> --> <!-- <property name="surtsSource"> <bean class="org.archive.spring.ConfigString"> <property name="value"> <value> # example.com # http://www.example.edu/path1/ # +http://(org,example, </value> </property> </bean> </property> --> </bean> <!-- ...but REJECT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> <!-- <property name="maxHops" value="20" /> --> </bean> <!-- ...but ACCEPT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TransclusionDecideRule"> <!-- <property name="maxTransHops" value="2" /> --> <!-- <property name="maxSpeculativeHops" value="1" /> --> </bean> <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <property name="decision" value="REJECT"/> <property name="seedsAsSurtPrefixes" value="false"/> <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> <!-- <property name="surtsSource"> <bean class="org.archive.spring.ConfigFile"> <property name="path" value="negative-surts.txt" /> </bean> </property> --> </bean> <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... --> <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <!-- <property name="listLogicalOr" value="true" /> --> <!-- <property name="regexList"> <list> </list> </property> --> </bean> <!-- ...and REJECT those with suspicious repeating path-segments... --> <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule"> <!-- <property name="maxRepetitions" value="2" /> --> </bean> <!-- ...and REJECT those with more than threshold number of path-segments... --> <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule"> <!-- <property name="maxPathDepth" value="20" /> --> </bean> <!-- ...but always ACCEPT those marked as prerequisitee for another URI... --> <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule"> </bean> <!-- ...but always REJECT those with unsupported URI schemes --> <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule"> </bean> </list> </property> </bean>
抽象类PredicatedDecideRule继承自DecideRule类
@Override protected DecideResult innerDecide(CrawlURI uri) { if (evaluate(uri)) { return getDecision(); } return DecideResult.NONE; } protected abstract boolean evaluate(CrawlURI object);
boolean evaluate(CrawlURI object)方法由子类实现
其他相关实现类我不再一一介绍了
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037547.html
浙公网安备 33010602011771号