Carrot2 in action_初试身手—融入自己的中文分词器

http://jiajiam.spaces.live.com/blog/cns!E9F2928B37455D08!281.entry

初试身手—融入自己的中文分词器

现在准备着手写一个真正意义上的聚类搜素了。一开始担心老外的carrot2对中文会进行“歧视”，后来发现原来carrot2还是比较重视中文的，在有一个org.carrot2.filter.lingo.local.ChineseLingoLocalFilterComponent的类，专门用来为中文提供分词操作。再次往下细看，底层的分词造作在org.carrot2.util.tokenizer.parser.jflex.JeZHWordSplit中实现的，采用的是基于lucene的MMAnalyer 。我没有使用过这种分词器，不知道它的消歧机制和切分效率如何，于是想拿经常使用的分词器来做个比较。于是，必须建立一个自己的中文filter组建。以往经常使用的是中科院的java改良版（还是很慢）和c++版本的mmseg，由于使用的是自己家是windows平台的，所以只好用中科院的java改良版。

1. 首先在org.carrot2.util.tokenizer.parser中新加一个分析器就叫KellyWordSplit：

package org.carrot2.util.tokenizer.parser;

import org.apache.lucene.analysis.ictcals.FMNM;

import org.carrot2.util.tokenizer.parser.jflex.PreprocessedJFlexWordBasedParserBase;

public class KellyWordSplit extends PreprocessedJFlexWordBasedParserBase {

//public Segment seg = null;

public KellyWordSplit() {

// try {

// seg = new Segment(1, new File(".").getCanonicalPath()

// + File.separator+"dic"+File.separator);

// } catch (IOException e) {

// // TODO Auto-generated catch block

// e.printStackTrace();

// }

}

@Override

public String preprocess(String input) {

System.out.println("cut:"+input);

return FMNM.ICTCLASCut(input) ;

}

然后再在这个包中建立一个解析工厂：ICTCALWordBasedParserFactory

package org.carrot2.util.tokenizer.parser;

import org.apache.commons.pool.BasePoolableObjectFactory;

import org.apache.commons.pool.ObjectPool;

import org.apache.commons.pool.impl.SoftReferenceObjectPool;

public class ICTCALWordBasedParserFactory {

/** Chinese tokenizer factory */

public static final ICTCALWordBasedParserFactory ChineseSimplified = new KellyICTCALWordBasedParserFactory();

/** Parser pool */

protected ObjectPool parserPool;

/** No public constructor */

private ICTCALWordBasedParserFactory() {

// No public constructor

}

public WordBasedParserBase borrowParser() {

try {

return (WordBasedParserBase) parserPool.borrowObject();

} catch (Exception e) {

throw new RuntimeException("Cannot borrow a parser", e);

}

/**

* @param parser

public void returnParser(WordBasedParserBase parser) {

try {

parserPool.returnObject(parser);

} catch (Exception e) {

throw new RuntimeException("Cannot return a parser", e);

}

/**

* @author Stanislaw Osinski

* @version $Revision: 2122 $

private static class KellyICTCALWordBasedParserFactory extends

ICTCALWordBasedParserFactory {

public KellyICTCALWordBasedParserFactory() {

parserPool = new SoftReferenceObjectPool(

new BasePoolableObjectFactory() {

public Object makeObject() throws Exception {

return new KellyWordSplit();

}

});

}

11月23日

Carrot2 in action_初试身手—融入自己的中文分词器（2）

2. 第二步就是在org.carrot2.util.tokenizer.languages.chinese中建立一个自己的语言类ICTCALChineseSimplified

public class ICTCALChineseSimplified extends StemmedLanguageBase{

/**

* A set of stopwords for this language.

private final static Set stopwords;

* Load stopwords from an associated resource.

static

{

try

{

stopwords = WordLoadingUtils.loadWordSet("stopwords.zh-cn");

}

catch (IOException e)

{

throw new RuntimeException("Could not initialize class: " + e.getMessage());

}

/**

* Public constructor.

public ICTCALChineseSimplified()

{

super.setStopwords(stopwords);

}

/**

* Creates a new instance of a {@link LanguageTokenizer} for this language.

* @see org.carrot2.util.tokenizer.languages.StemmedLanguageBase#createTokenizerInstanceInternal()

protected LanguageTokenizer createTokenizerInstanceInternal()

{

return ICTCALWordBasedParserFactory.ChineseSimplified.borrowParser();

}

/**

* @return Language code: <code>pl</code>

* @see org.carrot2.core.linguistic.Language#getIsoCode()

public String getIsoCode()

{

return "zh-cn";

}

protected Stemmer createStemmerInstance()

{

return EmptyStemmer.INSTANCE;

}

3. 第三步就可以建立自己的中filter组建了：

package org.carrot2.filter.lingo.local;

import java.util.HashMap;

import java.util.Map;

import org.carrot2.core.linguistic.Language;

import org.carrot2.util.tokenizer.languages.chinese.ICTCALChineseSimplified;

import org.carrot2.util.tokenizer.languages.english.English;

public class ICTCALLingoLocalFilterComponent extends LingoLocalFilterComponent {

public ICTCALLingoLocalFilterComponent() {

super(new Language[] { new English(), new ICTCALChineseSimplified() },

new ICTCALChineseSimplified(), new HashMap());

}

public ICTCALLingoLocalFilterComponent(Map parameters) {

super(new Language[] { new English(), new ICTCALChineseSimplified() },

new ICTCALChineseSimplified(), parameters);

}

哈哈，是不是很容易啊？怎么用它呢？

如下：

final LocalComponentFactory lingo = new LocalComponentFactory() {

public LocalComponent getInstance() {

HashMap defaults = new HashMap();

// These are adjustments settings for the clustering algorithm...

// You can play with them, but the values below are our 'best guess'

// settings that we acquired experimentally.

defaults.put("lsi.threshold.clusterAssignment", "0.150");

defaults.put("lsi.threshold.candidateCluster", "0.775");

// we will use the defaults here, see {@link Example}

// for more verbose configuration.

//return new ChineseLingoLocalFilterComponent();

return new ICTCALLingoLocalFilterComponent(defaults);

}

};

// add the clustering component as "lingo-classic"

controller.addLocalComponentFactory("lingo-classic", lingo);

下一次，我将谈谈如何将carrot2融合自己的搜索框架，以及在架构上对搜索聚类/分类的一些自己的看法

11月24日

Carrot2 in action（3）_融入系统

接上面阐述，从以上两种聚类的结构和效率来看，其实carrot2自带的MMAnalyer的效果都还不错，没有特殊需求可以不用加入自己的分词组建。

融入系统

Carrot2针对来自lucene的搜索源提供了专门的输入组建LuceneLocalInputComponent，看了它里面的结构，我觉得并不符合我这套系统的搜索架构

，换句话说LuceneLocalInputComponent太过“傻瓜”化，对于需要高性能的应用并不适合。于是我决定使用carrot2的直接输入输出组建ArrayInputComponent和ArrayOutputComponent，俗话说“最基本的也是最灵活的”真的是不错！此外我选用lingo算法的过滤组建。Ok，一切就绪，马上着手组建。一下是主要程序片段：

/**

* @param documentList:原信息

* @return ArrayOutputComponent.Result 下午03:55:03

public ArrayOutputComponent.Result cluster(

List<RawDocumentSnippet> documentList) {

final HashMap params = new HashMap();

params

.put(ArrayInputComponent.PARAM_SOURCE_RAW_DOCUMENTS,

documentList);

// params

// .put(ArrayInputComponent.,

// documentList);

ProcessingResult pResult;

try {

pResult = controller.query("direct-feed-lingo", query, params);

return (ArrayOutputComponent.Result) pResult.getQueryResult();

} catch (MissingProcessException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

return null

11月24日

Carrot2 in action（4）_融入系统

private LocalController initLocalController() throws DuplicatedKeyException {

final LocalController controller = new LocalControllerBase();

// Create direct document feed input component factory. The documents

// that that this component will feed will be provided at clustering

// request time.

final LocalComponentFactory input = new LocalComponentFactory() {

public LocalComponent getInstance() {

return new ArrayInputComponent();

}

};

// add direct document feed input as 'input-direct'

controller.addLocalComponentFactory("input-direct", input);

// Now it's time to create filters. We will use Lingo clustering

// component.

final LocalComponentFactory lingo = new LocalComponentFactory() {

public LocalComponent getInstance() {

HashMap defaults = new HashMap();

// These are adjustments settings for the clustering algorithm...

// You can play with them, but the values below are our 'best guess'

// settings that we acquired experimentally.

defaults.put("lsi.threshold.clusterAssignment", "0.150");

defaults.put("lsi.threshold.candidateCluster", "0.775");

// we will use the defaults here, see {@link Example}

// for more verbose configuration.

//return new ChineseLingoLocalFilterComponent();

return new ICTCALLingoLocalFilterComponent(defaults);

}

};

// add the clustering component as "lingo-classic"

controller.addLocalComponentFactory("lingo-classic", lingo);

// Finally, create a result-catcher component

final LocalComponentFactory output = new LocalComponentFactory() {

public LocalComponent getInstance() {

return new ArrayOutputComponent();

}

};

// add the output component as "buffer"

controller.addLocalComponentFactory("buffer", output);

// In the final step, assemble a process from the above.

try {

controller

.addProcess("direct-feed-lingo", new LocalProcessBase(

"input-direct", "buffer",

new String[] { "lingo-classic" }));

} catch (InitializationException e) {

// This exception is thrown during verification of the added

// component chain,

// when a component cannot properly initialize for some reason. We

// don't

// expect it here, so rethrow it as runtime exception.

throw new RuntimeException(e);

} catch (MissingComponentException e) {

// If you give an identifier of a component for which factory has

// not been

// added to the controller, you'll get this exception. Impossible in

// our

// example.

throw new RuntimeException(e);

}

return controller;

}

11月24日

Carrot2 in action（5）_融入系统

到这儿一切主要的步骤就差不多了，剩下的就是如何组装聚类结果并返回了。我选择了以xml的方式返回。一下是主要片段：

/**

* 将结果组装成xml中，并返回

* @param result

* @return String 上午11:17:08

public String wrapperResult(ArrayOutputComponent.Result result,

ClusterObject co) {

if (result == null) {

return null;

}

StringBuilder sb = new StringBuilder();

final List clusters = result.clusters;

int size = clusters.size();

if (size > 0) {

sb.append("<CLUSTERS_SIZE>");

sb.append(size);

sb.append("</CLUSTERS_SIZE>");

int num = 1;

for (Iterator i = clusters.iterator(); i.hasNext(); num++) {

wrapperCluster(sb, 0, (RawCluster) i.next(), co);

}

return sb.toString();

}

Carrot2 in action（6）_融入系统

/**

* wrap the content of a single cluster, descending recursively to

* subclusters.

* @param level

* current nesting level.

* @param tag

* prefix for the current nesting level.

* @param cluster

* cluster to display.

* @return String 上午11:24:04

private void wrapperCluster(StringBuilder sb, final int level,

RawCluster cluster, ClusterObject co) {

// Detect and skip "junk" clusters -- clusters that have no meaning.

// Also note that clusters have properties. Algorithms may pass

// additional

// information about clusters this way.

if (cluster.getProperty(RawCluster.PROPERTY_JUNK_CLUSTER) != null) {

return;

}

sb.append("<CLUSTER>");

// Get the label of the current cluster. The description of a cluster

// is a list of strings, ordered according to the accuracy of their

// relationship with the cluster's content. Typically you'll just

// show the first few phrases. We'll limit ourselves to just one.

final List phrases = cluster.getClusterDescription();

final String label = (String) phrases.get(0);

sb.append("<LABEL><![CDATA[");

sb.append(label);

sb.append("]]></LABEL>");

sb.append("<SIZE>");

int size = cluster.getDocuments().size();

sb.append(size);

sb.append("</SIZE>");

if (size > 0)

// if this cluster has documents, display three topmost documents.

{

int count = 1;

sb.append("<DOCUMENTS>");

for (Iterator d = cluster.getDocuments().iterator(); d.hasNext(); count++) {

final RawDocument document = (RawDocument) d.next();

sb.append("<DOC>");

// <NUM>

sb.append(count);

sb.append(System.getProperty("line.separator"));

// <Score>

sb.append(document.getScore());

sb.append(System.getProperty("line.separator"));

// <ID>

sb.append(document.getTitle());

sb.append(System.getProperty("line.separator"));

// <Value>

sb.append(document.getProperty("Value"));

sb.append(System.getProperty("line.separator"));

// <Key>

String Key = document.getSnippet();

Key = highlightUtil.highlight(StringUtil.filterKeyWords(Key),

StringUtil.filterKeyWords(co.keyWord), false, analyzer,

co.hiliPrefix, co.hiliPostfix);

sb.append(Key);

sb.append(System.getProperty("line.separator"));

sb.append("</DOC>");

}

sb.append("</DOCUMENTS>");

}

// finally, if this cluster has subclusters, descend into recursion.

int scnum = cluster.getClusterDescription().size();

if (scnum > 0) {

int num = 1;

sb.append("<SUBCLUSTER>");

for (Iterator c = cluster.getSubclusters().iterator(); c.hasNext(); num++) {

wrapperCluster(sb, level + 1, (RawCluster) c.next(), co);

}

sb.append("</SUBCLUSTER>");

}

sb.append("</CLUSTER>");

}

主要的步骤到此基本就完成了。当然其中还有很多关系到性能的细节问题，如缓存的设置，聚类和搜索的并发处理等等，都需要根据各自系统的需求而处理。这里就不累述。

11月24日

Carrot2 in action（7）_多嘴说说

多嘴说说

其实carrot2是一个做实时聚类的开源项目，它聚类的输入类型是数组，即将所有要聚类的数据一次性输入，这样无疑对大数据量的聚类操作是不合适的。所以carrot2适合做新闻发布系统等实时聚类的项目。本人草草的看了一下源码，发现carrot2的主要聚类操作在MultilingualClusteringContext和MultilingualFeatureExtractionStrategy

；特征值采用VSM（vector space model向量空间模型），提取主要一下方法来完成private Feature[] extractSingleTerms()

rivate Feature[] extractPhraseTerms(int[] indexMapping)。

实施聚类需要边搜索边聚类，这无比给搜索性能带来负面影响。为了提高聚类和搜索的效率，我预备从框架上做一个新调整，思路如下：就是专门开一个聚类/分类的进程。第一步，搜索进程将聚类信息传递给聚类/分类进程后，就可以去做自己的事情了，如组装xml（结果中带有一个key定义该此搜索嘴硬的聚类结果值）等等。第二步，当聚类/分类进程收到聚类信息后，开始聚类/分类操作,组装聚类的结果。第三步，这一步可以有两种实现方式：一种是当前台显示层接收到搜索结果后，根据结果xml中的聚类Key值去聚类/分类进程拿聚类结果，这种方式在于前台可以尽快的显示搜索结果，而且如果聚类/分类进程和搜索进程不在一台服务器上，还可以减少搜索进程的并发负担，因为它可以快速的返回减少在搜索服务器的停留时间。但是这种方式会增大前台显示的通讯负担和显示效果，因为一次搜索前台会提出两次请求，而且搜索结果和左侧聚类会分两次先后显现，即异步显示；另一种方式就是在返回结果前由后台搜索进程从聚类/分类进程中取结果，并组装返回，它的好处是减少了前台的通讯次，而且两中结果（聚类和搜索）会同时显现，感官上会好接受一些。但这种方式的不足在于一次搜索的时间会变长，即用户等待结果的时间会变长。

啊，java的世界真是广阔无垠啊，开源真是个促进技术发展的好东西。原来有一个叫weka的开源项目,早已在数据挖掘界众人皆知了，它里面有着很多data mining的算法实现，对于大数据量也很使用，所以接下来，我将对weka展开“攻击”o(∩_∩)o…。

posted on 2010-05-07 23:47 cy163 阅读(4132) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

祥龙之子