dinghao

记录成长点滴

 

理解Lucene(二)理解核心Indexing classes

Lucene In Action 的1.5,绝大部分的翻译和自己的理解,双语对照
IndexWriter

IndexWriter is the central component of the indexing process. This class creates

a new index and adds documents to an existing index. You can think of Index-

Writer as an object that gives you write access to the index but doesn’t let you read

or search it. Despite its name, IndexWriter isn’t the only class that’s used to modify

an index; section 2.2 describes how to use the Lucene API to modify an index.

IndexWriter是索引过程中的核心组件,它可以创建新的Index,添加documents到已有的Index,你可以把它当成只能写,不能读和搜索Index的对象,IndexWriter不是唯一可以用来修改Index的类。2.2节描述了怎么用LuceneApi修改Index

Directory

The Directory class represents the location of a Lucene index. It’s an abstract

class that allows its subclasses (two of which are included in Lucene) to store the

index as they see fit.

In your applications, you will most likely be storing a Lucene index on a disk.

To do so, use FSDirectory, a Directory subclass that maintains a list of real files

in the file system, as we did in Indexer.

The other implementation of Directory is a class called RAMDirectory.

Because all data is held in the fast-access memory and

not on a slower hard disk, RAMDirectory is suitable for situations where you need

very quick access to the index, whether during indexing or searching.

Of course, the performance difference between RAMDirectory and

FSDirectory is less visible when Lucene is used on operating systems that cache

files in memory.

Dirctory 描述了Lucene Index的位置,它是抽象类,因此允许子类(Lucene包含两个它的子类)把索引存在任何合适的位置。

FSDirctory 子类把Index存储在硬盘,Ramdirctory子类把Index存储在Ram

Ramdirectory适合存取Index非常快的场合,Lunce的测试用例就是使用的Ramdirectory

由于有操作系统的Cache,这两个子类的性能差别是很小的(几乎不可见)

Analyzer

Before text is indexed, it’s passed through an Analyzer. The Analyzer, specified

in the IndexWriter constructor, is in charge of extracting tokens out of text to be

indexed and eliminating the rest. If the content to be indexed isn’t plain text, it

should first be converted to it, as depicted in figure 2.1. Chapter 7 shows how to

extract text from the most common rich-media document formats. Analyzer is

an abstract class, but Lucene comes with several implementations of it. Some of

them deal with skipping stop words (frequently used words that don’t help distinguish

one document from the other, such as a, an, the, in, and on); some deal with

conversion of tokens to lowercase letters, so that searches aren’t case-sensitive;

and so on. Analyzers are an important part of Lucene and can be used for much

more than simple input filtering. For a developer integrating Lucene into an

application, the choice of analyzer(s) is a critical element of application design.

You’ll learn much more about them in chapter 4.

文本被索引前会被传到Analyzer。通过IndexWriter的构造函数指定的Analyzer,负责从将要被索引的文本中提取标记(tokens),消除休止符(rest)。如果被将要被索引的内容不是文本(plain text),首先要被转换到文本,详细的描述,请看第七章。

Analyzer是抽象类,Lucene提供了几个它的子类。如,有的负责跳过对区分document没有用处的词(stop words),有的负责把标记转换到小写,目的是使searchers成为大小写不敏感的。

Analyzers使Lucene的重要部分,能被用到很多方面,不仅仅是简单的输入过滤。

对需要把Lucene集成到自己应用程序的开发者而言,选择Analyzer 是设计中至关重要的元素。

第四章可以看到更多关于Analyzerd内容。

Document

A Document represents a collection of fields. You can think of it as a virtual document—

a chunk of data, such as a web page, an email message, or a text file—

that you want to make retrievable at a later time. Fields of a document represent

the document or meta-data associated with that document. The original source

(such as a database record, a Word document, a chapter from a book, and so on)

of document data is irrelevant to Lucene. The meta-data such as author, title,

subject, date modified, and so on, are indexed and stored separately as fields of

a document.

Document代表fileds的集合,可以把它想象成虚拟的文档――一块数据,如一个网页,有一个邮件信息,一个文本文件――一会儿你会从这个文档中获取数据。

文档的Fields是文档或者和文档相关的元数据的描述。文档的最初的来源(original source)如数据库的记录,word文档,书的一章等是和Lucene不相关的。象作者,标题,主题,修改日期等元数据是作为文档的Fileds分开存储和索引的。

注意:

When we refer to a document in this book, we mean a Microsoft Word,

RTF, PDF, or other type of a document; we aren’t talking about Lucene’s

Document class. Note the distinction in the case and font.

In our Indexer, we’re concerned with indexing text files. So, for each text file

we find, we create a new instance of the Document class, populate it with Fields

(described next), and add that Document to the index, effectively indexing the file.

Indexer关注的是索引文件,因此对每一个找到的文本文件,创建一个由Fileds组成的Document类,然后把Document添加到索引,最后就能高效率的索引文件

Field

Each Document in an index contains one or more named fields, embodied in a

class called Field. Each field corresponds to a piece of data that is either queried

against or retrieved from the index during search

Lucene offers four different types of fields from which you can choose:

每个索引的文档包含一个或者多个fields,filedsField类中描述。

在搜索时,每个field对应索引中被查询或者重新取得的一个数据。

Lucene提供了四种不同的Field

Keyword—Isn’t analyzed, but is indexed and stored in the index verbatim.

This type is suitable for fields whose original value should be preserved in

its entirety, such as URLs, file system paths, dates, personal names, Social

Security numbers, telephone numbers, and so on. For example, we used

the file system path in Indexer (listing 1.1) as a Keyword field.

 

UnIndexed—Is neither analyzed nor indexed, but its value is stored in the

index as is. This type is suitable for fields that you need to display with

search results (such as a URL or database primary key), but whose values

you’ll never search directly. Since the original value of a field of this type is

stored in the index, this type isn’t suitable for storing fields with very large

values, if index size is an issue.

不被分析不被索引,值被存储在索引中,适合作为搜索结果显示,但不能被直接搜索的Filed

因为原始值被存储在索引中,所以不能太大。

UnStored—The opposite of UnIndexed. This field type is analyzed and

indexed but isn’t stored in the index. It’s suitable for indexing a large

amount of text that doesn’t need to be retrieved in its original form, such

as bodies of web pages, or any other type of text document.

UnIndexed相反,被分析被索引,适合大的数据块,这种数据块不能够获取原始的形式,因为他们被索引了。例如文档的内容。

Text—Is analyzed, and is indexed. This implies that fields of this type can

be searched against, but be cautious about the field size. If the data

indexed is a String, it’s also stored; but if the data (as in our Indexer example)

is from a Reader, it isn’t stored. This is often a source of confusion, so

take note of this difference when using Field.Text.

被分析被索引,隐含着这种fields可以被搜索,但要主意field的大小。

All fields consist of a name and value pair.

Fieldsnamevalue对组成

要分清楚它们四个的区别:

1、 理解analyzedindexedstored以及他们和Search的关系。

2、理解fieldfield name,field value的区别

posted on 2006-07-31 14:48  思无邪  阅读(746)  评论(1编辑  收藏  举报

导航