Knowledge about index

  • The indexing process

Content Source(Start Addresses)->Protocol Handler(ie.HTTP)->IFilters(Office Docs, HTML files PDFs, etc.)->Word Breakers->Stemmers->Noise Word->index Catalog.

Indexing process start with the content source and the start address(es). Once the protocol handler and IFilter are loaded the content is collected as a stream of text. The data is then passed to the word breakers, stemmers and noise words. Finally, the data is added to the index. 

The protocol handlers are used to make the initial connection to content source and tells the gatherer how to connect to content. The protocol handlers for the following types of content:

1)      SharePoint Servers

2)      Web Sites

3)      File Shares

4)      Exchange Public Folders

5)      Business Data

6)      Lotus Notes

For example, sts2://, sts3://, http://, spsimport://, anchorqh://, etc. We have specific protocol handlers for, HTTP, WSS, SMB, etc. People search protocol is Sps3.

 

IFilters are used to interpret different file formats. Office 2007 Server includes IFilters the following document types:

1)      Microsoft Office documents

2)      docx, xlsx, pptx

3)      HTML files

4)      TIFF files

5)      Text files

6)      Exchange Public Folders

7)      PDF(third party)

 

  • Index Propagation

Propagation is the process of copying the index from the indexing server to all query servers.

Propagation will only occur when the index and search components are on separate servers (e.g. a large farm, 2 WFE, 2 Query servers, an Index Server, and a SQL Server), otherwise the shadow index is merged with the current index and is used for search queries.

                       

Propagation Table Names:

2007

2010

descriptions

MSSPropagationSearchServerReady

MSSPropagationTaskCompletions

Lists completed propagation   tasks. One entry per Query Component that reports complete.

MSSPropagationPropagationTask

MSSPropagationTasks

Lists Propagation Tasks that   need to be completed. One entry per task.

MSSPropagationSearchServerTable

MSSQueryComponents

Lists Query Components and   their status. Columns include ComponentID, ServerName, Index Path and many   others.

  

  • Location:

1)      The default location of index files is C:\Program Files\Microsoft Office Servers\12.0\Data\Office Server\Applications\<Application GUID>\Projects.

2)      Edit index files Central Administration > Application Management > Manage this Farm's Shared Services > New Shared Services Provider 

3)      By default the catalog location will be \Program Files\Microsoft Office Servers\12.0\Data\Applications.

4)      Content Source

Run regedit.  HKEY_LOCAL_MACHINE(HKLM)\SOFTWARE\Microsoft\Office Server\12.0\Search\Applications\<GUID>\Gather\Portal_Content\ContentSources\0

5)      Crawl rules are stored in the registry:

Crawl rules provide a way of specifying what content is crawled in a particular path. [HKLM\SOFTWARE\Microsoft\Office server\12.0\search\applications\<application GUID>\gather\portal_content\sites\*\path\<numbered key>]

posted @ 2012-05-20 20:58  l'oiseau  阅读(178)  评论(0)    收藏  举报