Knowledge about index
- The indexing process
Content Source(Start Addresses)->Protocol Handler(ie.HTTP)->IFilters(Office Docs, HTML files PDFs, etc.)->Word Breakers->Stemmers->Noise Word->index Catalog.
Indexing process start with the content source and the start address(es). Once the protocol handler and IFilter are loaded the content is collected as a stream of text. The data is then passed to the word breakers, stemmers and noise words. Finally, the data is added to the index.
The protocol handlers are used to make the initial connection to content source and tells the gatherer how to connect to content. The protocol handlers for the following types of content:
1) SharePoint Servers
2) Web Sites
3) File Shares
4) Exchange Public Folders
5) Business Data
6) Lotus Notes
For example, sts2://, sts3://, http://, spsimport://, anchorqh://, etc. We have specific protocol handlers for, HTTP, WSS, SMB, etc. People search protocol is Sps3.
IFilters are used to interpret different file formats. Office 2007 Server includes IFilters the following document types:
1) Microsoft Office documents
2) docx, xlsx, pptx
3) HTML files
4) TIFF files
5) Text files
6) Exchange Public Folders
7) PDF(third party)
- Index Propagation

Propagation is the process of copying the index from the indexing server to all query servers.
Propagation will only occur when the index and search components are on separate servers (e.g. a large farm, 2 WFE, 2 Query servers, an Index Server, and a SQL Server), otherwise the shadow index is merged with the current index and is used for search queries.
Propagation Table Names:
|
2007 |
2010 |
descriptions |
|
MSSPropagationSearchServerReady |
MSSPropagationTaskCompletions |
Lists completed propagation tasks. One entry per Query Component that reports complete. |
|
MSSPropagationPropagationTask |
MSSPropagationTasks |
Lists Propagation Tasks that need to be completed. One entry per task. |
|
MSSPropagationSearchServerTable |
MSSQueryComponents |
Lists Query Components and their status. Columns include ComponentID, ServerName, Index Path and many others. |
- Location:
1) The default location of index files is C:\Program Files\Microsoft Office Servers\12.0\Data\Office Server\Applications\<Application GUID>\Projects.
2) Edit index files Central Administration > Application Management > Manage this Farm's Shared Services > New Shared Services Provider
3) By default the catalog location will be \Program Files\Microsoft Office Servers\12.0\Data\Applications.
4) Content Source
Run regedit. HKEY_LOCAL_MACHINE(HKLM)\SOFTWARE\Microsoft\Office Server\12.0\Search\Applications\<GUID>\Gather\Portal_Content\ContentSources\0
5) Crawl rules are stored in the registry:
Crawl rules provide a way of specifying what content is crawled in a particular path. [HKLM\SOFTWARE\Microsoft\Office server\12.0\search\applications\<application GUID>\gather\portal_content\sites\*\path\<numbered key>]

浙公网安备 33010602011771号