ElasticSearch

Notes from Exploring ElasticSearch

 

The installation of Elasticsearch is very simple. It's a server for processing texts.

Elasticsearch is a standalone Java app, and can be easily started from command line. A copy can be obtained from the elasticsearch download page.

 

Microsoft Windows:

Download the .zip version and unpack it to a folder. Navigate to the bin folder, then double click elasticsearch.bat to run.

 

If the serve is successfully be startedl, you'll see information in terminal like this:

[2015-02-04 20:43:12,747][INFO ][node              ] [Joe Fixit] started

 

P.S: There's a problem you may meet. If the terminal provides the information like this:

[2014-12-17 09:31:03,820][WARN ][cluster.routing.allocation.decider]
[logstash test] high disk watermark [10%] exceeded on
[7drCr113QgSM8wcjNss_Mg][Blur] free: 632.3mb[8.4%], shards will be
relocated away from this node

[2014-12-17 09:31:03,820][INFO ][cluster.routing.allocation.decider]
[logstash test] high disk watermark exceeded on one or more nodes,
rerouting shards

It just means there's no enough space in your current disk. So you only need to delete some files for freeing space.

 

After you've started your server, you can ensure it's running properly by opening your browser to the URL: http://localhost:9200. You should see a page like this:

{
  "status" : 200,
  "name" : "Joe Fixit",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.4.2",
    "build_hash" : "927caff6f05403e936c20bf4529f144f0c89fd8c",
    "build_timestamp" : "2014-12-16T14:11:12Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.2"
  },
  "tagline" : "You Know, for Search"
}

 

As it's free to use any tool you wish to query elasticsearch, we can install curl and cygwin to query elasticsearch.

 

But if you're reading the book Exploring ElasticSearch, you'd better install the tool made by the author: elastic-hammer. You can find the detailed information on Github: https://github.com/andrewvc/elastic-hammer. It's very easy to install it as a plugin with the following steps:

 

 

Modeling Data

field:      the smallest individual unit of data.

documents:    collections of fields, and comprise the base unit of storage in elasticsearch.

 

The primary data-format elasticsearch uses is JSON. A sample document:

{
	"_id" : 1,
	"handle" : "ron",
	"hobbies" : ["hacking", "the great outdoors"],
	"computer" : {"cpu" : "pentium pro", "mhz" : 200}
}

 

The user-dfined type is analogous to a database schema. Types are defined with the Mapping APIs:

{
	"user" : {
		"properties" : {
			"handle" : {"type" : "string"},
			"age" : {"type" : "integer"},
			"hobbies" : {"type" : "string"},
			"computer" : {
				"properties" : {
					"cpu" : {"type" : string},
					"speed" : {"type" : "integer"}
				}
			}
		}
	}
}

 

 

Basic CRUD

The full CRUD lifecycle in elasticsearch is Create, Read, Update, Delete. We'll create an index, then a type, and finally a document within that index using tat type. The URL scheme is consistent for these operations, with most URLs having the form /index/type/docid, and that special operations on a given namespace are namespaced with an uderscore prefix. 

// create an index named 'planet'
PUT /planet


// create a type called 'hacker'
PUT /planet/hacker/_mapping
{
	"hacker" : {
		"properties" : {
			"handle" : {"type" : "string"},
			"age" : {"type" : "long"}
		}
	}
}


// create a document
PUT /planet/hacker/1
{"handle" : "jean-michea", "age" : 18}


// retrieve the document
GET /planet/hacker/1


// update the document's age field
POST /planet/hacker/1/_update
{"doc" : {"age" : 19}}


// delete the document
DELETE /planet/hacker/1

 

 

Search Data

First, create our schema:

// Delete the document
DELETE /planet/hacker/1


// Delete any existing indexes named planet
DELETE /planet

// Create our index
PUT /planet/
{
	"mappings" : {
		"hacker" : {
			"properties" : {
				"handle" : {"type" : "string"},
				"hobbies" : {"type" : "string", "analyzer" : "snowball"}
			}
		}
	}
}

Then, seed some data by datasets as hacker_planet.eloader.

The data repository can be got at http://github.com/andrewvc/ee-datasets. After cloned the repository, you can load examples into your server by executing the included elastic-loader.jar program, providing the address of your elasticsearch server, and the path to the data-file. For example, to load the hacker_planet dataset, open a command prompt in the ee-datasets folder, an run:

java -jar elastic-loader.jar http://localhost:9200 datasets/hacker_planet.eloader

 

 

Finally, we can perform our search:

// Do the search
POST /planet/hacker/_search
{
	"query" : {
		"match" : {
			"hobbies" : "rollerblading"
		}
	}
}

The above codes perform a search for those who like rollerblading out of the 3 users we've created in the datbase.

 

Searches in elasticsearch are handled by the aptly named search API. The search API is provided by the _search endpoint.

  • index search:                  /myidx/_search
  • document type search:    /myidx/mytpe/_search

For example:

// index search
POST /planet/_search
...

// document type search
POST /planet/hacker/_search
...

 

 

A complex search's skeleton

// Load Dataset: hacker_planet.eloader
POST /planet/_search
{
	"from" : 0,
	"size" : 15,
	"query" : {"match_all" : {}},
	"sort" : {"handle" : "desc"},
	"filter" : {"term" : {"_all" : "coding"}},
	"facet" : {
		"hobbies" : {
			"term" : {
				"field" : "hobbies"
			}
		}
	}
}

   

All elasticsearch queries boil down to the task of

  1. restricting the result set
  2. scoring (the default scoring algorithm implemented in Lucene's TFIDF Similarity class.)
  3. sorting

 

 

Text Analysis

Elasticsearch has toolbox with which we can slice and dice words in order to efficiently searched. Utilizing these tools we can narrow our search space, and find common ground between linguistically similar terms. 

The Snowball analyzer is great at figuring out what the stems of English words are. The stem of a word is its root.

 

The process by which documents are analyzed is as follows:

  1. A document update or create is received via a PUT or POST.
  2. The field values in the document are each run through an analyzer which converts each value to zero, one, or more indexable tokens.
  3. The tokenized values are stored in an index, pointing back to the full version of the document.

The easist way to see analysis in action is with the Analyzer API:

GET /_analyze?analyzer=snowball&text=candles%20candle&pretty=true'

 

An analyzer is a really a three stage pipeline comprised of the following execution steps:

  1. Character Filtering    Turns the input string into a different string
  2. Tokenization              Turns the char-filtered string into an array of tokens
  3. Token Filtering          Post-process the filtered tokens into a mutated token array

 

Let's dive in by building a cutom analyzer for tokenizing CSV data. Custom analyzer can be stored at the index level either during or after index creation. Lets's:

  1. create a "recipes" index
  2. close it
  3. update the analysis settings
  4. reopen it (in order to experiment with a custom analyzer)
// Create the index
PUT /recipes

// Close the index for settings update
POST /recipes/_close

// Create the analyzer
PUT /recipes/_settings
{
	"index" : {
		"analysis" : {
			"tokenizer" : {
				"comma" : {"type" : "pattern", "pattern" : ","}
			},
			"analyzer" : {
				"recipe_csv" : {
					"type" : "custom",
					"tokenizer" : "comma",
					"filter" : ["trim", "lowercase"]
				}
			}
		}
	}
}

// Reopen the index
POST /recipes/_open

 

Faceting

Facets are always attached to a query, letting you return aggregate statistics alongside regular query results. We'll create a database of movies and return facets based on the movies genres alongside standard query results. As usual, we need to load the movie_db.eloader data-set into elasticsearch server.

Simple movie mapping:

// Load Dataset: movie_db.eloader
GET /movie_db/movie/_mapping?pretty=true
{
	"movie" : {
		"properties" : {
			"actors" : {"type" : "string", "analyzer" : "standard", "position_offset_gap" : 100},
			"genre" : {"type" : "string", "index" : "not_analyzed"},
			"release_year" : {"type" : "integer", "index" : "not_analyzed"},
			"title" : {"type" : "string", "analyzer" : "snowball"},
			"description" : {"type" : "string", "analyzer" : "snowball"} 
		}
	}
}

 

Simple terms faceting:

// Load Dataset: movie_db.eloader
POST /movie_db/_search
{
	"query" : {"match" : {"description" : "hacking"}},
	"facets" : {
		"genre" : {
			"terms" : {"field" : "genre"},
			"size" : 10
		}
	}
}

This query searches for movies with a description containing "hacking". The query will return a list of facets showing which genres have descriptions containing the term "hacking", and how often films are in that genre with a matching description.

 

posted @ 2015-02-04 21:10  kid551  阅读(1468)  评论(0编辑  收藏  举报