Cassandra 0.7中的二级索引

在Cassandra的官方博客中，新发布了一篇关于Cassandra二级索引的文章，写得不错，翻译出来与大家一起分享:)

概述

在Cassandra中，对列值（column values）的索引叫做“二级索引”，它与列簇（ColumnFamilies）中对Key的索引不同。二级索引允许我们对列值进行查询，并且在读取和写入的时候不会引起操作阻塞。

理解二级索引最好的方式就是用实际的例子来说明，在这个例子中，我们使用Cassandra自带的命令行工具（CLI）进行操作，并且使用一个名为users的列簇：

$ bin/cassandra-cli --host localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit.
[default@unknown] create keyspace demo;
[default@unknown] use demo;
[default@demo] create column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

在上面的示例中，我们定义了1个Keyspace，名为Demo，然后又定一个了一个列簇，名为users，最后在这个列簇中，对2个列定义了二级索引full_name和birth_date。

在Cassandra 0.7.0中，只支持KEYS的所有，它的工作方式类似于哈希索引。在未来的版本中，将支持位图索引。

接下来，我们在列簇users中添加一些测试数据：
[default@demo] set users[bsanderson][full_name] = 'Brandon Sanderson';
[default@demo] set users[bsanderson][birth_date] = 1975;
[default@demo] set users[prothfuss][full_name] = 'Patrick Rothfuss';
[default@demo] set users[prothfuss][birth_date] = 1973;
[default@demo] set users[htayler][full_name] = 'Howard Tayler';
[default@demo] set users[htayler][birth_date] = 1968;

现在我们可以在Cassandra对刚刚写入的测试数据进行查询了：
[default@demo] get users where birth_date = 1973;
-------------------
RowKey: prothfuss
=> (column=birth_date, value=1973, timestamp=1291333944389000)
=> (column=full_name, value=Patrick Rothfuss, timestamp=1291333940538000)

添加索引

现在假设有这样一个需求，我们想要对State这个列进行查询。要处理这个需求，只需要在users列簇中再添加一个二级索引即可。

首先，让我们再添加一些测试数据：
[default@demo] set users[bsanderson][state] = 'UT';
[default@demo] set users[prothfuss][state] = 'WI';
[default@demo] set users[htayler][state] = 'UT';

虽然现在state列还没有索引，但是依旧可以通过与其他包含索引的列进行联合查询：
[default@demo] get users where state = 'UT';
No indexed columns present in index clause with operator EQ

[default@demo] get users where state = 'UT' and birth_date > 1970;
No indexed columns present in index clause with operator EQ

[default@demo] get users where birth_date = 1968 and state = 'UT';
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=5554, timestamp=1291334890708000)

这里需要注意一点，说KEYS索引更像哈希索引而不是BTree索引的的原因就在这里：即使birth_date有二级索引，但是Cassandra依旧不能提供范围查询，比如"> 1970"。

最后，我们给state列添加二级索引，这样就可以对state列的值进行单独的查询了：
[default@demo] update column family users with comparator=UTF8Type
... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

添加完索引后，对state列的值进行单独的查询：
[default@demo] get users where state = 'UT';
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)
-------------------
RowKey: htayler
=> (column=birth_date, value=1968, timestamp=1291334765649000)
=> (column=full_name, value=Howard Tayler, timestamp=1291334749160000)
=> (column=state, value=UT, timestamp=1291334890708000)

[default@demo] get users where state = 'UT' and birth_date > 1970;
-------------------
RowKey: bsanderson
=> (column=birth_date, value=1975, timestamp=1291333936242000)
=> (column=full_name, value=Brandon Sanderson, timestamp=1291333931790000)
=> (column=state, value=UT, timestamp=1291334909266000)

由于查询的列都包含二级索引，所以现在Cassandra可以进行范围查询了。

编程进行操作

在Python的客户端pycassa中，操作如下：

state_expr = pycassa.create_index_expression('state', 'UT')
birth_expr = pycassa.create_index_expression('birth_date', 1970, op=IndexOperator.GT)
clause = pycassa.create_index_clause([state_expr, bday_expr])
result = users.get_indexed_slices(clause):

在Java的客户端Hector中，操作如下：

StringSerializer ss = StringSerializer.get();
IndexedSlicesQuery<String, String, String> indexedSlicesQuery = HFactory.createIndexedSlicesQuery(keyspace, ss, ss, ss);
indexedSlicesQuery.setColumnNames("full_name", "birth_date", "state");
indexedSlicesQuery.addGtExpression("birth_date", 1970L);
indexedSlicesQuery.addEqualsExpression("state", "UT");
indexedSlicesQuery.setColumnFamily("users");
indexedSlicesQuery.setStartKey("");
QueryResult<OrderedRows<String, String, String>> result = indexedSlicesQuery.execute();

可以参考pycassa documentation和hector documentation获取更加详细的信息。

更多关于Cassandra的文章：http://www.cnblogs.com/gpcuster/tag/Cassandra/

posted on 2010-12-06 10:54 逖靖寒阅读(5336) 评论(10) 收藏举报

刷新页面返回顶部

逖靖寒的世界

导航

公告

Cassandra 0.7中的二级索引