Example: A common column family Socorro uses is "ids:" and a common column qualifier in that family is "ids:ooid". Another column is "ids:hang"
The table schema enumerates the column families that are part of it. The column family contains metadata about compression, number of value versions retained, and caching.
A column family can store tens of thousands of values with different column qualifier names.
Retrieving data from multiple column families requires at least one block access (disk or memory) per column family. Accessing multiple columns in the same family requires only one block access.
If you specify just the column family name when retrieving data, the values for all columns in that column family will be returned.
If a record does not contain a value for a particular column in a set of columns you query for, there is no "null", there just isn't an entry for that column in the returned row.
Manipulating a row
All manipulations are performed using a rowkey.
Setting a column to a value will create the row if it doesn't exist or update the column if it already existed.
Deleting a non-existent row or column is a no-op.
Counter column increments are atomic and very fast. StumbleUpon has some counters that they increment hundreds of times per second.
Tables are always ordered by their rowkeys
Scanning a range of a table based on a rowkey prefix or a start and end range is fast.
Retrieving a row by its key is fast.
Searching for a row requires a rowkey structure that you can easily do a range scan on, or a reverse index table.
A full scan on a table that contains billions of items is slow (although, unlike an RDBMS it isn't likely to cause performance problems)
If you are continually inserting rows that have similar rowkey prefixes, you are beating up on a single RegionServer. In excess, it is unpleasant.