HBase 10分钟快速入门

hbase这种基于列族的数据库 使用phoenix套娃后就能用sql操作,与mysql一样简单。

当然这里不讲phoenix,只讲hbase基础用法。

 

Introduction

HBase offers an alternative to Hive which is based on HDFS and has a write-once, read-many approach.

HBase is a column-oriented NoSQL system based in Google's Big Table and built on top of HDFS

Ideal for:

  • Large, sparsely populated tables
  • Real-time processing
  • Read/write random access

Characterized as a sparse, distributed, multi-dimensional, sorted map

HBase organizes the data as column-oriented with column families. Rows are kept organized by the same row-key. Data from a classic tuple may be separately stored in different columns.

HBase offers:

  • Random (row-level) read/write access
  • Strong consistency
  • "Schema-less" or flexible data modeling

Differences from relational tables:

  • Cell values are uninterpreted byte arrays (no notion of different data types)
  • Individual cells of a table can be versioned, storing the history of values for a cell
  • Each row has a row key, with the table sorted by the row key
  • Atomicity is only guaranteed at the row level (no atomicity for multi-row updates)
  • Major difference: Columns in Hbase are not the same as columns in relational tables!

HBase and RDBMS comparison

HBaseRDBMS
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.
No transactions exist in HBase. RDBMS is transactional.
It has de-normalized data tables. It will have normalized data.
It is good for semi-structured as well as structured data. It is good for structured data.

Features of HBase

  • HBase is linearly scalable.
  • It has automatic failure support.
  • It provides consistent read and writes.
  • It integrates with HDFS, both as a source and a destination.
  • It has easy java API for client.
  • It provides data replication across clusters.
  • It  provides fast random access to available data.

 

 

Columns in HBase:

  • Columns are known as column qualifiers
  • Column qualifiers are organized into column families
  • Column families conceptually organize column qualifiers into groups that have the same access patterns
  • Column families must be defined when a table is created
  • The number of column qualifiers in a column family is dynamic and can be defined as needed, giving HBase flexibility for dealing with unstructured data
  • The number of column families should be limited to no more than two or three families for storage efficiency

 

 

Data that has missing data can be stored more efficiently as sparse columns as in this social media example:

 

Table cells are also stored as version each with a UTC timestamp:

 

 

 

Two column families: PersonInfo and Sales

PersonInfo column family has two column qualifiers: Name and Address

Sales column family has two column qualifiers: Territory and SalesYTD

All values have a timestamp

Territory for row key 002 has multiple values that have changed over time (same for Address of row key 004)

The above can be represented as this JSON -like map:

{"001" :  {"PersonInfo" :
		{"Name" :
			{TS1: "J. Smith"}
		 "Address" :
			{TS1: "Oak St"}}
	   "Sales" :
		{"Territory" :
			{TS1: "East"}
		 "SalesYTD" :
			{TS1: "100,000"}}}
 "002" :  {"PersonInfo" :
		{"Name" :
			{TS2: "B. Parker"}
		  "Address" :
			{TS2: "Elm St."}}
	 "Sales" :
		{"Territory" :
			{TS4: "West"
			 TS3: "East"
			 TS2: "North"}
		 "SalesYTD" :
			{TS2: "90,000"}}}
…}

Hbase table design issues

More complex than relational table design as table design is driven by data access patterns.

Tables should be structured to return data in an efficient manner.

Important considerations:

  • What values should compose the key?
  • What column families are needed (specified at table creation)?
  • What column qualifiers are associated with each column family?

 

HBase Shell Environment

Allows a user to interactively

  • Create tables
  • Insert data
  • Retrieve data

Main commands

  • create
  • put
  • get
  • scan

Example create tablename, family-name....

hbase> create 'salesPerson', 'personinfo', 'sales'

Alternate syntax with version specification

hbase> create 'salesperson', {NAME => 'personinfo', VERSIONS => 3}, {NAME => 'sales', VERSIONS => 3}

Data manipulation

The put command is used to write a single cell

hbase> put 'salesPerson','001','personinfo:name','J. Smith'
hbase> put 'salesperson','001','personinfo:address','Oak St.'
hbase> put 'salesperson','001','sales:territory','East'
hbase> put 'salesperson','001','sales:salesytd','100,000'

To change a cell value, use the put command to write to the same cell. Hbase will automatically assign timestamps to each cell version

hbase> put 'salesPerson','002','sales:territory','North'
hbase> put 'salesPerson','002','sales:territory','East'

Retrieve all contents of a row

hbase> get 'salesperson','001'

Restrict data of a row to a time  stamp range

hbase> get 'salesperson', 001',{TIMERANGE =>[ts1, ts5]}

Retrieve a specific column value of a row

hbase> get 'salesperson','001,{COLUMN => 'personinfo:name'}

Retrieve several column values of a row

hbase> get 'salesperson','001',{COLUMN => ['personinfo:name','personinfo:address']}

Indicate number of versions to retrieve

hbase> get 'salesperson','001',{COLUMN => ['personinfo:name','personinfo:address'],VERSIONS => 3}

Filters can be specified on columns

hbase> get 'salesperson','001',{FILTER => "ValueFilter(=,'binary:Oak St.')"}

A ValueFilter is used to filter cell values using a comparison operator and a comparator (in single quotes).

The comparator indicates a comparison type together with a value separated by a colon

Comparison types:

  • binary (byte-to-byte comparison of values)
  • binaryprefix (a cell should begin with the value)
  • regexstring (specifies a pattern for the value)

The following statement will read the entire table and print the data in key-value pair format:

hbase> scan 'salesperson'

Specify start and stop row keys for the scan

hbase> scan 'salesperson',{STARTROW =>'001',STOPROW => '003'}

Specify a timestamp for the data to retrieve

hbase> scan 'salesperson', {TIMESTAMP => TS1}

Scan and return key-value pairs of the indicated columns

hbase> scan 'salesperson',{COLUMN => ['personinfo:name','personinfo:address']}

Scan and return key-value pairs of all columns in a column family

hbase> scan 'salesperson',{COLUMN => 'personinfo'}

 


文章来源 https://jcsites.juniata.edu/faculty/rhodes/smui/hbase.htm

 

posted @ 2022-09-30 10:34  zjsxwc  阅读(142)  评论(0编辑  收藏  举报