作为后台开发，平时接触到的数据库产品大概分为两类：一是以mysql、ORACLE为代表的关系型数据库，二是以mongodb、redis、HBase为代表的NoSQL数据库。前者特点是原生支持事务、具有较高的稳定性和易用性、但是难以横向扩容，这类数据库产品常称为OLTP产品；后者属于原生支持横行扩展的新型非关系型数据库，具有高效读写等特点，这类产品常称为OLAP产品。

而腾讯的分布式HTAP数据库 TBase（TencentDB for TBase）是自主研发的分布式数据库系统，集高扩展性、高SQL兼容度、完整的分布式事务支持、多级容灾能力以及多维度资源隔离等能力于一身。

为了体验TBase的灵活横向扩展性，本文重点体验分布式数据自动 shard 分片和冷热分离存储两大特征

一、TBase主要特点

TBase 具备事务和分析混合处理技术。提供了高效的 OLTP 能力和海量的 OLAP 同时处理能力，可降低业务架构复杂度和成本。
TBase 引入全局事务管理节点来管理分布式事务，通过拥有自主专利的分布式事务一致性技术来保证在全分布式环境下的事务一致性。
支持实时在线自动扩容，满足横行扩展的大数据需求，且对业务影响时间可以控制在秒级。
内核支持三权分立的体系，提供数据透明加密，数据脱敏访问，强制访问控制等多个层级的数据安全保障能力。

二、TBase架构简述

TBase架构图

GTM

GlobalTransactionManager（简称 GTM），是全局事务管理器，负责全局事务管理。GTM 上不存储业务数据。

Coordinato（类似Hadoop HDFS的NameNode）

Coordinator（简称 CN）是协调节点，是数据库服务的对外入口，负责数据的分发和查询规划，多个节点位置对等。业务请求发送给 CN 后，无需关心数据计算和存储的细节，由 CN 统一返回执行结果。 CN 上只存储系统的元数据，并不存储实际的业务数据，可以配合支持业务接入增长动态增加。

Datanode（类似Hadoop HDFS的DateNode）

Datanode（简称 DN）是数据节点，执行协调节点分发的执行请求，实际存储业务数据。各个 DN 可以部署在不同的物理机上，也支持同物理机部署多个 DN 节点，DN互为主备节点不能部署在同一台物理机上。 DN 节点存储空间彼此之间独立、隔离，是标准的 share nothing 存储拓扑结构。另外TBase-V2与V1最大的不同地方是DN与DN之间可以通信，互相交换数据。

三、基于TBase开源代码编译和搭建体验系统

开源版本安装流程开源直接参考HBase Github上的wiki教程，本文不做赘述。

安装完成会提示

Success.
Done.
Starting all the datanode masters.
Starting datanode master dn001.
Starting datanode master dn002.
Done.

Initialize all the datanode slaves.
Initialize datanode slave dn001
Initialize datanode slave dn002

Starting all the datanode slaves.
Starting datanode slave dn001.
Starting datanode slave dn002.
Done.

按教程安装完成发现无法建表的坑，解决办法：

test=> create table foo(id bigint, str text) distribute by shard(id);
ERROR:  default group not defined

# 解决办法，建表需要创建组 default_group，这个在官方教程上没有说明
postgres=# create default node group default_group with(dn001, dn002); 
CREATE NODE GROUP
postgres=# create sharding group to group default_group;
CREATE SHARDING GROUP
postgres=# clean sharding;
CLEAN SHARDING

集群安装配置拓补结构

node name IP data directory

GTM master 10.128.0.20 /data/TBase/data/gtm

GTM slave 10.128.0.21 /data/TBase/data/gtm

CN1 10.128.0.20 /data/TBase/data/coord

CN2 10.128.0.21 /data/TBase/data/coord

DN1 master 10.128.0.20 /data/TBase/data/dn001

DN1 slave 10.128.0.21 /data/TBase/data/dn001

DN2 master 10.128.0.21 /data/TBase/data/dn002

DN2 slave 10.128.0.20 /data/TBase/data/dn002

四、数据自动shard分片

高数据量和吞吐量的数据库应用会对单机的性能造成较大压力，大的查询量会将单机的CPU耗尽，大的数据量对单机的存储压力较大，最终会耗尽系统的内存而将压力转移到磁盘IO上。为了解决这些问题，HBase使用的是水平扩展策略：将数据集分布在多个服务器上，即分片（sharding）。

查看数据节点分布情况

postgres=# select node_name, node_type, node_host, node_port from pgxc_node where node_type = 'D';
 node_name | node_type |  node_host  | node_port 
-----------+-----------+-------------+-----------
 dn001     | D         | 10.128.0.20 |     40004
 dn002     | D         | 10.128.0.21 |     40004
(2 rows)

体验数据自动shard

1. 创建 shard 表，插入测试数据

postgres=# create table test_shard(id bigint, str text) distribute by shard(id);
CREATE TABLE
postgres=# insert into test_shard select generate_series(1, 100), 'charleyyang';
postgres=# select * from test_shard;
 id  |     str     
-----+-------------
   1 | charleyyang
   2 | charleyyang
...skipping 87 lines
  99 | charleyyang
 100 | charleyyang
(100 rows)

2. 查看id=1的数据分布

postgres=# explain select * from test_shard where id = 1;
                            QUERY PLAN                            
------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001
   ->  Seq Scan on test_shard  (cost=0.00..21.00 rows=4 width=40)
         Filter: (id = 1)
(4 rows)

通过explain输出观察到在dn001上查询id=1的记录

3. 查看id=3的数据分布

postgres=# explain select * from test_shard where id = 3;
                            QUERY PLAN                            
------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn002
   ->  Seq Scan on test_shard  (cost=0.00..21.00 rows=4 width=40)
         Filter: (id = 3)
(4 rows)

通过explain输出观察到在dn002上查询id=3的记录

4. 查看id在1-100之间的数据分布

postgres=# explain select * from test_shard where id > 1 and id < 100;
                            QUERY PLAN                            
------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002
   ->  Seq Scan on test_shard  (cost=0.00..23.20 rows=4 width=40)
         Filter: ((id > 1) AND (id < 100))
(4 rows)

通过explain输出观察到同时在dn001和dn002上查询所有记录，体验TBase分布式存储的特性

5. 典型复杂场景分析（join 为例）

postgres=# create table dim(c1 bigint, c2 bigint, c3 text);
CREATE TABLE
postgres=# insert into dim select generate_series(1, 100), generate_series(1, 100), 'tbase';
INSERT 0 100

# join key 与 shard key 相同 可以下推到节点计算的场景 
postgres=# explain select t1.id, t2.c3 from test_shard t1 join dim t2 on t1.id = t2.c1 where t1.id = 100;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn002
   ->  Nested Loop  (cost=0.00..41.33 rows=16 width=40)
         ->  Seq Scan on test_shard t1  (cost=0.00..21.00 rows=4 width=8)
               Filter: (id = 100)
         ->  Materialize  (cost=0.00..20.14 rows=4 width=40)
               ->  Seq Scan on dim t2  (cost=0.00..20.12 rows=4 width=40)
                     Filter: (c1 = 100)
(8 rows)

# join key 与 shard key 不同 TBase可以执行，该场景能体现分布式的优势，分库分表的插件并不能自动的做到
这种
postgres=# explain select t1.id, t2.c3 from test_shard t1 join dim t2 on t1.id = t2.c2 where t1.id = 100;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..62.29 rows=4 width=14)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..141.27 rows=1 width=14)
         ->  Seq Scan on dim t2  (cost=0.00..41.25 rows=1 width=14)
               Filter: (c2 = 100)
   ->  Materialize  (cost=100.00..121.06 rows=4 width=8)
         ->  Remote Subquery Scan on all (dn002)  (cost=100.00..121.05 rows=4 width=8)
               ->  Seq Scan on test_shard t1  (cost=0.00..21.00 rows=4 width=8)
                     Filter: (id = 100)
(8 rows)

五、冷热分离存储

随着业务的持续发展，业务数据库存储量会持续增长。对于大量存储瓶颈类业务，存储成本依然是系统设计中需要关注的重中之重，冷热数据分离是一个很好的解决方案，可以将冷数据存储到性价比高的节点。对于历史订单、原始IOT数据随时间推移访问次数递减的海量数据，冷热分离是最佳的场景。

集群参数配置

所有 cn,dn 主务节点的 postgresql.conf 中配置这三个参数

cold_hot_sepration_mode = 'year'
enable_cold_seperation = on
manual_hot_date = '2019-01-01'

此处手工指定冷数据的界限为'2019-01-01'，早于该日期的数据存储到冷节点dn002，晚于该日期的数据存储到热节点dn001 ，配置后重启集群。

体验冷热数据分离存储

1. 连接 cn

psql -h 10.128.0.20-p 30004-d postgres -U charley_yangs

2. 创建冷热group default_group和cold_group

******* 注意：需要删除之前创建的default_group **********

drop sharding in group default_group
drop node group default_group

再从新创建冷热group

postgres=# create default node group default_group with(dn001);
postgres=# create sharding group to group default_group;
postgres=# clean sharding;

postgres=# create node group cold_group with(dn002);
postgres=# create extension sharding group to group cold_group;clean sharding;

3.设置 dn002 为冷节点，主备都要设置

# 连接dn002主节点

psql -h 10.128.0.20 -p 50004 -d postgres -U charley_yangs

postgres=# select pg_set_node_cold_access();
pg_set_node_cold_access
-------------------------
success
(1 row)

# 连接dn002备节点

psql -h 10.128.0.20-p 54004-d postgres -U charley_yangs
postgres=# select pg_set_node_cold_access();
pg_set_node_cold_access
-------------------------
success
(1 row)

4. 连接 cn，创建冷热分区表

psql -h 10.128.0.20  -p 30004 -d postgres -U charley_yangs

postgres=# create table public.test_cold_hot
(
f1 int not null,
f2 timestamp not null,
f3 varchar(20),
primary key(f1)
)
partition by range (f2)
begin (timestamp without time zone '2015-01-01 0:0:0')
step (interval '1 year') partitions (12)
distribute by shard(f1,f2)
to group default_group cold_group;

5. 插入数据

insert into public.test_cold_hot values(1,'2015-02-18','hbase');
insert into public.test_cold_hot values(2,'2016-02-11','kafka');
insert into public.test_cold_hot values(3,'2019-10-30','mysql');
insert into public.test_cold_hot values(4,'2020-03-23','hive');

6. 同时查询冷数据和热数据

postgres=# select * from public.test_cold_hot;
 f1 |         f2          |  f3   
----+---------------------+-------
  3 | 2019-10-30 00:00:00 | mysql
  4 | 2020-03-23 00:00:00 | hive
  1 | 2015-02-18 00:00:00 | hbase
  2 | 2016-02-11 00:00:00 | kafka
(4 rows)

# 分析sql执行逻辑
postgres=# explain select * from public.test_cold_hot;

QUERY PLAN                                             
----------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002

通过explain输出观察到同时在热区dn001和冷区dn002进行查询

7. 单独查询热数据

postgres=# select * from public.test_cold_hot where f2>='2019-01-01' and f2<'2020-01-01';
 f1 |         f2          |  f3   
----+---------------------+-------
  3 | 2019-10-30 00:00:00 | mysql
(1 row)

QUERY PLAN                                                       
--------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001
   
   ->  Seq Scan on test_cold_hot_part_4  (cost=0.00..19.90 rows=3 width=70)
         Filter: ((f2 >= '2019-01-01 00:00:00'::timestamp without time zone) AND (f2 < '2020-01-01 00:00:00'::times
tamp without time zone))
(4 rows)

通过explain输出观察到2019年后的数据只在热节点dn001进行查询

8. 单独查询冷数据

postgres=# select * from public.test_cold_hot where f2>='2015-01-01' and f2<'2019-01-01';
 f1 |         f2          |  f3   
----+---------------------+-------
  1 | 2015-02-18 00:00:00 | hbase
  2 | 2016-02-11 00:00:00 | kafka
(2 rows)

QUERY PLAN                                                                                                 
---------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn002

   Filter: ((f2 >= '2015-01-01 00:00:00'::timestamp without time zone) AND (f2 < '2019-01-01 00:00:00':
:timestamp without time zone))
(11 rows)

通过explain输出观察到2019年前的数据只在冷节点dn002进行查询

总结

通过实际部署和体验TBase，不仅体会到部署流程的顺畅和工具的易用性，同时对sharing自动分片和冷热数据数据分离两大特性进行了深度的体验，感受到了国产数据库的强大。

【TBase开源版测评】深度测评TBase的shard分片和冷热分离存储特性