部门的项目原先基于Oracle进行存储、DM层OLAP计算、后端接口访问，而Oracle只是单机+从库的部署方式，经过一段时间的使用，存储与性能都有捉襟见肘之处，另外也想节省成本，决定响应去O的大潮。

选型时主要考虑公司内部的HTAP类型数据库，因为需要其类似Oracle既能进行离线任务计算，也可以承担频繁的查询（其实是DAU只有100+的对内服务…），并且后续扩容还比较方便，范围内的选项有TiDB和TBase，考虑到TBase语法对Oracle兼容比较多，故对TBase做进一步的评估。这里会先进行基本的测试，之后会有使用过程中的问题与注意点。

一、测评体验项

1.1 分布式数据自动 shard 分片

1.1.1 查看集群的数据节点情况

=> select node_name, node_type, node_host, node_port from pgxc_node where node_type = 'D';
 node_name | node_type |  node_host   | node_port
-----------+-----------+--------------+-----------
 dn001     | D         | 9.56.27.183  |     11006
 dn002     | D         | 9.56.27.183  |     11000
 dn003     | D         | 9.24.147.143 |     11000
 dn004     | D         | 9.24.147.143 |     11002
 dn005     | D         | 9.56.21.88   |     11004
 dn006     | D         | 9.56.21.88   |     11006
 dn007     | D         | 9.37.26.52   |     11002
 dn008     | D         | 9.37.26.52   |     11000
(8 rows)

时间原因这部分较简略。

二、迁移采坑记

2.1 Oracle元数据与数据迁移

这部分简述一下数据与任务迁移的预备工作步骤。

2.1.1 DDL迁移

这里我们使用ora2pg这一命令行工具，官方文档见附录。

批量获取表schema(执行完在当前目录生成output.sql文件)ora2pg -d -t TABLE VIEW TABLESPACE TYPE PARTITION -a t_name_a,t_name_b生成的语句不能自动适应分布式环境，需要手动加上partition与shardCREATE TABLE t_name_a ( ds bigint, id bigint, str text) -- 需要添加： PARTITION BY LIST(ds) DISTRIBUTE BY SHARD(id);另外TBase分区字段不允许为空，故建表后还需要 ALTER TABLE t_name_a alter log_time set not null;2.1.2 表分区自动化维护PG分区表的父表不能创建索引，需要为每个分区子表维护索引，旧分区中的索引也不能自动继承。TBase对内提供了一系列存储过程，简化了操作，详细说明与样例见附录。 1. 注册分区表(对表进行分区类型、最后创建分区等信息的维护)select tdw_meta.create_parent('public.t_name_a', 'ds', 'time-static', 'daily', 3); NOTICE: SELECT tdw_meta.create_time_partition('public.t_name_a','ds','1 day','YYYYMMDD','{"2020-08-08 11:19:19.306806","2020-08-07 11:19:19.306806","2020-08-09 11:19:19.306806","2020-08-06 11:19:19.306806","2020-08-10 11:19:19.306806","2020-08-05 11:19:19.306806","2020-08-11 11:19:19.306806"}')

NOTICE: public.t_name_a_p_20200811

create_parent

=>d+ t_name_a;

                       Partitioned table "public.t_name_a"

--------+--------+-----------+----------+---------+----------+--------------+-------------

Partition key: LIST (ds)

Partitions: t_name_a_p_20200805 FOR VALUES IN ('20200805'),

        t_name_a_p_20200806 FOR VALUES IN ('20200806'),

        t_name_a_p_20200807 FOR VALUES IN ('20200807'),

        t_name_a_p_20200808 FOR VALUES IN ('20200808'),

        t_name_a_p_20200809 FOR VALUES IN ('20200809'),

        t_name_a_p_20200810 FOR VALUES IN ('20200810'),

        t_name_a_p_20200811 FOR VALUES IN ('20200811')

最后一个参数表明需要回补的历史分区，这里执行的日期是0807，但是回补的数量有点摸不着规律，而且重建表重试参数值为1的时候自动创建的是07-09的分区，值得商榷。

2.  索引创建
4个参数，分别为：1.主表名字；2.分区列（注意先后顺序，要带上分区列）；3.需创建的历史分区数量，默认为全部历史分区（实际来看似乎是按4+n来计算的）；4.类型，默认为btree

=> select tdw_meta.create_partition_index('public.t_name_a', 'ds, id', 1, 'btree', false );

NOTICE: v_sql : create index on public.t_name_a_p_20200811 USING btree(ds, id)

NOTICE: v_sql : create index on public.t_name_a_p_20200810 USING btree(ds, id)

NOTICE: v_sql : create index on public.t_name_a_p_20200809 USING btree(ds, id)

NOTICE: v_sql : create index on public.t_name_a_p_20200808 USING btree(ds, id)

NOTICE: v_sql : create index on public.t_name_a_p_20200807 USING btree(ds, id)

create_partition_index

3. 滚动分区
- 将表的分区与索引置为此状态

=>d+ t_name_a;

                       Partitioned table "public.t_name_a"

--------+--------+-----------+----------+---------+----------+--------------+-------------

Partition key: LIST (ds)

Partitions: t_name_a_p_20200806 FOR VALUES IN ('20200806')

- 查看已有索引

=> select * from tdw_meta.show_partition_index('public.t_name_a');

schemaname | tablename | columnlist | indexdef

------------+-----------+------------+----------------------------------------------------------------------------------------

public | t_name_a | ds, id | CREATE INDEX t_name_a_p_20200806_ds_id_idx ON t_name_a_p_20200806 USING btree (ds, id)

(1 row)

- 验证结果

=> select tdw_meta.run_maintenance_table('public.t_name_a');

NOTICE: public.t_name_a

NOTICE: sql: public.t_name_a_p_20200807 USING btree (ds, id)

NOTICE: sql: public.t_name_a_p_20200808 USING btree (ds, id)

NOTICE: sql: public.t_name_a_p_20200809 USING btree (ds, id)

run_maintenance_table

可以看到不管分区还是索引都创建了新的

=> select * from pg_indexes where tablename='t_name_a_p_20200807';

schemaname | tablename | indexname | tablespace | indexdef

------------+---------------------+-------------------------------+------------+----------------------------------------------------------------------------------------

public | t_name_a_p_20200807 | t_name_a_p_20200807_ds_id_idx | | CREATE INDEX t_name_a_p_20200807_ds_id_idx ON t_name_a_p_20200807 USING btree (ds, id)

(1 row)

### 2.1.3 数据迁移
依然使用ora2pg进行迁移，
```ora2pg -P 10 -t COPY -a 'TABLE[t_name_a]'```

可以使用-e选项排除已迁移的分区数据。支持多连接/多进程读Oracle。如果只指定-P选项标明同时进行多PG表的迁移，速度并不会有改善，只有1w行/秒。

[==================>      ]  789726754/1044650409 rows (75.6%) on total estimated data (73612 sec., avg: 10728 tuples/sec)

2.2 任务迁移与优化

2.2.1 常见语法切换

虽然兼容了很多Oracle语法，但TBase毕竟还是对标Postgresql的，部分语法需要手动处理。

Oracle	TBase	说明
nvl	coleasce	空值处理函数
number(20,6)	numeric(20,6)	ora2pg会将number(10,6)转为float8造成精度缺失，可改为numeric(20,6)
where rownum <= n;	limit n;	后端接口进行了批量替换
MERGE INTO	on conflict() do	pg只能在分区子表中做此操作

还有其他区别网上资料很多。另外postgres比较严格，需要手动转换一些类型，在涉及到分区字段、shard字段、索引字段要注意类型准确性，否则查询时用cast或者::type转换都有损耗。

2.2.2 任务迁移优化一例

目标语句是一个5个子表逐次left join，其中第t1表的目标分区大概30w行，t4表行数最多有3000w+行。原语句较长，详情见附录4。

禁用Nested Loop 其中t1和t2部分的join就耗时3m56s。对这一部分进行explain。左右表实际都是按时间分区并且有distribute key的，并且左表对应分区的distribute key也有索引。问题在于左表有37w条数据，右表也有2w+，后续的其他右表还有几十万行的数据，不适合Nested Loop的join方式。"Remote Subquery Scan on all (dn001,dn002,dn003,dn004,dn005,dn006,dn007,dn008) (cost=10.26..21.32 rows=1 width=4156)" -> Nested Loop Left Join (cost=10.26..21.32 rows=1 width=4156) Join Filter: ((app_id)::text = (t_md_light_tj_search_tag.app_id)::text) -> Remote Subquery Scan on all (dn006) (cost=110.26..113.95 rows=1 width=3640) Distribute results by S: app_id -> Group (cost=10.26..10.29 rows=1 width=3640) " Group Key: t_md_light_tj_search_index.statis_day, t_md_light_tj_search_index.tj_key, t_md_light_tj_search_index.tj_name, t_md_light_tj_search_index.app_type, t_md_light_tj_search_index.device_type, t_md_light_tj_search_index.app_id, t_md_light_tj_search_index.app_name, t_md_light_tj_search_index.app_industry, t_md_light_tj_search_index.app_tag, t_md_light_tj_search_index.logo, t_md_light_tj_search_index.rank_comb, t_md_light_tj_search_index.search_name" -> Sort (cost=10.26..10.27 rows=1 width=3640) " Sort Key: t_md_light_tj_search_index.tj_key, t_md_light_tj_search_index.tj_name, t_md_light_tj_search_index.app_type, t_md_light_tj_search_index.device_type, t_md_light_tj_search_index.app_id, t_md_light_tj_search_index.app_name, t_md_light_tj_search_index.app_industry, t_md_light_tj_search_index.app_tag, t_md_light_tj_search_index.logo, t_md_light_tj_search_index.rank_comb, t_md_light_tj_search_index.search_name" -> Seq Scan on t_md_light_tj_search_index (cost=0.00..10.25 rows=1 width=3640) Filter: (statis_day = 20200713) -> Materialize (cost=100.00..111.94 rows=1 width=934) -> Remote Subquery Scan on all (dn006) (cost=100.00..111.94 rows=1 width=934) Distribute results by S: app_id -> Seq Scan on t_md_light_tj_search_tag (cost=0.00..11.00 rows=1 width=934) Filter: (statis_day = 20200713)注意到explain中标的rows=1，说明其实对于表的行数信息是缺失的，优化器无法利用这部分的信息。我们可以使用analyze更新这个信息。analyze两个表之后explain，可以看到执行计划已改为Hash join。之后这两表join只需要10s即可得到结果。使用set enable_nestloop=off;也可以达到类似效果。"Remote Subquery Scan on all (dn001,dn002,dn003,dn004,dn005,dn006,dn007,dn008) (cost=111498.98..123940.01 rows=396268 width=224)" -> Hash Left Join (cost=111498.98..123940.01 rows=396268 width=224) Hash Cond: ((app_id)::text = (app_id)::text) " -> Remote Subquery Scan on all (dn001,dn002,dn003,dn004,dn005,dn006,dn007,dn008) (cost=110504.38..154939.66 rows=173915 width=208)" Distribute results by S: app_id -> Group (cost=110404.38..116056.62 rows=173915 width=208) " Group Key: statis_day, tj_key, tj_name, app_type, device_type, app_id, app_name, app_industry, app_tag, logo, rank_comb, search_name" -> Sort (cost=110404.38..110839.17 rows=173915 width=0) " Sort Key: tj_key, tj_name, app_type, device_type, app_id, app_name, app_industry, app_tag, logo, rank_comb, search_name" " -> Remote Subquery Scan on all (dn001,dn002,dn003,dn004,dn005,dn006,dn007,dn008) (cost=82101.17..95266.80 rows=173915 width=0)" Distribute results by S: rank_comb -> Group (cost=82001.17..94297.22 rows=173915 width=208) " Group Key: t_md_light_tj_search_index_p_20200713.statis_day, t_md_light_tj_search_index_p_20200713.tj_key, t_md_light_tj_search_index_p_20200713.tj_name, t_md_light_tj_search_index_p_20200713.app_type, t_md_light_tj_search_index_p_20200713.device_type, t_md_light_tj_search_index_p_20200713.app_id, t_md_light_tj_search_index_p_20200713.app_name, t_md_light_tj_search_index_p_20200713.app_industry, t_md_light_tj_search_index_p_20200713.app_tag, t_md_light_tj_search_index_p_20200713.logo, t_md_light_tj_search_index_p_20200713.rank_comb, t_md_light_tj_search_index_p_20200713.search_name" -> Sort (cost=82001.17..82947.02 rows=378340 width=208) " Sort Key: t_md_light_tj_search_index_p_20200713.tj_key, t_md_light_tj_search_index_p_20200713.tj_name, t_md_light_tj_search_index_p_20200713.app_type, t_md_light_tj_search_index_p_20200713.device_type, t_md_light_tj_search_index_p_20200713.app_id, t_md_light_tj_search_index_p_20200713.app_name, t_md_light_tj_search_index_p_20200713.app_industry, t_md_light_tj_search_index_p_20200713.app_tag, t_md_light_tj_search_index_p_20200713.logo, t_md_light_tj_search_index_p_20200713.rank_comb, t_md_light_tj_search_index_p_20200713.search_name" -> Append (cost=0.00..17389.25 rows=378340 width=208) -> Seq Scan on t_md_light_tj_search_index_p_20200713 (cost=0.00..17389.25 rows=378340 width=208) Filter: (statis_day = 20200713) -> Hash (cost=2066.41..2066.41 rows=28584 width=38) -> Remote Subquery Scan on all (dn006) (cost=100.00..2066.41 rows=28584 width=38) Distribute results by S: app_id -> Append (cost=0.00..737.30 rows=28584 width=38) -> Seq Scan on t_md_light_tj_search_tag_p_20200713 (cost=0.00..737.30 rows=28584 width=38) Filter: (statis_day = 20200713)where statis_month = (select max(statis_month) from table_4)
选取最新分区优化 t4表使用这个方式找到最新的分区，子查询造成了全表扫描。改为直接从分区子表查询后时间从5m降为10s以内。
分布 key join 语句中的个别表没有指定shard id，重新建表后所有join都会在分布键上执行，分布键 join 查询性能会更好。

另外t1为了去重使用了所有字段的group by，但实际数据是可以保证不重复的，故去除了group by。数据量最大的t4表也前置到第一次left join。最终这个任务从20min+甚至经常打满资源无法终止优化到20s左右。

在优化这个任务的过程中遇到了非常多次语句无法终止执行的问题，pg_terminate_backend对应pid没有响应，DBA告知是节点CPU等资源打满无法响应SIGTERM，而且有时候并不是这个insert select语句导致的，在我explain analyze这个语句后也出现了类似情况，未来TBase可能需要在OLAP以及资源隔离限制方面加以进步。

2.3 数据丢失相关的乌龙事件

unlogged table是PostgreSQL的特性，这类表在数据写入时不会写WAL(write-ahead log)，也不会拷贝到standby server，这让其有更快的写入速度，然而它不crash-safe，会在崩溃后truncate。

由于对此特性不够熟悉，很多非分区的表都使用了这个类型，上述的未优化任务导致数据库资源打满崩溃，重启后发现很多表数据清空或者残缺。开始觉得这个是一个严重bug，也从事务可见性等方面排查，最终发现是使用不当造成的。后续处理：

列出所有unlogged table ： select relname from pg_class where relpersistence='u' and relkind='r'
修改为logged table语句(数据量大时可以在插入前改回unlogged，插入完成后进行此操作)： alter table <table_name> set logged;

总结

TBase在试运行阶段良好地承接了Oracle的OLAP+OLTP应用，预计下线大批Oracle任务，也让提心吊胆的SRE团队松了一口气。

附：

内部链接无法打开可私信联系

TBase锁表+无法kill进程问题与处理 https://iwiki.oa.tencent.com/pages/viewpage.action?pageId=247326931
Ora2pg官方文档 http://ora2pg.darold.net/documentation.html
TBase分区维护相关存储过程使用样例 http://wiki.tdw.oa.com/wiki/index.php/TPG%E5%88%86%E5%8C%BA%E8%A1%A8%E7%9A%84%E4%BB%8B%E7%BB%8D%E4%BB%A5%E5%8F%8A%E4%BD%BF%E7%94%A8 http://km.oa.com/group/pgxz/articles/show/389601
多join任务优化 https://iwiki.oa.tencent.com/pages/viewpage.action?pageId=254741342

【TBase开源版测评】轻松愉快去O选项：TBase