PostgreSQL利用递归优化求稀疏列唯一值的方法_PostgreSQL

在数据库中经常会碰到一些表的列是稀疏列，只有很少的值，例如性别字段，一般就只有2种不同的值。
但是当我们求这些稀疏列的唯一值时，如果表的数据量很大，速度还是会很慢。

例如：
创建测试表

									bill=# create table t_sex (sex char(1), otherinfo text);

									CREATE TABLE

									bill=# insert into t_sex select 'm', generate_series(1,10000000)||'this is test';

									INSERT 0 10000000

									bill=# insert into t_sex select 'w', generate_series(1,10000000)||'this is test';

									INSERT 0 10000000

查询：
可以看到下面的查询速度很慢。

									bill=# select count(distinct sex) from t_sex;

									 count

									-------

									   2

									(1 row)

									Time: 8803.505 ms (00:08.804)

									bill=# select sex from t_sex t group by sex;

									 sex

									-----

									 m

									 w

									(2 rows)

									Time: 1026.464 ms (00:01.026)

那么我们对该字段加上索引又是什么情况呢？

速度依然没有明显

									bill=# create index idx_sex_1 on t_sex(sex);

									CREATE INDEX

									bill=# select count(distinct sex) from t_sex;

									 count

									-------

									   2

									(1 row)

									Time: 8502.460 ms (00:08.502)

									bill=# select sex from t_sex t group by sex;

									 sex

									-----

									 m

									 w

									(2 rows)

									Time: 572.353 ms

的变化，可以看到执行计划已经使用Index Only Scan了。

									bill=# explain select count(distinct sex) from t_sex;

									                     QUERY PLAN

									----------------------------------------------------------------------------------------------

									 Aggregate (cost=371996.44..371996.45 rows=1 width=8)

									  -> Index Only Scan using idx_sex_1 on t_sex (cost=0.44..321996.44 rows=20000000 width=2)

									(2 rows)

同样的SQL我们看看在Oracle中性能如何？

创建测试表：

									SQL> create table t_sex (sex char(1), otherinfo varchar2(100));

									Table created.

									SQL> insert into t_sex select 'm', rownum||'this is test' from dual connect by level <=10000000;

									10000000 rows created.

									SQL> commit;

									Commit complete.

									SQL> insert into t_sex select 'w', rownum||'this is test' from dual connect by level <=10000000;

									10000000 rows created.

									SQL> commit;

									Commit complete.

性能测试：

									SQL> set lines 1000 pages 2000

									SQL> set autotrace on

									SQL> set timing on

									SQL> select count(distinct sex) from t_sex;

									COUNT(DISTINCTSEX)

									------------------

									         2

									Elapsed: 00:00:01.58

									Execution Plan

									----------------------------------------------------------

									Plan hash value: 3915432945

									----------------------------------------------------------------------------

									| Id | Operation     | Name | Rows | Bytes | Cost (%CPU)| Time   |

									----------------------------------------------------------------------------

									|  0 | SELECT STATEMENT  |    |   1 |   3 | 20132  (1)| 00:00:01 |

									|  1 | SORT GROUP BY   |    |   1 |   3 |      |     |

									|  2 |  TABLE ACCESS FULL| T_SEX |  14M|  42M| 20132  (1)| 00:00:01 |

									----------------------------------------------------------------------------

									Note

									-----

									  - dynamic statistics used: dynamic sampling (level=2)

									Statistics

									----------------------------------------------------------

									     0 recursive calls

									     0 db block gets

									   74074 consistent gets

									     0 physical reads

									     0 redo size

									    552 bytes sent via SQL*Net to client

									    608 bytes received via SQL*Net from client

									     2 SQL*Net roundtrips to/from client

									     1 sorts (memory)

									     0 sorts (disk)

									     1 rows processed

									SQL> select sex from t_sex t group by sex;

									SE

									--

									m

									w

									Elapsed: 00:00:01.08

									Execution Plan

									----------------------------------------------------------

									Plan hash value: 3915432945

									----------------------------------------------------------------------------

									| Id | Operation     | Name | Rows | Bytes | Cost (%CPU)| Time   |

									----------------------------------------------------------------------------

									|  0 | SELECT STATEMENT  |    |  14M|  42M| 20558  (3)| 00:00:01 |

									|  1 | SORT GROUP BY   |    |  14M|  42M| 20558  (3)| 00:00:01 |

									|  2 |  TABLE ACCESS FULL| T_SEX |  14M|  42M| 20132  (1)| 00:00:01 |

									----------------------------------------------------------------------------

									Note

									-----

									  - dynamic statistics used: dynamic sampling (level=2)

									Statistics

									----------------------------------------------------------

									     0 recursive calls

									     0 db block gets

									   74074 consistent gets

									     0 physical reads

									     0 redo size

									    589 bytes sent via SQL*Net to client

									    608 bytes received via SQL*Net from client

									     2 SQL*Net roundtrips to/from client

									     1 sorts (memory)

									     0 sorts (disk)

									     2 rows processed

可以看到Oracle的性能即使不加索引也明显比PostgreSQL中要好。
那么我们在PostgreSQL中是不是没办法继续优化了呢？这种情况我们利用pg中的递归语句结合索引可以大幅提升性能。

SQL改写：

									bill=# with recursive tmp as (

									bill(#  (

									bill(#   select min(t.sex) as sex from t_sex t where t.sex is not null

									bill(#  )

									bill(#  union all

									bill(#  (

									bill(#   select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)

									bill(#    from tmp s where s.sex is not null

									bill(#  )

									bill(# )

									bill-# select count(distinct sex) from tmp;

									 count

									-------

									   2

									(1 row)

									Time: 2.711 ms

查看执行计划：

									bill=# explain with recursive tmp as (

									bill(#  (

									bill(#   select min(t.sex) as sex from t_sex t where t.sex is not null

									bill(#  )

									bill(#  union all

									bill(#  (

									bill(#   select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)

									bill(#    from tmp s where s.sex is not null

									bill(#  )

									bill(# )

									bill-# select count(distinct sex) from tmp;

									                           QUERY PLAN

									----------------------------------------------------------------------------------------------------------------------

									 Aggregate (cost=53.62..53.63 rows=1 width=8)

									  CTE tmp

									   -> Recursive Union (cost=0.46..51.35 rows=101 width=32)

									      -> Result (cost=0.46..0.47 rows=1 width=32)

									         InitPlan 3 (returns $1)

									          -> Limit (cost=0.44..0.46 rows=1 width=2)

									             -> Index Only Scan using idx_sex_1 on t_sex t (cost=0.44..371996.44 rows=20000000 width=2)

									                Index Cond: (sex IS NOT NULL)

									      -> WorkTable Scan on tmp s (cost=0.00..4.89 rows=10 width=32)

									         Filter: (sex IS NOT NULL)

									  -> CTE Scan on tmp (cost=0.00..2.02 rows=101 width=32)

									(11 rows)

									Time: 1.371 ms