记一次 PostgreSQL LIKE 索引优化，联合字段 LIKE 查询优化。

由于最近我司业务量上涨，数据量剧增，数据库查询速度明显变慢，单次查询居然达到1800ms以上，急需优化。待查阅一番后，我知道了LIKE查询正确的索引使用姿势，特别是一些符合字段索引。

模糊查询的使用场景实在太多了，正确使用索引实在是太重要了。或许一个关键的优化就能为你节省更多的资源，带来更好的用户体验。

少啰嗦，我们看实例。

由于未经我司同意，数据不能用真实了，下面数据是我盗来的😢

首先，我们这里有一张存了一百万条信息的users表。

                                    Table "public.users"
Column      |            Type             |                     Modifiers
------------+-----------------------------+----------------------------------------------------
 id         | integer                     | not null default nextval('users_id_seq'::regclass)
 username   | character varying           |
 first_name | character varying           |
 last_name  | character varying           |

普通查询

在不添加任何索引的情况下，我们查询一次。

> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=21927.24..21927.25 rows=1 width=0) (actual time=737.523..737.523 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..21927.00 rows=96 width=0) (actual time=737.520..737.520 rows=0 loops=1)
         Filter: ((username)::text ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.373 ms
 Execution time: 737.593 ms
(6 rows)

Time: 738.404 ms

结果，如大家所想一样，顺序扫描，很慢。

btree 索引

很多童鞋会想到，那就加个索引呗。我们知道，pg里面一般默认都是btree索引，我们加一个试试。

> CREATE INDEX idx_users_username ON users (username);
Time: 15987.751 ms

查询一下：

> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=21927.24..21927.25 rows=1 width=0) (actual time=752.271..752.271 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..21927.00 rows=96 width=0) (actual time=752.268..752.268 rows=0 loops=1)
         Filter: ((username)::text ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.599 ms
 Execution time: 752.318 ms
(6 rows)

Time: 753.251 ms

emm, 好家伙，居然还是顺序扫描。这是由于一般的索引只能优化 LIKE foo%这类的向后的模糊查询。所以btree索引这里也达不到优化的效果。

gin 索引 & pg_trgm 模块

pg_trgm模块提供函数和操作符测定字母，数字，文本基于三元模型匹配的相似性，还有支持快速搜索相似字符串的索引操作符类。

这里提到了一个三元模型，可能有童鞋不理解，其实很简单。打个比方foo的三元模型的集合为{" f"," fo","foo","oo "}, foo|bar的三元模型的集合为{" f"," fo","foo","oo "," b"," ba","bar","ar "}。也就是说将字符串拆解成三个字符一组，每个字符串被认为有两个空格前缀和一个空格后缀。

Postgres使用trigram将字符串分解成更小的单元便于有效地索引它们。pg_trgm模块支持GIST或GIN索引，从9.1开始，这些索引支持LIKE/ILIKE查询。

要使用pg_trgm模块，首先要启用该扩展，然后使用gin_trgm_ops创建索引。

> CREATE EXTENSION pg_trgm;
Time: 42.206 ms

> CREATE INDEX trgm_idx_users_username ON users USING gin (username gin_trgm_ops);
Time: 7082.474 ms

我们再次分析查询

> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=369.12..369.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
   ->  Bitmap Heap Scan on users  (cost=12.75..368.88 rows=96 width=0) (actual time=0.026..0.026 rows=0 loops=1)
         Recheck Cond: ((username)::text ~~* '%foo%'::text)
         ->  Bitmap Index Scan on trgm_idx_users_username  (cost=0.00..12.72 rows=96 width=0) (actual time=0.024..0.024 rows=0 loops=1)
               Index Cond: ((username)::text ~~* '%foo%'::text)
 Planning time: 0.636 ms
 Execution time: 0.095 ms
(7 rows)

Time: 2.333 ms

哇，2.33ms!!!

从查询过程上看，索引已经起作用了。这将我们的查询直接提高了两个数量级，作用很大。我在我司使用该索引后，查询从1800ms提升到了10ms左右,也算是效果惊人了。

这里出现了个"Bitmap Heap Scan",这是什么呢？ pg里面常见到的对表扫描的计划有这四种：
Seq Scan
Index Scan
Bitmap Heap Scan
Index Only Scan
前面两个你也许一看就懂。第一个无非是按照表的记录顺序从头到尾依次检索扫描，全表扫描，代价大；第二个先扫描索引，从索引找到数据位置，再准备获取数据，索引扫描，快，I/O少；第三个一次性将满足条件的索引项全部取出，并在内存中进行排序, 然后根据取出的索引项访问表数据。一般需要合并索引访问的结果子集时会用到这种方式。第四个就是查询的字段直接就在索引中了，直接扫描索引即可。

联合字段LIKE查询

有时候，我们可能需要联合多个字段来查询数据。如，我们需要搜索一位可能叫John Do的人,这时候就需要我们将first_name与last_name联合作为条件查询了。

一般我们首先想到的是，在first_name与last_name都加上索引。那么可行么？我们先试一下。

> CREATE INDEX trgm_idx_users_first ON users USING gin (first_name gin_trgm_ops);
Time: 4577.637 ms
> CREATE INDEX trgm_idx_users_last ON users USING gin (last_name gin_trgm_ops);
Time: 4770.507 ms

查询

> SELECT COUNT(*) FROM users WHERE first_name || ' ' || last_name ILIKE '%foo%'

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=27027.00..27027.01 rows=1 width=0) (actual time=1025.543..1025.544 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..26927.00 rows=40000 width=0) (actual time=1025.539..1025.539 rows=0 loops=1)
         Filter: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.273 ms
 Execution time: 1025.591 ms
(6 rows)

Time: 1027.547 ms

看来还是没有用到索引，由于查询的时候使用的是联合的字段，单独的索引并不起作用。那么我们可以考虑将字段联合起来，添加一个索引。

CREATE INDEX index_users_full_name ON users using gin ((first_name || ' ' || last_name) gin_trgm_ops);

我们再次尝试查询。


> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE first_name || ' ' || last_name ILIKE '%foo%';

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10605.00..10605.01 rows=1 width=0) (actual time=0.020..0.020 rows=1 loops=1)
   ->  Bitmap Heap Scan on users  (cost=378.01..10505.00 rows=40000 width=0) (actual time=0.018..0.018 rows=0 loops=1)
         Recheck Cond: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
         ->  Bitmap Index Scan on index_users_full_name  (cost=0.00..368.01 rows=40000 width=0) (actual time=0.016..0.016 rows=0 loops=1)
               Index Cond: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
 Planning time: 0.338 ms
 Execution time: 0.080 ms
(7 rows)

Time: 0.975 ms

喔😯，相当的快。使用了 Bitmap Index扫描的，看来索引很有效。

最后

勇敢上文，我们一起了解了一下pg里利用pg_trgm模块来建立索引优化模糊查询。其实pg里买还有更多好用实用的功能，期待与大家一起探索～。

本文参考

普通查询#

btree 索引#

gin 索引 & pg_trgm 模块#

联合字段LIKE查询#

最后#

普通查询

btree 索引

gin 索引 & pg_trgm 模块

联合字段LIKE查询

最后