PostgreSQL 实践 - 内容社区(如论坛)图式搜索应用

发布于 2017-10-11 03:58:38 | 196 次阅读 | 评论: 0 | 来源: 网友投递

这里有新鲜出炉的PostgreSQL数据库教程，程序狗速度看过来！

PostgreSQL关系型数据库管理系统

PostgreSQL是以加州大学伯克利分校计算机系开发的 POSTGRES，现在已经更名为PostgreSQL，版本 4.2为基础的对象关系型数据库管理系统（ORDBMS）。

作者 digoal

原文出处：https://github.com/digoal/blog/blob/master/201710/20171009_01.md

背景

通常一个内容社区网站可能需要记录这么一些数据: 文章，用户，标签。

还有三者之间的关系，包括，文章的标签，用户阅读了文章，用户收藏了文章，用户关注了某用户，用户是某篇文章的作者。

最终要实现毫无人道的查询，例如：

阅读了此篇文章的人还在阅读什么其他文章，和我爱好相近的人都有哪些等等等等。

其中文章数量几千万，用户数量接近一千万。

如何实现这样的需求呢？

实际上PostgreSQL里面的数组、smlar实现这个需求非常的方便。下面开始设计和压测。

数组用于存储正向和反向关系，标签等。

smlar用于查询相似的数组（找出爱好相似的人）。

设计

元数据

1、用户表

create table users(      uid int primary key,             -- 用户ID      info text,           -- 附加信息      crt_time timestamp   -- 时间    );

2、标签表

create table tags(      tagid int primary key,              -- 标签ID      info text,              -- 附加信息       crt_time timestamp      -- 时间    );

3、文章表

create table arts(      artid int primary key,      -- 文章ID      info text,      -- 附加信息、内容      uids int[],     -- 用户IDs（作者可能是多人，所以使用数组）      tags int[]      -- 标签    );

关系数据

1、正向关系

1.1、文章被谁看过

create table art_uids_view (      artid int primary key,      uids int[]    );

1.2、文章被谁收藏过

create table art_uids_like (      artid int primary key,      uids int[]    );

2、反向关系

2.1、用户看过哪些文章，包含哪些标签

create table uid_arts_view (      uid int primary key,      arts int[],      tags int[]    );

2.2、用户收藏了哪些文章，包含哪些标签

create table uid_arts_like (      uid int primary key,      arts int[],      tags int[]    );

查询

1、阅读了此篇文章的其他人还在阅读什么其他文章，(过滤当前文章、以及我阅读过的文章)。

逻辑如下，写成UDF即可：

create extension intarray ;  -- 创建intarry插件，简化数组的加减        select (uids - $current_uid) into v1 from art_uids_view where artid = $current_artid ;  -- 取出阅读了当前文章的所有用户(减去当前用户)        select (unnest(arts) as arts, count(*) as cnt) into v2 from uid_arts_view where uid = any (v1) group by 1 ;  -- 获取阅读了同样文章的人，还阅读了哪些文章        select arts into v3 from uid_arts_view where uid= $current_uid ;          -- 当前用户阅读了哪些文章        result = v2.arts - v3 ;   -- 其他人阅读的所有文章 减去 当前用户阅读的文章，得到其他人阅读的文章。  按重叠数从大到小排序，推荐给用户

UDF如下，都能使用索引，都是聚合后的点查，性能很赞：

create or replace function rec_arts_view(      i1 int,  -- 文章ID      i2 int,  -- 当前用户ID      i3 int   -- limit    ) returns setof int as $$    declare      res int[];  -- 结果      v1 int[];   -- 文章被哪些用户阅读了    begin            -- 文章被哪些用户阅读了      select (uids - i2) into v1 from art_uids_view where artid = i1 ;            -- 阅读了这篇文章的其他用户，阅读了哪些文章，排除当前用户已阅读的，按重复率排序，返回N条。      -- 如果阅读了该文章的其他人，还阅读了很多其他文章，排序可能会略耗时。    return query      select t1.arts from       (        select unnest(arts) arts, count(*) cnt from uid_arts_view where uid = any (v1) group by 1      ) t1      left join      (        select unnest(arts) arts, 1 cnt from uid_arts_view where uid= i2       ) t2      on (t1.arts=t2.arts)       where t2.* is null      order by t1.cnt desc      limit i3;        end;    $$ language plpgsql strict;

2、与我（阅读文章）爱好相近的人有哪些，走GIN索引，性能很赞。

create extension smlar;        set smlar.type='overlap';          set smlar.threshold=?;             -- 设置重叠阈值          select arts into v1 from uid_arts_view where uid = ?;       -- 我阅读了哪些文章          select            *,            smlar( arts, v1, 'N.i' )       -- 其他人与我阅读的文章的重叠数是多少      from            uid_arts_view          where            arts % v1                      -- where cosine similarity >= smlar.threshold        ;

3、与我（阅读文章标签）爱好相近的人有哪些。

与2类似，略。

4、与我（收藏文章）爱好相近的人有哪些。

与2类似，略。

5、与我（收藏文章标签）爱好相近的人有哪些。

与2类似，略。

生成正反向关系的UDF

使用UDF，减少交互次数，完成以下几类业务逻辑的操作。UDF可以使用plpgsql编写，很简单，本文略：

https://www.postgresql.org/docs/10/static/plpgsql.html

1、新建文章的行为，自动产生标签，并更新或追加标签表。

insert into tags values ();        insert into arts values ();

2、阅读行为，修改正向反向关系。

文章的tags信息从arts里获取        insert into art_uids_view values ();        insert into uid_arts_view values ();

3、收藏行为，修改正向反向关系。

文章的tags信息从arts里获取        insert into art_uids_like values ();        insert into uid_arts_like values ();

索引

-- smlar 相似查询    create index idx_gin_1 on art_uids_view using gin ( uids _int4_sml_ops );      create index idx_gin_2 on art_uids_like using gin ( uids _int4_sml_ops );          create index idx_gin_3 on uid_arts_view using gin ( arts _int4_sml_ops );      create index idx_gin_4 on uid_arts_view using gin ( tags _int4_sml_ops );          create index idx_gin_5 on uid_arts_like using gin ( arts _int4_sml_ops );      create index idx_gin_6 on uid_arts_like using gin ( tags _int4_sml_ops );          create index idx_gin_7 on art_uids_view using gin ( uids _int4_sml_ops );      create index idx_gin_8 on art_uids_like using gin ( uids _int4_sml_ops );

可选索引

-- 数组相交、包含查询    create index idx_gin_01 on art_uids_view using gin ( uids gin__int_ops );      create index idx_gin_02 on art_uids_like using gin ( uids gin__int_ops );          create index idx_gin_03 on uid_arts_view using gin ( arts gin__int_ops );      create index idx_gin_04 on uid_arts_view using gin ( tags gin__int_ops );          create index idx_gin_05 on uid_arts_like using gin ( arts gin__int_ops );      create index idx_gin_06 on uid_arts_like using gin ( tags gin__int_ops );          create index idx_gin_07 on art_uids_view using gin ( uids gin__int_ops );      create index idx_gin_08 on art_uids_like using gin ( uids gin__int_ops );

填充测试数据

1、生成1000万用户

insert into users select id, md5(id::text), now() from generate_series(1,10000000) t(id);

2、生成10万标签

insert into tags select id, md5(id::text), now() from generate_series(1,100000) t(id);

3、生成5000万文章

create or replace function gen_arr(int,int) returns int[] as $$      select array(select ceil(random()*$1) from generate_series(1,$2))::int[];    $$ language sql strict;

insert into arts select id, md5(id::text),       gen_arr(10000000 ,3),        gen_arr(100000 ,10)     from generate_series(1,50000000) t(id);

4、生成正向关系，平均每篇文章被500人阅读，被50人收藏。

insert into art_uids_view select id, gen_arr(10000000, 500) from generate_series(1,50000000) t(id);        insert into art_uids_like select id, gen_arr(10000000, 50) from generate_series(1,50000000) t(id);

5、生成反向关系（按理说，反向关系和正向关系应该一一对应，为了测试方便，我这里就不对应了，测试效果是一样的）

平均每人阅读1000篇文章，涉及500个标签。收藏100篇文章，涉及50个标签。

insert into uid_arts_view select id, gen_arr(50000000, 1000), gen_arr(100000, 500) from generate_series(1,10000000) t(id);        insert into uid_arts_like select id, gen_arr(50000000, 100), gen_arr(100000, 50) from generate_series(1,10000000) t(id);

性能测试

1、阅读了此篇文章的其他人还在阅读什么其他文章，(过滤当前文章、以及我阅读过的文章)。

select rec_arts_view(1,2,10);    -- 文章ID=1， 当前用户ID=2，  返回10条推荐文章给当前用户。

其他人一共阅读了约50万其他文章，获取加排序耗时：200毫秒。

postgres=# select count(*) from rec_arts_view(1,4,1000000);   count    --------   497524  (1 row)    Time: 565.524 ms    postgres=# select count(*) from rec_arts_view(1,4,10);   count   -------      10  (1 row)    Time: 198.368 ms

2、与我（阅读文章）爱好相近的人有哪些。

set smlar.type='overlap';          set smlar.threshold=10;             -- 设置重叠阈值          select arts into v1 from uid_arts_view where uid = 1;       -- 我阅读了哪些文章        select            *,            smlar( arts, v1, 'N.i' )       -- 其他人与我阅读的文章的重叠数是多少      from            uid_arts_view          where            arts % v1                      -- where cosine similarity >= smlar.threshold        ;

耗时：2.4毫秒。

 Bitmap Heap Scan on public.uid_arts_view  (cost=933.50..29296.93 rows=10000 width=72) (actual time=1.955..2.351 rows=2 loops=1)     Output: uid, arts, tags, smlar(arts, '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[], 'N.i'::text)     Recheck Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])     Heap Blocks: exact=2     Buffers: shared hit=107     ->  Bitmap Index Scan on idx_gin_3  (cost=0.00..931.00 rows=10000 width=0) (actual time=1.506..1.506 rows=2 loops=1)           Index Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])           Buffers: shared hit=85   Planning time: 0.110 ms   Execution time: 2.378 ms  (10 rows)

预计算与性能优化

前面的推荐文章、找相似的人。指的是实时查询的性能，而实际这些操作都可以预计算的(因为文章增量不会太大、而且文章的阅读人群变化不会太大)，例如一天刷新一次，那么像用户推荐相似用户，推荐相似文章时，有预计算则直接查询结果，那性能会提升到0.0N毫秒级响应。没有预计算的新文章，则实时查询（并更新到预计算的表中），也能够毫秒级响应。

预计算还可以做成另一种模式，当有人查询这篇文章时，根据上次预计算的时间，决定是否需要重新查询，并更新它。（也就是说，实时计算 + 缓存 + 缓存超时的模式。）

逻辑如下

select xxx from pre_view_tbl where xxx=xxx;  -- 查询缓存，return    -- 写入或更新缓存  if not found then    -- 同步写入    insert into pre_view_tbl select xxxx returning *; -- 实时计算, 并返回  else if mod_time < (now() - 超时阈值) then    -- 异步    delete from pre_view_tbl where xxx=xxx;    insert into pre_view_tbl select xxxx; -- 实时计算  end if;

小结

3分开发，7分运营。内容网站与社交软件类似，运营是重头戏。运营中关键的一环是圈子，圈子可以聚人气，形成圈子往往靠的是推荐，推荐的源头又是行为，推荐什么样的内容、人给目标，靠的是行为。所谓物以类聚，人以群居，就是这个理。

PostgreSQL 的数组、smlar实现高效的归类查询、推荐需求非常的方便。

1、数组用于存储正向和反向关系，标签等。

2、smlar用于查询相似的数组（找出爱好相似的人）。

在社交运营、内容运营场景中，非常方便、高效。

热点人、热点文章也不在话下，在其他案例中已经测试过，可以参考本文末尾。

参考

https://www.postgresql.org/docs/10/static/plpgsql.html

《电商内容去重内容筛选应用(实时识别转载盗图侵权?) - 文本、图片集、商品集、数组相似判定的优化和索引技术》

https://www.postgresql.org/docs/10/static/intarray.html

计数、实时需求也口可以使用流计算，案例参考：

《三体高可用PCC大赛 - facebook微博 like场景 - 数据库设计与性能压测》

https://github.com/bitnine-oss/agensgraph

PostgreSQL 实践 - 内容社区(如论坛)图式搜索应用

PostgreSQL关系型数据库管理系统

作者 digoal

背景

如何实现这样的需求呢？

设计

元数据

关系数据

查询

生成正反向关系的UDF

索引

填充测试数据

性能测试

预计算与性能优化

小结

参考

后端技术

前端技术

数据库

热门框架

常用IDE

其他