此篇作为ClickHouse的SQL优化的衍生篇,主要对SQL中同层存在多个JOIN时的SQL使用优化,这是一个很通用的场景,但在目前ClickHouse的过滤条件下推的实现还不完善的前提下,不做好人工下推的工作,会严重影响SQL的执行。
总结如下
一般情况下ClickHouse都会用一个数据表,使用LEFT JOIN关联多个维度表,完成维度信息扩充,从而实现数据统计。在这种情况下WHERE语句的过滤条件会成功下推到语法解析后的每一层子查询,从而不会对性能带来太大的影响。
select A.id, B.name, C.val
from A
left join B on A.id = B.id
left join C on A.id = C.id
where A.dat = '2021-09-04'
但当一些特殊场景下,需要掺杂一个非指向数据表的JOIN(RIGHT JOIN,FULL JOIN)时,发现WHERE语句的过滤条件不会再下推导语法解析后的每一层子查询中,也就是A表会被全表数据拿来做JOIN,从而是性能急剧下降。
select A.id, B.name, C.val
from A
left join B on A.id = B.id
right join C on A.id = C.id
where A.dat = '2021-09-04'
以下一个具体的SQL,结合explain syntax进行语法分析,以及各种变种分析
select ao.order_name as order_name
, cm.name as campaign_name
, sum(cnt_total_imp) as imp
from ad_report.dsp_real_time_consume_report r
left join (
select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
where r.dat = '2021-09-15'
group by ao.order_name, cm.name
利用explain syntax分析,可以发现ClickHouse最后实际执行语句中,WHERE语句中的dat='2021-09-15'没有下推,从造成数据表ad_report.dsp_real_time_consume_report被全表用来做join了
SELECT
order_name AS order_name,
name AS campaign_name,
sum(cnt_total_imp) AS imp
FROM
(
SELECT
campaign_id,
dat,
cnt_total_imp,
order_id AS `--r.order_id`,
id,
name
FROM ad_report.dsp_real_time_consume_report AS r
ALL LEFT JOIN
(
SELECT
id,
name
FROM ad_fancy.campaign
) AS cm ON campaign_id = id
) AS `--.s`
ALL FULL OUTER JOIN
(
SELECT
order_id,
order_name
FROM amo.order
) AS ao ON toUInt32(`--r.order_id`) = order_id
WHERE dat = '2021-09-15'
GROUP BY
order_name,
name
此时在不改变需求的前提下,手动把多JOIN转换成单JOIN的模式,并且把过滤条件提前下推到数据表层,此时只需0.6S完成查询
推荐优化模式
select ao.order_name as order_name
, r.campaign_name as campaign_name
, sum(cnt_total_imp) as imp
from (
select r.order_id, r.cnt_total_imp, cm.name as campaign_name
from ad_report.dsp_real_time_consume_report r
left join (
select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
where dat = '2021-09-15'
) r
full join (
select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name
为何不使用只用子查询封装数据表,然后保留left join与full join同层的模式呢?原因在于此方案的缺陷过大,假如第二层维度表使用了local表,就会造成数据不对,而且对于不熟悉CH的用户,容易忽略这一点
不推荐优化模式
select ao.order_name as order_name
, cm.name as campaign_name
, sum(cnt_total_imp) as imp
from (
select order_id, campaign_id, cnt_total_imp
from ad_report.dsp_real_time_consume_report
where dat = '2021-09-15'
) r
left join (
select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name
where dat = '2021-09-15'
SELECT
order_name AS order_name,
name AS campaign_name,
sum(cnt_total_imp) AS imp
FROM
(
SELECT
campaign_id,
dat,
cnt_total_imp,
order_id AS `--r.order_id`,
id,
name
FROM ad_report.dsp_real_time_consume_report AS r
ALL LEFT JOIN
(
SELECT
id,
name
FROM ad_fancy.campaign
) AS cm ON campaign_id = id
WHERE dat = '2021-09-15'
) AS `--.s`
ALL LEFT JOIN
(
SELECT
order_id,
order_name
FROM amo.order
) AS ao ON toUInt32(`--r.order_id`) = order_id
WHERE dat = '2021-09-15'
GROUP BY
order_name,
name
SELECT
order_name AS order_name,
name AS campaign_name,
sum(cnt_total_imp) AS imp
FROM
(
SELECT
campaign_id,
dat,
cnt_total_imp,
order_id AS `--r.order_id`,
id,
name
FROM ad_report.dsp_real_time_consume_report AS r
ALL LEFT JOIN
(
SELECT
id,
name
FROM ad_fancy.campaign
) AS cm ON campaign_id = id
) AS `--.s`
ALL RIGHT JOIN
(
SELECT
order_id,
order_name
FROM amo.order
) AS ao ON toUInt32(`--r.order_id`) = order_id
WHERE dat = '2021-09-15'
GROUP BY
order_name,
name
SELECT
order_name AS order_name,
name AS campaign_name,
sum(cnt_total_imp) AS imp
FROM
(
SELECT
campaign_id,
dat,
cnt_total_imp,
order_id AS `--r.order_id`,
id,
name
FROM ad_report.dsp_real_time_consume_report AS r
ALL RIGHT JOIN
(
SELECT
id,
name
FROM ad_fancy.campaign
) AS cm ON campaign_id = id
) AS `--.s`
ALL RIGHT JOIN
(
SELECT
order_id,
order_name
FROM amo.order
) AS ao ON toUInt32(`--r.order_id`) = order_id
WHERE dat = '2021-09-15'
GROUP BY
order_name,
name