ClickHouse多个同层JOIN的使用优化

Published on with 0 views and 0 comments

此篇作为ClickHouse的SQL优化的衍生篇,主要对SQL中同层存在多个JOIN时的SQL使用优化,这是一个很通用的场景,但在目前ClickHouse的过滤条件下推的实现还不完善的前提下,不做好人工下推的工作,会严重影响SQL的执行。

总结如下

  • 同一层级不允许使用两个及两个以上的(非LEFT/INNER)JOIN,除非所有都是LEFT JOIN/INNER JOIN。
    • ClickHouse会把同一层多个JOIN的语句重构成多层,每一次只会有一个JOIN。常规情况下数据表都会在第一个,在此前提下只要出现至少一个 RIGHT JOIN、FULL JOIN,那么WHERE条件就不会下推导第一层JOIN,从而会出现数据表被全表做JOIN的情况,从而使性能急剧下降。
    • 不推荐通过对数据表单独封装子查询的方案原因为,在维度表用的是local表时,此方案会造成数据不准,具体详见CH查询逻辑

多个同层JOIN的优化

一般情况下ClickHouse都会用一个数据表,使用LEFT JOIN关联多个维度表,完成维度信息扩充,从而实现数据统计。在这种情况下WHERE语句的过滤条件会成功下推到语法解析后的每一层子查询,从而不会对性能带来太大的影响。

select A.id, B.name, C.val
from A 
left join B on A.id = B.id
left join C on A.id = C.id
where A.dat = '2021-09-04'

但当一些特殊场景下,需要掺杂一个非指向数据表的JOIN(RIGHT JOIN,FULL JOIN)时,发现WHERE语句的过滤条件不会再下推导语法解析后的每一层子查询中,也就是A表会被全表数据拿来做JOIN,从而是性能急剧下降。

select A.id, B.name, C.val
from A 
left join B on A.id = B.id
right join C on A.id = C.id
where A.dat = '2021-09-04'

以下一个具体的SQL,结合explain syntax进行语法分析,以及各种变种分析

left join + full join,性能爆炸,以下SQL花了100S

select ao.order_name     as order_name
    , cm.name            as campaign_name
    , sum(cnt_total_imp) as imp
from ad_report.dsp_real_time_consume_report r
left join (
    select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
where r.dat = '2021-09-15'
group by ao.order_name, cm.name

利用explain syntax分析,可以发现ClickHouse最后实际执行语句中,WHERE语句中的dat='2021-09-15'没有下推,从造成数据表ad_report.dsp_real_time_consume_report被全表用来做join了

SELECT                                              
    order_name AS order_name,                       
    name AS campaign_name,                          
    sum(cnt_total_imp) AS imp                       
FROM                                                
(                                                   
    SELECT                                          
        campaign_id,                                
        dat,                                        
        cnt_total_imp,                              
        order_id AS `--r.order_id`,                 
        id,                                         
        name                                        
    FROM ad_report.dsp_real_time_consume_report AS r
    ALL LEFT JOIN                                   
    (                                               
        SELECT                                      
            id,                                     
            name                                    
        FROM ad_fancy.campaign                      
    ) AS cm ON campaign_id = id                     
) AS `--.s`                                         
ALL FULL OUTER JOIN                                 
(                                                   
    SELECT                                          
        order_id,                                   
        order_name                                  
    FROM amo.order                                  
) AS ao ON toUInt32(`--r.order_id`) = order_id      
WHERE dat = '2021-09-15'                            
GROUP BY                                            
    order_name,                                     
    name

此时在不改变需求的前提下,手动把多JOIN转换成单JOIN的模式,并且把过滤条件提前下推到数据表层,此时只需0.6S完成查询

推荐优化模式

select ao.order_name     as order_name
    , r.campaign_name    as campaign_name
    , sum(cnt_total_imp) as imp
from (
    select r.order_id, r.cnt_total_imp, cm.name as campaign_name
    from ad_report.dsp_real_time_consume_report r
    left join (
        select id, name from ad_fancy.campaign
    ) cm on r.campaign_id = cm.id
    where dat = '2021-09-15'
) r
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name

为何不使用只用子查询封装数据表,然后保留left join与full join同层的模式呢?原因在于此方案的缺陷过大,假如第二层维度表使用了local表,就会造成数据不对,而且对于不熟悉CH的用户,容易忽略这一点

不推荐优化模式

select ao.order_name     as order_name
    , cm.name            as campaign_name
    , sum(cnt_total_imp) as imp
from (
    select order_id, campaign_id, cnt_total_imp
    from ad_report.dsp_real_time_consume_report
    where dat = '2021-09-15'
) r
left join (
    select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name

以下为分别把第二个JOIN改为LEFT JOIN/RIGHT JOIN的语法分析及运行时间,探测具体哪些情况会有问题

LEFT JOIN + LEFT JOIN 0.6S,从语法分析上看成功下推,注意被封装的第一层中有where dat = '2021-09-15'

SELECT                                               
    order_name AS order_name,                        
    name AS campaign_name,                           
    sum(cnt_total_imp) AS imp                        
FROM                                                 
(                                                    
    SELECT                                           
        campaign_id,                                 
        dat,                                         
        cnt_total_imp,                               
        order_id AS `--r.order_id`,                  
        id,                                          
        name                                         
    FROM ad_report.dsp_real_time_consume_report AS r 
    ALL LEFT JOIN                                    
    (                                                
        SELECT                                       
            id,                                      
            name                                     
        FROM ad_fancy.campaign                       
    ) AS cm ON campaign_id = id                      
    WHERE dat = '2021-09-15'                         
) AS `--.s`                                          
ALL LEFT JOIN                                        
(                                                    
    SELECT                                           
        order_id,                                    
        order_name                                   
    FROM amo.order                                   
) AS ao ON toUInt32(`--r.order_id`) = order_id       
WHERE dat = '2021-09-15'                             
GROUP BY                                             
    order_name,                                      
    name
LEFT JOIN + RIGHT JOIN 100S,从语法分析上看,没有实现下推,注意被封装的第一层中没有where语句
SELECT                                              
    order_name AS order_name,                       
    name AS campaign_name,                          
    sum(cnt_total_imp) AS imp                       
FROM                                                
(                                                   
    SELECT                                          
        campaign_id,                                
        dat,                                        
        cnt_total_imp,                              
        order_id AS `--r.order_id`,                 
        id,                                         
        name                                        
    FROM ad_report.dsp_real_time_consume_report AS r
    ALL LEFT JOIN                                   
    (                                               
        SELECT                                      
            id,                                     
            name                                    
        FROM ad_fancy.campaign                      
    ) AS cm ON campaign_id = id                     
) AS `--.s`                                         
ALL RIGHT JOIN                                      
(                                                   
    SELECT                                          
        order_id,                                   
        order_name                                  
    FROM amo.order                                  
) AS ao ON toUInt32(`--r.order_id`) = order_id      
WHERE dat = '2021-09-15'                            
GROUP BY                                            
    order_name,                                     
    name 
RIGHT JOIN + RIGHT JOIN 100S,从语法分析上看,没有实现下推,注意被封装的第一层中没有where语句
SELECT                                                
    order_name AS order_name,                        
    name AS campaign_name,                           
    sum(cnt_total_imp) AS imp                        
FROM                                                 
(                                                    
    SELECT                                           
        campaign_id,                                 
        dat,                                         
        cnt_total_imp,                               
        order_id AS `--r.order_id`,                  
        id,                                          
        name                                         
    FROM ad_report.dsp_real_time_consume_report AS r 
    ALL RIGHT JOIN                                   
    (                                                
        SELECT                                       
            id,                                      
            name                                     
        FROM ad_fancy.campaign                       
    ) AS cm ON campaign_id = id                      
) AS `--.s`                                          
ALL RIGHT JOIN                                       
(                                                    
    SELECT                                           
        order_id,                                    
        order_name                                   
    FROM amo.order                                   
) AS ao ON toUInt32(`--r.order_id`) = order_id       
WHERE dat = '2021-09-15'                             
GROUP BY                                             
    order_name,                                      
    name

标题:ClickHouse多个同层JOIN的使用优化
作者:wuqingbo
地址:http://solo.fancydigital.com.cn/articles/2021/11/22/1637584776687.html