此篇作为ClickHouse的SQL优化的衍生篇,主要对SQL中同层存在多个JOIN时的SQL使用优化,这是一个很通用的场景,但在目前ClickHouse的过滤条件下推的实现还不完善的前提下,不做好人工下推的工作,会严重影响SQL的执行。
总结如下
一般情况下ClickHouse都会用一个数据表,使用LEFT JOIN关联多个维度表,完成维度信息扩充,从而实现数据统计。在这种情况下WHERE语句的过滤条件会成功下推到语法解析后的每一层子查询,从而不会对性能带来太大的影响。
select A.id, B.name, C.val
from A 
left join B on A.id = B.id
left join C on A.id = C.id
where A.dat = '2021-09-04'
但当一些特殊场景下,需要掺杂一个非指向数据表的JOIN(RIGHT JOIN,FULL JOIN)时,发现WHERE语句的过滤条件不会再下推导语法解析后的每一层子查询中,也就是A表会被全表数据拿来做JOIN,从而是性能急剧下降。
select A.id, B.name, C.val
from A 
left join B on A.id = B.id
right join C on A.id = C.id
where A.dat = '2021-09-04'
以下一个具体的SQL,结合explain syntax进行语法分析,以及各种变种分析
select ao.order_name     as order_name
    , cm.name            as campaign_name
    , sum(cnt_total_imp) as imp
from ad_report.dsp_real_time_consume_report r
left join (
    select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
where r.dat = '2021-09-15'
group by ao.order_name, cm.name
利用explain syntax分析,可以发现ClickHouse最后实际执行语句中,WHERE语句中的dat='2021-09-15'没有下推,从造成数据表ad_report.dsp_real_time_consume_report被全表用来做join了
SELECT                                              
    order_name AS order_name,                       
    name AS campaign_name,                          
    sum(cnt_total_imp) AS imp                       
FROM                                                
(                                                   
    SELECT                                          
        campaign_id,                                
        dat,                                        
        cnt_total_imp,                              
        order_id AS `--r.order_id`,                 
        id,                                         
        name                                        
    FROM ad_report.dsp_real_time_consume_report AS r
    ALL LEFT JOIN                                   
    (                                               
        SELECT                                      
            id,                                     
            name                                    
        FROM ad_fancy.campaign                      
    ) AS cm ON campaign_id = id                     
) AS `--.s`                                         
ALL FULL OUTER JOIN                                 
(                                                   
    SELECT                                          
        order_id,                                   
        order_name                                  
    FROM amo.order                                  
) AS ao ON toUInt32(`--r.order_id`) = order_id      
WHERE dat = '2021-09-15'                            
GROUP BY                                            
    order_name,                                     
    name
此时在不改变需求的前提下,手动把多JOIN转换成单JOIN的模式,并且把过滤条件提前下推到数据表层,此时只需0.6S完成查询
推荐优化模式
select ao.order_name     as order_name
    , r.campaign_name    as campaign_name
    , sum(cnt_total_imp) as imp
from (
    select r.order_id, r.cnt_total_imp, cm.name as campaign_name
    from ad_report.dsp_real_time_consume_report r
    left join (
        select id, name from ad_fancy.campaign
    ) cm on r.campaign_id = cm.id
    where dat = '2021-09-15'
) r
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name
为何不使用只用子查询封装数据表,然后保留left join与full join同层的模式呢?原因在于此方案的缺陷过大,假如第二层维度表使用了local表,就会造成数据不对,而且对于不熟悉CH的用户,容易忽略这一点
不推荐优化模式
select ao.order_name     as order_name
    , cm.name            as campaign_name
    , sum(cnt_total_imp) as imp
from (
    select order_id, campaign_id, cnt_total_imp
    from ad_report.dsp_real_time_consume_report
    where dat = '2021-09-15'
) r
left join (
    select id, name from ad_fancy.campaign
) cm on r.campaign_id = cm.id
full join (
    select order_id, order_name from amo.`order`
) ao on toUInt32(r.order_id) = ao.order_id
group by ao.order_name, cm.name
where dat = '2021-09-15'SELECT                                               
    order_name AS order_name,                        
    name AS campaign_name,                           
    sum(cnt_total_imp) AS imp                        
FROM                                                 
(                                                    
    SELECT                                           
        campaign_id,                                 
        dat,                                         
        cnt_total_imp,                               
        order_id AS `--r.order_id`,                  
        id,                                          
        name                                         
    FROM ad_report.dsp_real_time_consume_report AS r 
    ALL LEFT JOIN                                    
    (                                                
        SELECT                                       
            id,                                      
            name                                     
        FROM ad_fancy.campaign                       
    ) AS cm ON campaign_id = id                      
    WHERE dat = '2021-09-15'                         
) AS `--.s`                                          
ALL LEFT JOIN                                        
(                                                    
    SELECT                                           
        order_id,                                    
        order_name                                   
    FROM amo.order                                   
) AS ao ON toUInt32(`--r.order_id`) = order_id       
WHERE dat = '2021-09-15'                             
GROUP BY                                             
    order_name,                                      
    name
SELECT                                              
    order_name AS order_name,                       
    name AS campaign_name,                          
    sum(cnt_total_imp) AS imp                       
FROM                                                
(                                                   
    SELECT                                          
        campaign_id,                                
        dat,                                        
        cnt_total_imp,                              
        order_id AS `--r.order_id`,                 
        id,                                         
        name                                        
    FROM ad_report.dsp_real_time_consume_report AS r
    ALL LEFT JOIN                                   
    (                                               
        SELECT                                      
            id,                                     
            name                                    
        FROM ad_fancy.campaign                      
    ) AS cm ON campaign_id = id                     
) AS `--.s`                                         
ALL RIGHT JOIN                                      
(                                                   
    SELECT                                          
        order_id,                                   
        order_name                                  
    FROM amo.order                                  
) AS ao ON toUInt32(`--r.order_id`) = order_id      
WHERE dat = '2021-09-15'                            
GROUP BY                                            
    order_name,                                     
    name 
SELECT                                                
    order_name AS order_name,                        
    name AS campaign_name,                           
    sum(cnt_total_imp) AS imp                        
FROM                                                 
(                                                    
    SELECT                                           
        campaign_id,                                 
        dat,                                         
        cnt_total_imp,                               
        order_id AS `--r.order_id`,                  
        id,                                          
        name                                         
    FROM ad_report.dsp_real_time_consume_report AS r 
    ALL RIGHT JOIN                                   
    (                                                
        SELECT                                       
            id,                                      
            name                                     
        FROM ad_fancy.campaign                       
    ) AS cm ON campaign_id = id                      
) AS `--.s`                                          
ALL RIGHT JOIN                                       
(                                                    
    SELECT                                           
        order_id,                                    
        order_name                                   
    FROM amo.order                                   
) AS ao ON toUInt32(`--r.order_id`) = order_id       
WHERE dat = '2021-09-15'                             
GROUP BY                                             
    order_name,                                      
    name