ClickHouse的ASOF JOIN功能的使用探索

Updated on with 0 views and 0 comments

ClickHouse的JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN都是不支持在ON中都只支持等式判定,无法支持不等式判定。因此提供了一个叫ASOF JOIN的功能,其支持在ON写不等式,但依旧有以下限定

  1. 只支持一个不等式,即 N 个等式 + 1个不等式的写法。
  2. 不等式的触发逻辑是最近匹配(closest match condition),具体含义见如下栗子
    • 从结果可以看出虽然表1中event_1_2的时间大于表2中的所有记录,但最后只取了与其最接近的event_2_3。这就是closest match的真谛。
table1eventev_timeuser_idtable2eventev_timeuser_id
|event_1_12021-09-09 12:00:0042|event_2_12021-09-09 11:59:0042
|event_1_22021-09-09 13:00:0042|event_2_22021-09-09 12:30:0042
| |event_2_32021-09-09 13:00:0042

结果

user_ideventev_timeevent2ev_time2
42event_1_22021-09-09 05:00:00event_2_32021-09-09 05:00:00
42event_1_12021-09-09 04:00:00event_2_12021-09-09 03:59:00
select table1.user_id, table1.event, table1.ev_time, table2.event as event2, table2.ev_time as ev_time2
from
(
    select 'event_1_1' as event, toDateTime('2021-09-09 12:00:00') as ev_time, 42 as user_id
    union all
    select 'event_1_2' as event, toDateTime('2021-09-09 13:00:00') as ev_time, 42 as user_id
) table1
asof left join
(
    select 'event_2_1' as event, toDateTime('2021-09-09 11:59:00') as ev_time, 42 as user_id
    union all
    select 'event_2_2' as event, toDateTime('2021-09-09 12:30:00') as ev_time, 42 as user_id
    union all
    select 'event_2_3' as event, toDateTime('2021-09-09 13:00:00') as ev_time, 42 as user_id
) table2
on table1.user_id = table2.user_id and table1.ev_time >= table2.ev_time 

场景1:AMO订单与FancyDSP系统映射关系的使用

FancyDSP允许一个投放计划同时映射多个AMO订单,同时有个前提,即这些AMO订单的执行周期是无交集的。

举个栗子

  • campaign1 只绑定一个amo_order1这个是允许,也是最常见的情况。
  • campaign1 不绑定任何一个AMO订单,这是不允许的,目前是系统直接禁止的,未绑定AMO订单的投放计划无法建设出来。
  • campaign1 同时绑定 amo_order1, amo_order2
    • 假如amo_order1的执行周期是2021-09-01到2021-09-04,amo_order2的执行周期是2021-09-05到2021-09-10,那么这种情况就是允许的
    • 假如amo_order1的执行周期是2021-09-01到2021-09-04,amo_order2的执行周期是2021-09-04到2021-09-10,由于存在交集20212-09-04,那么这种情况就是不允许的

基于以上的场景,汇总底表数据中只有分天,分投放计划等执行系统信息,外加一个投放计划于AMO订单的映射表,此时如何准确用汇总数据及映射表信息关联出AMO订单信息呢?

分析可得,此问题最大的难点

  1. 在于报表查询时允许多天查询,于是这查询的日期范围内就有可能同时命中一个投放计划关联的多个AMO订单。
  2. 可能存在漏绑等现象,那么底表数据此时的不应该被过滤掉,至少可以以投放计划的名义显示出来

举个栗子来说明就是

  • campaign1 同时绑定了 amo_order1与amo_order2
  • amo_order1的执行周期为2021-09-03到2021-09-04
  • amo_order2的执行周期为2021-09-05到2021-09-05
  • campaign1在2021-09-06号还有遗留数据
  • 查询日期为2021-09-03到2021-09-06

在上述栗子中,最终实现的数据需要满足以下几点

  1. 3号与4号的数据归到amo_order1
  2. 5号的数据归到amo_order2
  3. 6号的数据不能丢,但无法归到任何一个amo_order中

我们先从几个错误的尝试开始

尝试1: 先用left join把维度信息关联上,然后在where语句做过滤,此方法无法满足第3点。

select dat, data.c_id
    , caom.order_id as caom_order_id
    , ao.order_id as ao_order_id
from
(
    select toDate('2021-09-03') as dat, 1 as c_id
    union all
    select toDate('2021-09-04') as dat, 1 as c_id
    union all
    select toDate('2021-09-05') as dat, 1 as c_id
    union all
    select toDate('2021-09-06') as dat, 1 as c_id
) data
left join (
    select 1 as c_id, 1 as order_id
    union all
    select 1 as c_id, 2 as order_id
) caom on data.c_id = caom.c_id
left join (
    select 1 as order_id, toDate('2021-09-03') as start_date, toDate('2021-09-04') as end_date
    union all
    select 2 as order_id, toDate('2021-09-05') as start_date, toDate('2021-09-05') as end_date
) ao on caom.order_id = ao.order_id 
where data.dat >= ao.start_date and data.dat <= ao.end_date
order by dat, ao_order_id
datc_idcaom_order_idao_order_id
2021-09-03111
2021-09-04111
2021-09-05122

尝试2: 把对AMO的订单信息在子查询中直接过滤订单的日期,实际上就类似上面的demo语句去掉where语句,实际上是无法满足1,2两点的。

datc_idcaom_order_idao_order_id
2021-09-03111
2021-09-03122
2021-09-04111
2021-09-04122
2021-09-05111
2021-09-05122
2021-09-06111
2021-09-06122

尝试3: 使用两次ASOF JOIN来实现两次不等式判定,此版本未充分考虑left join的特性,造成数据被放大了。

在第一个left join时,data数据就已经被放大,然后后面两个asof left join只会在此基础上确保数据不放大,已经无力回天。

select dat, data.c_id
    , caom.order_id as caom_order_id
    , ao1.order_id  as ao1_order_id
    , ao.order_id as ao_order_id
from
(
    select toDate('2021-09-03') as dat, 1 as c_id
    union all
    select toDate('2021-09-04') as dat, 1 as c_id
    union all
    select toDate('2021-09-05') as dat, 1 as c_id
    union all
    select toDate('2021-09-06') as dat, 1 as c_id
) data
left join (
    select 1 as c_id, 1 as order_id
    union all
    select 1 as c_id, 2 as order_id
) caom on data.c_id = caom.c_id
asof left join (
    select 1 as order_id, toDate('2021-09-03') as start_date
    union all
    select 2 as order_id, toDate('2021-09-05') as start_date
) ao1 on caom.order_id = ao1.order_id and data.dat >= ao1.start_date
asof left join (
    select 1 as order_id, toDate('2021-09-04') as end_date
    union all
    select 2 as order_id, toDate('2021-09-05') as end_date
) ao on  ao1.order_id = ao.order_id and data.dat <= ao.end_date
order by dat, caom_order_id
datc_idcaom_order_idao1_order_idao_order_id
2021-09-031111
2021-09-031200
2021-09-041111
2021-09-041200
2021-09-051110
2021-09-051222
2021-09-061110
2021-09-061220

尝试4(成功): 依旧使用两次ASOF JOIN来实现,把第一个的LEFT JOIN 与 ASOF LEFT JOIN通过子查询合并成一个,从而避免数据被放大。

select dat, data.c_id
    , ao1.order_id  as ao1_order_id
    , ao1.start_date as start_date
    , ao.order_id as ao_order_id
    , ao.end_date as end_date
from
(
    select toDate('2021-09-03') as dat, 1 as c_id
    union all
    select toDate('2021-09-04') as dat, 1 as c_id
    union all
    select toDate('2021-09-05') as dat, 1 as c_id
    union all
    select toDate('2021-09-06') as dat, 1 as c_id
) data
asof left join ( 
    select c_id, caom.order_id as order_id, ao.start_date as start_date
    from (
        select 1 as c_id, 1 as order_id
        union all
        select 1 as c_id, 2 as order_id
    ) caom 
    join (
        select 1 as order_id, toDate('2021-09-03') as start_date
        union all
        select 2 as order_id, toDate('2021-09-05') as start_date
    ) ao on caom.order_id = ao.order_id
) ao1 on data.c_id = ao1.c_id and data.dat >= ao1.start_date
asof left join (
    select ao.order_id as order_id, ao.end_date as end_date
    from (
        select 1 as order_id, toDate('2021-09-04') as end_date
        union all
        select 2 as order_id, toDate('2021-09-05') as end_date
    ) ao 
) ao on  ao1.order_id = ao.order_id and data.dat <= ao.end_date
order by dat, ao_order_id
datc_idao1_order_idstart_dateao_order_idend_date
2021-09-03112021-09-0312021-09-04
2021-09-04112021-09-0312021-09-04
2021-09-05122021-09-0522021-09-05
2021-09-06122021-09-0501970-01-01

标题:ClickHouse的ASOF JOIN功能的使用探索
作者:wuqingbo
地址:http://solo.fancydigital.com.cn/articles/2021/11/23/1637636915860.html