博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hive left outer join的问题
阅读量:4689 次
发布时间:2019-06-09

本文共 10402 字,大约阅读时间需要 34 分钟。

最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug

Query 1 返回结果数6071

 

select count(distinct reviewid) as dis_reviewcntfrom(select a.reviewidfrom bi.dpods_dp_reviewreport  aleft outer join bi.dpods_dp_reviewlog  bon a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'  where to_date(a.feedadddate) >= '2013-07-01'   and a.hp_statdate='2013-07-24') a

Query 2 返回结果数6443

 

 

select count(distinct reviewid) as dis_reviewcntfrom(select a.reviewidfrom bi.dpods_dp_reviewreport  aleft outer join bi.dpods_dp_reviewlog  bon a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'   and a.hp_statdate='2013-07-24'where to_date(a.feedadddate) >= '2013-07-01'  ) a

第二条query比第一条多了372条数据,而且在子查询的左表中并不存在

 

两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件(hp_statdate是partition column)一个在where后面,另一个在on后面

粗看感觉出来的数据应该是一样的,但是玄机其实就在where和on的区别。

where 后面跟的是过滤条件,query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区,所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。

而query 2中会读入所有partition下的数据,再和dpods_dp_reviewlog join,并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join,其余情况下又由于是left outer join, join不上右面会留NULL,query 2中其实是取出了所有的reviewid,所以会和query 1 结果不一样

可以做一个实验,query2去掉on后面的a.hp_statdate='2013-07-24',其余不动,执行语句,出来的distinct reviewcnt 也是 6443

 

select count(distinct reviewid) as dis_reviewcntfrom(select a.reviewidfrom bi.dpods_dp_reviewreport  aleft outer join bi.dpods_dp_reviewlog  bon a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'where to_date(a.feedadddate) >= '2013-07-01'  ) a

 

query 1的query plan

 

ABSTRACT SYNTAX TREE:  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))STAGE DEPENDENCIES:  Stage-5 is a root stage , consists of Stage-1  Stage-1  Stage-2 depends on stages: Stage-1  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-5    Conditional Operator  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        a:a           TableScan            alias: a            Filter Operator              predicate:                  expr: (to_date(feedadddate) >= '2013-07-01')                  type: boolean              Reduce Output Operator                key expressions:                      expr: reviewid                      type: int                sort order: +                Map-reduce partition columns:                      expr: reviewid                      type: int                tag: 0                value expressions:                      expr: feedadddate                      type: string                      expr: reviewid                      type: int                      expr: hp_statdate                      type: string        a:b           TableScan            alias: b            Reduce Output Operator              key expressions:                    expr: reviewid                    type: int              sort order: +              Map-reduce partition columns:                    expr: reviewid                    type: int              tag: 1      Reduce Operator Tree:        Join Operator          condition map:               Left Outer Join0 to 1          condition expressions:            0 {VALUE._col5} {VALUE._col8} {VALUE._col17}            1           handleSkewJoin: false          outputColumnNames: _col5, _col8, _col17          Select Operator            expressions:                  expr: _col8                  type: int            outputColumnNames: _col0            Select Operator              expressions:                    expr: _col0                    type: int              outputColumnNames: _col0              Group By Operator                aggregations:                      expr: count(DISTINCT _col0)                bucketGroup: false                keys:                      expr: _col0                      type: int                mode: hash                outputColumnNames: _col0, _col1                File Output Operator                  compressed: true                  GlobalTableId: 0                  table:                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  Stage: Stage-2    Map Reduce      Alias -> Map Operator Tree:        hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002             Reduce Output Operator              key expressions:                    expr: _col0                    type: int              sort order: +              tag: -1              value expressions:                    expr: _col1                    type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

Query 2的query plan

 

 

ABSTRACT SYNTAX TREE:  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')) (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))STAGE DEPENDENCIES:  Stage-5 is a root stage , consists of Stage-1  Stage-1  Stage-2 depends on stages: Stage-1  Stage-0 is a root stageSTAGE PLANS:  Stage: Stage-5    Conditional Operator  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        a:a           TableScan            alias: a            Filter Operator              predicate:                  expr: (to_date(feedadddate) >= '2013-07-01')                  type: boolean              Reduce Output Operator                key expressions:                      expr: reviewid                      type: int                sort order: +                Map-reduce partition columns:                      expr: reviewid                      type: int                tag: 0                value expressions:                      expr: feedadddate                      type: string                      expr: reviewid                      type: int                      expr: hp_statdate                      type: string        a:b           TableScan            alias: b            Reduce Output Operator              key expressions:                    expr: reviewid                    type: int              sort order: +              Map-reduce partition columns:                    expr: reviewid                    type: int              tag: 1      Reduce Operator Tree:        Join Operator          condition map:               Left Outer Join0 to 1          condition expressions:            0 {VALUE._col5} {VALUE._col8}            1           filter predicates:            0 {(VALUE._col17 = '2013-07-24')}            1           handleSkewJoin: false          outputColumnNames: _col5, _col8          Select Operator            expressions:                  expr: _col8                  type: int            outputColumnNames: _col0            Select Operator              expressions:                    expr: _col0                    type: int              outputColumnNames: _col0              Group By Operator                aggregations:                      expr: count(DISTINCT _col0)                bucketGroup: false                keys:                      expr: _col0                      type: int                mode: hash                outputColumnNames: _col0, _col1                File Output Operator                  compressed: true                  GlobalTableId: 0                  table:                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  Stage: Stage-2    Map Reduce      Alias -> Map Operator Tree:        hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-13-32_879_3623450294049807419/-mr-10002             Reduce Output Operator              key expressions:                    expr: _col0                    type: int              sort order: +              tag: -1              value expressions:                    expr: _col1                    type: bigint      Reduce Operator Tree:        Group By Operator          aggregations:                expr: count(DISTINCT KEY._col0:0._col0)          bucketGroup: false          mode: mergepartial          outputColumnNames: _col0          Select Operator            expressions:                  expr: _col0                  type: bigint            outputColumnNames: _col0            File Output Operator              compressed: false              GlobalTableId: 0              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  Stage: Stage-0    Fetch Operator      limit: -1

 

参考:

 

 

转载于:https://www.cnblogs.com/jiangu66/p/3221690.html

你可能感兴趣的文章
Navicat远程连接阿里云服务器的mysql
查看>>
https soap链接示例
查看>>
Git Day02,工作区,暂存区,回退,删除文件
查看>>
学前班
查看>>
关于自关联1
查看>>
hdu-1814(2-sat)
查看>>
谷歌浏览器,添加默认搜索引擎的搜索地址
查看>>
数据结构化与保存
查看>>
如何避免在简单业务逻辑上面的细节上面出错
查看>>
Linux shell 命令判断执行语法 ; , && , ||
查看>>
vim代码格式化插件clang-format
查看>>
RTP Payload Format for Transport of MPEG-4 Elementary Streams over http
查看>>
Java环境变量设置
查看>>
【JBPM4】判断节点decision 方法3 handler
查看>>
filter 过滤器(监听)
查看>>
node启动时, listen EADDRINUSE 报错;
查看>>
杭电3466————DP之01背包(对状态转移方程的更新理解)
查看>>
kafka中的消费组
查看>>
python--注释
查看>>
SQL case when else
查看>>