三种中位数求法效率比拼

解法1：这是日本人MICK在其著作《SQL进阶教程》里提出的方法：

select avg(distinct salary)
from 
(
    select t1.salary
    from tb_employee t1,tb_employee t2
    group by t1.salary
    having sum(case when t1.salary>=t2.salary then 1 else 0 end)>=count(*)/2
        and  sum(case when t1.salary<=t2.salary then 1 else 0 end)>=count(*)/2
)

解法2：这是我在https://www.cnblogs.com/xiandedanteng/p/12637767.html 里提出的方法：

select avg(b.salary) from
(select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_employee order by salary) a ) b
where b.diff=(select min(c.diff) from
(select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,rank() over (order by salary asc) as seq,rank() over (order by salary desc) as revseq from tb_employee order by salary) a ) c)

上面把rank函数换成row_number函数也是一样的结果。

解法3：这是我在 https://www.cnblogs.com/xiandedanteng/p/12677633.html 里提出的方法：

select avg(a.salary) from
(select 
id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil 
from tb_employee) a
where a.seq=a.ceil or a.revseq=a.ceil

上面的row_numbe函数不能换成rank函数，否则查不出结果。

为了比拼效率，我特地制作了一张有两百万随机数据的大表：

表结构：

create table tb_employee(
    id number(7,0) primary key,
    name nvarchar2(20) not null,
    salary integer not null)

充值方法：

insert into tb_employee
select rownum,dbms_random.string('*',dbms_random.value(2,20)),dbms_random.value(1000,10000)
from dual
connect by level<2000001
order by dbms_random.random;

然后发现，执行方法一时搞得sql plus都僵死了，估计是做非等值连接时出现两百万乘以两百万的超大结果集导致的，方法二三倒是可以出来，直觉上感觉方法三更快。

两百万数据时方法二的解释计划：

SQL> select avg(b.salary) from
  2  (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) b
  3  where b.diff=(select min(c.diff) from
  4  (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) c);
已用时间:  00: 00: 00.00

执行计划
----------------------------------------------------------
Plan hash value: 3874635296

-----------------------------------------------------------------------------------------------
| Id  | Operation               | Name        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |             |     1 |    39 |       | 34592   (2)| 00:06:56 |
|   1 |  SORT AGGREGATE         |             |     1 |    39 |       |            |          |
|*  2 |   VIEW                  |             |  1916K|    71M|       | 17296   (2)| 00:03:28 |
|   3 |    WINDOW SORT          |             |  1916K|    47M|    66M| 17296   (2)| 00:03:28 |
|   4 |     WINDOW SORT         |             |  1916K|    47M|    66M| 17296   (2)| 00:03:28 |
|   5 |      TABLE ACCESS FULL  | TB_EMPLOYEE |  1916K|    47M|       |  3044   (2)| 00:00:37 |
|   6 |    SORT AGGREGATE       |             |     1 |    26 |       |            |          |
|   7 |     VIEW                |             |  1916K|    47M|       | 17296   (2)| 00:03:28 |
|   8 |      WINDOW SORT        |             |  1916K|    47M|    66M| 17296   (2)| 00:03:28 |
|   9 |       WINDOW SORT       |             |  1916K|    47M|    66M| 17296   (2)| 00:03:28 |
|  10 |        TABLE ACCESS FULL| TB_EMPLOYEE |  1916K|    47M|       |  3044   (2)| 00:00:37 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ABS("A"."SEQ"-"A"."REVSEQ")= (SELECT MIN(ABS("A"."SEQ"-"A"."REVSEQ"))
              FROM  (SELECT "ID" "ID","SALARY" "SALARY",ROW_NUMBER() OVER ( ORDER BY "SALARY")
              "SEQ",ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("SALARY") DESC ) "REVSEQ" FROM
              "TB_EMPLOYEE" "TB_EMPLOYEE" ORDER BY "SALARY") "A"))

Note
-----
   - dynamic sampling used for this statement (level=2)

两百万数据时方法三的解释计划：

SQL> select avg(a.salary) from
  2  (select
  3  id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil
  4  from tb_employee) a
  5  where a.seq=a.ceil or a.revseq=a.ceil;
已用时间:  00: 00: 00.00

执行计划
----------------------------------------------------------
Plan hash value: 3541675306

-----------------------------------------------------------------------------------------------
| Id  | Operation              | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |              |     1 |    52 |       | 21406   (2)| 00:04:17 |
|   1 |  SORT AGGREGATE        |              |     1 |    52 |       |            |          |
|   2 |   SORT AGGREGATE       |              |     1 |       |       |            |          |
|   3 |    INDEX FAST FULL SCAN| SYS_C0012264 |  1916K|       |       |  1504   (1)| 00:00:19 |
|*  4 |   VIEW                 |              |  1916K|    95M|       | 21406   (2)| 00:04:17 |
|   5 |    WINDOW SORT         |              |  1916K|    23M|    36M| 21406   (2)| 00:04:17 |
|   6 |     WINDOW SORT        |              |  1916K|    23M|    36M| 21406   (2)| 00:04:17 |
|   7 |      TABLE ACCESS FULL | TB_EMPLOYEE  |  1916K|    23M|       |  3044   (2)| 00:00:37 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("A"."SEQ"="A"."CEIL" OR "A"."REVSEQ"="A"."CEIL")

Note
-----
   - dynamic sampling used for this statement (level=2)

方法三cost比方法二少三分之一多，这与执行时间上的感觉是可以互相印证的。

而跑方法一的效果就有点恐怖了：

SQL> select avg(distinct salary)
  2  from
  3  (
  4      select t1.salary
  5      from tb_employee t1,tb_employee t2
  6      group by t1.salary
  7      having sum(case when t1.salary>=t2.salary then 1 else 0 end)>=count(*)/2
  8          and  sum(case when t1.salary<=t2.salary then 1 else 0 end)>=count(*)/2
  9  );
已用时间:  00: 00: 00.00

执行计划
----------------------------------------------------------
Plan hash value: 93677323

------------------------------------------------------------------------------------------
| Id  | Operation                  | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |             |     1 |    13 |  6315M  (9)|999:59:59 |
|   1 |  SORT AGGREGATE            |             |     1 |    13 |            |          |
|   2 |   VIEW                     | VM_NWVW_1   |  3671G|    43T|  6315M  (9)|999:59:59 |
|   3 |    HASH GROUP BY           |             |  3671G|    43T|  6315M  (9)|999:59:59 |
|   4 |     VIEW                   |             |  3671G|    43T|  6315M  (9)|999:59:59 |
|*  5 |      FILTER                |             |       |       |            |          |
|   6 |       HASH GROUP BY        |             |  3671G|    86T|  6315M  (9)|999:59:59 |
|   7 |        MERGE JOIN CARTESIAN|             |  3671G|    86T|  5829M  (2)|999:59:59 |
|   8 |         TABLE ACCESS FULL  | TB_EMPLOYEE |  1916K|    23M|  3044   (2)| 00:00:37 |
|   9 |         BUFFER SORT        |             |  1916K|    23M|  6315M  (9)|999:59:59 |
|  10 |          TABLE ACCESS FULL | TB_EMPLOYEE |  1916K|    23M|  3042   (2)| 00:00:37 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter(SUM(CASE  WHEN "T1"."SALARY">="T2"."SALARY" THEN 1 ELSE 0 END
              )>=COUNT(*)/2 AND SUM(CASE  WHEN "T1"."SALARY"<="T2"."SALARY" THEN 1 ELSE 0 END
              )>=COUNT(*)/2)

Note
-----
   - dynamic sampling used for this statement (level=2)

Cost是6315M！，令人惊讶的数字，难怪会让sql plus跑死。

方案二的运行时间：

SQL> set autotrace off;
SQL> select avg(b.salary) from
  2  (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) b
  3  where b.diff=(select min(c.diff) from
  4  (select a.*,abs(a.seq-a.revseq) as diff from (select id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq from tb_employee order by salary) a ) c);

AVG(B.SALARY)
-------------
         5498

已用时间:  00: 00: 10.76

方案三的运行时间：

SQL> select avg(a.salary) from
  2  (select
  3  id,salary,row_number() over (order by salary asc) as seq,row_number() over (order by salary desc) as revseq,(select ceil(count(*)/2) from tb_employee) as ceil
  4  from tb_employee) a
  5  where a.seq=a.ceil or a.revseq=a.ceil;

AVG(A.SALARY)
-------------
         5498

已用时间:  00: 00: 01.89

方案一因运行导致僵死而不能参赛。

最终的大比拼表格：

#	方案一	方案二	方案三
执行结果	没出来	5498	5498
Cost	6315M	34592	21406
时间	僵死	10.76秒	1.89秒

从这里看，代码上简洁，未必是最好的方案，经过实际检验执行时间最短才是。

如果要衡量一个程序的好坏，第一标准是出现正确的结果，第二是出现正确结果的最快方案，第三才是寻求最快方案里的最简洁方案。

从以上标准衡量，方法三最优，方法二其次，方法一则经不住大量数据的考验，只适用于小数据量的场合。对于这个结论，不知《SQL进阶教程》的著者和译者看了有什么想法。

当然，方法三和方法二出现的结果是有细微差别的，认真执行过的同学能知道，具体原因还需要日后探讨。

--2020年4月11日--

posted @ 2020-04-11 08:58 逆火狂飙阅读(521) 评论(4) 收藏举报

刷新页面返回顶部