oracle：中位数查询整理

中位数的数学定义
oracle中位数查询
- 1.求解思路
- 2.应用案例
  - (一)使用median函数的情况
  - (二)不使用median函数的情况

中位数的数学定义

中位数的定义如下图所示:

在oracle数据查询中，当N为偶数时，一般不取平均值，因为这有可能创建出新的样例使得查询语句为空，所以通常为偶数时，一般返回最中间的一组数据。

oracle中位数查询

1.求解思路

目前常见的求解主要包括以下两种思路：

1)从数值角度出发；

先按照目标列进行排序，然后按照总行数的奇偶性，利用类似上图的数学方法进行方法进行筛选。

2）从数值索引(所在位置)出发

这种方法从中位数可能出现的位置，进行巧妙构造，以选取目标位置。
类似于python中根据目标索引对list进行切片。其更符合中位数的定义，即有一半的数大于中位数，有一半数小于中位数。

实际应用中出于不同的考虑,对于内置函数median选择性使用

2.应用案例

为了方便说明，采用leetcode上的案例：编写SQL查询来查找每个公司的薪水中位数。

Id	Company	Salary
1	A	2341
2	A	341
3	A	15
4	A	15314
5	A	451
6	A	513
7	B	15
8	B	13
9	B	1154
10	B	1345
11	B	1221
12	B	234
13	C	2345
14	C	2645
15	C	2645
16	C	2652
17	C	65

表创建语句：

create table employee(
id number(3) primary key,
    company varchar2(2),
    salary number(8)
);

insert into employee values(1,'A',2341);
insert into employee values(2,'A',341);
insert into employee values(3,'A',15);
insert into employee values(4,'A',15314);
insert into employee values(5,'A',451);
insert into employee values(6,'A',513);
insert into employee values(7,'B',15);
insert into employee values(8,'B',13);
insert into employee values(9,'B',1154);
insert into employee values(10,'B',1354);
insert into employee values(11,'B',1221);
insert into employee values(12,'B',234);
insert into employee values(13,'C',2345);
insert into employee values(14,'C',2645);
insert into employee values(15,'C',2645);
insert into employee values(16,'C',2652);
insert into employee values(17,'C',65);

(一)使用`median`函数的情况

select id,company,salary
from (
	select tmp.* ,median(r_num) over(partition by company) med
	from (
			select e.*, row_number() over(partition by company order by salary) r_num
				from employee e) tmp)
where abs(r_num-med)<=0.5;

程序解释:利用median求出的中位数是数学上的定义,不是我们所需要的.所以利用median进行改造.思路:获取每一份组的行号row_number(对应的新列记为r_num)和median(对应的新列记为med),当abs(r_num-med)<=0.5时,说明取到了中位数;

统计结果如下:

(二)不使用`median`函数的情况

（1）----->>>对应的第一种方法
这种方法需要区分不同分组数据量的奇偶性，在使用where的时候需要考虑单值和多值混合查询。
具体查询语句：

select id,company,salary
from (
	select e.*, count(*) over(partition by company) x,row_number() over(partition by company order by salary) r_num
		from employee e) tmp
where r_num in (ceil(x/2),x/2+1);

最终的查询结果：

值得注意的是where r_num in (ceil(x/2),x/2+1)的设定。当x是奇数时，(ceil(x/2),x/2+1)中只有一个有效（即ceil(x/2)=（x+1）/2）;当x是偶数时，(ceil(x/2),x/2+1)=（x/2,x/2+1）就是最中间的一组数据。

这类方法的其它查询案例,大多需要通过group by+连接构造,相对于上面的比较复杂,本文暂且不考虑其它方法.

（2）----->>>对应的第一种方法

select e1.id,e1.company,e1.salary
from employee e1,employee e2
where e1.company=e2.company(+) 
group by e1.company,e1.salary,e1.id
having sum(decode(e1.salary-e2.salary,0,1,0))
        >=abs(sum(sign(e1.salary-e2.salary)))
order by e1.id;

程序分解：以A公司为例，共有6条数据。

上述程序中核心思路是采用自连接+having条件。having条件构造的特别巧妙。所以将对其进一步分解：如下图所示。

中间的命令行截图为A公司数据的自连接结果(按照salary)排序。其中“+”表示大于当前对象的样例个数(后面的具体数字)；“0”表示相等；“-”表示小于。sum(sign())表示having中的部分结果。x=sign(a)为符号函数，a>0,x=1;a=0,x=0,a<0,x=-1.

其中having的统计结果如下图所示，每一组中相等的情况中只有一种图表中的A列(A公司正好没有重复数据)，abs()的统计结果为(图表中的B列)，具体计算方法见上图（关联结果）

（2）----->>>对应的第二种方法
思路仍然是从中位数的定义出发，只不过引用的核心函数是row_number,和count（*）。其中row_number的使用方法，请查看oracle学习笔记(六)：oracle中排序函数及其应用_数据库_qq_40584718的博客-CSDN博客。

具体程序(来源于leetcode)如下：


/* Write your PL/SQL query statement below */
select
	id, company, salary
from
	(select
		id, company, salary,
		row_number() over (partition by company order by salary) as rn, -- 各薪水记录在其公司内的顺序编号
		count(1) over (partition by company) as cnt -- 各公司的薪水记录数
	from employee
	)
where abs(rn - (cnt+1)/2) < 1 -- 顺序编号在公司薪水记录数中间的，即为中位数

核心程序解读：
(1) row_number() over() 按照公司分组，并按照薪水排序，将该结果保存为新列rn;
(2) count(1) over()应该和count(*) over()效果相同，是用来统计不同公司的样例条数；
(3) where调价，这个程序的灵魂。rn可以理解为一个列表，where的运行过程可以理解为下图：

where abs(rn - (cnt+1)/2) < 1 -- 顺序编号在公司薪水记录数中间的，即为中位数

这里的<1说明rn中有（cnt+1）/2特别靠近的行号(或者理解为索引或位置)存在。举例说明：

假如rn=[1,2,...8],则（cnt+1）/2=4.5，中对数对应的索引为4和5；
若rn=[1,2,...,7],则（cnt+1）/2=4，那么这个中位数对应的索引只能是4.

参考:
569. 员工薪水中位数 - 力扣（LeetCode）:评论+题解

posted @ 2020-04-03 17:26 LgRun 阅读(3577) 评论(0) 收藏举报

刷新页面返回顶部

LgRun