HIve(三)

  • Hive 常用函数
    • 关系运算
      • 等值比较 = == <=>
      • 不等值比较 != <>
      • 区间比较: select * from default.students where id between 1500100001 and 1500100010;
      • 空值/非空值判断:is null、is not null、nvl()、isnull()
      • like、rlike、regexp用法

 

    • 数值计算
      • 取整函数(四舍五入):round
      • 向上取整:ceil
      • 向下取整:floor
  • 条件函数
    • if: if(表达式,如果表达式成立的返回值,如果表达式不成立的返回值)
      • select if(1>0,1,0); 
      • select if(1>0,if(-1>0,-1,1),0);
    • COALESCE
      • select COALESCE(null,'1','2'); // 1 从左往右 一次匹配 直到非空为止
      • select COALESCE('1',null,'2'); // 1
    • case when
      • select  score
        
                ,case when score>120 then '优秀'
        
                      when score>100 then '良好'
        
                      when score>90 then '及格'
        
                else '不及格'
        
                end as pingfen
        
        from default.score limit 20;
        
        
        
        select  name
        
                ,case name when "施笑槐" then "槐ge"
        
                          when "吕金鹏" then "鹏ge"
        
                          when "单乐蕊" then "蕊jie"
        
                else "算了不叫了"
        
                end as nickname
        
        from default.students limit 10;
      •  注意条件的顺序

 

  • 日期函数
    • select from_unixtime(1610611142,'YYYY/MM/dd HH:mm:ss');
    • select from_unixtime(unix_timestamp(),'YYYY/MM/dd HH:mm:ss');
    • // '2021年01月14日' -> '2021-01-14'
    • select from_unixtime(unix_timestamp('2021年01月14日','yyyy年MM月dd日'),'yyyy-MM-dd');
    • // "04牛2021数加16逼" -> "2021/04/16"
    • select from_unixtime(unix_timestamp("04牛2021数加16逼","MM牛yyyy数加dd逼"),"yyyy/MM/dd");

 

  • 字符串函数
    • concat('123','456'); // 123456
      
      concat('123','456',null); // NULL
      
      
      
      select concat_ws('#','a','b','c'); // a#b#c
      
      select concat_ws('#','a','b','c',NULL); // a#b#c 可以指定分隔符,并且会自动忽略NULL
      
      select concat_ws("|",cast(id as string),name,cast(age as string),gender,clazz) from students limit 10;
      
      
      
      select substring("abcdefg",1); // abcdefg HQL中涉及到位置的时候 是从1开始计数
      
      // '2021/01/14' -> '2021-01-14'
      
      select concat_ws("-",substring('2021/01/14',1,4),substring('2021/01/14',6,2),substring('2021/01/14',9,2));
      
      
      
      select split("abcde,fgh",","); // ["abcde","fgh"]
      
      select split("a,b,c,d,e,f",",")[2]; // c
      
      
      
      select explode(split("abcde,fgh",",")); // abcde
      
                                              //  fgh
      
      // 解析json格式的数据
      
      select get_json_object('{"name":"zhangsan","age":18,"score":[{"course_name":"math","score":100},{"course_name":"english","score":60}]}',"$.score[0].score"); // 100
      
      ```

       

  • Hive 中的wordCount
    • create table words(
      
          words string
      
      )row format delimited fields terminated by '|';
      
      
      // 数据
      
      hello,java,hello,java,scala,python
      
      hbase,hadoop,hadoop,hdfs,hive,hive
      
      hbase,hadoop,hadoop,hdfs,hive,hive
      
      select word,count(*) from (select explode(split(words,',')) word from words) a group by a.word; // 结果 hadoop 4 hbase 2 hdfs 2 hello 2 hive 4 java 2 python 1 scala 1
  •  Hive 开窗函数
    • 好像给每一份数据 开一扇窗户 所以叫开窗函数
    • 在sql中有一类函数叫做聚合函数,例如sum()、avg()、max()等等,这类函数可以将多行数据按照规则聚集为一行,一般来讲聚集后的行数是要少于聚集前的行数的.但是有时我们想要既显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.
    • 测试数据
      • 111,69,class1,department1
        
        112,80,class1,department1
        
        113,74,class1,department1
        
        114,94,class1,department1
        
        115,93,class1,department1
        
        121,74,class2,department1
        
        122,86,class2,department1
        
        123,78,class2,department1
        
        124,70,class2,department1
        
        211,93,class1,department2
        
        212,83,class1,department2
        
        213,94,class1,department2
        
        214,94,class1,department2
        
        215,82,class1,department2
        
        216,74,class1,department2
        
        221,99,class2,department2
        
        222,78,class2,department2
        
        223,74,class2,department2
        
        224,80,class2,department2
        
        225,85,class2,department2

         

    • 建表语句
      • create table new_score(
        
            id  int
        
            ,score int
        
            ,clazz string
        
            ,department string
        
        ) row format delimited fields terminated by ",";

         

  • row_number:无并列排名
    • 用法: select xxxx, row_number() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx
    • dense_rank:有并列排名,并且依次递增
    • rank:有并列排名,不依次递增
    • percent_rank:(rank的结果-1)/(分区内数据的个数-1)
    • cume_dist:计算某个窗口或分区中某个值的累积分布。
  •  假定升序排序,则使用以下公式确定累积分布: 小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
    •  NTILE(n):对分区内数据再分成n组,然后打上组号
    •  max、min、avg、count、sum:基于每个partition分区内的数据做对应的计算

 

  • 窗口帧:用于从分区中选择指定的多条记录,供窗口函数处理
    •  Hive 提供了两种定义窗口帧的形式:`ROWS` 和 `RANGE`。两种类型都需要配置上界和下界。例如,`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` 表示选择分区起始记录到当前记录的所有行;`SUM(close) RANGE BETWEEN 100 PRECEDING AND 200 FOLLOWING` 则通过 *字段差值* 来进行选择。如当前行的 `close` 字段值是 `200`,那么这个窗口帧的定义就会选择分区中 `close` 字段值落在 `100` 至 `400` 区间的记录。以下是所有可能的窗口帧定义组合。如果没有定义窗口帧,则默认为 `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`。
    • 只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上
    • (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    • (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    • (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
    • range between 3 PRECEDING and 11 FOLLOWING
    • SELECT id
               ,score
               ,clazz
               ,SUM(score) OVER w as sum_w
               ,round(avg(score) OVER w,3) as avg_w
               ,count(score) OVER w as cnt_w
       FROM new_score
      
       WINDOW w AS (PARTITION BY clazz ORDER BY score rows between 2 PRECEDING and 2 FOLLOWING);

       

    • select  id
      
              ,score
      
              ,clazz
      
              ,department
      
              ,row_number() over (partition by clazz order by score desc) as rn_rk
      
              ,dense_rank() over (partition by clazz order by score desc) as dense_rk
      
              ,rank() over (partition by clazz order by score desc) as rk
      
              ,percent_rank() over (partition by clazz order by score desc) as percent_rk
      
              ,round(cume_dist() over (partition by clazz order by score desc),3) as cume_rk
      
              ,NTILE(3) over (partition by clazz order by score desc) as ntile_num
      
              ,max(score) over (partition by clazz order by score desc range between 3 PRECEDING and 11 FOLLOWING) as max_p
      
      from new_score;

       

  • LAG(col,n):往前第n行数据
  • LEAD(col,n):往后第n行数据
  • FIRST_VALUE:取分组内排序后,截止到当前行,第一个值
  •  LAST_VALUE:取分组内排序后,截止到当前行,最后一个值,对于并列的排名,取最后一个
    • select  id
      
              ,score
      
              ,clazz
      
              ,department
      
              ,lag(id,2) over (partition by clazz order by score desc) as lag_num
      
              ,LEAD(id,2) over (partition by clazz order by score desc) as lead_num
      
              ,FIRST_VALUE(id) over (partition by clazz order by score desc) as first_v_num
      
              ,LAST_VALUE(id) over (partition by clazz order by score desc) as last_v_num
      
              ,NTILE(3) over (partition by clazz order by score desc) as ntile_num
      
      from new_score;

       

  • Hive 行转列
    • lateral view explode
    • create table testArray2(
      
          name string,
      
          weight array<string>
      
      )row format delimited 
      
      fields terminated by '\t'
      
      COLLECTION ITEMS terminated by ',';
    • select name,col1  from testarray2 lateral view explode(weight) t1 as col1;

      select key from (select explode(map('key1',1,'key2',2,'key3',3)) as (key,value)) t;

      select name,col1,col2  from testarray2 lateral view explode(map('key1',1,'key2',2,'key3',3)) t1 as col1,col2;

      select name,pos,col1  from testarray2 lateral view posexplode(weight) t1 as pos,col1;

  • Hive 列转行
    • // testLieToLine
    • name col1
      • create table testLieToLine(
        
            name string,
        
            col1 int
        
        )row format delimited 
        
        fields terminated by '\t';
        
        
        
        select name,collect_list(col1) from testLieToLine group by name;
        
        
        
        // 结果
        
        上单    ["150","180","190"]
        
        志凯    ["150","170","180"]
        
        
        
        select  t1.name
        
                ,collect_list(t1.col1) 
        
        from (
        
            select  name
        
                    ,col1 
        
            from testarray2 
        
            lateral view explode(weight) t1 as col1
        
        ) t1 group by t1.name;
        
        ```

         

posted @ 2021-09-29 20:32  钟心意  阅读(58)  评论(0)    收藏  举报