Hive自定义函数

Hive自定义函数(UDF,UDTF,UDAF)

一UDF:一进一出

1、先创建maven项目并下载依赖

<dependency>
       <groupId>org.apache.hive</groupId>
       <artifactId>hive-exec</artifactId>
       <version>1.2.1</version>
</dependency>

2、在后面打包jar包的时候可能出现报错:

Could not transfer artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde
解决方法:在pom文件中修改hive-exec的配置
 <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <exclusions>
                <!--排除pentaho-aggdesigner-algorithm依赖,不将它引入-->
                <exclusion>
                    <groupId>org.pentaho</groupId>
                    <artifactId>pentaho-aggdesigner-algorithm</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

UDF已弃用的方法(弃用的方法也写一下):

package customer.UDF;

import org.apache.hadoop.hive.ql.exec.UDF;

public class MyUDF extends UDF{
    /*
    实现evaluate函数
    将来hive调用自定义函数的时候,实际上调用的是该类中的evaluate函数
    evaluate函数当中的参数就是将来sql语句传入的列的值
    evaluate函数当中的返回值就是sql语句当中函数的返回值

    这个项目的目的是测试自定义函数,在前面加上日期
     */
    public String evaluate(String str) {
        return "日期"+str;
    }
}

1、进行打包,上传到存储区

2、在hive中使用命令将jar包当作资源上载到hive环境中去(这里是我自己的路径)

add jar /usr/soft/shellsdir/hive.jar;

image-20240412202919031

3、注册一个临时函数并使用jar包,function后面跟着的是函数名(自己取名字),在as后面跟着的是从IDEA中获取的类名

create temporary function sp as 'customer.UDF.MyUDFDemo1';

image-20240412202718820

4、可以先查看下是否有刚刚创建的临时函数

show functions;

image-20240412203134100

5、使用函数名处理数据:

select date_time,sp(date_time) from test1;

image-20240412203323128

案例:转小写(要转大写改个函数即可)

empno,ename,job,mgr,hiredate,sal,comm,deptno
7369,SMITH,CLERK,7902,1980-12-17,800,null,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,null,20
7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01,2850,null,30
7782,CLARK,MANAGER,7839,1981-06-09,2450,null,10
7788,SCOTT,ANALYST,7566,1987-07-13,3000,null,20
7839,KING,PRESIDENT,1981-11-17,5000,null,10
7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30
7876,ADAMS,CLERK,7788,1987-07-13,1100,null,20
7900,JAMES,CLERK,7698,1981-12-03,950,null,30
7902,FORD,ANALYST,7566,1981-12-03,3000,null,20
7934,MILLER,CLERK,7782,1982-01-23,1300,null,10

create table if not exists(empno bigint,ename string,job string,mgr bigint,hiredate date,sal bigint,comm string,deptno bigint) row fromat delimited fields terminated by ',';
在新写了类并且打包完jar包上传完jar包到本地之后需要先断开本地的连接,然后重新连接:
ctrl+c
nohup hiveserver2 &;

image-20240412205026733

剩下的步骤和上面一样:
add jar /usr/soft/shellsdir/hive.jar;

create temporary function sp2 as 'customer.UDF.MyUDFDemo2';

show functions;

select ename,sp2(ename) from emp;

image-20240412211851585

创建永久函数:

1、将jars上传至HDFS(先创建一个存储jar包的地方):
hadoop fs -put hadoop-mapreduce-1.0-SNAPSHOT.jar /jar/

image-20240412212340122

2、在hive中用命令行创建永久函数:
create function myUp as 'customer.UDF.MyUDFDemo2' using jar 'hdfs:/jar/hive1-1.0-SNAPSHOT-jar-with-dependencies.jar';

image-20240412212638689

3、退出hive,再进入执行测试
use test1;
select myUp('ASGUIAD');

image-20240412212930307

4、删除永久函数并检查(检查的时候报错即为删除成功)
drop function myUp;

image-20240412213226152

UDF新的方法:

package customer.UDF;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

public class MyGenericUDF extends GenericUDF {
    String output;
    @Override
    public ObjectInspector initialize(ObjectInspector[] d2) throws UDFArgumentException {
        //使用PrimitiveObjectInspectorFactory工厂类,获取String类型的ObjectInspector
        output="";
        return PrimitiveObjectInspectorFactory.javaStringObjectInspector;
    }
    /*
    该方法是自定义UDF的核心方法,用于实现自定义UDF的逻辑
    是在initialize方法后执行的
    DeferredObject[]是函数的参数,参数个数和类型由initialize决定
     */
    @Override
    public Object evaluate(DeferredObject[] d1) throws HiveException {
        //如果将来调用自定义函数的时候不传值的话,则o为null,需要加入判断,防止空指针异常
        String canshu=null;
        Object o = d1[0].get();
        if (o!=null){
            canshu=o.toString();
        }
        return "ABC"+canshu;
    }

    @Override
    public String getDisplayString(String[] strings) {
        return "111ABCDEFG";
    }
}

后续步骤同上(测试并执行成功截图)

image-20240412224428386

UDTF:一进多出

以一个案例来讲

现在有一条数据,想要讲其格式化输出,M前缀的是id,#前缀的是姓名name,S前缀的是card
想要将其输出格式为三列,每行id,name,card
M1001#xiaohu#S324231212,sp#M1002#S2543412432,S21312312412#M1003#dyj
package customer.UDTF;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
/*
    编写UDTF的类需要自己自定义的类继承GenericUDTF
    指定输出的列的名称以及列的类型
 */

public class MyUDTF extends GenericUDTF {
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        //创建一个List集合存储结果列的名字
        ArrayList<String> colNames = new ArrayList<>();
        //创建一个List集合存储每一列的数据类型
        ArrayList<ObjectInspector> colTypes = new ArrayList<>();
        //向集合中添加元素,设置列的名字和类型
        colNames.add("id");
        colTypes.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        colNames.add("name");
        colTypes.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        colNames.add("card");
        colTypes.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        //返回一个对象,该对象封装了列的名字和列的类型
        return ObjectInspectorFactory.getStandardStructObjectInspector(colNames, colTypes);
    }
    /*
    process方法主要是在调用函数的时候,底层会进行调用处理传入列的列数据

    M1001#xiaohu#S324231212,sp#M1002#S2543412432,S21312312412#M1003#dyj

     */
    @Override
    public void process(Object[] args) throws HiveException {
        //创建一个数组,存储每一列的数据
        //因为一行的结果中有三列,所以数组的长度是3,用来进行存储
        String[] rows = new String[3];

        //args[0]是输入的第一个列的数据
        //String col = (String) args[0];
        String col = args[0].toString();
        String[] infos = col.split(",");
        //遍历切分后的数组,得到每一个列的值

        for (String info : infos) {
            String[] strings = info.split("#");
            for (String i : strings) {
                if (i.startsWith("M")){
                    rows[0]=i.substring(1);
                } else if (i.startsWith("S")) {
                    rows[2]=i.substring(1);
                }else {
                    rows[1]=i;
                }
            }
            //forward(Object o)调用该函数,将一行数据,封装成一个数组,传入到forward方法中,给hive后续处理
            forward(rows);
        }
    }
    @Override
    public void close() throws HiveException {
        //这里一般是用作释放在initialize方法中创建的资源
        //这里可以不写
        //这里可能会有连接mysql的操作,所以在这里关闭连接
    }
}
1、步骤和之前一样
add jar /usr/soft/shellsdir/hive1-1.0-SNAPSHOT-jar-with-dependencies.jar;

create temporary function sp1 as 'customer.UDTF.MyUDTF';

show functions;
2、直接对数据进行处理
select sp1('M1001#xiaohu#S324231212,sp#M1002#S2543412432,S21312312412#M1003#dyj');
3、这里的hive中会有这个报错,是类型不匹配,需要去java中将类型转换换成 .toString(),然后重新打包重复上面操作即可

image-20240414165806388

4、更改后执行成功

image-20240414170521692

UDAF:多进一出

以一个例子来进行示范:

id	datestr	  amount
1,2019-02-08,6214.23 
1,2019-02-08,6247.32 
1,2019-02-09,85.63 
1,2019-02-09,967.36 
1,2019-02-10,85.69 
1,2019-02-12,769.85 
1,2019-02-13,943.86 
1,2019-02-14,538.42
1,2019-02-15,369.76
1,2019-02-16,369.76
1,2019-02-18,795.15
1,2019-02-19,715.65
1,2019-02-21,537.71
2,2019-02-08,6214.23 
2,2019-02-08,6247.32 
2,2019-02-09,85.63 
2,2019-02-09,967.36 
2,2019-02-10,85.69 
2,2019-02-12,769.85 
2,2019-02-13,943.86 
2,2019-02-14,943.18
2,2019-02-15,369.76
2,2019-02-18,795.15
2,2019-02-19,715.65
2,2019-02-21,537.71
3,2019-02-08,6214.23 
3,2019-02-08,6247.32 
3,2019-02-09,85.63 
3,2019-02-09,967.36 
3,2019-02-10,85.69 
3,2019-02-12,769.85 
3,2019-02-13,943.86 
3,2019-02-14,276.81
3,2019-02-15,369.76
3,2019-02-16,369.76
3,2019-02-18,795.15
3,2019-02-19,715.65
3,2019-02-21,537.71
建表语句
create table deal_tb(
    id string
    ,datestr string
    ,amount string
)row format delimited fields terminated by ',';
需求:求出用户连续登录天数和距离上次登录间隔多少天
posted @ 2024-05-11 10:18  peculiar-  阅读(236)  评论(0)    收藏  举报