Hive UDF处理特殊字符[\x22、urlencode等编码问题
如果你的函数读和返回都是基础数据类型(Hadoop&Hive 基本writable类型,如Text,IntWritable,LongWriable,DoubleWritable等等),那么简单的API(org.apache.hadoop.hive.ql.exec.UDF)可以胜任
但是,如果你想写一个UDF用来操作内嵌数据结构,如Map,List和Set,那么你要去熟悉org.apache.hadoop.hive.ql.udf.generic.GenericUDF这个API
简单API: org.apache.hadoop.hive.ql.exec.UDF
复杂API: org.apache.hadoop.hive.ql.udf.generic.GenericUDF
复杂API: org.apache.hadoop.hive.ql.udf.generic.GenericUDF
接下来我将通过一个示例为上述两个API建立UDF,我将为接下来的示例提供代码与测试 。
注://事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
pom文件参考:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<!-- <dependency>-->
<!-- <groupId>com.aliyun.odps</groupId>-->
<!-- <artifactId>odps-sdk-udf</artifactId>-->
<!-- <version>0.29.10-public</version>-->
<!-- </dependency>-->
</dependencies>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>6</source>
<target>6</target>
</configuration>
</plugin>
</plugins>
</build>
DEMO:
package udf;
import jodd.util.URLDecoder;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.io.UnsupportedEncodingException;
public class TestDecodeX extends UDF {
public static void decodeX (String s) throws UnsupportedEncodingException {
String s1 = s.replaceAll("\\\\x", "%");
String decode = URLDecoder.decode(s1, "utf-8");
System.out.println(decode);
}
public String evaluate(String input) throws Exception {
//事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
if (input == null) return null ;
String decode = null ;
try {
String s1 = input.replaceAll("\\\\x", "%");
decode = URLDecoder.decode(s1, "utf-8");
// System.out.println(decode);
} catch (Exception e) {
// e.printStackTrace();
}
System.out.println(decode);
return decode ;
}
public static void main(String[] args) throws Exception {
String s1 = "G977N%7C7.1.2%7Cwifi%7C%7Cgamepubgoogle%7CGetHashed%7Ccom.gamepub.ft2.g%7Candroid%7C%7C%7C1.0.2%7Csamsung%7C1547548%7C1%7CAsia%2FSeoul%7CARM%7C%7C19d1b5cdf01341e99c670f254765148d%22%5D" ;
String s = "172.31.35.210|21/04/2021:10:59:01|[\\x22TakeSample|0bb9f14b1041a8d9|32550283-4DF6-4CC5-9922-E4F9CFAFD7FD|iPhone13,1|14.2.1|wifi||gamepubappstore|GetHashed|com.gamepub.fr2|ios|BAB3A467-A4D0-4900-80F7-BCB9D53757B1||0.26.87|\\xE8\\x8B\\xB9\\xE6\\x9E\\x9C|3.63|0|Asia/Seoul|ARM64||\\x22]\n" ;
TestDecodeX t = new TestDecodeX() ;
t.evaluate(s1) ;
}
}
result结果示例:
G977N|7.1.2|wifi||gamepubgoogle|GetHashed|com.gamepub.ft2.g|android|||1.0.2|samsung|1547548|1|Asia/Seoul|ARM||19d1b5cdf01341e99c670f254765148d"]
Process finished with exit code 0
在hive客户端:
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar; hive> CREATE TEMPORARY FUNCTION decodeX as 'udf.TestDecodeX';
参考:
Hive UDF开发指南posted on 2021-04-28 16:39 RICH-ATONE 阅读(954) 评论(0) 收藏 举报
浙公网安备 33010602011771号