dairui130

导航

UDF函数 解码url

背景

URL 的编码 是ASCII十六进制格式。数仓接受到前端上报的URL,要对URL字段解码。

如要将

https%3A%2F%2Fmywebsite%2Fdocs%2Fenglish%2Fsite%2"Fmybook.do%3Frequest_type%3D%26type%3Dprivate

解码为:

https://mywebsite/docs/english/site/mybook.do?request_type=&type=private

 

方式

hive sql中,通过反射调用java.net.URLDecoder方法

select reflect('java.net.URLDecoder', 'decode',url, 'UTF-8') from table_name;

问题

实践中发现,部分https开头的url解析不完整。例如解码下面的URL,

https%253A%252F%252Fwww.cnblogs.com/drjava

通过上面的方法,结果是:

https%3A%2F%2Fwww.cnblogs.com/drjava

 

解决方案

通过google找到了可行的方法,解码两次,由于sql写起来比较难以阅读,所以封装了udf。代码如下:

 1 import org.apache.hadoop.hive.ql.exec.UDF;
 2 import java.io.UnsupportedEncodingException;
 3 import java.net.URLDecoder;
 4 
 5 public class UrlDecode extends UDF {
 6     /**
 7      * 
 8      *
 9      * @param component 编码的url
10      * @return 解码url
11      */
12     public  String evaluate(String component) {
13         if(component == null || component.length() <= 0){
14             return "";
15         }
16 
17         String result = "";
18         component = component.replaceAll("%(?![0-9a-fA-F]{2})", "%25");
19         try {
20             result = URLDecoder.decode(component, "UTF-8");
21             result = URLDecoder.decode(result, "UTF-8");
22         } catch (UnsupportedEncodingException e) {
23             result = component;
24         }
25         return result;
26     }
27 }

做一下单元测试

 1 import org.junit.Assert;
 2 import org.junit.Test;
 3 import udf.UrlDecode;
 4 
 5 public class UrlDecodeTest {
 6     @Test
 7     public void UrlDecodeMethod(){
 8         UrlDecode ud = new UrlDecode();
 9         Assert.assertEquals(ud.evaluate("https%3A%2F%2Fmywebsite%2Fdocs%2Fenglish%2Fsite%2" +
10                 "Fmybook.do%3Frequest_type%3D%26type%3Dprivate"),
11                 "https://mywebsite/docs/english/site/mybook.do?request_type=&type=private");
12 
13         Assert.assertEquals(ud.evaluate("你是谁,为了谁"),"你是谁,为了谁");
14 
15         Assert.assertEquals(ud.evaluate(null),"");
16     }
17 }

 

 

posted on 2019-04-18 16:34  dairui130  阅读(1167)  评论(0编辑  收藏  举报