Spark sql 连续登录例题
2021/05/18 10.3.9.18 tianqi 1006 error 2021/05/18 10.7.5.56 lisi 1002 warning 2021/05/18 194.23.67.23 wangwu 1003 normal 2021/05/18 111.202.103.60 zhaoliu 1004 normal 2021/05/18 10.3.9.18 zhangsan 1005 warning 2021/05/18 10.3.9.18 lisi 1002 normal 2021/05/18 220.181.90.52 John 1001 normal 2021/05/18 10.7.5.56 lisi 1002 normal 2021/05/19 49.51.8.209 wangwu 1003 error 2021/05/19 66.102.251.33 zhaoliu 1004 normal 2021/05/19 10.3.9.18 zhangsan 1005 normal 2021/05/19 111.202.103.60 tianqi 1006 error 2021/05/19 140.205.94.189 lisi 1002 normal 2021/05/19 69.167.138.34 lisi 1002 normal 2021/05/19 194.23.67.23 tianqi 1006 normal 2021/05/20 111.202.103.60 zhaoliu 1004 normal 2021/05/20 10.3.9.18 tianqi 1006 warning 2021/05/20 66.102.251.33 tianqi 1006 normal 2021/05/20 69.167.138.34 John 1001 normal 2021/05/21 220.181.90.52 lisi 1002 warning 2021/05/21 111.202.103.60 wangwu 1003 normal 2021/05/21 10.7.5.56 zhaoliu 1004 normal 2021/05/21 49.51.8.209 zhangsan 1005 error 2021/05/21 140.205.94.189 lisi 1002 normal 2021/05/21 104.160.171.85 John 1001 normal 2021/05/22 10.3.9.18 lisi 1002 warning 2021/05/22 111.202.103.60 wangwu 1003 normal 2021/05/22 220.181.90.52 zhaoliu 1004 error 2021/05/22 104.160.171.85 zhangsan 1005 normal 2021/05/22 194.23.67.23 tianqi 1006 normal
1)创建sparksession环境 2)创建sparkcontext 3)读取上述文件生成rdd 4)数据格式化成元组类型 5)添加合适的字段名。 6)注册成临时表 7)查询lisi在那一天访问 8)查询连续三天访问用户 9)udf函数把时间转成yyyymmmdd格式
1 package week3 2 3 import org.apache.spark.sql.SparkSession 4 5 object SparkTest02 extends App { 6 //1)创建sparksession环境。 7 val spark = SparkSession 8 .builder() 9 .master("local") 10 .appName("test02") 11 .getOrCreate() 12 //2)创建sparkcontext。 13 val sc = spark.sparkContext 14 //3)读取上述文件生成rdd 文件为本地文件 15 var lineRDD = sc.textFile("data/week3/test02.txt") 16 //4)数据格式化成元组类型 17 var mapRDD = lineRDD.map(s=>{ 18 var ss = s.split("\\s+") 19 //返回数值 20 (ss(0),ss(1),ss(2),ss(3),ss(4)) 21 }) 22 //5)添加合适的字段名。 23 var logsDF = spark.createDataFrame(mapRDD).toDF("time","ip","name","id","log") 24 //6)注册成临时表 25 logsDF.createTempView("logs") 26 //7)查询李四在哪一天访问 27 spark.sql("select distinct l.time from logs l where l.name='lisi'").show() 28 //8)查询连续三天访问用户 29 //求出日期 和 用户名 并去重 剔除相同时间内的重复人员数据 30 // spark.sql("select replace(l.time,'/','-'),l.name from logs l ").show() 31 /*spark.sql("select distinct replace(l.time,'/','-') day,l.name from logs l ").createTempView("t1") 32 33 //根据用户 对日期进行排序 34 spark.sql("select t1.*,rank() over(partition by t1.name order by t1.day) num from t1").createTempView("t2") 35 36 //将日期与排名相减 37 spark.sql("select t.*,date_sub(t.day,t.num) cha from t2 t").createTempView("t3") 38 39 //根据姓名和 差 进行分组 40 spark.sql("select t.name ,t.cha ,count(*) cnt from t3 t group by t.name,t.cha having cnt >=3").show() 41 */ 42 //使用一条sql 43 var sql = 44 """ 45 | select 46 | t3.name, 47 | t3.cha, 48 | count(*) cnt 49 |from 50 | ( 51 | select 52 | t2.*, 53 | date_sub(t2.day,t2.num) cha 54 | from 55 | ( 56 | select 57 | t1.*, 58 | rank() over (partition by t1.name order by t1.day) num 59 | from 60 | ( 61 | select 62 | distinct replace(l.time,'/','-') day, 63 | l.name 64 | from logs l 65 | )t1 66 | ) t2 67 | ) t3 68 |group by 69 | t3.name, 70 | t3.cha 71 |having 72 | cnt >=3 73 | 74 |""".stripMargin 75 76 spark.sql(sql).show() 77 //9)udf函数吧时间转成yyyymmmdd格式 78 spark.udf.register("MyTime",(str:String)=>{ 79 //直接将/ 替换成 空字符串 80 str.replaceAll("/","") 81 }) 82 83 spark.sql("select l.*,MyTime(l.time) from logs l").show() 84 spark.stop() 85 }
我有一杯酒,足以慰风尘。

浙公网安备 33010602011771号