Spark 2.x管理与开发-Spark SQL-【Spark SQL案例】(二)多表查询*

1.将三张表合并实现查询

Scala代码:

package sqlExamples

import org.apache.spark.sql.SparkSession
import org.apache.log4j.Logger
import org.apache.log4j.Level
import java.text.SimpleDateFormat

//注意:在Spark和Hive里面读取进来的原数据尽量用String,这样可以避免很多不必要的问题
//如果后期需要进行运算,再将其进行类型的转换即可
/**
 * sno:学号
 * sname:姓名
 * ssex:性别
 * sbirthday:生日
 * sclass:班级
 */
case class Students(sno: String, sname: String, ssex: String, sbirthday: String, sclass: String)
/**
 * cno:课程号
 * cname:课程名
 * tno:教工编号
 */
case class Course(cno: String, cname: String, tno: String)
/**
 * sno:学号
 * cno:课程号
 * degree:成绩
 */
case class Score(sno: String, cno: String, degree: String)
/**
 * tprof:职称
 * tdepart:教工所在部门
 */
case class Teacher(tno: String, tname: String, tsex: String, tbirthday: String, tprof: String, tdepart: String)
object StudentsAndTeachers {
  def main(args: Array[String]): Unit = {
    //下面的两行代码定义日志级别,可以减少打印出来的日志
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    //Spark环境
    val spark = SparkSession.builder().master("local").appName("SparkSQLExample").getOrCreate()

    import spark.sqlContext.implicits._
    //读取数据
    spark.sparkContext.textFile("D:\\tmp_files\\spark_sql_test_data\\Student.csv")
      .map(_.split(",")) //注意:csv格式的文件最好用逗号进行切割
      .map(x => Students(x(0), x(1), x(2), x(3), x(4))) //采用case class 创建DataFrame 并关联表结构
      .toDF //将RDD转换成DataFrame
      .createOrReplaceTempView("student") //创建视图
    spark.sparkContext.textFile("D:\\tmp_files\\spark_sql_test_data\\Course.csv")
      .map(_.split(","))
      .map(x => Course(x(0), x(1), x(2)))
      .toDF
      .createOrReplaceTempView("course")
    spark.sparkContext.textFile("D:\\tmp_files\\spark_sql_test_data\\Score.csv")
      .map(_.split(","))
      .map(x => Score(x(0), x(1), x(2)))
      .toDF
      .createOrReplaceTempView("Score")
    spark.sparkContext.textFile("D:\\tmp_files\\spark_sql_test_data\\Teacher.csv")
      .map(_.split(","))
      .map(x => Teacher(x(0), x(1), x(2), x(3), x(4), x(5)))
      .toDF
      .createOrReplaceTempView("Teacher")
//*********************多表查询***********************
    //提示:1)  多表查询可后写select部分,例如下面的:一、
    //     2) 需要Join的可考虑先Join,之后再考虑其它复杂的查询条件,例如下面的:三、
    //一、将三张表合并实现查询-join on
    //查询所有学生的sname,cname degree
    spark.sql("select s.sname,c.cname,t.degree from score t " +
      "Join student s on t.sno=s.sno " +
      "join course c on c.cno=t.cno").show(false)
   //关闭Spark
    spark.close()
  }
}

结果:

 

2.多个子查询

Scala代码:

//二、多个子查询
    //1.查询Score中选修多门选修课的学生中,分数为非最高分成绩的记录--max()
    spark.sql("select * from score where sno in " +
      "(select sno from score t group by t.sno having count(1)>1 )" +
      "and degree!=(select max(degree) from score)").show()
    //2.查询所有没有讲过课的老师的tname和tdepart --not in()、in()
    spark.sql("select tname,tdepart from Teacher t where t.tno not in(select tno from Course c where c.cno in (select cno from Score))").show()

结果:

 

 

 

3.Join自己临时获取的中间表

Scala代码:

//三、join自己临时获取的中间表
    //查询选修某课程的同学人数多于5人的教师姓名
    spark.sql("select tname from Teacher t " +
      "join Course c on c.tno=t.tno " +
      "join (select cno from score group by cno having count(1)>5) m on c.cno=m.cno").show()
    //注:count(1)换成count(cno)也可以

结果:

 

 

 

博客园  ©  2004-2025
浙公网安备 33010602011771号 浙ICP备2021040463号-3