scala及spark配置

1.scala在linux的配置

上传scala-2.11.5.tgz 到opt目录下,解压缩: tar -zxvf scala-2.11.5.tgz -C /usr/local/

1.cd /usr/local/scala-2.11.5
2.vi /etc/profile
3.最后一行编辑
export SCALA_HOME=/usr/local/scala-2.11.5
export PATH=$PATH:$SCALA_HOME/bin
4.保存,执行source /etc/profile
5.命令输入scala,则进入scala命令提示。:q退出

2.spark配置

2.1 spark的配置

spark在hadoop上的配置
1.上传spark-2.4.0-bin-hadoop2.6.tgz到/opt目录,并解压到/usr/local
tar -zxf /opt/spark-2.4.0-bin-hadoop2.6.tgz -C /usr/local/

2.进入/usr/local/spark-2.4.0-bin-hadoop2.6/conf
复制slaves.template:cp slaves.template slaves
修改slaves,先删除其中的localhost,然后添加:
slave1
slave2
slave3

3.修改spark-defaults.conf
cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf
添加:
spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://master:8020/spark-logs
spark.history.fs.logDirectory     hdfs://master:8020/spark-logs

4.修改spark-env.sh
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
添加:
JAVA_HOME=/usr/java/jdk1.8.0_151
HADOOP_CONF_DIR=/usr/local/hadoop-2.6.5/etc/hadoop
SPARK_MASTER_IP=master
SPARK_MASTER_PORT=7077
SPARK_WORKER_MEMORY=512m
SPARK_WORKER_CORES=1
SPARK_EXECUTOR_MEMORY=512m
SPARK_EXECUTOR_CORES=1
SPARK_WORKER_INSTANCES=1

5.启动Hadoop集群,在HDFS中新建目录:
hdfs dfs -mkdir /spark-logs

6.将Spark安装包分发到其他节点
 scp -r /usr/local/spark-2.4.0-bin-hadoop2.6/ slave1:/usr/local/
 scp -r /usr/local/spark-2.4.0-bin-hadoop2.6/ slave2:/usr/local/
 scp -r /usr/local/spark-2.4.0-bin-hadoop2.6/ slave3:/usr/local/

7.在所有节点配置Spark环境变量
vi /etc/profile
在文件尾加入:
export SPARK_HOME=/usr/local/spark-2.4.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin

执行source /etc/profile使命令生效

8.启动spark
进入/usr/local/spark-2.4.0-bin-hadoop2.6/sbin
执行
./start-all.sh
9.查看客户端
http://master:8080

10.测试spark程序
输入spark-shell,进入spark命令行
传一个数据文件到hdfs做数据统计
sc.textFile("/test/bb.txt").flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey(_+_) #对单词规约统计
res0.collect()  #展示结果

2.2 配置spark的java环境

新建java工程--》新建包--》新建单例类Object
(1)Java环境配置集群提交模式代码

连接集群代码
package spark
import org.apache.spark.sql.SparkSession
object WordCount {
  def main(args: Array[String]): Unit = {
    val spark=SparkSession.builder().appName("wordcount").getOrCreate() //本地模式
    val sc=spark.sparkContext
    val file=args(0)
    val splitter=args(1)
    val output=args(2)
    val count=sc.textFile(file)
      .flatMap(x=>x.split(splitter).map(x=>(x,1))).reduceByKey((x,y)=>x+y)
    count.coalesce(1,false).saveAsTextFile(output)
  }
}

(2)配置Spark开发依赖包

配置Spark开发依赖包
  • 创建一个Scala工程,点击菜单栏中的“File”->“Project Structure”,选择“Libraries”,单击“+”按钮.

  • 选择“Java”选项->在弹出的界面选择Spark安装包(提前解压)下的“jars”文件夹(注意:事先删除该目录下的commons-compiler-3.0.9.jar)-->点击“OK”

  • 在IDEA中将程序打成jar包.选择“File”→“Project Structure”命令-->在弹出的对话框中选择“Artifacts”选项-->选择“+”下的“JAR”选项中的“Empty”

  • 在弹出的对话框中修改“Name”为自定义的JAR包的名字“word”,双击右侧栏工程下的“‘workspace’compile output”,(不选择scala,jars的jar包,选择output文件夹)它会转移到左侧,wordspace表示工程名

  • 选择菜单栏中的“Build”→“Build Artifacts”命令,在弹出的方框(右下图)中选择“word” →“build”

  • 生成Artifact后,在工程目录中会有一个/out目录,可以看到生成的JAR包,如图所示,在JAR包处单击右键,在弹出菜单中选择“Show in Explorer”命令,直接到达JAR包路径下

  • 上传jar包到/opt/目录,hdfs中准备好数据文件。

  • 集群提交命令spark-submit --master yarn --deploy-mode cluster --class spark.WordCount /opt/word.jar /sparkdata/words.txt " " /sparkdata/wd1

3.spark SQL

3.1 配置

配置Spark SQL
1.进入hive安装目录bin目录,修改hive文件
vi hive
将sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`
修改为:
sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`

2.拷贝hive-site.xml到/usr/local/spark-2.4.0-bin-hadoop2.6/conf
cp /usr/local/apache-hive-1.2.1-bin/conf/hive-site.xml /usr/local/spark-2.4.0-bin-hadoop2.6/conf/
scp /usr/local/apache-hive-1.2.1-bin/conf/hive-site.xml slave1:/usr/local/spark-2.4.0-bin-hadoop2.6/conf/
scp /usr/local/apache-hive-1.2.1-bin/conf/hive-site.xml slave2:/usr/local/spark-2.4.0-bin-hadoop2.6/conf/
scp /usr/local/apache-hive-1.2.1-bin/conf/hive-site.xml slave3:/usr/local/spark-2.4.0-bin-hadoop2.6/conf/

3.拷贝MYSQL驱动到/usr/local/spark-2.4.0-bin-hadoop2.6/jars
cp /usr/local/apache-hive-1.2.1-bin/lib/mysql-connector-java-5.1.32-bin.jar /usr/local/spark-2.4.0-bin-hadoop2.6/jars/
scp /usr/local/spark-2.4.0-bin-hadoop2.6/jars/mysql-connector-java-5.1.32-bin.jar slave1:/usr/local/spark-2.4.0-bin-hadoop2.6/jars/
scp /usr/local/spark-2.4.0-bin-hadoop2.6/jars/mysql-connector-java-5.1.32-bin.jar slave2:/usr/local/spark-2.4.0-bin-hadoop2.6/jars/
scp /usr/local/spark-2.4.0-bin-hadoop2.6/jars/mysql-connector-java-5.1.32-bin.jar slave3:/usr/local/spark-2.4.0-bin-hadoop2.6/jars/

4.在所有节点/usr/local/spark-2.4.0-bin-hadoop2.6/conf/spark-env.sh 文件中配置 MySQL 驱动
SPARK_CLASSPATH=/usr/local/spark-2.4.0-bin-hadoop2.6/jars/mysql-connector-java-5.1.32-bin.jar


5.启动 MySQL 服务
service mysqld start

6.启动 Hive 的 metastore 服务
hive --service metastore &

7.修改日志级别,在各节点:
cp /usr/local/spark-2.4.0-bin-hadoop2.6/conf/log4j.properties.template /usr/local/spark-2.4.0-bin-hadoop2.6/conf/log4j.properties
修改log4j.properties
log4j.rootCategory=WARN, console

8.启动spark集群
9.访问spark-sql,输入spark-sql,后面就可以输入mysql语句。

3.2 简单语法

从虚拟机启动spark,务必先启动hadoop、mysql,hive、spark,再输入spark-shell,进入spark界面。云服务器能搭建成功还是需要云,否则每次开电脑,都这些操作,确实麻烦。

  • 创建DataFrame
从hive表导入数据到DataFrame
val df_hive=spark.read.table("taitan.people")   //前提hive表有这个数据,生成DataFrame格式数据
val df_hive2=spark.sql("select * from taitan.people") //利用spark的sql语句查询,比hive的sql查询快,hive是基于mapreduce
df_hive2.show()

//从外部文件导入到DataFrame
读取csv:
val df_csv=spark.read.option("header","true").option("seq",";").csv("/user/root/sparksqldata/people.csv")
//设置 表头进文件,分隔符

  • DataFrame数据导入导出
1.导出dataframe数据到外部文件
df_data.repartition(1).write.mode("overwrite").option("header","true").csv("/sparkdata/df_person")
// repartition,设置分区1,数据整合一起,

2.导出DataFrame数据到hive表
df_data.write.mode("overwrite").saveAsTable("taitan.score") 

  • 创建DataSet
1.建样例类。case class Goods(ID:String,Goods:String)

2.创建Dataset
val df_goods=spark.read.table("taitan.goods")
val df_order=spark.read.option("header","true").option("sep",",").csv("/sparkdata/GoodsOrder.csv")
val ds_goods=df_order.as[Goods] //从DataFrame转ds
val rdd=sc.textFile("/sparkdata/person.txt").map(x=>x.split(",")).map(x=>People(x(0).toInt,x(1),x(2).toInt))
//从外部文件创建DataSet,People为事先建好的样例类
3.保存DataSet
 val shop_count=ds_goods.groupBy("Goods").agg(count(lit(1)) as "shopcount") //统计结果为DataFrame
//保存DataFrame到hive表
shop_count.write.mode("overwrite").saveAsTable("taitan.shopcount") 

4.SparkMLib案例

ALS电影推荐模型

4.1电影推荐数据预处理

object DataProcess {  //建Object类
  case class Rating(uid:Int,mid:Int,rating:Double)//要在main方法外面定义样例类
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("als model data").enableHiveSupport().getOrCreate()

    val sc = spark.sparkContext
    //获取外部传入的参数
    val input = args(0) //输入路径
    val splitter = args(1) //文本分隔符
    val trainTable =args(2) //训练数据输出表表名
    val testTable = args(3) // 测试数据表


    import spark.implicits._
    val data = sc.textFile(input).map(x=>x.split(splitter)).map(x=>Rating(x(0).trim.toInt,x(1).trim.toInt,x(2).trim.toDouble)).toDF()

    val Array(train,test) = data.randomSplit(Array(0.8,0.2))
    train.write.mode("overwrite").saveAsTable(trainTable)
    test.write.mode("overwrite").saveAsTable(testTable)
  }
}

4.2ALS模型构建

object ALSModelCreate {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
    val sc = spark.sparkContext
    //获取参数
    val inputTable = args(0) //训练数据表,保存在hive中
    val rank = args(1).toInt //矩阵的秩
    val reg = args(2).toDouble //正则化参数
    val uidCol = args(3) //用户列
    val midCol = args(4)  //电影列
    val ratingCol = args(5) //评分列
    val modelPath = args(6) //模型保存的路径

    val data = spark.read.table(inputTable)
    //定义ALS模型
    val als = new ALS()
      .setMaxIter(5)
      .setRegParam(reg)
        .setRank(rank)
      .setUserCol(uidCol)
      .setItemCol(midCol)
      .setRatingCol(ratingCol)
    //训练模型
    val model = als.fit(data)
    //保存模型
    model.save(modelPath)
  }
}

4.3ALS预测与评估

object ALSModelEvaluator {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
    val sc = spark.sparkContext

    val modelPath = args(0) //模型路径
    val testTable = args(1) //测试表
    val testResult = args(2) //测试结果保存路径

    val model =ALSModel.load(modelPath) //加载模型
    val testdata = spark.read.table(testTable) //读取测试数据

    val predictions = model.transform(testdata) //预测过程
    val evaluator = new RegressionEvaluator() //评估预测结果
      .setMetricName("rmse")
      .setLabelCol("rating")
      .setPredictionCol("prediction")
    val rmse = evaluator.evaluate(predictions)

    sc.parallelize(List(rmse)).repartition(1).saveAsTextFile(testResult) //将结果保存到HDFS
  }
}

4.4 ALS模型推荐

object ALSModelPredict {
  case class users(uid:Int)
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

    val sc = spark.sparkContext
    val modelPath = args(0) //模型路径
    val recommendTable=args(1) //推荐结果表存hive表
    val recommendNum=args(2).toInt //推荐个数
    val userList = args(3).split(",").toList //需要预测的用户组

    val model =ALSModel.load(modelPath) //加载模型


    import spark.implicits._
    val user_df = sc.parallelize(userList).map(x=>users(x.toInt)).toDF()
    val userSubsetRecs = model.recommendForUserSubset(user_df, recommendNum)
    userSubsetRecs.write.mode("overwrite").saveAsTable(recommendTable)
  }

4.5 模型提交集群

先打包成alsModel.jar,然后提交到/opt/目录下,执行集群提交命令
spark-submit --.....--class 相应类名 每个类需要的路径参数

出现问题

(1). spark集群提交出现exitcode 13问题
网上说是程序配置中是单机模式引起问题,但是检查代码val spark=SparkSession.builder().appName("wordcount").getOrCreate(),这个是正确的。后面改提交spark集群,提示类找不到问题,决定重建类。重建工程--》建包--》建Object class,然后重新配置编译包,即jar,再次提交集群,成功。

posted @ 2020-09-18 09:54  林老师dsc  阅读(553)  评论(0编辑  收藏  举报