spark系列 - 随笔分类 - bioamin

spark 指定参数配置文件

摘要：一般情况下，一个客户端指定一个集群，但是存在一个客户端指定多个集群的情况，这种情况下，需要动态调整配置文件 //配置文件可以通过传参或者数据库进行读取 package com.cslc import org.apache.hadoop.conf.Configuration import org.ap 阅读全文

posted @ 2021-01-11 20:39 bioamin 阅读(1234) 评论(0) 推荐(0)

spark 累加器

摘要：累加器原理图：累加器创建： sc.longaccumulator("") sc.longaccumulator sc.collectionaccumulator() sc.collectionaccumulator sc.doubleaccumulator() sc.doubleaccumulat 阅读全文

posted @ 2021-01-11 18:58 bioamin 阅读(353) 评论(0) 推荐(0)

spark mapPartitionWithindex && repartition && coalesce

摘要：mapPartitionWithindex transformation算子，每次输入是一个分区的数据，并且传入数据的分区号 spark.sparkContext.setLogLevel("error")val kzc=spark.sparkContext.parallelize(List(("hi 阅读全文

posted @ 2021-01-11 15:11 bioamin 阅读(254) 评论(0) 推荐(0)

spark action 算子

摘要：action算子会触发spark进行运算，用于job划分，一个action算子就是一个job。带有shuffle的算子用于划分stage（一个分区的数据去往多个分区），例如reduceByKey、 action算子如下： 1、count() 返回数据集中的元素数。会在结果计算完成后回收到Drive 阅读全文

posted @ 2021-01-04 19:14 bioamin 阅读(449) 评论(0) 推荐(0)

scala Array

摘要：scala不可变数组和可变数组 package com.cslc.day2 object ArrayApp extends App { /* * 不可变数组 * */ //通过new和赋值进行初始化 val a:Array[String]=new Array[String](5) println(a 阅读全文

posted @ 2021-01-03 16:11 bioamin 阅读(73) 评论(0) 推荐(0)

spark map

摘要：map transformation算子 idea显示，map的输入参数是一个函数，其中函数的输入与数据有关，本次输入是一个字符串，输出可以是很多种数据类型 map 字符串转列表 data.map(fun1).foreach(println) def fun1(x:String):List[Stri 阅读全文

posted @ 2020-12-31 14:25 bioamin 阅读(265) 评论(0) 推荐(0)

spark filter

摘要：filter是一个transformation 类的算子：过滤符合条件的记录数，true保留，false过滤掉。查看idea提示：输入和数据有关系，本次输入的是一个元组（String,Int），输出是一个Boolean类型的变量需求：就元组的第一个字符包含"Caused"的过滤输出方式一： v 阅读全文

posted @ 2020-12-31 13:50 bioamin 阅读(563) 评论(0) 推荐(0)

spark foreach

摘要：foreach 是一个action算子，不会触发shuffle 读取数据后，查看idea提示，foreach算子要求输入一个函数，这个函数的输入和数据相关（本次是String类型的变量），返回值为空。需求：读取数据，利用foreach算子，输入一个函数，输出时在每个数据的首部加一个字符串head。阅读全文

posted @ 2020-12-31 11:43 bioamin 阅读(1509) 评论(0) 推荐(0)

spark 学习笔记 RDD 向 Dataframe 转换

摘要：1、普通方式：例如rdd.map(para(para(0).trim(),para(1).trim().toInt)).toDF("name","age") #需要导入隐式转换 import spark.implicits._ // 隐式转换 val df1=data.map(x=>x.split 阅读全文

posted @ 2020-08-06 16:09 bioamin 阅读(426) 评论(0) 推荐(0)

spark 学习笔记 show()

摘要：def show(numRows: Int): Unit = show(numRows, truncate = true) /** * Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 charac 阅读全文

posted @ 2020-08-04 14:12 bioamin 阅读(3217) 评论(0) 推荐(0)

spark 学习笔记 sample 算子

摘要：def sample( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = { require(fraction >= 0, s"Fraction must be nonn 阅读全文

posted @ 2020-08-04 13:28 bioamin 阅读(1200) 评论(0) 推荐(0)

spark 学习笔记 dataframe注册生成表

摘要：DataFrame注册成一张表格，如果通过CreateTempView这种方式来创建，那么该表格Session有效，如果通过CreateGlobalTempView来创建，那么该表格跨Session有效，但是SQL语句访问该表格的时候需要加上前缀global_temp dataframe 转换为临时阅读全文

posted @ 2020-08-04 11:30 bioamin 阅读(1267) 评论(0) 推荐(0)

spark基于zookeeper的高可用异常启动分析

摘要：1、背景： spark采用了stand alone模式，其中master基于zookeeper部署了高可用； zookeeper维护了当前的active master信息，以及所属worker信息 worker和active master进行通信而spark的启动脚本SPARK_HOME/sbin 阅读全文

posted @ 2020-03-10 11:45 bioamin 阅读(557) 评论(0) 推荐(0)

spark2.1消费kafka0.8的数据 Recevier && Direct

摘要：官网案例： http://spark.apache.org/docs/2.1.1/streaming-kafka-0-8-integration.html pom.xml依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>s 阅读全文

posted @ 2019-12-13 15:31 bioamin 阅读(366) 评论(0) 推荐(0)

spark2.3 消费kafka0.10数据

摘要：官网介绍 http://spark.apache.org/docs/2.3.0/streaming-kafka-0-10-integration.html#creating-a-direct-stream 案例pom.xml依赖 <dependency> <groupId>org.apache.sp 阅读全文

posted @ 2019-12-13 13:57 bioamin 阅读(741) 评论(0) 推荐(0)

spark学习02天-scala读取文件，词频统计

摘要：1.在本地安装jdk环境和scala环境 2.读取本地文件： scala> import scala.io.Source import scala.io.Source scala> val lines=Source.fromFile("F:/ziyuan_badou/file.txt").getLi 阅读全文

posted @ 2019-06-08 23:30 bioamin 阅读(1396) 评论(0) 推荐(0)

spark学习第一天-词频统计demo

摘要：依赖： <properties> <scala.version>2.11.12</scala.version> <spark.version>2.3.0</spark.version> </properties> <dependencies> <dependency> <groupId>org.sc 阅读全文

posted @ 2019-05-31 17:18 bioamin 阅读(436) 评论(0) 推荐(0)

bioamin

追寻创业的梦想

随笔分类 - spark系列

公告