随笔分类 -  Spark

摘要:In addition to the Resilient Distributed Dataset (RDD) interface, the second kind of low level API in Spark is two types of “distributed shared variab 阅读全文
posted @ 2019-03-04 10:36 DataNerd 阅读(343) 评论(0) 推荐(0)
摘要:This chapter covers the advanced RDD operations and focuses on key–value RDDs, a powerful abstraction for manipulating data. We also touch on some mor 阅读全文
posted @ 2019-03-04 10:03 DataNerd 阅读(355) 评论(0) 推荐(0)
摘要:What Are the Low Level APIs? There are two sets of low level APIs: there is one for manipulating distributed data (RDDs), and another for distributing 阅读全文
posted @ 2019-02-28 11:24 DataNerd 阅读(160) 评论(0) 推荐(0)
摘要:Datasets are a strictly Java Virtual Machine (JVM) language feature that work only with Scala and Java. Using Datasets, you can define the object that 阅读全文
posted @ 2019-02-23 14:51 DataNerd 阅读(382) 评论(0) 推荐(0)
摘要:What Is SQL? Big Data and SQL: Apache Hive Big Data and SQL: Spark SQL The power of Spark SQL derives from several key facts: SQL analysts can now tak 阅读全文
posted @ 2019-02-23 11:05 DataNerd 阅读(357) 评论(0) 推荐(0)
摘要:Spark Core DataSource: CSV JSON Parquet ORC JDBC/ODBC connections Plain text files The Structure of the Data Sources API Read API Structure The core s 阅读全文
posted @ 2019-02-23 09:58 DataNerd 阅读(480) 评论(0) 推荐(0)
摘要:分组的类型: The simplest grouping is to just summarize a complete DataFrame by performing an aggregation in a select statement. A “group by” allows you to 阅读全文
posted @ 2019-02-19 11:06 DataNerd 阅读(355) 评论(0) 推荐(0)
摘要:Where to Look for APIs DataFrame本质上是类型为Row的DataSet,需要多看https://spark.apache.org/docs/latest/api/scala/index.html org.apache.spark.sql.Dataset来发现API的更新 阅读全文
posted @ 2019-02-16 12:40 DataNerd 阅读(397) 评论(0) 推荐(0)
摘要:DataFrame由record序列组成,record的类型是Row类型。 columns代表者计算表达式可以在独立的record上运行。 Schema定义了各列的名称和数据类型。 分区定义了DataFrame和DataSet在集群上的物理分配。 Schemas 可以让数据源定义Schema(又叫做 阅读全文
posted @ 2019-02-14 16:58 DataNerd 阅读(411) 评论(0) 推荐(0)
摘要:users = ParallelCollectionRDD[62] at parallelize at :49 ParallelCollectionRDD[62] at parallelize at :49 relationships = ParallelCollectionRDD[63] at p 阅读全文
posted @ 2018-12-20 11:40 DataNerd 阅读(514) 评论(0) 推荐(0)
摘要:Name: Compile Error Message: :30: error: class $iw needs to be abstract, since value userGraph is not defined class $iw extends Serializable { ^ Stack 阅读全文
posted @ 2018-12-20 11:07 DataNerd 阅读(325) 评论(0) 推荐(0)