摘要: 分析spark常见的问题不外乎oom:我们首先看一下Spark 的内存模型:Spark在一个Executor中的内存分为三块,一块是execution内存,一块是storage内存,一块是other内存。execution内存是执行内存,文档中说join,aggregate都在这部分内存中执行,sh 阅读全文
posted @ 2021-10-28 17:34 muyue123 阅读(466) 评论(0) 推荐(0) 编辑
摘要: 4个分位数的取法: df1 = spark.createDataFrame([(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10),(2,1),(2,10),(2,100)],['id','cnt']) cnt_med_1 = F. 阅读全文
posted @ 2021-09-24 13:44 muyue123 阅读(301) 评论(0) 推荐(0) 编辑
摘要: select id,cnt,sum(cnt) over w as sum_cntfrom( select 'a' as id, 1 as cnt union all select 'a' as id, 9 as cnt union all select 'a' as id, 4 as cnt uni 阅读全文
posted @ 2021-09-02 15:07 muyue123 阅读(88) 评论(0) 推荐(0) 编辑
摘要: # 例子1 import matplotlib.pyplot as plt data = [[1,2,3,4],[6,5,4,3],[1,3,5,1]] table = plt.table(cellText=data, colLabels=['A', 'B', 'C', 'D'], loc='cen 阅读全文
posted @ 2021-08-23 15:17 muyue123 阅读(897) 评论(0) 推荐(0) 编辑
摘要: spark = SparkSession.builder. \ appName(app_name). \ enableHiveSupport(). \ config("spark.debug.maxToStringFields", "100"). \ config("spark.executor.m 阅读全文
posted @ 2021-08-12 15:22 muyue123 阅读(45) 评论(0) 推荐(0) 编辑
摘要: 方法一: ALTER TABLE kuming.tableName DELETE WHERE toDate(insert_at_timestamp)='2020-07-21'; 方法二: ALTER TABLE kuming.tableName DELETE WHERE insert_at_time 阅读全文
posted @ 2021-08-10 15:36 muyue123 阅读(126) 评论(0) 推荐(0) 编辑
摘要: (前人写的不错,很实用,负责任转发)转自:http://www.crazyant.net/1197.html Hive的insert语句能够从查询语句中获取数据,并同时将数据Load到目标表中。现在假定有一个已有数据的表staged_employees(雇员信息全量表),所属国家cnty和所属州st 阅读全文
posted @ 2021-07-22 11:32 muyue123 阅读(801) 评论(0) 推荐(0) 编辑
摘要: 是采用的将更新的维度表放在最新的分区的形式。 # coding=utf-8 from pyspark.sql.types import IntegerType, StructType from pyspark.sql import SparkSession import datetime from 阅读全文
posted @ 2021-07-15 17:44 muyue123 阅读(155) 评论(0) 推荐(0) 编辑
摘要: aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "0*" --include "1*" --include "2*" --in 阅读全文
posted @ 2021-07-05 10:34 muyue123 阅读(74) 评论(0) 推荐(0) 编辑
摘要: #CSV mySchema = StructType().add("id", IntegerType(), True).add("name",StringType(),True) df = spark.readStream.option("sep",",").option("header","fal 阅读全文
posted @ 2021-06-24 16:08 muyue123 阅读(134) 评论(0) 推荐(0) 编辑