PySpark-SQL

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StringType,StructType,StructField
spark = SparkSession.builder.getOrCreate()
# 配置项目
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
spark.conf.set("spark.sql.repl.eagerEval.maxNumRows",1000)
spark
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.1
Master
local[*]
AppName
pyspark-shell

读取数据

df=spark.read.csv("demo.csv",header=True)
df.show()
+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|black|carrot|  6| 60|
| blue|banana|  2| 20|
|  red|banana|  7| 70|
|  red|carrot|  3| 30|
|  red|banana|  1| 10|
| blue| grape|  4| 40|
|  red|carrot|  5| 50|
|  red| grape|  8| 80|
+-----+------+---+---+

查看结构

df.printSchema()
root
 |-- color: string (nullable = true)
 |-- fruit: string (nullable = true)
 |-- v1: string (nullable = true)
 |-- v2: string (nullable = true)
df.select(df['color'],df['v1']+1)
color(v1 + 1)
black7.0
blue3.0
red8.0
red4.0
red2.0
blue5.0
red6.0
red9.0

过滤操作

df.filter(df['v1']>3)
colorfruitv1v2
blackcarrot660
redbanana770
bluegrape440
redcarrot550
redgrape880

聚合

df.groupBy(df['color']).count()
colorcount
red5
black1
blue2

将DataFrame注册为SQL临时表

df.createOrReplaceTempView("dual")
spark.sql("select * from dual limit 3")
colorfruitv1v2
blackcarrot660
bluebanana220
redbanana770

注册为全局的

df.createOrReplaceGlobalTempView("dual2")
spark.sql("select count(1) as cnt from global_temp.dual2")

需要注意的是,需要添加global_temp.来访问表

cnt
8
posted @ 2021-06-18 11:20  人人从众  阅读(214)  评论(0)    收藏  举报