PySpark-SQL
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StringType,StructType,StructField
spark = SparkSession.builder.getOrCreate()
# 配置项目
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
spark.conf.set("spark.sql.repl.eagerEval.maxNumRows",1000)
spark
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.1
Master
local[*]
AppName
pyspark-shell
读取数据
df=spark.read.csv("demo.csv",header=True)
df.show()
+-----+------+---+---+
|color| fruit| v1| v2|
+-----+------+---+---+
|black|carrot| 6| 60|
| blue|banana| 2| 20|
| red|banana| 7| 70|
| red|carrot| 3| 30|
| red|banana| 1| 10|
| blue| grape| 4| 40|
| red|carrot| 5| 50|
| red| grape| 8| 80|
+-----+------+---+---+
查看结构
df.printSchema()
root
|-- color: string (nullable = true)
|-- fruit: string (nullable = true)
|-- v1: string (nullable = true)
|-- v2: string (nullable = true)
df.select(df['color'],df['v1']+1)
| color | (v1 + 1) |
|---|---|
| black | 7.0 |
| blue | 3.0 |
| red | 8.0 |
| red | 4.0 |
| red | 2.0 |
| blue | 5.0 |
| red | 6.0 |
| red | 9.0 |
过滤操作
df.filter(df['v1']>3)
| color | fruit | v1 | v2 |
|---|---|---|---|
| black | carrot | 6 | 60 |
| red | banana | 7 | 70 |
| blue | grape | 4 | 40 |
| red | carrot | 5 | 50 |
| red | grape | 8 | 80 |
聚合
df.groupBy(df['color']).count()
| color | count |
|---|---|
| red | 5 |
| black | 1 |
| blue | 2 |
将DataFrame注册为SQL临时表
df.createOrReplaceTempView("dual")
spark.sql("select * from dual limit 3")
| color | fruit | v1 | v2 |
|---|---|---|---|
| black | carrot | 6 | 60 |
| blue | banana | 2 | 20 |
| red | banana | 7 | 70 |
注册为全局的
df.createOrReplaceGlobalTempView("dual2")
spark.sql("select count(1) as cnt from global_temp.dual2")
需要注意的是,需要添加global_temp.来访问表
| cnt |
|---|
| 8 |

浙公网安备 33010602011771号