(3)pyspark----dataframe和sql
1、读取:
- sparkDF = spark.read.csv(path)
- sparkDF = spark.read.text(path)
2、打印:
sparkDF.show()【这是pandas中没有的】:打印内容
sparkDF.head():打印前面的内容
sparkDF.describe():统计信息
sparkDF.printSchema():打印schema,列的属性信息打印出来【这是pandas中没有的】
sparkDF.columns:将列名打印出来
3、选择列
【select函数,原pandas中没有】
sparkDF.select('列名1','列名2‘).show():选择dataframe的两列数据显示出来
sparkDF.select ( sparkDF['列名1']+1 , '列名2' ).show():直接对列1进行操作(值+1)打印出来
4、筛选列:
filter【类似pandas中dataframe的采用列名来筛选功能】
sparkDF.filter ( sparkDF['value'] == 100 ).show():将value这一列值为100的行筛选出来
5、计算不重复值以及统计dataframe的行数
distinct()函数:将重复值去除
sparkDF.count():统计dataframe中有多少行
将评分为100的电影数量统计出来:
6. 类似sql查询
selectExpr
#表查询selectExpr,可以使用UDF函数,指定别名等 import datetime spark.udf.register("getBirthYear",lambda age:datetime.datetime.now().year-age) dftest = df.selectExpr("name", "getBirthYear(age) as birth_year" , "UPPER(gender) as gender" ) dftest.show()
where
#表查询where, 指定SQL中的where字句表达式 dftest = df.where("gender='male' and age>15") dftest.show()
filter
#表查询filter dftest = df.filter("gender ='male'") dftest.show()
groupBy + collect_list
#表分组聚合,groupBy,agg dfagg = df.groupBy("gender").agg(F.expr("avg(age)"),F.expr("collect_list(name)")) dfagg.show()
+------+--------+------------------+ |gender|avg(age)|collect_list(name)| +------+--------+------------------+ | null| 16.0| [RuHua]| |female| 16.0| [HanMeiMei]| | male| 16.0| [LiLei, DaChui]| +------+--------+------------------+
窗口函数
#窗口函数 df = spark.createDataFrame([("LiLei",78,"class1"),("HanMeiMei",87,"class1"), ("DaChui",65,"class2"),("RuHua",55,"class2")]) \ .toDF("name","score","class") df.show() dforder = df.selectExpr("name","score","class", "row_number() over (partition by class order by score desc) as order") dforder.show()
+---------+-----+------+ | name|score| class| +---------+-----+------+ | LiLei| 78|class1| |HanMeiMei| 87|class1| | DaChui| 65|class2| | RuHua| 55|class2| +---------+-----+------+ +---------+-----+------+-----+ | name|score| class|order| +---------+-----+------+-----+ | DaChui| 65|class2| 1| | RuHua| 55|class2| 2| |HanMeiMei| 87|class1| 1| | LiLei| 78|class1| 2| +---------+-----+------+-----+