(3)pyspark----dataframe和sql

1、读取:

  • sparkDF = spark.read.csv(path)
  • sparkDF = spark.read.text(path)

2、打印:

sparkDF.show()【这是pandas中没有的】:打印内容

sparkDF.head():打印前面的内容

sparkDF.describe():统计信息

sparkDF.printSchema():打印schema,列的属性信息打印出来【这是pandas中没有的】

sparkDF.columns:将列名打印出来

3、选择列

【select函数,原pandas中没有】

sparkDF.select('列名1','列名2‘).show():选择dataframe的两列数据显示出来

sparkDF.select ( sparkDF['列名1']+1 , '列名2' ).show():直接对列1进行操作(值+1)打印出来

 

4、筛选列:

filter【类似pandas中dataframe的采用列名来筛选功能】

sparkDF.filter ( sparkDF['value'] == 100 ).show():将value这一列值为100的行筛选出来

 

 5、计算不重复值以及统计dataframe的行数

 distinct()函数:将重复值去除

sparkDF.count():统计dataframe中有多少行

将评分为100的电影数量统计出来:

 

6. 类似sql查询

selectExpr
#表查询selectExpr,可以使用UDF函数,指定别名等
import datetime
spark.udf.register("getBirthYear",lambda age:datetime.datetime.now().year-age)
dftest = df.selectExpr("name", "getBirthYear(age) as birth_year" , "UPPER(gender) as gender" )
dftest.show()

where

#表查询where, 指定SQL中的where字句表达式
dftest = df.where("gender='male' and age>15")
dftest.show()

filter

#表查询filter
dftest = df.filter("gender ='male'")
dftest.show()
groupBy + collect_list
#表分组聚合,groupBy,agg
dfagg = df.groupBy("gender").agg(F.expr("avg(age)"),F.expr("collect_list(name)"))
dfagg.show()
+------+--------+------------------+
|gender|avg(age)|collect_list(name)|
+------+--------+------------------+
|  null|    16.0|           [RuHua]|
|female|    16.0|       [HanMeiMei]|
|  male|    16.0|   [LiLei, DaChui]|
+------+--------+------------------+
 

窗口函数

#窗口函数

df = spark.createDataFrame([("LiLei",78,"class1"),("HanMeiMei",87,"class1"),
                           ("DaChui",65,"class2"),("RuHua",55,"class2")]) \
    .toDF("name","score","class")

df.show()
dforder = df.selectExpr("name","score","class",
         "row_number() over (partition by class order by score desc) as order")

dforder.show()
+---------+-----+------+
|     name|score| class|
+---------+-----+------+
|    LiLei|   78|class1|
|HanMeiMei|   87|class1|
|   DaChui|   65|class2|
|    RuHua|   55|class2|
+---------+-----+------+

+---------+-----+------+-----+
|     name|score| class|order|
+---------+-----+------+-----+
|   DaChui|   65|class2|    1|
|    RuHua|   55|class2|    2|
|HanMeiMei|   87|class1|    1|
|    LiLei|   78|class1|    2|
+---------+-----+------+-----+

 

posted on 2018-10-11 12:48  吱吱了了  阅读(1727)  评论(0)    收藏  举报

导航