ZhangZhihui's Blog  

What will spark.sql(hsql).collect() return?

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName('spark with python') \
    .master('local[*]') \
    .enableHiveSupport() \
    .getOrCreate()

hsql = 'select distinct id from aaa'
result = spark.sql(hsql).collect()

 

Step 1. spark.sql(hsql)

  • Runs the SQL query select distinct id from aaa against the table aaa.

  • This returns a Spark DataFrame with a single column (id) containing distinct values from the table.

Example DataFrame (before .collect()):

+---+
| id|
+---+
|  1|
|  2|
|  5|
+---+

 


Step 2. .collect()

  • Collects the entire DataFrame from Spark executors back to the driver program.

  • The result is a Python list of Row objects (pyspark.sql.Row).

Example return value:

[Row(id=1), Row(id=2), Row(id=5)]

 


Important notes

  • If the table aaa is large, collect() can be very expensive because it loads all rows into the driver’s memory.

  • If you just want to see the results without pulling them all, better use:

    spark.sql(hsql).show()

     

    or limit results:

    spark.sql(hsql).limit(10).collect()

     


So in short:
spark.sql(hsql).collect() returns a list of Row objects, where each Row contains one distinct id from table aaa.

 

Is the pyspark.sql.Row object a list or tuple?
 
ChatGPT said:

Good question 👍

The pyspark.sql.Row object is neither a plain Python list nor a plain tuple.

It is a special class provided by PySpark that behaves like a named tuple.


Key properties of Row

  1. Looks like a named tuple

    from pyspark.sql import Row
    
    r = Row(id=1, name="Alice")
    
    print(r)          # Row(id=1, name='Alice')
    print(r.id)       # 1   (attribute access)
    print(r["name"])  # Alice (dict-style access)

     

  2. Indexable like a tuple

    print(r[0])  # 1
    print(r[1])  # Alice

     

  3. Iterable like a tuple

    for v in r:
        print(v)
    # 1
    # Alice

     

  4. But not exactly a tuple

    isinstance(r, tuple)  # False
    isinstance(r, list)   # False
    type(r)               # <class 'pyspark.sql.types.Row'>

     

So, it’s its own type (Row), designed to combine the benefits of:

  • tuple (indexable, iterable)

  • dict / namedtuple (field names, attribute access)


In short:
A Row is not a list or tuple, but it behaves like a tuple with names attached.

 

posted on 2025-09-05 09:40  ZhangZhihuiAAA  阅读(6)  评论(0)    收藏  举报