What will spark.sql(hsql).collect() return?
from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('spark with python') \ .master('local[*]') \ .enableHiveSupport() \ .getOrCreate() hsql = 'select distinct id from aaa' result = spark.sql(hsql).collect()
Step 1. spark.sql(hsql)
-
Runs the SQL query
select distinct id from aaa
against the tableaaa
. -
This returns a Spark DataFrame with a single column (
id
) containing distinct values from the table.
Example DataFrame (before .collect()
):
Step 2. .collect()
-
Collects the entire DataFrame from Spark executors back to the driver program.
-
The result is a Python list of Row objects (
pyspark.sql.Row
).
Example return value:
Important notes
-
If the table
aaa
is large,collect()
can be very expensive because it loads all rows into the driver’s memory. -
If you just want to see the results without pulling them all, better use:
or limit results:
✅ So in short:
spark.sql(hsql).collect()
returns a list of Row objects, where each Row contains one distinct id
from table aaa
.
ChatGPT said: