ZhangZhihui's Blog  

 

from pyspark.sql.functions import expr

df_filtered = df_filtered.withColumn('filtered_array', expr('filter(filtered_doc, x -> len(x) >= 4)'))

Please help me understand the expr() and filter() functions.

 

✅ What expr() Does

expr() lets you write Spark SQL expressions inside PySpark code.

Example:

from pyspark.sql.functions import expr
df = df.withColumn("new_col", expr("colA + colB"))

So instead of writing:

df.colA + df.colB

you can write SQL-style logic as a string.

👉 It is useful for higher-order functions, which are easier to express in SQL syntax than Python UDFs.


🧩 What filter() Does (Spark SQL Higher-Order Function)

filter(array, function) is a higher-order function in Spark SQL that:

  • Takes an array column

  • Applies a lambda function to every element

  • Keeps only the elements for which the function returns true

Syntax inside expr():

filter(array_column, x -> predicate(x))

 


🧠 Understanding Your Line

filter(array_column, x -> predicate(x))

Breakdown:

filtered_doc

An array column (probably array of words like ["this", "is", "text"]).

filter(...)

Keeps only array elements that satisfy the condition.

x -> len(x) >= 4

This is a lambda function:

  • x is each element of the array

  • len(x) = length of the string

  • Condition: keep words with 4 or more characters

✔ The result

If filtered_doc is:

["this", "is", "a", "sample"]

Then filtered_array will be:

["this", "sample"]

 


📝 Equivalent Without expr()

You could use PySpark’s built-in functions:

from pyspark.sql.functions import filter, col, length

df_filtered = df_filtered.withColumn(
    "filtered_array",
    filter(col("filtered_doc"), lambda x: length(x) >= 4)
)

(Spark 3.2+ supports lambda functions directly in Python.)


🎯 Summary

expr()

  • Runs SQL expressions in PySpark

  • Useful for complex expressions and higher-order functions

filter()

  • Spark SQL higher-order function

  • Filters an array based on a lambda predicate

  • Used here to keep only words of length ≥ 4

 

 

In the above code, I didn't import the filter function, but it can be used. I'd like to know is it a built-in Python function?

 

Ah, I see exactly what you’re asking — this is a subtle but important point.

No, the filter in your line:

df_filtered = df_filtered.withColumn(
    'filtered_array', 
    expr('filter(filtered_doc, x -> len(x) >= 4)')
)

is not the Python built-in filter() function. Let me clarify.


1️⃣ Where this filter comes from

  • You are using expr() here, which lets you write Spark SQL expressions inside PySpark.

  • Inside expr(), the string 'filter(filtered_doc, x -> len(x) >= 4)' is interpreted as a Spark SQL higher-order function.

  • In Spark SQL, filter(array<T>, function<T, Boolean>) is built-in to Spark, and it operates on array columns.

  • It has nothing to do with Python’s filter(), which works on Python iterables.


🔹 Python built-in filter() vs Spark SQL filter()

FeaturePython filter()Spark SQL filter() in expr()
Location Python standard library Spark SQL (through expr)
Input Any iterable (list, tuple, etc.) Column of type array
Output Iterable (needs list() to materialize) Column (array)
Syntax filter(lambda x: condition, iterable) filter(array_col, x -> condition)
Example list(filter(lambda x: len(x)>=4, ["a","abcd"]))['abcd'] expr('filter(array_col, x -> len(x) >= 4)')

2️⃣ Why you didn’t need to import anything

  • Spark SQL functions used inside expr() strings are recognized by Spark itself.

  • They are not Python functions, so Python has no filter() object to import.

  • Spark parses the string and converts it into the proper logical plan internally.


3️⃣ Alternative: Using PySpark functions in Python directly

If you wanted to avoid expr(), you could use PySpark’s filter function:

from pyspark.sql.functions import filter, col, length

df_filtered = df_filtered.withColumn(
    "filtered_array",
    filter(col("filtered_doc"), lambda x: length(x) >= 4)
)
  • Here, filter is imported from pyspark.sql.functions, not Python.

  • This does the same thing as your expr() version.


Summary

  • The filter in your expr() string is a Spark SQL array function, not Python’s built-in filter().

  • It works without imports because Spark parses the string inside expr().

  • Python’s built-in filter() cannot operate on Spark DataFrame columns directly.

 

posted on 2025-11-26 11:03  ZhangZhihuiAAA  阅读(2)  评论(0)    收藏  举报