from pyspark.sql.functions import expr df_filtered = df_filtered.withColumn('filtered_array', expr('filter(filtered_doc, x -> len(x) >= 4)'))
Please help me understand the expr() and filter() functions.
✅ What expr() Does
expr() lets you write Spark SQL expressions inside PySpark code.
Example:
👉 It is useful for higher-order functions, which are easier to express in SQL syntax than Python UDFs.
🧩 What filter() Does (Spark SQL Higher-Order Function)
filter(array, function) is a higher-order function in Spark SQL that:
-
Takes an array column
-
Applies a lambda function to every element
-
Keeps only the elements for which the function returns true
Syntax inside expr():
🧠 Understanding Your Line
✔ filtered_doc
An array column (probably array of words like ["this", "is", "text"]).
✔ filter(...)
Keeps only array elements that satisfy the condition.
✔ x -> len(x) >= 4
This is a lambda function:
-
xis each element of the array -
len(x)= length of the string -
Condition: keep words with 4 or more characters
✔ The result
If filtered_doc is:
📝 Equivalent Without expr()
You could use PySpark’s built-in functions:
🎯 Summary
expr()
-
Runs SQL expressions in PySpark
-
Useful for complex expressions and higher-order functions
filter()
-
Spark SQL higher-order function
-
Filters an array based on a lambda predicate
-
Used here to keep only words of length ≥ 4
In the above code, I didn't import the filter function, but it can be used. I'd like to know is it a built-in Python function?
Ah, I see exactly what you’re asking — this is a subtle but important point.
No, the filter in your line:
1️⃣ Where this filter comes from
-
You are using
expr()here, which lets you write Spark SQL expressions inside PySpark. -
Inside
expr(), the string'filter(filtered_doc, x -> len(x) >= 4)'is interpreted as a Spark SQL higher-order function. -
In Spark SQL,
filter(array<T>, function<T, Boolean>)is built-in to Spark, and it operates on array columns. -
It has nothing to do with Python’s
filter(), which works on Python iterables.
🔹 Python built-in filter() vs Spark SQL filter()
| Feature | Python filter() | Spark SQL filter() in expr() |
|---|---|---|
| Location | Python standard library | Spark SQL (through expr) |
| Input | Any iterable (list, tuple, etc.) | Column of type array |
| Output | Iterable (needs list() to materialize) |
Column (array) |
| Syntax | filter(lambda x: condition, iterable) |
filter(array_col, x -> condition) |
| Example | list(filter(lambda x: len(x)>=4, ["a","abcd"])) → ['abcd'] |
expr('filter(array_col, x -> len(x) >= 4)') |
2️⃣ Why you didn’t need to import anything
-
Spark SQL functions used inside
expr()strings are recognized by Spark itself. -
They are not Python functions, so Python has no
filter()object to import. -
Spark parses the string and converts it into the proper logical plan internally.
3️⃣ Alternative: Using PySpark functions in Python directly
If you wanted to avoid expr(), you could use PySpark’s filter function:
-
Here,
filteris imported frompyspark.sql.functions, not Python. -
This does the same thing as your
expr()version.
✅ Summary
-
The
filterin yourexpr()string is a Spark SQL array function, not Python’s built-infilter(). -
It works without imports because Spark parses the string inside
expr(). -
Python’s built-in
filter()cannot operate on Spark DataFrame columns directly.

浙公网安备 33010602011771号