ZhangZhihui's Blog  

 

from pyspark.sql.functions import expr

df_filtered = df_filtered.withColumn('filtered_array', expr('filter(filtered_doc, x -> len(x) >= 4)'))

Please help me understand the expr() and filter() functions.

 

✅ What expr() Does

expr() lets you write Spark SQL expressions inside PySpark code.

Example:

from pyspark.sql.functions import expr
df = df.withColumn("new_col", expr("colA + colB"))

So instead of writing:

df.colA + df.colB

you can write SQL-style logic as a string.

👉 It is useful for higher-order functions, which are easier to express in SQL syntax than Python UDFs.


🧩 What filter() Does (Spark SQL Higher-Order Function)

filter(array, function) is a higher-order function in Spark SQL that:

  • Takes an array column

  • Applies a lambda function to every element

  • Keeps only the elements for which the function returns true

Syntax inside expr():

filter(array_column, x -> predicate(x))

 


🧠 Understanding Your Line

filter(array_column, x -> predicate(x))

Breakdown:

filtered_doc

An array column (probably array of words like ["this", "is", "text"]).

filter(...)

Keeps only array elements that satisfy the condition.

x -> len(x) >= 4

This is a lambda function:

  • x is each element of the array

  • len(x) = length of the string

  • Condition: keep words with 4 or more characters

✔ The result

If filtered_doc is:

["this", "is", "a", "sample"]

Then filtered_array will be:

["this", "sample"]

 


📝 Equivalent Without expr()

You could use PySpark’s built-in functions:

from pyspark.sql.functions import filter, col, length

df_filtered = df_filtered.withColumn(
    "filtered_array",
    filter(col("filtered_doc"), lambda x: length(x) >= 4)
)

(Spark 3.2+ supports lambda functions directly in Python.)


🎯 Summary

expr()

  • Runs SQL expressions in PySpark

  • Useful for complex expressions and higher-order functions

filter()

  • Spark SQL higher-order function

  • Filters an array based on a lambda predicate

  • Used here to keep only words of length ≥ 4

 

posted on 2025-11-26 11:03  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报