ZhangZhihui's Blog  

 

from pyspark.ml.feature import OneHotEncoder

data = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)]
columns = ["input1", "input2"]
df = spark.createDataFrame(data, columns)

encoder = OneHotEncoder(
    inputCols=["input1", "input2"],
    outputCols=["output1", "output2"]
)

encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

 

+------+------+-------------+-------------+
|input1|input2|output1      |output2      |
+------+------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |
+------+------+-------------+-------------+

 

 

OneHotEncoder output format

Each encoded column is a SparseVector, shown as:

(size, [indices], [values])

Where:

  • size = number of categories minus 1
    (because OneHotEncoder drops the last category to avoid multicollinearity)

  • indices = which category this row belongs to

  • values = always [1.0]


📌 Vocabulary sizes in your case

From the output:

For output1

(2,[0],[1.0]) → size = 2  

Meaning input1 has 3 categories:

0.0 → index 0
1.0 → index 1
2.0 → index dropped (all zeros)

For output2

(1,[0],[1.0]) → size = 1

Meaning input2 has 2 categories:

0.0 → index 0
1.0 → dropped (all zeros)

 


🔍 Now decode your rows


Row 1

input1 = 0.0
input2 = 1.0

output1 = (2,[0],[1.0])
output2 = (1,[],[])
  • input1=0 → category index 0 → one-hot at index 0
    (2,[0],[1.0])

  • input2=1 is the last category, so encoder drops it
    → all zeros (1,[],[])


Row 2

input1 = 1.0
input2 = 0.0

output1 = (2,[1],[1.0])
output2 = (1,[0],[1.0])
  • input1=1 → index 1 → (2,[1],[1.0])

  • input2=0 → index 0 → (1,[0],[1.0])


Row 3

input1 = 2.0
input2 = 1.0

output1 = (2,[],[])
output2 = (1,[],[])
  • input1=2 → last category, dropped → (2,[],[])

  • input2=1 → last category, dropped → (1,[],[])


🎯 Summary

output1

  • 3 categories in input1 → output vector size = 2

  • Encoded as:

    input1=0.0 → (2,[0],[1.0])
    input1=1.0 → (2,[1],[1.0])
    input1=2.0 → (2,[],[])     # dropped category

output2

  • 2 categories in input2 → output vector size = 1

  • Encoded as:

    input2=0.0 → (1,[0],[1.0])
    input2=1.0 → (1,[],[])     # dropped category

If you want, I can also show you how to:

✔ keep all categories (no drop)
✔ extract the category → vector index mapping
✔ convert SparseVector to dense vector for readability

 
You said:
Yes, I want.
 
ChatGPT said:

Great — here are the three things you asked for, clearly explained with ready-to-use PySpark code:

(A) Keep all categories (no drop)
(B) Get the category → vector index mapping
(C) Convert SparseVector to DenseVector for readability


🅰 A. Keep all categories (no drop)

By default, PySpark OneHotEncoder drops the last category to avoid multicollinearity (dummy variable trap).

To keep all categories:

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(
    inputCols=["input1", "input2"],
    outputCols=["output1", "output2"],
    dropLast=False   # ← keep every category
)

encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

If a column has k categories, the output vector will have k elements (instead of k-1).


🅱 B. Get category → vector index mapping

After fitting:

encoder_model.categorySizes

This returns the number of categories for each input column.

Example output:

[3, 2]

Meaning:

  • input1 has 3 distinct categories → output1 vector length = 2 (or 3 if dropLast=False)

  • input2 has 2 distinct categories → output2 vector length = 1 (or 2 if dropLast=False)

You can also get full mapping:

1. Extract labels for input columns

If input columns were originally string -> indexed -> encoded,
you can inspect the StringIndexer model.

But since your inputs were numeric categories, Spark assumes the category index = the value itself.

So the mapping is simply:

Category value → vector index

Unless categories are not dense (e.g., 10, 20, 50), in which case Spark remaps them internally.

To see mapping explicitly:

for i, size in enumerate(encoder_model.categorySizes):
    print(f"Column {i} categories: 0..{size - 1}")

Example output:

Column 0 categories: 0..2
Column 1 categories: 0..1

If you want a fully explicit mapping:

for col, size in zip(["input1", "input2"], encoder_model.categorySizes):
    print(f"Mapping for {col}:")
    for idx in range(size):
        print(f"  category {idx} → vector index {idx}")

 


🅲 C. Convert SparseVector to DenseVector

Sometimes you want to see the full vector (e.g., [0,1,0]) instead of:

(3,[1],[1.0])

Use a UDF:

from pyspark.ml.linalg import DenseVector

to_dense = F.udf(lambda v: DenseVector(v.toArray()), DenseVector())

df_dense = encoded_df.withColumn("output1_dense", to_dense("output1")) \
                     .withColumn("output2_dense", to_dense("output2"))

df_dense.show(truncate=False)

Example output

+------+------+-------------+-------------+-------------+-------------+
|input1|input2|output1      |output2      |output1_dense|output2_dense|
+------+------+-------------+-------------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |[1.0,0.0]    |[0.0]        |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|[0.0,1.0]    |[1.0]        |
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |[0.0,0.0]    |[0.0]        |
+------+------+-------------+-------------+-------------+-------------+

This makes the encoding much easier to inspect.


🎉 Summary

✔ Keep all categories

dropLast=False

✔ See number of categories per column

encoder_model.categorySizes

✔ Convert SparseVector → DenseVector

DenseVector(v.toArray())

 

When I ran the code snippet in C, I got error:

TypeError: DenseVector.__init__() missing 1 required positional argument: 'ar'

 

You're right — the error happens because DenseVector() cannot be called directly as a type in a PySpark UDF.

PySpark expects a data type (VectorUDT) rather than the DenseVector constructor.

Here is the correct, working version.


Correct way to convert SparseVector → DenseVector in PySpark

Use Vectors.dense() and VectorUDT:

import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT

to_dense = F.udf(lambda v: Vectors.dense(v.toArray()), VectorUDT())

df_dense = encoded_df.withColumn("output1_dense", to_dense("output1")) \
                     .withColumn("output2_dense", to_dense("output2"))

df_dense.show(truncate=False)

 


✅ Why your version failed

You used:

DenseVector(v.toArray())

But DenseVector(...) is not meant to be used as a schema type.

Instead, Spark requires:

  • VectorUDT() for the output type of the UDF

  • Vectors.dense([...]) to construct the actual vector

So the corrected version uses:

Vectors.dense(v.toArray())

and

VectorUDT()

 


🎉 Result

You will now see readable dense vectors, e.g.:

[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]

 

I ran the new code you gave, got below error:
PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 1231, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 1067, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 529, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 90, in read_command
    command = serializer._read_with_length(file)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
    return self.loads(obj)
           ^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 472, in loads
    return cloudpickle.loads(obj, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import (
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/base.py", line 40, in <module>
    from pyspark.ml.param import P
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 32, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

But I'm sure numpy is installed. I can run 'import numpy as np' successfully.

 

This error is not caused by your code — it is caused by a missing NumPy inside the PySpark worker environment, even though NumPy exists in your local Python environment.

This is a very common PySpark issue.


❗ Why this error happens

You are running PySpark in local mode or cluster mode, and Spark launches Python worker processes.

 
Restarted Spark master and worker under (zvenv) prompt. The code ran successfully.
 
posted on 2025-11-23 17:37  ZhangZhihuiAAA  阅读(2)  评论(0)    收藏  举报