公告

日历

from pyspark.ml.feature import OneHotEncoder

data = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)]
columns = ["input1", "input2"]
df = spark.createDataFrame(data, columns)

encoder = OneHotEncoder(
    inputCols=["input1", "input2"],
    outputCols=["output1", "output2"]
)

encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

+------+------+-------------+-------------+
|input1|input2|output1      |output2      |
+------+------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |
+------+------+-------------+-------------+

OneHotEncoder output format

Each encoded column is a SparseVector, shown as:

size = number of categories minus 1
(because OneHotEncoder drops the last category to avoid multicollinearity)
indices = which category this row belongs to
values = always [1.0]

📌 Vocabulary sizes in your case

From the output:

For output1

🔍 Now decode your rows

▶ Row 1

input1=0 → category index 0 → one-hot at index 0
→ (2,[0],[1.0])
input2=1 is the last category, so encoder drops it
→ all zeros (1,[],[])

▶ Row 2

input1=1 → index 1 → (2,[1],[1.0])
input2=0 → index 0 → (1,[0],[1.0])

▶ Row 3

input1=2 → last category, dropped → (2,[],[])
input2=1 → last category, dropped → (1,[],[])

🎯 Summary

output1

3 categories in input1 → output vector size = 2

Encoded as:

output2

2 categories in input2 → output vector size = 1

Encoded as:

If you want, I can also show you how to:

✔ keep all categories (no drop)
✔ extract the category → vector index mapping
✔ convert SparseVector to dense vector for readability

Yes, I want.

Great — here are the three things you asked for, clearly explained with ready-to-use PySpark code:

✔ (A) Keep all categories (no drop)
✔ (B) Get the category → vector index mapping
✔ (C) Convert SparseVector to DenseVector for readability

🅰 A. Keep all categories (no drop)

By default, PySpark OneHotEncoder drops the last category to avoid multicollinearity (dummy variable trap).

To keep all categories:

🅱 B. Get category → vector index mapping

After fitting:

Example output:

input1 has 3 distinct categories → output1 vector length = 2 (or 3 if dropLast=False)
input2 has 2 distinct categories → output2 vector length = 1 (or 2 if dropLast=False)

You can also get full mapping:

1. Extract labels for input columns

If input columns were originally string -> indexed -> encoded,
you can inspect the StringIndexer model.

But since your inputs were numeric categories, Spark assumes the category index = the value itself.

So the mapping is simply:

To see mapping explicitly:

🅲 C. Convert SparseVector to DenseVector

Sometimes you want to see the full vector (e.g., [0,1,0]) instead of:

Example output

+------+------+-------------+-------------+-------------+-------------+
|input1|input2|output1      |output2      |output1_dense|output2_dense|
+------+------+-------------+-------------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |[1.0,0.0]    |[0.0]        |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|[0.0,1.0]    |[1.0]        |
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |[0.0,0.0]    |[0.0]        |
+------+------+-------------+-------------+-------------+-------------+

This makes the encoding much easier to inspect.

🎉 Summary

✔ Keep all categories

✔ See number of categories per column

✔ Convert SparseVector → DenseVector

When I ran the code snippet in C, I got error:

TypeError: DenseVector.__init__() missing 1 required positional argument: 'ar'

You're right — the error happens because DenseVector() cannot be called directly as a type in a PySpark UDF.

PySpark expects a data type (VectorUDT) rather than the DenseVector constructor.

Here is the correct, working version.

✅ Correct way to convert SparseVector → DenseVector in PySpark

Use Vectors.dense() and VectorUDT:

✅ Why your version failed

You used:

Instead, Spark requires:

VectorUDT() for the output type of the UDF
Vectors.dense([...]) to construct the actual vector

So the corrected version uses:

🎉 Result

You will now see readable dense vectors, e.g.:

[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]

I ran the new code you gave, got below error:

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 1231, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 1067, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 529, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 90, in read_command
    command = serializer._read_with_length(file)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
    return self.loads(obj)
           ^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 472, in loads
    return cloudpickle.loads(obj, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import (
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/base.py", line 40, in <module>
    from pyspark.ml.param import P
  File "/opt/spark-3.5.7-bin-hadoop3/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 32, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

But I'm sure numpy is installed. I can run 'import numpy as np' successfully.

This error is not caused by your code — it is caused by a missing NumPy inside the PySpark worker environment, even though NumPy exists in your local Python environment.

This is a very common PySpark issue.

❗ Why this error happens

You are running PySpark in local mode or cluster mode, and Spark launches Python worker processes.

Restarted Spark master and worker under (zvenv) prompt. The code ran successfully.

posted on 2025-11-23 17:37 ZhangZhihuiAAA 阅读(2) 评论(0) 收藏举报

刷新页面返回顶部


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航