from pyspark.ml.feature import OneHotEncoder data = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)] columns = ["input1", "input2"] df = spark.createDataFrame(data, columns) encoder = OneHotEncoder( inputCols=["input1", "input2"], outputCols=["output1", "output2"] ) encoder_model = encoder.fit(df) encoded_df = encoder_model.transform(df) encoded_df.show(truncate=False)
+------+------+-------------+-------------+ |input1|input2|output1 |output2 | +------+------+-------------+-------------+ |0.0 |1.0 |(2,[0],[1.0])|(1,[],[]) | |1.0 |0.0 |(2,[1],[1.0])|(1,[0],[1.0])| |2.0 |1.0 |(2,[],[]) |(1,[],[]) | +------+------+-------------+-------------+
OneHotEncoder output format
Each encoded column is a SparseVector, shown as:
-
size= number of categories minus 1
(because OneHotEncoder drops the last category to avoid multicollinearity) -
indices= which category this row belongs to -
values= always[1.0]
📌 Vocabulary sizes in your case
From the output:
For output1
🔍 Now decode your rows
▶ Row 1
-
input1=0 → category index 0 → one-hot at index 0
→(2,[0],[1.0]) -
input2=1 is the last category, so encoder drops it
→ all zeros(1,[],[])
▶ Row 2
-
input1=1 → index 1 →
(2,[1],[1.0]) -
input2=0 → index 0 →
(1,[0],[1.0])
▶ Row 3
-
input1=2 → last category, dropped →
(2,[],[]) -
input2=1 → last category, dropped →
(1,[],[])
🎯 Summary
output1
-
3 categories in input1 → output vector size = 2
-
Encoded as:
output2
-
2 categories in input2 → output vector size = 1
-
Encoded as:
You said:
ChatGPT said:
When I ran the code snippet in C, I got error:
TypeError: DenseVector.__init__() missing 1 required positional argument: 'ar'
You're right — the error happens because DenseVector() cannot be called directly as a type in a PySpark UDF.
PySpark expects a data type (VectorUDT) rather than the DenseVector constructor.
Here is the correct, working version.
✅ Correct way to convert SparseVector → DenseVector in PySpark
Use Vectors.dense() and VectorUDT:
✅ Why your version failed
You used:
Instead, Spark requires:
-
VectorUDT() for the output type of the UDF
-
Vectors.dense([...]) to construct the actual vector
So the corrected version uses:
🎉 Result
You will now see readable dense vectors, e.g.:

浙公网安备 33010602011771号