ZhangZhihui's Blog  

 

from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol='features', outputCol='scaled_features')
scaler_model = scaler.fit(df)
scaled_df = scaler_model.transform(df)
scaled_df.show(truncate=False)

 

+---+--------------+---------------+
|id |features      |scaled_features|
+---+--------------+---------------+
|0  |[1.0,0.1,-1.0]|(3,[],[])      |
|1  |[2.0,1.1,1.0] |[0.5,0.1,0.5]  |
|2  |[3.0,10.1,3.0]|[1.0,1.0,1.0]  |
+---+--------------+---------------+

Please help me understand the values of the scaled_features column.

 

Here is what’s happening in your scaled_features column and why you see values like (3,[],[]) for the first row.


✅ How MinMaxScaler Works

MinMaxScaler rescales each feature column-wise to the range [0, 1]:

1

Applied per column, not per row.


📌 Your Input

Row 0: [1.0, 0.1, -1.0]
Row 1: [2.0, 1.1,  1.0]
Row 2: [3.0,10.1,  3.0]

Let’s compute min and max for each column:

Feature IndexValuesMinMaxRange
0 1.0, 2.0, 3.0 1.0 3.0 2.0
1 0.1, 1.1, 10.1 0.1 10.1 10.0
2 -1.0, 1.0, 3.0 -1.0 3.0 4.0

📌 Compute Scaled Values

Row 1 (id = 1) → [2.0, 1.1, 1.0]

Index 0:

(2.0−1.0)/(3.0−1.0)=1/2=0.5

Index 1:

(1.1−0.1)/(10.1−0.1)=1/10=0.1

Index 2:

(1.0−(−1.0))/(3.0−(−1.0))=2/4=0.5

✔ Output: [0.5, 0.1, 0.5]

Row 2 (id = 2) → [3.0, 10.1, 3.0]

All are max values → scale to 1:

[1.0, 1.0, 1.0]


❗ Why Row 0 → (3,[],[]) (all zeros)

For row 0:

[1.0, 0.1, -1.0]

Index 0:

(1.0−1.0)/2.0=0

Index 1:

(0.1−0.1)/10.0=0

Index 2:

(−1.0−(−1.0))/4.0=0

👉 All scaled values are 0:

[0.0, 0.0, 0.0]

But Spark uses SparseVector format to store vectors with many zeros:

(3, [], [])

This means:

  • size = 3

  • no non-zero indices = []

  • no non-zero values = []

This is equivalent to:

[0.0, 0.0, 0.0]

✅ Summary

idfeaturesscaled_featuresMeaning
0 [1.0,0.1,-1.0] (3,[],[]) all zeros → min values
1 [2.0,1.1,1.0] [0.5,0.1,0.5] mid-range values
2 [3.0,10.1,3.0] [1.0,1.0,1.0] max values

 

posted on 2025-11-23 21:05  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报