ZhangZhihui's Blog  

 

from pyspark.ml.feature import PCA

pca = PCA(k=2, inputCol='features', outputCol='pca_features')
pca_model = pca.fit(df)
pca_df = pca_model.transform(df)
pca_df.show(truncate=False)

 

+---+--------------+-----------------------------------------+
|id |features      |pca_features                             |
+---+--------------+-----------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.06466700238304013,-0.4536718845187467]|
|1  |[2.0,1.1,1.0] |[-1.6616789696362084,1.2840650302335732] |
|2  |[3.0,10.1,3.0]|[-10.87075006221038,0.19181523649833354] |
+---+--------------+-----------------------------------------+

 

In the PCA transformer:

pca = PCA(k=2, inputCol="features", outputCol="pca_features")

the parameter k means:

k = number of principal components you want to keep


📌 What PCA does in PySpark

PCA takes your original feature vector (maybe 10, 100, or 1000 dimensions) and compresses it into k orthogonal components that capture the most variance in the data.

Example

If your input vector is 5-dimensional:

features = [x1, x2, x3, x4, x5]

and you set:

k = 2

then PCA will output a 2-dimensional compressed vector:

pca_features = [pc1, pc2]

These 2 values are linear combinations of the original 5 values.


📌 Meaning of choosing a value for k

k < number of original features

→ dimensionality reduction
→ you keep the most informative components
→ remove noise and redundancy

k = number of original features

→ no reduction; PCA just rotates the axes

k > number of original features

❌ not allowed; Spark will throw an error


📌 How to choose k?

Common strategies:

1. Based on explained variance

You can examine variance:

model = pca.fit(df)
print(model.explainedVariance)

This returns e.g.:

[0.92, 0.06, 0.01, 0.01]

→ PC1 explains 92% of the variation

→ PC2 explains 6%
→ PC3–PC4 explain almost nothing.

Then choose k = 1 or 2.


📌 Summary

ParameterMeaning
k Number of PCA components to return
Output dimension = k
k must be ≤ number of input features True

If you want, I can also explain:

  • how PCA results are computed

  • how to interpret pca_features

  • how to view the PCA loadings (eigenvectors) in PySpark

Just tell me!

 
You said:
Yes, I want.
 
ChatGPT said:

Great — here is a clear, practical explanation of how to interpret PCA results in PySpark, including:

  1. How PCA computes the result vectors

  2. How to interpret pca_features

  3. How to view PCA loadings (eigenvectors)

  4. A complete example


✅ 1. How PCA computes the results

Given an input vector:

features = [x1, x2, x3, ..., xn]

PCA finds a set of orthogonal directions (principal components):

PC1 = [w11, w12, ..., w1n]
PC2 = [w21, w22, ..., w2n]
...

These are eigenvectors of the covariance matrix.

The PCA output is:

pca_features = [
   PC1 ⋅ features,
   PC2 ⋅ features,
   ...
   PCk ⋅ features
]

This is a projection of your original vector onto the new PCA axes.


✅ 2. How to interpret pca_features

Example

If you set:

pca = PCA(k=2, inputCol="features", outputCol="pca_features")

Then each row gets:

3-dimensional input → 2-dimensional PCA output

Example:

input:  [1.0, 2.0, 3.0]
output: [4.53, -0.22]

Interpretation:

  • PC1 value (4.53)
    → projection onto the direction that explains the MOST variance

  • PC2 value (-0.22)
    → projection onto the second-most important direction

So PCA transforms your data into a coordinate system where:

  • PC1 contains the most information

  • PC2 contains the second most

  • PCk contains less important info


✅ 3. How to view PCA loadings (eigenvectors)

PySpark stores eigenvectors in the model:

model = pca.fit(df)
model.pc

This returns a matrix of size:

num_features × k

To view them in a readable form:

import numpy as np

pc = np.array(model.pc.toArray())
print(pc)

Example output (for k=2 and 3 input features):

[[ 0.57,  0.82],
 [ 0.57, -0.41],
 [ 0.57, -0.41]]

Interpretation:

PC1 loading vector = [0.57, 0.57, 0.57]

  • All features contribute equally

  • PC1 measures overall magnitude

PC2 loading vector = [0.82, -0.41, -0.41]

  • PC2 separates feature1 from feature2 & feature3

  • PC2 measures differences in the features


✅ 4. How to print explained variance

model.explainedVariance

Example:

[0.92, 0.07, 0.01]

Meaning:

  • PC1 explains 92% of the variance → very important

  • PC2 explains 7% → somewhat important

  • PC3 explains 1% → noise

Often you choose k so that:

sum(variance[:k]) ≥ 0.95

 


📌 Complete Practical Example

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    (Vectors.dense([1.0, 2.0, 3.0]),),
    (Vectors.dense([4.0, 6.0, 8.0]),),
    (Vectors.dense([9.0, 13.0, 17.0]),)
]

df = spark.createDataFrame(data, ["features"])

pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(df)

result = model.transform(df)
result.show(truncate=False)

print("Explained variance:", model.explainedVariance)
print("Principal components:\n", model.pc.toArray())

 


🎉 Summary

ParameterMeaning
k Number of principal components to keep
model.pc Eigenvectors (loadings)
model.explainedVariance Variance explained by each PC
pca_features Projection of original vector onto PC axes

You now fully understand how PCA in PySpark works.

 

posted on 2025-11-23 21:47  ZhangZhihuiAAA  阅读(5)  评论(0)    收藏  举报