ZhangZhihuiAAA - 博客园

Clustering - Silhouette Score

摘要： The Silhouette Score is a metric used to evaluate the quality of a clustering result. It measures how well each data point fits within its assigned cl 阅读全文

posted @ 2025-11-27 09:29 ZhangZhihuiAAA 阅读(4) 评论(0) 推荐(0)

PySpark - dropping non-existed column doesn't cause an error

摘要： newsgroups = newsgroups.drop('tf_features') # dropping non-existed column doesn't cause an error 阅读全文

posted @ 2025-11-26 15:15 ZhangZhihuiAAA 阅读(5) 评论(0) 推荐(0)

PySpark - expr() and filter()

摘要： from pyspark.sql.functions import expr df_filtered = df_filtered.withColumn('filtered_array', expr('filter(filtered_doc, x -> len(x) >= 4)')) Please h 阅读全文

posted @ 2025-11-26 11:03 ZhangZhihuiAAA 阅读(2) 评论(0) 推荐(0)

PostgreSQL - How to convert timestamp to date?

摘要： In PostgreSQL, you can convert a timestamp to a date (i.e., drop hours/minutes/seconds) in several common ways: ✅ 1. Cast to date Fastest and simplest 阅读全文

posted @ 2025-11-26 08:29 ZhangZhihuiAAA 阅读(14) 评论(0) 推荐(0)

PySpark - PolynomialExpansion

摘要： from pyspark.ml.feature import PolynomialExpansion data = [ (0, Vectors.dense([1.0, 2.0])), (1, Vectors.dense([2.0, 3.0])), (2, Vectors.dense([3.0, 4. 阅读全文

posted @ 2025-11-23 22:13 ZhangZhihuiAAA 阅读(6) 评论(0) 推荐(0)

PySpark - PCA

摘要： from pyspark.ml.feature import PCA pca = PCA(k=2, inputCol='features', outputCol='pca_features') pca_model = pca.fit(df) pca_df = pca_model.transform( 阅读全文

posted @ 2025-11-23 21:47 ZhangZhihuiAAA 阅读(5) 评论(0) 推荐(0)

PySpark - Normalizer

摘要： from pyspark.ml.feature import Normalizer normalizer = Normalizer(inputCol='features', outputCol='normalized_features', p=2.0) normalized_df = normali 阅读全文

posted @ 2025-11-23 21:19 ZhangZhihuiAAA 阅读(4) 评论(0) 推荐(0)

PySpark - MinMaxScaler

摘要： from pyspark.ml.feature import MinMaxScaler scaler = MinMaxScaler(inputCol='features', outputCol='scaled_features') scaler_model = scaler.fit(df) scal 阅读全文

posted @ 2025-11-23 21:05 ZhangZhihuiAAA 阅读(4) 评论(0) 推荐(0)

PySpark - OneHotEncoder

摘要： from pyspark.ml.feature import OneHotEncoder data = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)] columns = ["input1", "input2"] df = spark.createDataFrame(dat 阅读全文

posted @ 2025-11-23 17:37 ZhangZhihuiAAA 阅读(7) 评论(0) 推荐(0)

PySpark - CountVectorizer

摘要： In Spark, I ran below code snippet: # The vocabSize parameter was set to 7 (0 to 6 - a total of 7 words), # meaaning that the vocabulary size (unique 阅读全文

posted @ 2025-11-23 10:47 ZhangZhihuiAAA 阅读(5) 评论(0) 推荐(0)

导航

2025年11月27日

2025年11月26日

2025年11月23日


博客园 © 2004-2026 浙公网安备 33010602011771号浙ICP备2021040463号-3