公告

日历

tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
remover = StopWordRemover(inputCol='tokens', outputCol='filtered')
encoder = OneHotEncoder(inputCol='filtered', outputCol='ohe')

lr = LogisticRegression(featuresCol='ohe', labelCol='label')
full_pipeline = Pipeline(stages=[tokenizer, remover, encoder, lr])
model_pipeline = full_pipeline.fit(data)

For the above Spark ML code, the Tokenizer already converted the text to numbers, why is the OneHotEncoder still needed?

Great question — this is a very common misunderstanding in Spark NLP pipelines.

Short answer

Tokenizer does not convert text to numbers.
It only splits text into strings (tokens).
OneHotEncoder is needed because LogisticRegression requires numeric feature vectors, not strings.

Step-by-step: what each stage really outputs

1. `Tokenizer`

Input

Output (tokens)

➡️ Still strings, not numbers.

2. `StopWordsRemover`

Output (filtered)

➡️ Still strings.

3. Why `OneHotEncoder` cannot work here (important correction)

🚨 This is actually incorrect Spark ML usage

OneHotEncoder does NOT accept string tokens.

It expects:

So this pipeline would fail at runtime.

The missing step: `CountVectorizer` or `HashingTF`

In Spark ML, text → numeric features is done with:

Option A: `CountVectorizer` (most common)

This converts tokens into a sparse numeric vector:

Option B: `HashingTF`

When `OneHotEncoder` is used in NLP pipelines

OneHotEncoder is used after StringIndexer, for categorical (non-text) columns:

Typical uses

Country
Gender
Product type
Document class (not text)

Correct NLP pipeline for Logistic Regression

Key takeaway

Stage	Output type
Tokenizer	`array<string>`
StopWordsRemover	`array<string>`
CountVectorizer / HashingTF	`vector` (numeric)
OneHotEncoder	categorical numeric → vector

Final answer

👉 Tokenizer does not convert text to numbers.
👉 OneHotEncoder is not meant for tokenized text.
👉 CountVectorizer or HashingTF is the correct component for NLP features.

posted on 2025-12-18 09:21 ZhangZhihuiAAA 阅读(4) 评论(0) 收藏举报

刷新页面返回顶部


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

Short answer

Step-by-step: what each stage really outputs

1. Tokenizer

2. StopWordsRemover

3. Why OneHotEncoder cannot work here (important correction)

The missing step: CountVectorizer or HashingTF

Option A: CountVectorizer (most common)

Option B: HashingTF

When OneHotEncoder is used in NLP pipelines