tokenizer = Tokenizer(inputCol='text', outputCol='tokens') remover = StopWordRemover(inputCol='tokens', outputCol='filtered') encoder = OneHotEncoder(inputCol='filtered', outputCol='ohe') lr = LogisticRegression(featuresCol='ohe', labelCol='label') full_pipeline = Pipeline(stages=[tokenizer, remover, encoder, lr]) model_pipeline = full_pipeline.fit(data)
For the above Spark ML code, the Tokenizer already converted the text to numbers, why is the OneHotEncoder still needed?
Great question — this is a very common misunderstanding in Spark NLP pipelines.
Short answer
Tokenizer does not convert text to numbers.
It only splits text into strings (tokens).
OneHotEncoder is needed because LogisticRegression requires numeric feature vectors, not strings.
Step-by-step: what each stage really outputs
1. Tokenizer
Input
Output (tokens)
➡️ Still strings, not numbers.
2. StopWordsRemover
Output (filtered)
➡️ Still strings.
3. Why OneHotEncoder cannot work here (important correction)
🚨 This is actually incorrect Spark ML usage
OneHotEncoder does NOT accept string tokens.
It expects:
So this pipeline would fail at runtime.
The missing step: CountVectorizer or HashingTF
In Spark ML, text → numeric features is done with:
Option A: CountVectorizer (most common)
This converts tokens into a sparse numeric vector:
Option B: HashingTF
When OneHotEncoder is used in NLP pipelines
OneHotEncoder is used after StringIndexer, for categorical (non-text) columns:
Typical uses
-
Country
-
Gender
-
Product type
-
Document class (not text)
Correct NLP pipeline for Logistic Regression
Key takeaway
| Stage | Output type |
|---|---|
| Tokenizer | array<string> |
| StopWordsRemover | array<string> |
| CountVectorizer / HashingTF | vector (numeric) |
| OneHotEncoder | categorical numeric → vector |
Final answer
👉 Tokenizer does not convert text to numbers.
👉 OneHotEncoder is not meant for tokenized text.
👉 CountVectorizer or HashingTF is the correct component for NLP features.

浙公网安备 33010602011771号