ZhangZhihui's Blog  

 

tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
remover = StopWordRemover(inputCol='tokens', outputCol='filtered')
encoder = OneHotEncoder(inputCol='filtered', outputCol='ohe')

lr = LogisticRegression(featuresCol='ohe', labelCol='label')
full_pipeline = Pipeline(stages=[tokenizer, remover, encoder, lr])
model_pipeline = full_pipeline.fit(data)

For the above Spark ML code, the Tokenizer already converted the text to numbers, why is the OneHotEncoder still needed?

 

Great question — this is a very common misunderstanding in Spark NLP pipelines.

Short answer

Tokenizer does not convert text to numbers.
It only splits text into strings (tokens).
OneHotEncoder is needed because LogisticRegression requires numeric feature vectors, not strings.


Step-by-step: what each stage really outputs

1. Tokenizer

tokenizer = Tokenizer(inputCol='text', outputCol='tokens')

Input

"I love Spark ML"

Output (tokens)

["I", "love", "Spark", "ML"]

➡️ Still strings, not numbers.


2. StopWordsRemover

remover = StopWordRemover(inputCol='tokens', outputCol='filtered')

Output (filtered)

["love", "Spark", "ML"]

➡️ Still strings.


3. Why OneHotEncoder cannot work here (important correction)

encoder = OneHotEncoder(inputCol='filtered', outputCol='ohe')

🚨 This is actually incorrect Spark ML usage

OneHotEncoder does NOT accept string tokens.

It expects:

inputCol: numeric column (usually output of StringIndexer)

So this pipeline would fail at runtime.


The missing step: CountVectorizer or HashingTF

In Spark ML, text → numeric features is done with:

Option A: CountVectorizer (most common)

vectorizer = CountVectorizer(
    inputCol='filtered',
    outputCol='features'
)

This converts tokens into a sparse numeric vector:

["love", "Spark", "ML"] → [0, 1, 0, 2, 0, ...]

Option B: HashingTF

hashingTF = HashingTF(
    inputCol='filtered',
    outputCol='features'
)

When OneHotEncoder is used in NLP pipelines

OneHotEncoder is used after StringIndexer, for categorical (non-text) columns:

indexer = StringIndexer(inputCol='category', outputCol='categoryIndex')
encoder = OneHotEncoder(inputCol='categoryIndex', outputCol='categoryVec')

Typical uses

  • Country

  • Gender

  • Product type

  • Document class (not text)


Correct NLP pipeline for Logistic Regression

tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
remover = StopWordsRemover(inputCol='tokens', outputCol='filtered')
vectorizer = CountVectorizer(inputCol='filtered', outputCol='features')

lr = LogisticRegression(featuresCol='features', labelCol='label')

pipeline = Pipeline(stages=[tokenizer, remover, vectorizer, lr])
model = pipeline.fit(data)

Key takeaway

StageOutput type
Tokenizer array<string>
StopWordsRemover array<string>
CountVectorizer / HashingTF vector (numeric)
OneHotEncoder categorical numeric → vector

Final answer

👉 Tokenizer does not convert text to numbers.
👉 OneHotEncoder is not meant for tokenized text.
👉 CountVectorizer or HashingTF is the correct component for NLP features.

 

posted on 2025-12-18 09:21  ZhangZhihuiAAA  阅读(4)  评论(0)    收藏  举报