ZhangZhihui's Blog  

In Spark, I ran below code snippet:

# The vocabSize parameter was set to 7 (0 to 6 - a total of 7 words), 
# meaaning that the vocabulary size (unique words) would be limited to 6.
# The model counted how many times each word appeared in each document.
cv = CountVectorizer(inputCol='words', outputCol='features', vocabSize=7)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)

The output is as below:

+---+------------------------------+-------------------------------+
|id |words                         |features                       |
+---+------------------------------+-------------------------------+
|0  |[apple, banana, orange, grape]|(7,[0,1,3,4],[1.0,1.0,1.0,1.0])|
|1  |[apple, banana, cherry, pear] |(7,[1,4,5,6],[1.0,1.0,1.0,1.0])|
|2  |[banana, cherry, grape, kiwi] |(7,[1,2,3,5],[1.0,1.0,1.0,1.0])|
|3  |[apple, pear, kiwi, orange]   |(7,[0,2,4,6],[1.0,1.0,1.0,1.0])|
|4  |[cherry, grape, kiwi, orange] |(7,[0,2,3,5],[1.0,1.0,1.0,1.0])|
+---+------------------------------+-------------------------------+

Please help me understand the values in the features column.

 

Understanding the features Column

Each features value is a SparseVector, represented as:

(size, [indices], [values])

Where:

  • size = length of the vocabulary (here vocabSize=7)

  • indices = positions (indices) of the words that appear in the row

  • values = counts (or TF if using TF-IDF, but CountVectorizer gives raw counts)


1. Vocabulary Built by CountVectorizer

Spark sorts words by descending frequency, keeping the top vocabSize=7.

Let’s manually count frequencies:

Across all rows:

wordcount
apple 3
banana 3
orange 3
grape 3
cherry 3
pear 2
kiwi 3

All except pear (2) appear 3 times.

So Spark will create a vocabulary of length 7.
When multiple words have the same frequency, Spark sorts alphabetically.

Thus the vocabulary is:

Index → Word
0 → apple
1 → banana
2 → cherry
3 → grape
4 → orange
5 → kiwi
6 → pear

(You can confirm by printing model.vocabulary.)

 

model.vocabulary 

['orange', 'banana', 'kiwi', 'grape', 'apple', 'cherry', 'pear']

 

ChatGPT said:

Great — now that we have the actual vocabulary, we can decode every features vector precisely.

Your vocabulary is:

Index → Word
0 → orange
1 → banana
2 → kiwi
3 → grape
4 → apple
5 → cherry
6 → pear

 


✅ Now decode each row

Row 0

words = [apple, banana, orange, grape]
features = (7,[0,1,3,4],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 1 → banana ✔

  • 3 → grape ✔

  • 4 → apple ✔

Exactly matches the input list.


Row 1

words = [apple, banana, cherry, pear]
features = (7,[1,4,5,6],[1.0,1.0,1.0,1.0])

Indices → words:

  • 1 → banana ✔

  • 4 → apple ✔

  • 5 → cherry ✔

  • 6 → pear ✔

Matches the input list.


Row 2

words = [banana, cherry, grape, kiwi]
features = (7,[1,2,3,5],[1.0,1.0,1.0,1.0])

Indices → words:

  • 1 → banana ✔

  • 2 → kiwi ✔

  • 3 → grape ✔

  • 5 → cherry ✔

Matches the input list.


Row 3

words = [apple, pear, kiwi, orange]
features = (7,[0,2,4,6],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 2 → kiwi ✔

  • 4 → apple ✔

  • 6 → pear ✔

Matches the input list.


Row 4

words = [cherry, grape, kiwi, orange]
features = (7,[0,2,3,5],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 2 → kiwi ✔

  • 3 → grape ✔

  • 5 → cherry ✔

Matches the input list.


🎯 Summary: How to interpret the features column

Each features entry:

(size, [indices], [values])

means:

  • indices correspond to positions in model.vocabulary

  • values are the counts of those words

  • Example: index 3 means word "grape" because vocab[3] = "grape"

Your sparse vectors are exactly correct given the vocabulary.

 

posted on 2025-11-23 10:47  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报