Combine numpy array with TfidfVectorizer as a joint feature matrix in SKLearn

110 views Asked by At

I have a dataset input, which is a list of ~40000 letters (that are represented as strings).

With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sklearn.pipeline

vectorizer = TfidfVectorizer(lowercase=False)
representation1 = vectorizer.fit_transform(input)  # TFIDF representation

Now, I want to manually add one feature representation2 for every letter. This feature should tell the amount of different words compared to all words in a specific letter/string:

count_vectorizer = CountVectorizer()
sum_words = np.sum(count_vectorizer.fit_transform(input).toarray(), axis=-1)
sum_different_words = np.count_nonzero(count_vectorizer.fit_transform(input).toarray(), axis=-1)
representation2 = np.divide(sum_different_words, sum_words)  # percentage of different words

The array representation2 is now an array of shape (39077,) (as expected). I now want to combine representation1 and representation2 into one feature vector representation.

I read about using FeatureUnion to combine two kinds of features in SKLearn, but I am not sure how to correctly use the Numpy array representation2as a feature here. I tried:

union = sklearn.pipeline.make_union([representation1, representation2])

But now I can't use e.g. union.get_feature_names_out(), since it throws: AttributeError: Transformer list (type list) does not provide get_feature_names_out.

What did I understand incorrectly here?

0

There are 0 answers