gridsearchcv with tfidf and count vectorizer

Question

gridsearchcv with tfidf and count vectorizer

6k views Asked by Abtc At 08 October 2020 at 08:26

I want to use GridSearchCV for parameter tuning. Is it also possible to check with GridSearchCV whether CountVectorizer or TfidfVectorizer works best? My idea:

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2), (1,3),  
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__max_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)

My idea: CountVectorizer is the same as TfidfVectorizer with use_idf=False and normalize=None. If GridSearchCV gives this as the best result those parameters, then CountVectorizer is the best option. Is that correct?

Thank you in advance :)

Original Q&A

There are 1 answers

**yatu** · Accepted Answer · 2020-10-08T08:33:58+00:00

Once you've included a given step with its corresponding name in the Pipeline, you can access it from the parameter grid and add other parameters, or vectorizers in this case, in the grid. You can also have a list of grids in a single pipeline:

from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = [{
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2), (1,3),)  
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2', None),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
},{
    'vect': (CountVectorizer(),)
    # count_vect_params...
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
}]

grid_search = GridSearchCV(pipeline, parameters)

TechQA.

gridsearchcv with tfidf and count vectorizer

There are 1 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in SENTIMENT-ANALYSIS

Related Questions in GRIDSEARCHCV

Popular Questions

Trending Questions