I want to use GridSearchCV for parameter tuning. Is it also possible to check with GridSearchCV whether CountVectorizer or TfidfVectorizer works best? My idea:
pipeline = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2), (1,3),
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__max_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
My idea: CountVectorizer is the same as TfidfVectorizer with use_idf=False and normalize=None. If GridSearchCV gives this as the best result those parameters, then CountVectorizer is the best option. Is that correct?
Thank you in advance :)
Once you've included a given step with its corresponding name in the
Pipeline, you can access it from the parameter grid and add other parameters, or vectorizers in this case, in the grid. You can also have a list of grids in a single pipeline: