I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object to reuse it later. I tried to use pickle, but it gave the following error.
loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects
I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?
Here is my full object:
class changeToMatrix(object):
    def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
        from sklearn.feature_extraction.text import TfidfVectorizer
        self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,
                                          token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',
                                          tokenizer=tokenizer)
    def load_ref_text(self,text_file):
        textfile = open(text_file,'r')
        lines = textfile.readlines()
        textfile.close()
        sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sentences = [item.strip().strip('.') for item in sent_tokenizer.tokenize(' '.join(lines).strip())]
        #vectorizer is transformed in this step
        chk2 = pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray())
        return sentences, [chk2]
    def get_processed_data(self,data_loc):
        ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
        loc = open("indexedData/vectorizer.obj","w")
        pickle.dump(self.vectorizer,loc) #getting error here
        loc.close()
        return ref_sentences, ref_dataframes
				
                        
Firstly, it's better to leave the import at the top of your code instead of within your class:
Next
StemTokenizerdon't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:
Note: Using the
withidiom for i/o file access automatically closes the file once you get out of thewithscope.Regarding the issue with
SnowballStemmer(), note thatSnowballStemmer('english')is an object while the stemming function isSnowballStemmer('english').stem.IMPORTANT:
TfidfVectorizer's tokenizer parameter expects to take a string and return a list of stringSo you will need to do this: