Solution for How to transform customize vectorizer for predicting classification?
is Given Below:
As far as I googled, I haven’t find similar questions, or I searched it with bad keywords.
I want to make a variation of feature extraction.
- Vectorize as plain bag of words
- Vectorize bag of words, combined with additional features
So for the first method I fit transform the dataset using this code (this is part of my function. df is a dataframe, vect is TFIDF/countvectorizer)
self.X = self.vect.fit_transform(df.Tweet)
self.X_columns=self.vect.get_feature_names()
so after I build the classification model, I can transform any text I want to predict by using this code. (vect is TFIDF/countvectorizer, new_df is a dataframe, clf is a trained built classifier using any algorithm)
text_features = vect.transform(new_df.Tweet)
predictions = clf.predict(text_features)
It’s done, and it works.
So for the 2nd case:
I did the same with some workaround. I look any usefule code in stackoverflow and I did it using this code. (sp is scipy lib, df is a dataframe)
self.X = sp.sparse.hstack((vect.fit_transform(df.Tweet), df[['feature_1','feature_2','score','sentiment']].values), format="csr")
self.X_columns=vect.get_feature_names() + df[['feature_1','feature_2','score','sentiment']].columns.tolist()
It works, the additional feature is added into the csr matrices.
But the question is how to transform new_df into the matrix?
I don’t know where to begin to try the solution
My guess is
# count/process each additional features ['feature_1','feature_2','score','sentiment']
...
# then use similar method but using transform instead fit_transform
text_features = sp.sparse.hstack((vect.transform(new_df.Tweet), new_df[['feature_1','feature_2','score','sentiment']].values), format="csr")
predictions = clf.predict(text_features)
I’ll update this if it’s the correct answer. Please share if you find better approach/solution.