Vectorized Text as Input into RNN

Solution for Vectorized Text as Input into RNN
is Given Below:

I have the following function which add a new column to my dataframe.
I want to use the vectorized text as into my RNN, however, i am not able to reshape the column to use it as input. How can i resolve this? Thanks

# vectorization
max_length = 500
def vectorization(text):
  seq = text.split()
  if seq:
    vectorizer = TfidfVectorizer()
    vectorizer.fit(seq)
    vector = vectorizer.transform(seq)
    return sequence.pad_sequences(vector.toarray(), maxlen=max_length)
  else:
    print(seq)
    return seq

df['text_vector']=df['text_cleaned'].apply(vectorization)

X_train, X_test, Y_train, Y_test = train_test_split(df['text_vector'], df['sentiment'], train_size=0.80, shuffle=True)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
Y_train = Y_train.to_numpy()
Y_test = Y_test.to_numpy()

X_train = X_train.reshape((X_train.shape[0], 500, 1))

Error here:

ValueError: cannot reshape array of size 3876 into shape (3876,500,1)

Few points

  • Ideally you should fit TfidfVectorizer on full train text but not per row as you are doing
  • Each row is a np array of size 500 after pad_sequences. So you will have to concatenate all the np arrays rows wise to create a np array of size (n X 500) where n is the len(df)

Fixed code (commented inline):

from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing import sequence


max_length = 500
def vectorization(vectorizer, text):
    vector = vectorizer.transform(text)
    return sequence.pad_sequences(vector.toarray(), maxlen=max_length)

import pandas as pd 
df = pd.DataFrame( {'text_cleaned': [
                                     'a cat on a table', 
                                     'a dog under a table', 
                                     'apple is red', 
                                     'sky is blue'] })
v = TfidfVectorizer()
# Fit on full test data text
v.fit(df['text_cleaned'])

df['text_vector']= df['text_cleaned'].apply(lambda text: vectorization(v, [text]))
# concatenate all the 500 length sequences
x_train = np.concatenate(df['text_vector'])
# reshape or use expand_dim to add last dimention so that it can be passed to RNN
x_train = x_train.reshape(-1,500,1)