Solution for NLP-How can we feed the output of FastText in Neural Network
is Given Below:
I am working on Text classification where I want to classify movie genres. I want to give input as movie summary/plot and want output as Movie Genres. I have used FastText using Gensim library for obtaining vector representations for words and I want to feed the output of FastText in Neural Network for training so that I can give movie summary/plot as an input to Neural Network and get the output of Movie Genre such as Drama, Horror, etc. I have read many blogs and all are feeding TFIDF in Neural Network but no one is feeding the output of FastText in Neural Network. Can someone please explain to me if it is possible or you think otherwise.
Thank you for your cooperation and understanding in this regard.
import time from gensim.models import FastText start = time.time() model_ted = FastText(sentences=movies_new['genre_new'], size=100, window=5, min_count=5, workers=4,sg=1) print(model_ted) end = time.time() print('Time to train fasttext from generator: %0.2fs' % (end - start)) model_ted.wv.most_similar("The Lemon Drop Kid , a New York City swindler, is illegally touting horses at a Florida racetrack. After several successful hustles, the Kid comes across a beautiful, but gullible, woman intending to bet a lot of money. The Kid convinces her to switch her bet, employing a prefabricated con. Unfortunately for the Kid, the woman belongs to notorious gangster Moose Moran , as does the money.")
[('Foreign legion', 0.9828806519508362), ('Space opera', 0.9763268828392029), ('Cyberpunk', 0.9738191366195679), ('Reboot', 0.9718296527862549), ('Kafkaesque', 0.9635183215141296), ('Libraries and librarians', 0.9622164368629456), ('Parkour in popular culture', 0.961660623550415), ('Movies About Gladiators', 0.9592210650444031), ('Women in prison films', 0.9587017297744751), ('Outlaw', 0.9548137784004211)]
FastText typically gives you one vector per word. Your movie summaries are probably multi-word texts.
How were you thinking of turning your multi-word texts into a single fixed-length vector, as a classifier tends to expect?
You might consider averaging all the word-vectors of the individual words together. That’s a simple approach often helpful as a starting baseline, and may do OK on some broad topical task. But, it also may dilute some useful info, compared to a ‘bag of words’ approach which can still feed a classifier the exact presence-or-absence of specific words. (Some individual words might be highly indicative of genre!)
You could also consider using the
-supervised mode in Facebook’s FastText, which directs word-vector training by their ability, when averaged together, to predict specific labels. Then the FastText model is a classifier, too. (But, that mode isn’t implemented in the Gensim
About your shown code:
- It’s unclear what
movies_new['genre_new']is providing as your training corpus. The
FastTextclass expects a sequence, where each item is a list-of-string-tokens – and those tokens are usually individual natural words. From your later
most_similar()results, it looks like you may instead be providing multi-word categories as the tokens. There are cases where that might make sense, but it may not be wahat you want. In particular, if the model is only trained on category-names, that’s all it will know – and later trying to use that to vectorize free-form movie summaries, in a lingo different from the formal categories, may not work well.
.most_similar()method is expecting some mix of word-tokens to look up, or raw vectors. It’s not expecting a long string of free-form text, like you’ve provided in this code. I believe it will intepret that as one long word, and synthesize an OOV word-vector for it from its many character n-grams. That’s probably nonsense in your case, meaning the results will be nonsense, too.