Text Encodings for large text dataset

I am having trouble encoding large amounts oftextual data to numbers

Problem : memory issues during encoding

Things I tried :

# Using traditional Python
TITLE_list=" ".join(df["TITLE"].dropna()).split(" ")
# Bag of words
bog = pd.Series([y for x in df.iloc[: , :4].dropna().values.flatten() for y in x.split()]).value_counts()

I also tried using other encoders but all result in the same memory error

This is the memory size of my server: (Output of !free –mega) (note this is in megabytes)

              total        used        free      shared  buff/cache   available
Mem:          13302         498       12097           1         707       12545
Swap:             0           0           0

Data example:

    TITLE         DESCRIPTION                  BULLET_POINTS    BRAND    
0   Pete The Cat, Pete the Cat is the coolest, [Pete the Cat ]  MeMakers    

Help needed : Solution to encode large amounts of text data to respective numbers such that I can feed them to a neural network architecture

You should try an another approch…
Instead of trying to create a list, you can try to manage an “iterator object”.
By doing this, python will keep in memory only the currently used iteration.
(saving a lot of memory)

Just an example:

l1=[1,2,3,4,5]
l2=iter(l1)
print (l2) #Output:<list_iterator object at 0x0164E658>

#Iterating through iterable(l1) using for loop.
    for i in l1:
        print (i,end=" ") #Output:1 2 3 4 5
    
    
#Iterating through iterator(l2) using for loop.
    for i in l2:
        print (i,end=" ") #Output:1 2 3 4 5

I had to use batching in the end to make this work :
Here is a code snippet of the working code that I used :

n = 16
for i in tqdm.tqdm(range(0, int(df.shape[0]), n)):
        df["TITLE"].iloc[i:i + n] = df["TITLE"].iloc[i:i + n].apply(lambda x: [item for item in x.split() if item not in stop]  if type(x) == str else x)

Leave a Comment