How to check that duplicated rows (based on one column) are identical after dealing with missing values in pandas python?

I hope you’re doing well.
So my problem is that I have some duplicated rows based on one column (column A), then I should handle missing values if they exist like in the following examples :

– 1st example :

    A   B   C
0   foo 2   3
1   foo nan nan
2   foo 1   4
3   bar nan nan
4   foo nan nan

Concerning this example, this case is invalid because row 0 and row 2 have different values in B or C (if there is a nan it’s okay but if there is another value it’s not okay).

– 2nd example :

    A   B   C
0   foo 2   3
1   foo nan nan
2   foo nan 3
3   bar 1   nan
4   foo 2   nan

This case is valid (the duplicated rows either have the same value in B and C or have nan), then we should handle missing values as following :

    A   B   C
0   foo 2   3
1   foo 2   3
2   foo 2   3
3   bar 1   nan
4   foo 2   3

to check if it is a valid data frame, will return true is not valid:

df.groupby('A').apply(lambda x: x.drop('A', axis=1).isna().all(axis=1).sum() > 1).any()

fill the na if it is valid:

df[['B', 'C']] = df.groupby('A').transform(lambda x: x.fillna(method='ffill'))

As long as the matrix you are trying to recreate is:

A,B,C
foo, 2, 3
bar, 1, nan

Assuming that the code you are developing is:

import numpy as np
import pandas as pd

first_example = {'A' : ['foo', 'foo', 'foo', 'bar', 'foo'],
                'B' : [2, np.nan, 1, np.nan, np.nan],
                'C' : [3, np.nan, 4, np.nan, np.nan]}



second_example = {'A' : ['foo', 'foo', 'foo', 'bar', 'foo'],
                 'B' : [2, np.nan, np.nan, 1, 2],
                 'C' : [3, np.nan, 3, np.nan, np.nan]}


df1 = pd.DataFrame(first_example)
df2 = pd.DataFrame(second_example)

You should try:

df1[['B_','C_']] = ['','']
df1.set_index('A', inplace=True)
df1.at['foo','B_'] = 2
df1.at['foo','C_'] = 3
df1.at['bar','B_'] = 1
df1.at['bar','C_'] = np.nan
df1 = df1.drop(columns=['B','C'])
df1.reset_index(inplace=True)
df1.columns = ['A','B','C']

Here is the output:
Output

Leave a Comment