Check similarity of 2 pandas dataframes - pandas

I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.

It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing

Related

Get max value of 3 columns from pandas DataFrame?

I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.

How to find pearson correlation between rows in two dataframes

I have a dataframe that I split into two dataframes of the same amount of columns and rows (df1 and df2). I want to write a function that will go through each row and feed their values into the scipy.stats.pearsonr() function. How would I do this?
Something like:
for index, row in d1.iterrows():
print(scipy.stats.pearsonr(df1.loc[index], df2.loc[index]))
If you just want the function, try this:
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(
{
'A': [0,2,3,4,5],
'B': [2,3,4,5,6],
'C': [5,6,7,8,9],
}
)
df2 = pd.DataFrame(
{
'A': [2,1,3,4,5],
'B': [3,2,4,5,6],
'C': [7,7,7,3,3],
}
)
def pandas_pearsonr(df1, df2):
assert len(df1)==len(df2)
coefs = []
for i in range(0, len(df1)):
coefs.append(pearsonr(df1.iloc[i].values, df2.iloc[i].values))
print(coefs)
return pd.DataFrame(index=df1.index, data=coefs, columns=['coef', 'p-value'])
pandas_pearsonr(df1, df2)
Output looks like this:
coef p-value
0 0.976221 0.139109
1 0.996271 0.054996
2 1.000000 0.000000
3 -0.720577 0.487754
4 -0.838628 0.366717
But I think, it can be more pythonic. And maybe you can use pandas.DataFrame.corr

How to apply a list comprehension in Panda Dataframe?

From a list of values, I try to identify any sequential pair of values whose sum exceeds 10
a = [1,9,3,4,5]
...so I wrote a for loop...
values = []
for i in range(len(a)-2):
if sum(a[i:i+2]) >10:
values += [a[i:i+2]]
...which I rewritten as a list comprehension...
values = [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >10]
Both produce same output:
values = [[1,9], [9,3]]
My question is how best may I apply the above list comprehension in a DataFrame.
Here is the sample 5 rows DataFrame
import pandas as pd
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['X'] = df.values.tolist()
where:
- a is within a df['X'] which is a list of values Columns A - F
df['X'] = [[1,9,3,4,5],[1,8,3,4,5],[1,3,3,4,5],[1,2,10,4,5],[0,2,3,4,5]]
and, result of the list comprehension is to be store in new column df['X1]
Desired output is:
df['X1'] = [[[1,9], [9,3]],[[8,3]],[[NaN]],[[2,10],[10,4]],[[NaN]]]
Thank you.
You could use pandas apply function, and put your list comprehension in it.
df = pd.DataFrame({'A': [1,1,1,1,0],
'B': [9,8,3,2,2],
'C': [3,3,3,10,3],
'E': [4,4,4,4,4],
'F': [5,5,5,5,5]})
df['x'] = df.apply(lambda a: [a[i:i+2] for i in range(len(a)-2) if sum(a[i:i+2]) >= 10], axis=1)
#Note the axis parameters tells if you want to apply this function by rows or by columns, axis = 1 applies the function to each row.
This will give the output as stated in df['X1']

Getting standard deviation on a specific number of dates

In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}

loop through data frame

I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1