pandas: .equals should evaluate to True? - pandas

I have 2 pandas DataFrames that appear to be exactly the same. However, when I test using the .equals method I get False. Any idea what the potential inconsistency may be? Is there something I'm not checking?
print(df1.values.tolist()==df2.values.tolist())
print(df1.columns.tolist()==df2.columns.tolist())
print(df1.index.tolist()==df2.index.tolist())
print(df1.equals(df2))
# True
# True
# True
# False

One possibility is different datatypes that evaluate as equal in python-space, e.g.
df1 = pd.DataFrame({'a': [1, 2.0, 3]})
df2 = pd.DataFrame({'a': [1,2,3]})
df1.values.tolist() == df2.values.tolist()
Out[45]: True
df1.equals(df2)
Out[46]: False
To chase this down, you can use the assert_frame_equal function.
from pandas.testing import assert_frame_equal
assert_frame_equal(df1, df2)
AssertionError: Attributes are different
Attribute "dtype" are different
[left]: float64
[right]: int64
In version of pandas before 0.20.1, the import is from pandas.util.testing import assert_frame_equal

Related

Is there a generalized dtype for all floating point types (float32, float64, etc.)? [duplicate]

Other than using a set of or statements
isinstance( x, np.float64 ) or isinstance( x, np.float32 ) or isinstance( np.float16 )
Is there a cleaner way to check of a variable is a floating type?
You can use np.floating:
In [11]: isinstance(np.float16(1), np.floating)
Out[11]: True
In [12]: isinstance(np.float32(1), np.floating)
Out[12]: True
In [13]: isinstance(np.float64(1), np.floating)
Out[13]: True
Note: non-numpy types return False:
In [14]: isinstance(1, np.floating)
Out[14]: False
In [15]: isinstance(1.0, np.floating)
Out[15]: False
to include more types, e.g. python floats, you can use a tuple in isinstance:
In [16]: isinstance(1.0, (np.floating, float))
Out[16]: True
To check numbers in numpy array, it provides 'character code' for the general kind of data.
x = np.array([3.6, 0.3])
if x.dtype.kind == 'f':
print('x is floating point')
See other kinds of data here in the manual
EDITED---------------
Be careful when using isinstance and is operator to determine type of numbers.
import numpy as np
a = np.array([1.2, 1.3], dtype=np.float32)
print(isinstance(a.dtype, np.float32)) # False
print(isinstance(type(a[0]), np.float32)) # False
print(a.dtype is np.float32) # False
print(type(a[0]) is np.dtype(np.float32)) # False
print(isinstance(a[0], np.float32)) # True
print(type(a[0]) is np.float32) # True
print(a.dtype == np.float32) # True
The best way to a check if a NumPy array is in floating precision whether 16,32 or 64 is as follows:
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(issubclass(a.dtype.type,np.floating))
a = np.random.rand(3).astype(np.float64)
print(issubclass(a.dtype.type,np.floating))
a = np.random.rand(3).astype(np.float16)
print(issubclass(a.dtype.type,np.floating))
In this case all will be True.
The common solution however can be give wrong outputs as shown below,
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(isinstance(a,np.floating))
a = np.random.rand(3).astype(np.float64)
print(isinstance(a,np.floating))
a = np.random.rand(3).astype(np.float16)
print(isinstance(a,np.floating))
In this case all will be False
The workaround for above though is
import numpy as np
a = np.random.rand(3).astype(np.float32)
print(isinstance(a[0],np.floating))
a = np.random.rand(3).astype(np.float64)
print(isinstance(a[0],np.floating))
a = np.random.rand(3).astype(np.float16)
print(isinstance(a[0],np.floating))
Now all will be True

Pandas: Check if Series of strings is in Series with list of strings

I'm looking for a way to decide if a pandas Series of strings is contained in the values of a list of strings of another Series.
Preferably a one-liner - I'm aware that I can solve this by looping over the rows and building up a new series.
Example:
import pandas as pd
df = pd.DataFrame([
{'value': 'foo', 'accepted_values': ['foo', 'bar']},
{'value': 'bar', 'accepted_values': ['foo']},
])
Desired output would be
pd.Series([True, False])
because 'foo' is in ['foo', 'bar'], but 'bar' is not in ['foo']
What I've tried:
df['value'].isin(df['accepted_values']), but that gives me [False, False]
Thanks!
You can use apply with in:
df.apply(lambda r: r.value in r.accepted_values, axis=1)
0 True
1 False

Parallel sampling and groupby in pandas

I have a large df (>=100k rows and 40 columns) that I am looking repeatedly sample and groupby. The code below works, but I was wondering if there is a way to speed up the process by parallelising any part of the process.
The df can live in shared memory, and nothing gets changed in the df, just need to return 1 or more aggregates for each column.
import pandas as pd
import numpy as np
from tqdm import tqdm
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
data['variant'] = np.repeat(['A', 'B'],50)
samples_list = []
for i in tqdm(range(0,1000)):
df = data.sample(
frac=1, # take the same number of samples as there are rows
replace=True, # allow the same row to be drawn multiple times
random_state=i # set state to be i for reproduceability
).groupby(['variant']).agg(
{
'A': 'count',
'B': [np.nanmean, np.sum, np.median, 'count'],
'C': [np.nanmean, np.sum],
'D': [np.sum]
}
)
df['experiment'] = i
samples_list.append( df )
# Convert to a df
samples = pd.concat(samples_list)
samples.head()
If you have enough memory, you can replicate the data first, then groupby and agg:
n_exp = 10
samples=(data.sample(n=len(data)*n_exp, replace=True, random_state=43)
.assign(experiment=np.repeat(np.arange(n_exp), len(data)) )
.groupby(['variant', 'experiment'])
.agg({'A': 'count',
'B': [np.nanmean, np.sum, np.median, 'count'],
'C': [np.nanmean, np.sum],
'D': [np.sum]
})
)
This is 4x faster with your sample data on my system.

Check similarity of 2 pandas dataframes

I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.
It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing

Finding entries containing a substring in a numpy array?

I tried to find entries in an Array containing a substring with np.where and an in condition:
import numpy as np
foo = "aa"
bar = np.array(["aaa", "aab", "aca"])
np.where(foo in bar)
this only returns an empty Array.
Why is that so?
And is there a good alternative solution?
We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -
np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Sample run -
In [91]: bar
Out[91]:
array(['aaa', 'aab', 'aca'],
dtype='|S3')
In [92]: foo
Out[92]: 'aa'
In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[93]: array([0, 1])
In [94]: bar[2] = 'jaa'
In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1)
Out[95]: array([0, 1, 2])
Look at some examples of using in:
In [19]: bar = np.array(["aaa", "aab", "aca"])
In [20]: 'aa' in bar
Out[20]: False
In [21]: 'aaa' in bar
Out[21]: True
In [22]: 'aab' in bar
Out[22]: True
In [23]: 'aab' in list(bar)
It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.
But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.
As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.
In [42]: np.char.find(bar, 'aa')
Out[42]: array([ 0, 0, -1])
Docstring:
This module contains a set of functions for vectorized string
operations and methods.
The preferred alias for defchararray is numpy.char.
For operations like this I think the np.char speeds are about same as with:
In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar)
Out[49]: array([0, 0, -1], dtype=object)
In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar)
Out[50]: array([True, True, False], dtype=object)
Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.
If using pandas is acceptable, then utilizing the str.contains method can be used.
import numpy as np
entries = np.array(["aaa", "aab", "aca"])
import pandas as pd
pd.Series(entries).str.contains('aa') # <----
Results in:
0 True
1 True
2 False
dtype: bool
The method also accepts regular expressions for more complex patterns:
pd.Series(entries).str.contains(r'a.a')
Results in:
0 True
1 False
2 True
dtype: bool
The way you are trying to use np.where is incorrect. The first argument of np.where should be a boolean array, and you are simply passing it a boolean.
foo in bar
>>> False
np.where(False)
>>> (array([], dtype=int32),)
np.where(np.array([True, True, False]))
>>> (array([0, 1], dtype=int32),)
The problem is that numpy does not define the in operator as an element-wise boolean operation.
One way you could accomplish what you want is with a list comprehension.
foo = 'aa'
bar = np.array(['aaa', 'aab', 'aca'])
out = [i for i, v in enumerate(bar) if foo in v]
# out = [0, 1]
bar = ['aca', 'bba', 'baa', 'aaf', 'ccc']
out = [i for i, v in enumerate(bar) if foo in v]
# out = [2, 3]
You can also do something like this:
mask = [foo in x for x in bar]
filter = bar[ np.where( mask * bar != '') ]