Let me generate a dataframe
df=pd.DataFrame([[1,2],[2,1]])
then I compare
df[0].equals(df[1].sort_values())
this gives False.
However, both d[0] and df[1].sort_values() gives the same output
0 1
1 2
Name: 0, dtype: int64
Why equals gives False? What is wrong?
There is different order of index values, so if create same e.g. here by Series.reset_index with drop=true it working like you expected:
a = df[0].equals(df[1].sort_values().reset_index(drop=True))
print (a)
True
Details:
print (df[0])
0 1
1 2
Name: 0, dtype: int64
print (df[1].sort_values())
1 1
0 2
Name: 1, dtype: int64
print (df[1].sort_values().reset_index(drop=True))
0 1
1 2
Name: 1, dtype: int64
You can also directly access Series values:
np.equal(df[0].values, df[1].sort_values().values)
array([ True, True])
np.equal(df[0].values, df[1].sort_values().values).all()
True
np.array_equal(df[0], df[1].sort_values())
True
As time perfomance is concerned, second and third approaches are equivalent while df[0].equals(df[1].sort_values().reset_index(drop=True)) is 1.5x slower.
Related
In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool
df = pd.DataFrame({"A" : ["1", "7.0", "xyz"]})
type(df.A[0])
the result is "str".
df.A = df.A.astype(int, errors = "ignore")
type(df.A[0])
the result is also "str". I wanna convert "1" and "7.0" to 1 and 7.
where did i do wrong?
why astype is not chance type of values?
Because errors = "ignore" working different like you think.
If it failed, it return same values, so nothing change.
If want values in numeric and NaN if failed:
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')
print (df)
A
0 1
1 7
2 <NA>
For mixed values - numbers with strings:
def num(x):
try:
return(int(float(x)))
except:
return x
df['A'] = df['A'].apply(num)
print (df)
0 1
1 7
2 xyz
Suppose such an array
In [8]: pd.Series(['testing', 'the', 'masking'])
Out[8]:
0 testing
1 the
2 masking
dtype: object
Masking is handy
In [10]: arr == 'testing'
Out[10]:
0 True
1 False
2 False
dtype: bool
If check if 't' in the individual strings, nested iterations should be applied
In [11]: [ u for u in arr if 't' in u]
Out[11]: ['testing', 'the']
Is it possible to get it done with
arr contains 't'
It is possible
s[s.str.contains('t')]
Assume running pandas' dataframe['prod_code'].value_counts() and storing result as 'df'. The operation outputs:
125011 90300
762 72816
None 55512
7156 14892
75162 8825
How would I extract the count for None? I'd expect the result to be 55512.
I've tried
>>> df.loc[df.index.isin(['None'])]
>>> Series([], Name: prod_code, dtype: int64)
and also
>>> df.loc['None']
>>> KeyError: 'the label [None] is not in the [index]'
It seems you need None, not string 'None':
df.loc[df.index.isin([None])]
df.loc[None]
EDIT:
If need check where NaN in index:
print (s1.loc[np.nan])
#or
print (df[pd.isnull(df.index)])
Sample:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', None, np.nan])
s1 = s.value_counts(dropna=False)
print (s1)
8825 3
90300 2
NaN 2
dtype: int64
print (s1[pd.isnull(s1.index)])
NaN 2
dtype: int64
print (s1.loc[np.nan])
2
print (s1.loc[None])
2
EDIT1:
For stripping whitespaces:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', 'None ', np.nan])
print (s)
0 90300
1 90300
2 8825
3 8825
4 8825
5 None
6 NaN
dtype: object
s1 = s.value_counts()
print (s1)
8825 3
90300 2
None 1
dtype: int64
s1.index = s1.index.str.strip()
print (s1.loc['None'])
1
Couple of things
pd.Series([None] * 2 + [1] * 3).value_counts() automatically drops the None.
pd.Series([None] * 2 + [1] * 3).value_counts(dropna=False) converts the None to np.NaN
That tells me that your None is a string. But since df.loc['None'] didn't work, I suspect your string has white space around it.
Try:
df.filter(regex='None', axis=0)
Or:
df.index = df.index.to_series().str.strip().combine_first(df.index.to_series())
df.loc['None']
All that said, I was curious how to reference np.NaN in the index
s = pd.Series([1, 2], [0, np.nan])
s.iloc[s.index.get_loc(np.nan)]
2
Suppose I have a Series with NaNs:
pd.Series([0, 1, None, 1])
I want to transform this to be equal to:
pd.Series([False, True, None, True])
You'd think x == 1 would suffice, but instead, this returns:
pd.Series([False, True, False, True])
where the null value has become False. This is because np.nan == 1 returns False, rather than None or np.nan as in R.
Is there a nice, vectorized way to get what I want?
Maybe map can do it:
import pandas as pd
x = pd.Series([0, 1, None, 1])
print x.map({1: True, 0: False})
0 False
1 True
2 NaN
3 True
dtype: object
You can use where:
In [11]: (s == 1).where(s.notnull(), np.nan)
Out[11]:
0 0
1 1
2 NaN
3 1
dtype: float64
Note: the True and False have been cast to float as 0 and 1.