Return count for specific value in pandas .value_counts()? - pandas

Assume running pandas' dataframe['prod_code'].value_counts() and storing result as 'df'. The operation outputs:
125011 90300
762 72816
None 55512
7156 14892
75162 8825
How would I extract the count for None? I'd expect the result to be 55512.
I've tried
>>> df.loc[df.index.isin(['None'])]
>>> Series([], Name: prod_code, dtype: int64)
and also
>>> df.loc['None']
>>> KeyError: 'the label [None] is not in the [index]'

It seems you need None, not string 'None':
df.loc[df.index.isin([None])]
df.loc[None]
EDIT:
If need check where NaN in index:
print (s1.loc[np.nan])
#or
print (df[pd.isnull(df.index)])
Sample:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', None, np.nan])
s1 = s.value_counts(dropna=False)
print (s1)
8825 3
90300 2
NaN 2
dtype: int64
print (s1[pd.isnull(s1.index)])
NaN 2
dtype: int64
print (s1.loc[np.nan])
2
print (s1.loc[None])
2
EDIT1:
For stripping whitespaces:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', 'None ', np.nan])
print (s)
0 90300
1 90300
2 8825
3 8825
4 8825
5 None
6 NaN
dtype: object
s1 = s.value_counts()
print (s1)
8825 3
90300 2
None 1
dtype: int64
s1.index = s1.index.str.strip()
print (s1.loc['None'])
1

Couple of things
pd.Series([None] * 2 + [1] * 3).value_counts() automatically drops the None.
pd.Series([None] * 2 + [1] * 3).value_counts(dropna=False) converts the None to np.NaN
That tells me that your None is a string. But since df.loc['None'] didn't work, I suspect your string has white space around it.
Try:
df.filter(regex='None', axis=0)
Or:
df.index = df.index.to_series().str.strip().combine_first(df.index.to_series())
df.loc['None']
All that said, I was curious how to reference np.NaN in the index
s = pd.Series([1, 2], [0, np.nan])
s.iloc[s.index.get_loc(np.nan)]
2

Related

Update categories in two Series / Columns for comparison

If I try to compare two Series with different categories I get an error:
a = pd.Categorical([1, 2, 3])
b = pd.Categorical([4, 5, 3])
df = pd.DataFrame([a, b], columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 3
df.a == df.b
# TypeError: Categoricals can only be compared if 'categories' are the same.
What is the best way to update categories in both Series? Thank you!
My solution:
df['b'] = df.b.cat.add_categories(df.a.cat.categories.difference(df.b.cat.categories))
df['a'] = df.a.cat.add_categories(df.b.cat.categories.difference(df.a.cat.categories))
df.a == df.b
Output:
0 False
1 False
2 True
dtype: bool
One idea with union_categoricals:
from pandas.api.types import union_categoricals
union = union_categoricals([df.a, df.b]).categories
df['a'] = df.a.cat.set_categories(union)
df['b'] = df.b.cat.set_categories(union)
print (df.a == df.b)
0 False
1 False
2 True
dtype: bool

why astype is not chance type of values?

df = pd.DataFrame({"A" : ["1", "7.0", "xyz"]})
type(df.A[0])
the result is "str".
df.A = df.A.astype(int, errors = "ignore")
type(df.A[0])
the result is also "str". I wanna convert "1" and "7.0" to 1 and 7.
where did i do wrong?
why astype is not chance type of values?
Because errors = "ignore" working different like you think.
If it failed, it return same values, so nothing change.
If want values in numeric and NaN if failed:
df['A'] = pd.to_numeric(df['A'], errors='coerce').astype('Int64')
print (df)
A
0 1
1 7
2 <NA>
For mixed values - numbers with strings:
def num(x):
try:
return(int(float(x)))
except:
return x
df['A'] = df['A'].apply(num)
print (df)
0 1
1 7
2 xyz

How do you strip out only the integers of a column in pandas?

I am trying to strip out only the numeric values--which is the first 1 or 2 digits. Some values in the column contain pure strings and others contain special characters. See pic for the value count:
enter image description here
I have tried multiple methods:
breaks['_Size'] = breaks['Size'].fillna(0)
breaks[breaks['_Size'].astype(str).str.isdigit()]
breaks['_Size'] = breaks['_Size'].replace('\*','',regex=True).astype(float)
breaks['_Size'] = breaks['_Size'].str.extract('(\d+)').astype(int)
breaks['_Size'].map(lambda x: x.rstrip('aAbBcC'))
None are working. The dtype is object. To be clear, I am attempting to make a new column with only the digits (as an int/float) and if I could convert the fraction to a decimal that would be bonus
This works for dividing fractions and also allows for extra numbers to be present in the string (it returns you just the first sequence of numbers):
In [60]: import pandas as pd
In [61]: import re
In [62]: df = pd.DataFrame([0, "6''", '7"', '8in', 'text', '3/4"', '1a3'], columns=['_Size'])
In [63]: df
Out[63]:
_Size
0 0
1 6''
2 7"
3 8in
4 text
5 3/4"
6 1a3
In [64]: def cleaning_function(row):
...: row = str(row)
...: fractions = re.findall(r'(\d+)/(\d+)', row)
...: if fractions:
...: return float(int(fractions[0][0])/int(fractions[0][1]))
...: numbers = re.findall(r'[0-9]+', str(row))
...: if numbers:
...: return numbers[0]
...: return 0
...:
In [65]: df._Size.apply(cleaning_function)
Out[65]:
0 0
1 6
2 7
3 8
4 0
5 0.75
6 1
Name: _Size, dtype: object

Copy values from one column in a pandas dataframe only when target field is blank in target dataframe

I have 2 dataframes of equal length. The source has one column, ML_PREDICTION that I want to copy to the target dataframe, which has some values already that I don't want to overwrite.
#Select only blank values in target dataframe
mask = br_df['RECOMMENDED_ACTION'] == ''
# Attempt 1 - Results: KeyError: "['Retain' 'Retain' '' ... '' '' 'Retain'] not in index"
br_df.loc[br_df['RECOMMENDED_ACTION'][mask]] = ML_df['ML_PREDICTION'][mask]
br_df.loc['REASON_CODE'][mask] = 'ML01'
br_df.loc['COMMENT'][mask] = 'Automated Prediction'
# Attempt 2 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'].where(mask, other=ML_df['ML_PREDICTION'], inplace=True)
br_df['REASON_CODE'].where(mask, other='ML01', inplace=True)
br_df['COMMENT'].where(mask, other='Automated Prediction', inplace=True)
# Attempt 3 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'] = [x for x in ML_df['ML_PREDICTION'] if [mask] ]
br_df['REASON_CODE'] = ['ML01' for x in ML_df['ML_PREDICTION'] if [mask]]
br_df['COMMENT'] = ['Automated Prediction' for x in ML_df['ML_PREDICTION'] if [mask]]
Attempt 4 - Results: Values in target (br_df) were unchanged
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'REASON_CODE'] = 'ML01'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'COMMENT'] = 'Automated Prediction'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'RECOMMENDED_ACTION'] = ML_df['ML_PREDICTION']
Attempt 5
#Dipanjan
` # Before - br_df['REASON_CODE'].value_counts()
BR03 10
BR01 8
Name: REASON_CODE, dtype: int64
#Attempt 5
br_df.loc['REASON_CODE'] = br_df['REASON_CODE'].fillna('ML01')
br_df.loc['COMMENT'] = br_df['COMMENT'].fillna('Automated Prediction')
br_df.loc['RECOMMENDED_ACTION'] = br_df['RECOMMENDED_ACTION'].fillna(ML_df['ML_PREDICTION'])
# After -- print(br_df['REASON_CODE'].value_counts())
BR03 10
BR01 8
ML01 2
Automated Prediction 1
Name: REASON_CODE, dtype: int64
#WTF? -- br_df[br_df['REASON_CODE'] == 'Automated Prediction']
PERSON_STATUS ... RECOMMENDED_ACTION REASON_CODE COMMENT
COMMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Automated Prediction Automated Prediction Automated Prediction
What am I missing here?
use below options -
df.loc[df['A'].isnull(), 'A'] = df['B']
or
df['A'] = df['A'].fillna(df['B'])
import numpy as np
df_a = pd.DataFrame([0,1,np.nan])
df_b = pd.DataFrame([0,np.nan,2])
df_a
0
0 0.0
1 1.0
2 NaN
df_b
0
0 0.0
1 NaN
2 2.0
df_a[0] = df_a[0].fillna(df_b[0])
final_output-
df_a
0
0 0.0
1 1.0
2 2.0
Ultimately, this is the syntax that appears to solve my problem:
mask = mask[:len(br_df)] # create the boolean index
br_df = br_df[:len(mask)] # make sure they are the same length
br_df['RECOMMENDED_ACTION'].loc[mask] = ML_df['ML_PREDICTION'].loc[mask]
br_df['REASON_CODE'].loc[mask] = 'ML01'
br_df['COMMENT'].loc[mask] = 'Automated Prediction'

After rename column get keyerror

I have df:
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print (df)
a b c
0 7 1 5
1 8 3 3
2 9 5 6
Then rename first value by this:
df.columns.values[0] = 'f'
All seems very nice:
print (df)
f b c
0 7 1 5
1 8 3 3
2 9 5 6
print (df.columns)
Index(['f', 'b', 'c'], dtype='object')
print (df.columns.values)
['f' 'b' 'c']
If select b it works nice:
print (df['b'])
0 1
1 3
2 5
Name: b, dtype: int64
But if select a it return column f:
print (df['a'])
0 7
1 8
2 9
Name: f, dtype: int64
And if select f get keyerror.
print (df['f'])
#KeyError: 'f'
print (df.info())
#KeyError: 'f'
What is problem? Can somebody explain it? Or bug?
You aren't expected to alter the values attribute.
Try df.columns.values = ['a', 'b', 'c'] and you get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-e7e440adc404> in <module>()
----> 1 df.columns.values = ['a', 'b', 'c']
AttributeError: can't set attribute
That's because pandas detects that you are trying to set the attribute and stops you.
However, it can't stop you from changing the underlying values object itself.
When you use rename, pandas follows up with a bunch of clean up stuff. I've pasted the source below.
Ultimately what you've done is altered the values without initiating the clean up. You can initiate it yourself with a followup call to _data.rename_axis (example can be seen in source below). This will force the clean up to be run and then you can access ['f']
df._data = df._data.rename_axis(lambda x: x, 0, True)
df['f']
0 7
1 8
2 9
Name: f, dtype: int64
Moral of the story: probably not a great idea to rename a column this way.
but this story gets weirder
This is fine
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
df.columns.values[0] = 'f'
df['f']
0 7
1 8
2 9
Name: f, dtype: int64
This is not fine
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f']
KeyError:
Turns out, we can modify the values attribute prior to displaying df and it will apparently run all the initialization upon the first display. If you display it prior to changing the values attribute, it will error out.
weirder still
df = pd.DataFrame({'a':[7,8,9],
'b':[1,3,5],
'c':[5,3,6]})
print(df)
df.columns.values[0] = 'f'
df['f'] = 1
df['f']
f f
0 7 1
1 8 1
2 9 1
As if we didn't already know that this was a bad idea...
source for rename
def rename(self, *args, **kwargs):
axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
copy = kwargs.pop('copy', True)
inplace = kwargs.pop('inplace', False)
if kwargs:
raise TypeError('rename() got an unexpected keyword '
'argument "{0}"'.format(list(kwargs.keys())[0]))
if com._count_not_none(*axes.values()) == 0:
raise TypeError('must pass an index to rename')
# renamer function if passed a dict
def _get_rename_function(mapper):
if isinstance(mapper, (dict, ABCSeries)):
def f(x):
if x in mapper:
return mapper[x]
else:
return x
else:
f = mapper
return f
self._consolidate_inplace()
result = self if inplace else self.copy(deep=copy)
# start in the axis order to eliminate too many copies
for axis in lrange(self._AXIS_LEN):
v = axes.get(self._AXIS_NAMES[axis])
if v is None:
continue
f = _get_rename_function(v)
baxis = self._get_block_manager_axis(axis)
result._data = result._data.rename_axis(f, axis=baxis, copy=copy)
result._clear_item_cache()
if inplace:
self._update_inplace(result._data)
else:
return result.__finalize__(self)