Using isnull() and groupby() on a pandas dataframe - pandas

Suppose I have a dataframe df with columns 'A', 'B', 'C'.
I would like to count the number of null values in column 'B' as grouped by 'A' and make a dictionary out of it:
Tried the following by failed:
df.groupby('A')['B'].isnull().sum().to_dict()
Any help will be appreciated.

Setup
df = pd.DataFrame(dict(A=[1, 2] * 3, B=[1, 2, None, 4, None, None]))
df
A B
0 1 1.0
1 2 2.0
2 1 NaN
3 2 4.0
4 1 NaN
5 2 NaN
Option 1
df['B'].isnull().groupby(df['A']).sum().to_dict()
{1: 2.0, 2: 1.0}
Option 2
df.groupby('A')['B'].apply(lambda x: x.isnull().sum()).to_dict()
{1: 2, 2: 1}
Option 3
Getting creative
df.A[df.B.isnull()].value_counts().to_dict()
{1: 2, 2: 1}
Option 4
from collections import Counter
dict(Counter(df.A[df.B.isnull()]))
{1: 2, 2: 1}
Option 5
from collections import defaultdict
d = defaultdict(int)
for t in df.itertuples():
d[t.A] += pd.isnull(t.B)
dict(d)
{1: 2, 2: 1}
Option 6
Unnecessarily complicated
(lambda t: dict(zip(t[1], np.bincount(t[0]))))(df.A[df.B.isnull()].factorize())
{1: 2, 2: 1}
Option 7
df.groupby([df.B.isnull(), 'A']).size().loc[True].to_dict()
{1: 2, 2: 1}

Or using the different between count and size ,see the link
(df.groupby('A')['B'].size()-df.groupby('A')['B'].count()).to_dict()
Out[119]: {1: 2, 2: 1}

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Apply custom functions to groupby pandas

I know there are questions/answers about how to use custom function for groupby in pandas but my case is slightly different.
My data is
group_col val_col
0 a [1, 2, 34]
1 a [2, 4]
2 b [2, 3, 4, 5]
data = {'group_col': {0: 'a', 1: 'a', 2: 'b'}, 'val_col': {0: [1, 2, 34], 1: [2, 4], 2: [2, 3, 4, 5]}}
df = pd.DataFrame(data)
What I am trying to do is to group by group_col, then sum up the length of lists in val_col for each group. My desire output is
a 5
b 4
I wonder I can do this in pandas?
You can try
df['val_col'].str.len().groupby(df['group_col']).sum()
df.groupby('group_col')['val_col'].sum().str.len()
Output:
group_col
a 5
b 4
Name: val_col, dtype: int64

how to generate random numbers that can be summed to a specific value?

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.
The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

pandas - groupby elements by column repeat pattern

I would like to groupby a dataframe by column's appearance patten (not the same order but not repeat).
for example below, group the x column (0,1,2) as a group and (3,4,5) as another group. group element maybe not the same, but no any element repeated in each group.
#+begin_src python :results output
import pandas as pd
df = pd.DataFrame({
'x': ['a', 'b', 'c', 'c', 'b', 'a'],
'y': [1, 2, 3, 4, 3, 1]})
print(df)
#+end_src
#+RESULTS:
: x y
: 0 a 1
: 1 b 2
: 2 c 3
: 3 c 4
: 4 b 3
: 5 a 1
Try with cumcount , the output can be the group number for you
df.groupby('x').cumcount()
Out[81]:
0 0
1 0
2 0
3 1
4 1
5 1
dtype: int64

fastest way to get value frequency stored in dictionary format in groupby pandas

In order to calculate frequency of each value by id, we can do something using value_counts and groupby.
>>> df = pd.DataFrame({"id":[1,1,1,2,2,2], "col":['a','a','b','a','b','b']})
>>> df
id col
0 1 a
1 1 a
2 1 b
3 2 a
4 2 b
5 2 b
>>> df.groupby('id')['col'].value_counts()
id col
1 a 2
b 1
2 b 2
a 1
But I would like to get results stored in dictionary format, not Series. So how am I able to achieve that and also the speed is fast if we have a large dataset?
The ideal format is:
id
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
You can unstack the groupby result to get a dict-of-dicts:
df.groupby('id')['col'].value_counts().unstack().to_dict(orient='index')
# {1: {'a': 2, 'b': 1}, 2: {'a': 1, 'b': 2}}
If you want a Series of dicts, use agg instead of to_dict:
df.groupby('id')['col'].value_counts().unstack().agg(pd.Series.to_dict)
col
a {1: 2, 2: 1}
b {1: 1, 2: 2}
dtype: object
I don't recommend storing data in this format, objects are generally more troublesome to work with.
If unstacking generates NaNs, try an alternative with GroupBy.agg:
df.groupby('id')['col'].agg(lambda x: x.value_counts().to_dict())
id
1 {'a': 2, 'b': 1}
2 {'b': 2, 'a': 1}
Name: col, dtype: object
We can do pd.crosstab
pd.Series(pd.crosstab(df.id,df.col).to_dict('i'))
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
dtype: object