Multi-level groupby sub-population percentages - pandas

Let's consider the following dataframe:
df = {'Location': ['A','A','B','B','C','C','A','C','A'],
'Gender'['M','M','F','M','M','F','M','M','M'],
'Edu'['N','N','Y','Y','Y','N','Y','Y','Y'],
'Access1': [1,0,1,0,1,0,1,1,1], 'Access2': [1,1,1,0,0,1,0,0,1] }
df = pd.DataFrame(data=d, dtype=np.int8)
Output from dataframe:
Access1 Access2 Edu Gender Location
0 1 1 N M A
1 0 1 N M A
2 1 1 Y F B
3 0 0 Y M B
4 1 0 Y M C
5 0 1 N F C
6 1 0 Y M A
7 1 0 Y M C
8 1 1 Y M A
Then I am using groupby to analyse the frequencies in df
D0=df.groupby(['Location','Gender','Edu']).sum()
((D0/ D0.groupby(level = [0]).transform(sum))*100).round(3).astype(str) + '%'
Output:
Access1 Access2
Location Gender Edu
A M N 33.333% 66.667%
Y 66.667% 33.333%
B F Y 100.0% 100.0%
M Y 0.0% 0.0%
C F N 0.0% 100.0%
M Y 100.0% 0.0%
From this output, I infer that 33.3% of uneducated men in location A with Access to service 1 (=Access1) is the result of considering 3 people in location A having access to service 1, of which 1 uneducated man has access to it (=1/3).
Yet, wish to get a different output. I would like to consider a total of 4 men in location A as my 100%. 50% of this group of men are uneducated. Out of that 50% of uneducated men, 25% have access to service 1. So, the percentage I would like to see in the table is 25% (total of uneducated men in area A accessing service 1). Is groupby the right way to get there, and what would be the best way to measure the % of Access to service 1 while considering a disaggregation from the total population of reference per location?

I believe need divide D0 by first level of MultiIndex mapped by a Series:
D0=df.groupby(['Location','Gender','Edu']).sum()
a = df['Location'].value_counts()
#alternative
#a = df.groupby(['Location']).size()
print (a)
A 4
C 3
B 2
Name: Location, dtype: int64
df1 = D0.div(D0.index.get_level_values(0).map(a.get), axis=0)
print (df1)
Access1 Access2
Location Gender Edu
A M N 0.250000 0.500000
Y 0.500000 0.250000
B F Y 0.500000 0.500000
M Y 0.000000 0.000000
C F N 0.000000 0.333333
M Y 0.666667 0.000000
Detail:
print (D0.index.get_level_values(0).map(a.get))
Int64Index([4, 4, 2, 2, 3, 3], dtype='int64', name='Location')

Related

How to *multiply* (for lack of a better term) two dataframes [duplicate]

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is #senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving #senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge now have option cross
left.merge(right, how='cross')
Using itertools product and recreate the value in dataframe
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
Here's an approach with triple concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
One option is with expand_grid from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'left':left, 'right':right}
jn.expand_grid(others = others)
left right
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:
left['dummy'] = 'a'
right['dummy'] = 'a'
cartesian = left.merge(right, how='inner', on='dummy')
del cartesian['dummy']

Conditional frequency of elements within lists in pandas data frame

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2
Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()
Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1

Pandas groupby sort each group values and order dataframe groups based on max of each group

I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2

conversion column names into first row

I would like to convert the following dataframe into a json .
df:
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
In order to do so by using the function df.to_json() I would need to have unique names in column and indices.
Therefore what I am looking for is to convert the column names into a row and have default column numbers . In short I would like the following output:
df:
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio                 --   35.99   12.70  20.63  14.75      23.06
RRM Sales            --  114.57    1.51   5.02   1.00    4594.13
MQR book         1.48    2.64    1.02   2.46   2.73       2.74
TTR cash        --   14.33    7.41  15.35   8.59  513854.86
Turning the column names into the first row so I can make the conversion correctly .
You could also use vstack in numpy:
>>> df
x y z
0 8 7 6
1 6 5 4
>>> pd.DataFrame(np.vstack([df.columns, df]))
0 1 2
0 x y z
1 8 7 6
2 6 5 4
The columns become the actual first row in this case.
Use assign by list of range and original column names:
print (range(len(df.columns)))
range(0, 6)
#for python2 list can be omit
df.columns = [list(range(len(df.columns))), df.columns]
Or MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])
Also is possible use RangeIndex:
print (pd.RangeIndex(len(df.columns)))
RangeIndex(start=0, stop=6, step=1)
df.columns = pd.MultiIndex.from_arrays([pd.RangeIndex(len(df.columns)), df.columns])
print (df)
0 1 2 3 4 5
A sector B sector C sector
TTM Ratio -- 35.99 12.70 20.63 14.75 23.06
RRM Sales -- 114.57 1.51 5.02 1.00 4594.13
MQR book 1.48 2.64 1.02 2.46 2.73 2.74
TTR cash -- 14.33 7.41 15.35 8.59 513854.86

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)