I have a data table df1 that looks like this (result of a df.groupby('id').agg(lambda x: x.tolist())):
df1:
id people
51 [125, 126, 127, 128, 129]
52 [302, 303, 128]
53 [312]
In another dataframe df2, I have mapped names and gender, according to a unique pid. The list entries in df1.people are in fact those pid items:
df2:
pid name gender
100 Jack Lumber m
125 Holly Polly f
126 Jeremy Owens m
127 Ron Bronco m
128 Natalia Berg f
129 Robyn Hill f
300 Crusty Clown m
302 Danny McKenny m
303 Tara Hill f
312 Glenn Dalough m
400 Fryda Beans f
Now I like to replace or map the respective pid with the gender field from df2 and hereby create following desired output, including a list count:
Outcome:
id gender count_m count_f
51 [f, m, m, f, f] 2 3
52 [m, f, f] 1 2
52 [m] 1 0
What's the best approach to create this table?
Solution:
from collections import Counter
d = dict(df2.drop('name', 1).values)
m = df1.assign(gender=df1.name.apply(lambda x: [d.get(i) for i in x])).drop('people', 1)
n = pd.DataFrame([Counter(x) for x in m.gender], index=m.index).fillna(0).add_prefix('count_')
final = m.join(n)
You can use dict.get() to get the corresponding dictionary values, then create a dataframe by exploding the dataframe and apply crosstab and then merge:
d=dict(df2.drop('name',1).values)
m=df1.assign(gender=df1.people.apply(lambda x: [d.get(i) for i in x])).drop('people',1)
n=pd.DataFrame({'id':m.loc[m.index.repeat(m.gender.str.len()),'id'],
'gender':np.concatenate(m.gender)})
#for pandas .25.0 use: n=m.explode('gender')
final=m.merge(pd.crosstab(n.id,n.gender).add_prefix('count_'),left_on='id',right_index=True)
id gender count_f count_m
0 51 [f, m, m, f, f] 3 2
1 52 [m, f, f] 2 1
2 53 [m] 0 1
Related
I would like to convert some of the columns to list in adataframe.
The dataframe, df:
Name salary department days other
0 ben 1000 A 90 abc
1 alex 3000 B 80 gf
2 linn 600 C 55 jgj
3 luke 5000 D 88 gg
The desired output, df1:
Name list other
0 ben [1000,A,90] abc
1 alex [3000,B,80] gf
2 linn [600,C,55] jgj
3 luke [5000,D,88] gg
You can slice and convert the columns to a list of list, then to a Series:
cols = ['salary', 'department', 'days']
out = (df.drop(columns=cols)
.join(pd.Series(df[cols].to_numpy().tolist(), name='list', index=df.index))
)
Output:
Name other list
0 ben abc [1000, A, 90]
1 alex gf [3000, B, 80]
2 linn jgj [600, C, 55]
3 luke gg [5000, D, 88]
If you want to preserve the order, then we can break it down into 3 parts, as #mozway mentioned in his answer
Define columns we want to group (as #mozway mentioned in his answer)
Find the first element's index (you can take it a step forward and find the smallest one, as the list won't be necessarily sorted as the DataFrame)
Insert the Series to the dataframe at the position we generated
cols = ['salary', 'department', 'other']
first_location = df.columns.get_loc(cols[0])
list_values = pd.Series(df[cols].values.tolist()) # converting values to one list
df.insert(loc=first_location, column='list', value=list_values) # inserting the Series in the desired location
df = df.drop(columns=cols) # dropping the columns we grouped together.
print(df)
Which results in:
Name list other
0 ben [1000, A, 90] abc
1 alex [3000, B, 80] gf
...
When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0
I have a large data frame. Sample below
| year | sentences | company |
|------|-------------------|---------|
| 2020 | [list of strings] | A |
| 2019 | [list of strings] | A |
| 2018 | [list of strings] | A |
| ... | .... | ... |
| 2020 | [list of strings] | Z |
| 2019 | [list of strings] | Z |
| 2018 | [list of strings] | Z |
I want to compare the sentences column by company by year so as to get a year on year change.
Example: for company A, I would like to apply an operator such as sentence similarity or some distance metric for the [list of strings]2020 and [list of strings]2019, then [list of strings]2019 and [list of strings]2018.
Similarly for company B, C, ... Z.
How can this be achieved?
EDIT
length of [list of strings] is variable. So some simple quantifying operators could be
Difference in number of elements --> length([list of strings]2020) - length([list of strings]2019)
Count of common elements --> length(set([list of strings]2020, [list of strings]2019))
The comparisons should be:
| years | Y-o-Y change (Some function) | company |
|-----------|------------------------------|---------|
| 2020-2019 | 15 | A |
| 2019-2018 | 3 | A |
| 2018-2017 | 55 | A |
| ... | .... | ... |
| 2020-2019 | 33 | Z |
| 2019-2018 | 32 | Z |
| 2018-2017 | 27 | Z |
TL;DR: see full code on bottom
You have to break down your task in simpler subtasks. Basically, you want to apply one or several calculations on your dataframe on successive rows, this grouped by company. This means you will have to use groupby and apply.
Let's start with generating an example dataframe. Here I used lowercase letters as words for the "sentences" column.
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
df
output:
date sentences company
0 2020 [z] A
1 2019 [s, f, g, a, d, a, h, o, c] A
2 2018 [b] A
…
26 2014 [q] C
27 2013 [i, w] C
28 2012 [o, p, i, d, f, w, k, d] C
29 2011 [l, f, h, p] C
Concatenate the "sentences" column of the next row (previous year):
pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
output:
date sentences company date_pre sentences_pre company_pre
0 2020 [z] A 2019.0 [s, f, g, a, d, a, h, o, c] A
1 2019 [s, f, g, a, d, a, h, o, c] A 2018.0 [b] A
2 2018 [b] A 2017.0 [x, n, r, a, s, d] A
3 2017 [x, n, r, a, s, d] A 2016.0 [u, n, g, u, k, s, v, s, o] A
4 2016 [u, n, g, u, k, s, v, s, o] A 2015.0 [v, g, d, i, b, z, y, k] A
5 2015 [v, g, d, i, b, z, y, k] A 2014.0 [q, o, p] A
6 2014 [q, o, p] A 2013.0 [j, s, s] A
7 2013 [j, s, s] A 2012.0 [g, u, l, g, n] A
8 2012 [g, u, l, g, n] A 2011.0 [v, p, y, a, s] A
9 2011 [v, p, y, a, s] A 2020.0 [a, h, c, w] B
…
Define a function to compute a number of distance metrics (here the two defined in the question). TypeError is caught to handle the case where there is no row to compare with (one occurrence per group).
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
This works on a sub-dataframe that was filtered to match only one company:
df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1
output:
years yoy_diff_len yoy_nb_common company
0 2020–2019 -4 0 A
1 2019–2018 6 1 A
2 2018–2017 1 0 A
3 2017–2016 1 0 A
4 2016–2015 -7 0 A
5 2015–2014 4 0 A
6 2014–2013 1 0 A
7 2013–2012 -1 0 A
8 2012–2011 -5 1 A
Now you can make a function to construct each dataframe per group and apply the computation:
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
return df2.apply(compare_lists, axis=1)
and use this function to apply on each group:
df.groupby('company').apply(group_compare)
Full code:
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
return df2.apply(compare_lists, axis=1)
## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)
output:
years yoy_diff_len yoy_nb_common company
company
A 0 2020–2019 -8 0 A
1 2019–2018 8 0 A
2 2018–2017 -5 0 A
3 2017–2016 -3 2 A
4 2016–2015 1 3 A
5 2015–2014 5 0 A
6 2014–2013 0 0 A
7 2013–2012 -2 0 A
8 2012–2011 0 0 A
B 10 2020–2019 3 0 B
11 2019–2018 -6 1 B
12 2018–2017 3 0 B
13 2017–2016 -5 1 B
14 2016–2015 2 2 B
15 2015–2014 4 1 B
16 2014–2013 3 0 B
17 2013–2012 -8 0 B
18 2012–2011 1 1 B
C 20 2020–2019 8 1 C
21 2019–2018 -7 0 C
22 2018–2017 0 1 C
23 2017–2016 7 0 C
24 2016–2015 -3 0 C
25 2015–2014 3 0 C
26 2014–2013 -1 0 C
27 2013–2012 -6 2 C
28 2012–2011 4 2 C
Let's consider the following dataframe:
df = {'Location': ['A','A','B','B','C','C','A','C','A'],
'Gender'['M','M','F','M','M','F','M','M','M'],
'Edu'['N','N','Y','Y','Y','N','Y','Y','Y'],
'Access1': [1,0,1,0,1,0,1,1,1], 'Access2': [1,1,1,0,0,1,0,0,1] }
df = pd.DataFrame(data=d, dtype=np.int8)
Output from dataframe:
Access1 Access2 Edu Gender Location
0 1 1 N M A
1 0 1 N M A
2 1 1 Y F B
3 0 0 Y M B
4 1 0 Y M C
5 0 1 N F C
6 1 0 Y M A
7 1 0 Y M C
8 1 1 Y M A
Then I am using groupby to analyse the frequencies in df
D0=df.groupby(['Location','Gender','Edu']).sum()
((D0/ D0.groupby(level = [0]).transform(sum))*100).round(3).astype(str) + '%'
Output:
Access1 Access2
Location Gender Edu
A M N 33.333% 66.667%
Y 66.667% 33.333%
B F Y 100.0% 100.0%
M Y 0.0% 0.0%
C F N 0.0% 100.0%
M Y 100.0% 0.0%
From this output, I infer that 33.3% of uneducated men in location A with Access to service 1 (=Access1) is the result of considering 3 people in location A having access to service 1, of which 1 uneducated man has access to it (=1/3).
Yet, wish to get a different output. I would like to consider a total of 4 men in location A as my 100%. 50% of this group of men are uneducated. Out of that 50% of uneducated men, 25% have access to service 1. So, the percentage I would like to see in the table is 25% (total of uneducated men in area A accessing service 1). Is groupby the right way to get there, and what would be the best way to measure the % of Access to service 1 while considering a disaggregation from the total population of reference per location?
I believe need divide D0 by first level of MultiIndex mapped by a Series:
D0=df.groupby(['Location','Gender','Edu']).sum()
a = df['Location'].value_counts()
#alternative
#a = df.groupby(['Location']).size()
print (a)
A 4
C 3
B 2
Name: Location, dtype: int64
df1 = D0.div(D0.index.get_level_values(0).map(a.get), axis=0)
print (df1)
Access1 Access2
Location Gender Edu
A M N 0.250000 0.500000
Y 0.500000 0.250000
B F Y 0.500000 0.500000
M Y 0.000000 0.000000
C F N 0.000000 0.333333
M Y 0.666667 0.000000
Detail:
print (D0.index.get_level_values(0).map(a.get))
Int64Index([4, 4, 2, 2, 3, 3], dtype='int64', name='Location')
First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)