pandas: row-wise operation to get change over time - pandas

I have a large data frame. Sample below
| year | sentences | company |
|------|-------------------|---------|
| 2020 | [list of strings] | A |
| 2019 | [list of strings] | A |
| 2018 | [list of strings] | A |
| ... | .... | ... |
| 2020 | [list of strings] | Z |
| 2019 | [list of strings] | Z |
| 2018 | [list of strings] | Z |
I want to compare the sentences column by company by year so as to get a year on year change.
Example: for company A, I would like to apply an operator such as sentence similarity or some distance metric for the [list of strings]2020 and [list of strings]2019, then [list of strings]2019 and [list of strings]2018.
Similarly for company B, C, ... Z.
How can this be achieved?
EDIT
length of [list of strings] is variable. So some simple quantifying operators could be
Difference in number of elements --> length([list of strings]2020) - length([list of strings]2019)
Count of common elements --> length(set([list of strings]2020, [list of strings]2019))
The comparisons should be:
| years | Y-o-Y change (Some function) | company |
|-----------|------------------------------|---------|
| 2020-2019 | 15 | A |
| 2019-2018 | 3 | A |
| 2018-2017 | 55 | A |
| ... | .... | ... |
| 2020-2019 | 33 | Z |
| 2019-2018 | 32 | Z |
| 2018-2017 | 27 | Z |

TL;DR: see full code on bottom
You have to break down your task in simpler subtasks. Basically, you want to apply one or several calculations on your dataframe on successive rows, this grouped by company. This means you will have to use groupby and apply.
Let's start with generating an example dataframe. Here I used lowercase letters as words for the "sentences" column.
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
df
output:
date sentences company
0 2020 [z] A
1 2019 [s, f, g, a, d, a, h, o, c] A
2 2018 [b] A
…
26 2014 [q] C
27 2013 [i, w] C
28 2012 [o, p, i, d, f, w, k, d] C
29 2011 [l, f, h, p] C
Concatenate the "sentences" column of the next row (previous year):
pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
output:
date sentences company date_pre sentences_pre company_pre
0 2020 [z] A 2019.0 [s, f, g, a, d, a, h, o, c] A
1 2019 [s, f, g, a, d, a, h, o, c] A 2018.0 [b] A
2 2018 [b] A 2017.0 [x, n, r, a, s, d] A
3 2017 [x, n, r, a, s, d] A 2016.0 [u, n, g, u, k, s, v, s, o] A
4 2016 [u, n, g, u, k, s, v, s, o] A 2015.0 [v, g, d, i, b, z, y, k] A
5 2015 [v, g, d, i, b, z, y, k] A 2014.0 [q, o, p] A
6 2014 [q, o, p] A 2013.0 [j, s, s] A
7 2013 [j, s, s] A 2012.0 [g, u, l, g, n] A
8 2012 [g, u, l, g, n] A 2011.0 [v, p, y, a, s] A
9 2011 [v, p, y, a, s] A 2020.0 [a, h, c, w] B
…
Define a function to compute a number of distance metrics (here the two defined in the question). TypeError is caught to handle the case where there is no row to compare with (one occurrence per group).
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
This works on a sub-dataframe that was filtered to match only one company:
df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1
output:
years yoy_diff_len yoy_nb_common company
0 2020–2019 -4 0 A
1 2019–2018 6 1 A
2 2018–2017 1 0 A
3 2017–2016 1 0 A
4 2016–2015 -7 0 A
5 2015–2014 4 0 A
6 2014–2013 1 0 A
7 2013–2012 -1 0 A
8 2012–2011 -5 1 A
Now you can make a function to construct each dataframe per group and apply the computation:
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
return df2.apply(compare_lists, axis=1)
and use this function to apply on each group:
df.groupby('company').apply(group_compare)
Full code:
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
return df2.apply(compare_lists, axis=1)
## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)
output:
years yoy_diff_len yoy_nb_common company
company
A 0 2020–2019 -8 0 A
1 2019–2018 8 0 A
2 2018–2017 -5 0 A
3 2017–2016 -3 2 A
4 2016–2015 1 3 A
5 2015–2014 5 0 A
6 2014–2013 0 0 A
7 2013–2012 -2 0 A
8 2012–2011 0 0 A
B 10 2020–2019 3 0 B
11 2019–2018 -6 1 B
12 2018–2017 3 0 B
13 2017–2016 -5 1 B
14 2016–2015 2 2 B
15 2015–2014 4 1 B
16 2014–2013 3 0 B
17 2013–2012 -8 0 B
18 2012–2011 1 1 B
C 20 2020–2019 8 1 C
21 2019–2018 -7 0 C
22 2018–2017 0 1 C
23 2017–2016 7 0 C
24 2016–2015 -3 0 C
25 2015–2014 3 0 C
26 2014–2013 -1 0 C
27 2013–2012 -6 2 C
28 2012–2011 4 2 C

Related

Grouped count of combinations in Pandas column

I have a dataset with two values per person like the below and want to generate all combinations and counts of the combinations. I have a working solution but it's hardcoded and not scalable, I am looking for ideas on how to improve my solution.
Example:
d = {'person': [1,1,2,2,3,3,4,4,5,5,6,6], 'type': ['a','b','a','c','c','b','d','a','b','c','b','d']}
df = pd.DataFrame(data=d)
df
person type
0 1 a
1 1 b
2 2 a
3 2 c
4 3 c
5 3 b
6 4 d
7 4 a
8 5 b
9 5 c
10 6 b
11 6 d
My Inefficient Solution:
df = pd.get_dummies(df)
typecols = [col for col in df.columns if 'type' in col]
df = df.groupby(['person'], as_index=False)[typecols].apply(lambda x: x.astype(int).sum())
df["a_b"] = df["type_a"] + df["type_b"]
df["a_c"] = df["type_a"] + df["type_c"]
df["a_d"] = df["type_a"] + df["type_d"]
df["b_c"] = df["type_b"] + df["type_c"]
df["b_d"] = df["type_b"] + df["type_d"]
df["c_d"] = df["type_c"] + df["type_d"]
df["a_b"] = df.apply(lambda x: 1 if x["a_b"] == 2 else 0, axis=1)
df["a_c"] = df.apply(lambda x: 1 if x["a_c"] == 2 else 0, axis=1)
df["a_d"] = df.apply(lambda x: 1 if x["a_d"] == 2 else 0, axis=1)
df["b_c"] = df.apply(lambda x: 1 if x["b_c"] == 2 else 0, axis=1)
df["b_d"] = df.apply(lambda x: 1 if x["b_d"] == 2 else 0, axis=1)
df["c_d"] = df.apply(lambda x: 1 if x["c_d"] == 2 else 0, axis=1)
df_sums = df[['a_b','a_c','a_d','b_c','b_d','c_d']].sum()
print(df_sums.to_markdown(tablefmt="grid"))
+-----+-----+
| | 0 |
+=====+=====+
| a_b | 1 |
+-----+-----+
| a_c | 1 |
+-----+-----+
| a_d | 1 |
+-----+-----+
| b_c | 2 |
+-----+-----+
| b_d | 1 |
+-----+-----+
| c_d | 0 |
+-----+-----+
This solution works because every person has exactly two distinct values from a list of six distinct values but would quickly become unmanageable if there were NULLS or more than six distinct.
We can do:
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()
s.index = [f'{x}_{y}' for x, y in s.index]
s = s.sort_index()
print(s)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
Name: type, dtype: int64
get all the combinations is also simple:
from itertools import combinations
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()\
.reindex(list(combinations(df['type'].unique(), 2)), fill_value=0)
(a, b) 1
(a, c) 1
(a, d) 1
(b, c) 2
(b, d) 1
(c, d) 0
Name: type, dtype: int64
You can do a self-merge within person with a query to de-duplicate the matches (this is why we create the N column). Then we sort the types so we only get one of 'a_b' (and not also 'b_a'), create the labels, and take the value_counts. Using combinations we can get the list of all possibilities to reindex with.
import numpy as np
from itertools import combinations
ids = ['_'.join(x) for x in combinations(df['type'].unique(), 2)]
#['a_b', 'a_c', 'a_d', 'b_c', 'b_d', 'c_d']
df['N'] = range(len(df))
df1 = df.merge(df, on='person').query('N_x > N_y')
df1[['type_x', 'type_y']] = np.sort(df1[['type_x', 'type_y']].to_numpy(), 1)
df1['label'] = df1['type_x'].str.cat(df1['type_y'], sep='_')
df1['label'].value_counts().reindex(ids, fill_value=0)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
c_d 0
Name: label, dtype: int64

Replacing values in a df with values from another df

I have a data table df1 that looks like this (result of a df.groupby('id').agg(lambda x: x.tolist())):
df1:
id people
51 [125, 126, 127, 128, 129]
52 [302, 303, 128]
53 [312]
In another dataframe df2, I have mapped names and gender, according to a unique pid. The list entries in df1.people are in fact those pid items:
df2:
pid name gender
100 Jack Lumber m
125 Holly Polly f
126 Jeremy Owens m
127 Ron Bronco m
128 Natalia Berg f
129 Robyn Hill f
300 Crusty Clown m
302 Danny McKenny m
303 Tara Hill f
312 Glenn Dalough m
400 Fryda Beans f
Now I like to replace or map the respective pid with the gender field from df2 and hereby create following desired output, including a list count:
Outcome:
id gender count_m count_f
51 [f, m, m, f, f] 2 3
52 [m, f, f] 1 2
52 [m] 1 0
What's the best approach to create this table?
Solution:
from collections import Counter
d = dict(df2.drop('name', 1).values)
m = df1.assign(gender=df1.name.apply(lambda x: [d.get(i) for i in x])).drop('people', 1)
n = pd.DataFrame([Counter(x) for x in m.gender], index=m.index).fillna(0).add_prefix('count_')
final = m.join(n)
You can use dict.get() to get the corresponding dictionary values, then create a dataframe by exploding the dataframe and apply crosstab and then merge:
d=dict(df2.drop('name',1).values)
m=df1.assign(gender=df1.people.apply(lambda x: [d.get(i) for i in x])).drop('people',1)
n=pd.DataFrame({'id':m.loc[m.index.repeat(m.gender.str.len()),'id'],
'gender':np.concatenate(m.gender)})
#for pandas .25.0 use: n=m.explode('gender')
final=m.merge(pd.crosstab(n.id,n.gender).add_prefix('count_'),left_on='id',right_index=True)
id gender count_f count_m
0 51 [f, m, m, f, f] 3 2
1 52 [m, f, f] 2 1
2 53 [m] 0 1

Multi-level groupby sub-population percentages

Let's consider the following dataframe:
df = {'Location': ['A','A','B','B','C','C','A','C','A'],
'Gender'['M','M','F','M','M','F','M','M','M'],
'Edu'['N','N','Y','Y','Y','N','Y','Y','Y'],
'Access1': [1,0,1,0,1,0,1,1,1], 'Access2': [1,1,1,0,0,1,0,0,1] }
df = pd.DataFrame(data=d, dtype=np.int8)
Output from dataframe:
Access1 Access2 Edu Gender Location
0 1 1 N M A
1 0 1 N M A
2 1 1 Y F B
3 0 0 Y M B
4 1 0 Y M C
5 0 1 N F C
6 1 0 Y M A
7 1 0 Y M C
8 1 1 Y M A
Then I am using groupby to analyse the frequencies in df
D0=df.groupby(['Location','Gender','Edu']).sum()
((D0/ D0.groupby(level = [0]).transform(sum))*100).round(3).astype(str) + '%'
Output:
Access1 Access2
Location Gender Edu
A M N 33.333% 66.667%
Y 66.667% 33.333%
B F Y 100.0% 100.0%
M Y 0.0% 0.0%
C F N 0.0% 100.0%
M Y 100.0% 0.0%
From this output, I infer that 33.3% of uneducated men in location A with Access to service 1 (=Access1) is the result of considering 3 people in location A having access to service 1, of which 1 uneducated man has access to it (=1/3).
Yet, wish to get a different output. I would like to consider a total of 4 men in location A as my 100%. 50% of this group of men are uneducated. Out of that 50% of uneducated men, 25% have access to service 1. So, the percentage I would like to see in the table is 25% (total of uneducated men in area A accessing service 1). Is groupby the right way to get there, and what would be the best way to measure the % of Access to service 1 while considering a disaggregation from the total population of reference per location?
I believe need divide D0 by first level of MultiIndex mapped by a Series:
D0=df.groupby(['Location','Gender','Edu']).sum()
a = df['Location'].value_counts()
#alternative
#a = df.groupby(['Location']).size()
print (a)
A 4
C 3
B 2
Name: Location, dtype: int64
df1 = D0.div(D0.index.get_level_values(0).map(a.get), axis=0)
print (df1)
Access1 Access2
Location Gender Edu
A M N 0.250000 0.500000
Y 0.500000 0.250000
B F Y 0.500000 0.500000
M Y 0.000000 0.000000
C F N 0.000000 0.333333
M Y 0.666667 0.000000
Detail:
print (D0.index.get_level_values(0).map(a.get))
Int64Index([4, 4, 2, 2, 3, 3], dtype='int64', name='Location')

pandas dataframe groupby sum index

I have a dataframe, I want to
FROM:
dow yield
0 F 2
1 F 3
2 M 4
3 M 6
4 TH 7
TO:
dow ysum
0 F 5
1 M 10
2 TH 7
butI got this :
|yield
-------------
dow |
-------------
F |5
M |10
TH |7
This is how I did it:
d1=['F','F','M','M','TH']
d2=[2,3,4,6,7]
d = {'dow': d1, 'yield': d2}
df = pd.DataFrame(data=d, index=None)
df1= df.groupby('dow').sum()
How could get result use dow as a column in stead of index?
First column is index, so you can add parameter as_index=False:
df1 = df.groupby('dow', as_index=False).sum()
print (df1)
dow yield
0 F 5
1 M 10
2 TH 7
Or reset_index:
df1 = df.groupby('dow').sum().reset_index()
print (df1)
dow yield
0 F 5
1 M 10
2 TH 7

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)