Name-Specific Variability Calculations Pandas - pandas

I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!

For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Pandas - Move data in one column to the same row in a different column

I have a df which looks like the below, There are 2 quantity columns and I want to move the quantities in the "QTY 2" column to the "QTY" column
Note: there are no instances where there are values in the same row for both columns (So for each row, QTY is either populated or else QTY 2 is populated. Not Both)
DF
Index
Product
QTY
QTY 2
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Desired Output
Index
Product
QTY
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Thanks
Try this:
import numpy as np
df['QTY'] = np.where(df['QTY'].isnull(), df['QTY 2'], df['QTY'])
df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
filling the gaps of QTY with QTY 2:
In [254]: df
Out[254]:
Index Product QTY QTY 2
0 0 Shoes 5.0 NaN
1 1 Jumpers NaN 10.0
2 2 T Shirts NaN 15.0
3 3 Shorts 13.0 NaN
In [255]: df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
In [256]: df
Out[256]:
Index Product QTY QTY 2
0 0 Shoes 5 NaN
1 1 Jumpers 10 10.0
2 2 T Shirts 15 15.0
3 3 Shorts 13 NaN
downcast="infer" makes it "these look like integer after NaNs gone, so make the type integer".
you can drop QTY 2 after this with df = df.drop(columns="QTY 2"). If you want one-line is as usual possible:
df = (df.assign(QTY=df["QTY"].fillna(df["QTY 2"], downcast="infer"))
.drop(columns="QTY 2"))
You can do ( I am assuming your empty values are empty strings):
df = df.assign(QTY= df[['QTY', 'QTY2']].
replace('', 0).
sum(axis=1)).drop('QTY2', axis=1)
print(df):
Product QTY
0 Shoes 5
1 Jumpers 10
2 T Shirts 15
3 Shorts 13
If the empty values are actually NaNs then
df['QTY'] = df['QTY'].fillna(df['QTY2']) #or
df['QTY'] = df[['QTY', 'QTY2']].sum(1)

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

How to transpose multiple columns values into one column based off of two indexes in Pandas dataframe

I am working with some family data which holds records on caregivers and the number of children that caregiver has. Currently, demographic information for the caregiver and all children that caregiver has is in the caregiver record. I want to take children's demographic information and put it into the the children's respective record/row. Here is an example of the data I am working with:
Vis POS FAMID G1ID G2ID G1B G2B1 G2B2 G2B3 G1R G2R1 G2R2 G2R3
1 0 1 100011 1979 2010 White White
1 1 1 200011
1 0 2 100021 1969 2011 2009 AA AA White
1 1 2 200021
1 2 2 200022
1 0 3 100031 1966 2008 2010 2011 White White AA AA
1 1 3 200031
1 2 3 200032
1 3 3 200033
G1 = caregiver data
G2 = child data
GxBx = birthyear
GxRx = race
OUTPUT
Visit POS FAMID G1 G2 G1Birth G2Birth G1Race G2Race
1 0 1 100011 1979 White
1 1 1 200011 2010 White
1 0 2 100021 1969 AA
1 1 2 200021 2011 AA
1 2 2 200022 2009 White
1 0 3 100031 1966 White
1 1 3 200031 2008 White
1 2 3 200032 2010 AA
1 3 3 200033 2011 AA
From these two tables you can see I want all G2Bx columns to fall into a new G2Birth column, and same principle for G2Rx columns. (I actually have several more instances like race and birthyear in my actual data)
I have been looking into pivots and stacking functions in the pandas dataframe but I hvaen't quite got what I wanted. The closest I have gotten was using the melt function, but the issue I have with the melt function was I couldn't get it to map to indexes with out taking all values from that column. IE it wants to create a row for child2 and child3 for people who only have child1. I might just be using the melt function incorrectly.
What I want are all values from g2Birthdate1 to map onto POS when POS=1, and all g2Birthdate2 to the POS=2 index, etc. Is there a function which can help accomplish this? Or does this require a some additional coding solution?
You can do this with a row and a column MultiIndex and a left join:
# df is your initial dataframe
# Make a baseline dataframe to hold the IDs
id_df = df.drop(columns=[c for c in df.columns if c not in ["G1ID", "G2ID","Vis","FAMID","POS"]])
# Make a rows MultiIndex to join on at the end
id_df = id_df.set_index(["Vis","FAMID","POS"])
# Rename the columns to reflect the hierarchical nature
data_df = df.drop(columns=[c for c in df.columns if c in ["G1ID", "G2ID", "POS"]])
# Make the first two parts of the MultiIndex required for the join at the end
data_df = data_df.set_index(["Vis","FAMID"])
# Make the columns also have a MultiIndex
data_df.columns = pd.MultiIndex.from_tuples([("G1Birth",0),("G2Birth",1),("G2Birth",2),("G2Birth",3),
("G1Race",0),("G2Race",1),("G2Race",2),("G2Race",3)])
# Name the columnar index levels
data_df.columns.names = (None, "POS")
# Stack the newly formed lower-level into the rows MultiIndex to complete it in prep for joining
data_df = data_df.stack("POS")
# Join to the id dataframe on the full MultiIndex
final = id_df.join(data_df)
final = final.reset_index()

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])