pandas- return Month containing Max value for each year - pandas

I have a dataframe like:
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 8
2019 5 87
2019 6 1
I'd the dataframe to return the Month and Value for each year where the value is the maximum:
year month value
2017 1 100
2018 3 88
2019 5 87
I've attempted something like df=df.groupby(["Year","Month"])['Value']).max() however, it returns the full data set because each Year / Month pair is unique (i believe).

You can get the index where the top Value occurs with .groupby(...).idxmax() and use that to index into the original dataframe:
In [28]: df.loc[df.groupby("Year")["Value"].idxmax()]
Out[28]:
Year Month Value
0 2017 1 100
3 2018 3 88
5 2019 5 87

Here is a solution that also handles duplicate possibility:
m = df.groupby('Year')['Value'].transform('max') == df['Value']
dfmax = df.loc[m]
Full example:
import pandas as pd
data = '''\
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 88
2019 5 87
2019 6 1'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+')
m = df.groupby('Year')['Value'].transform('max') == df['Value']
print(df[m])
Year Month Value
0 2017 1 100
3 2018 3 88
4 2018 4 88
5 2019 5 87

Related

Pandas groupby and filtering groups with rows greater than N rows

I have a pandas df of the following format
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
BBB 2022 1 1 5
BBB 2022 1 2 10
BBB 2022 2 1 10
BBB 2022 2 2 15
What I am looking to do is to filter this df such that I am grouping by STOCK and YR and MONTH and selecting the groups with 3 or more entries.
So the resulting df looks like
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
Note that BBB is eliminated as it had only 2 rows in each group, when grouped by STOCK, YR and MONTH
I have tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x.STOCK.nunique() > 5) but this resulted in an empty frame.
Also tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x['STOCK','YR','MONTH'].nunique() > 5) but this resulted in a KeyError: ('STOCK', 'YR', 'MONTH')
Thanks!
Use GroupBy.transform('count'):
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('count').ge(3)]
or 'size':
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').ge(3)]
output:
STOCK YR MONTH DAY PRICE
0 AAA 2022 1 1 10
1 AAA 2022 1 2 11
2 AAA 2022 1 3 10
3 AAA 2022 1 4 15
4 AAA 2022 1 5 10
Use GroupBy.transform:
If need counts (not exclude possible NaNs):
#if need test number of unique values
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').gt(3)]
Or:
#in large df should be slow
df.groupby(['STOCK','YR','MONTH']).filter(lambda x: len(x) > 3)

Pandas - creating new column based on data from other records

I have a pandas dataframe which has the folowing columns -
Day, Month, Year, City, Temperature.
I would like to have a new column that has the average (mean) temperature in same date (day\month) of all previous years.
Can someone please assist?
Thanks :-)
Try:
dti = pd.date_range('2000-1-1', '2021-12-1', freq='D')
temp = np.random.randint(10, 20, len(dti))
df = pd.DataFrame({'Day': dti.day, 'Month': dti.month, 'Year': dti.year,
'City': 'Nice', 'Temperature': temp})
out = df.set_index('Year').groupby(['City', 'Month', 'Day']) \
.expanding()['Temperature'].mean().reset_index()
Output:
>>> out
Day Month Year City Temperature
0 1 1 2000 Nice 12.000000
1 1 1 2001 Nice 12.000000
2 1 1 2002 Nice 11.333333
3 1 1 2003 Nice 12.250000
4 1 1 2004 Nice 11.800000
... ... ... ... ... ...
8001 31 12 2016 Nice 15.647059
8002 31 12 2017 Nice 15.555556
8003 31 12 2018 Nice 15.631579
8004 31 12 2019 Nice 15.750000
8005 31 12 2020 Nice 15.666667
[8006 rows x 5 columns]
Focus on 1st January of the dataset:
>>> df[df['Day'].eq(1) & df['Month'].eq(1)]
Day Month Year City Temperature # Mean
0 1 1 2000 Nice 12 # 12
366 1 1 2001 Nice 12 # 12
731 1 1 2002 Nice 10 # 11.33
1096 1 1 2003 Nice 15 # 12.25
1461 1 1 2004 Nice 10 # 11.80
1827 1 1 2005 Nice 12 # and so on
2192 1 1 2006 Nice 17
2557 1 1 2007 Nice 16
2922 1 1 2008 Nice 19
3288 1 1 2009 Nice 12
3653 1 1 2010 Nice 10
4018 1 1 2011 Nice 16
4383 1 1 2012 Nice 13
4749 1 1 2013 Nice 15
5114 1 1 2014 Nice 14
5479 1 1 2015 Nice 13
5844 1 1 2016 Nice 15
6210 1 1 2017 Nice 13
6575 1 1 2018 Nice 15
6940 1 1 2019 Nice 18
7305 1 1 2020 Nice 11
7671 1 1 2021 Nice 14

Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes

I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']

R - get a vector that tells me if a value of another vector is the first appearence or not

I have a data frame of sales with three columns: the code of the customer, the month the customer bought that item, and the year.
A customer can buy something in september and then in december make another purchase, so appear two times. But I'm interested in knowing the absolutely new customoers by month and year.
So I have thought in make an iteration and some checks and use the %in% function and build a boolean vector that tells me if a customer is new or not and then count by month and year with SQL using this new vector.
But I'm wondering if there's a specific function or a better way to do that.
This is an example of the data I would like to have:
date cust month new_customer
1 14975 25 1 TRUE
2 14976 30 1 TRUE
3 14977 22 1 TRUE
4 14978 4 1 TRUE
5 14979 25 1 FALSE
6 14980 11 1 TRUE
7 14981 17 1 TRUE
8 14982 17 1 FALSE
9 14983 18 1 TRUE
10 14984 7 1 TRUE
11 14985 24 1 TRUE
12 14986 22 1 FALSE
So put it more simple: the data frame is sorted by date, and I'm interested in a vector (new_customer) that tells me if the customer purchased something for the first time or not. For example customer 25 bought something the first day, and then four days later bought something again, so is not a new customer. The same can be seen with customer 17 and 22.
I create dummy data my self with id, month of numeric format, and year
dat <-data.frame(
id = c(1,2,3,4,5,6,7,8,1,3,4,5,1,2,2),
month = c(1,6,7,8,2,3,4,8,11,1,10,9,1,12,2),
year = c(2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021)
)
id month year
1 1 1 2019
2 2 6 2019
3 3 7 2019
4 4 8 2019
5 5 2 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
9 1 11 2020
10 3 1 2020
11 4 10 2021
12 5 9 2021
13 1 1 2021
14 2 12 2021
15 2 2 2021
Then, group by id and arrange by year and month (order is meaningful). Then use filter and row_number().
dat %>%
group_by(id) %>%
arrange(year, month) %>%
filter(row_number() == 1)
id month year
<dbl> <dbl> <dbl>
1 1 1 2019
2 5 2 2019
3 2 6 2019
4 3 7 2019
5 4 8 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
Sample Code
You can change in your code according to this logic:-
Create Table:-
CREATE TABLE PURCHASE(Posting_Date DATE,Customer_Id INT,Customer_Name VARCHAR(15));
Insert Data Into Table
Posting_Date Customer_Id Customer_Name
2018-01-01 C_01 Jack
2018-02-01 C_01 Jack
2018-03-01 C_01 Jack
2018-04-01 C_02 James
2019-04-01 C_01 Jack
2019-05-01 C_01 Jack
2019-05-01 C_03 Gill
2020-01-01 C_02 James
2020-01-01 C_04 Jones
Code
WITH Date_CTE (PostingDate,CustomerID,FirstYear)
AS
(
SELECT MIN(Posting_Date) as [Date],
Customer_Id,
YEAR(MIN(Posting_Date)) as [F_Purchase_Year]
FROM PURCHASE
GROUP BY Customer_Id
)
SELECT T.[ActualYear],(CASE WHEN T.[Customer Status] = 'new' THEN COUNT(T.[Customer Status]) END) AS [New Customer]
FROM (
SELECT DISTINCT YEAR(T2.Posting_Date) AS [ActualYear],
T2.Customer_Id,
(CASE WHEN T1.FirstYear = YEAR(T2.Posting_Date) THEN 'new' ELSE 'old' END) AS [Customer Status]
FROM Date_CTE AS T1
left outer join PURCHASE AS T2 ON T1.CustomerID = T2.Customer_Id
) AS T
GROUP BY T.[ActualYear],T.[Customer Status]
Final Result
ActualYear New Customer
2018 2
2019 1
2020 1
2019 NULL
2020 NULL

how to avoid null while doing diff using period in pandas?

I have the below dataframe and I am calculating the different with the previous value using diff periods but that makes the first value as Null, is there any way to fill that value?
example:
df['cal_val'] = df.groupby('year')['val'].diff(periods=1)
current output:
date year val cal_val
1/3/10 2010 12 NaN
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 NaN
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3
expected output:
date year val cal_val
1/3/10 2010 12 12
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 10
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3