Pandas group by aggregation for non numeric data - pandas

my sample df looks like this:
sid score cat_type
101 70 na
102 56 PNP
101 65 BAW
103 88 SAO
103 50 na
102 42 VVG
105 79 SAE
....
df_groupby = df.groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', '?')
)
with the group by, I want to get the first row value of cat_type that is not na and assign it to cat_type_first_row
my final df should look like this:
sid score_max cat_type_first_row
101 70 BAW
102 56 PNP
103 88 SAO
105 79 SAE
....
Could you please assist me in solving this problem?

Try replace na to NaN and do first
df_groupby = df.replace('na',np.nan).groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', 'first')
)
df_groupby
score_max cat_type_first_row
sid
101 70 BAW
102 56 PNP
103 88 SAO
105 79 SAE

If you do not want to drop any row, you could use:
(df.merge((df.dropna(subset=['cat_type'])
.groupby('sid')['cat_type']
.first()
.rename('cat_type_first_row')
), on='sid')
)
output:
sid score cat_type cat_type_first_row
0 101 70 NaN BAW
1 101 65 BAW BAW
2 102 56 PNP PNP
3 102 42 VVG PNP
4 103 88 SAO SAO
5 103 50 NaN SAO
6 105 79 SAE SAE

You can define a function that takes as input the grouped pandas series. I've tested this code and got your desired output (added rows for cases when all cat_type values are np.nan for a group):
df = {
'sid': [101, 102, 101, 103, 103, 102, 105, 106, 106],
'score': [70, 56, 65, 88, 50, 42, 79, 0, 0],
'cat_type': [np.nan, 'PNP', 'BAW', 'SAO', np.nan, 'VVG', 'SAE', np.nan, np.nan]
}
df = pd.DataFrame(df)
display(df)
def get_cat_type_first_row(series):
series_nona = series.dropna()
if len(series_nona) == 0:
return np.nan
else:
return series.dropna().iloc[0]
df_groupby = df.groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', get_cat_type_first_row)
)
df_groupby
Output:

Related

count :SQL :DB2

I have this table:
CODE
IDNR
NAME
LIMIT
123
80
XXX
2019-05
123
81
XXX
2019-10
124
80
YYY
2019-01
125
80
ZZZ
2019-05
125
81
ZZZ
2019-06
125
80
ZZZ
2019-07
126
80
III
2019-05
126
80
III
2019-09
126
80
III
2019-07
I want to have a new column (Count-LIMIT ) contain how many LIMIT per code, and another contain YES if the limit are continuous and No if not.
MY RESULT that I want like:
CODE
IDNR
NAME
LIMIT
Count-Limit
CON
123
80
XXX
2019-05
2
NO
123
81
XXX
2019-10
2
NO
124
80
YYY
2019-01
1
NO
125
80
ZZZ
2019-05
3
YES
125
81
ZZZ
2019-06
3
YES
125
80
ZZZ
2019-07
3
YES
126
80
III
2019-05
3
NO
126
80
III
2019-09
3
NO
126
80
III
2019-07
3
NO
THANKS!
Try this:
WITH T (CODE, IDNR, NAME, LIMIT) AS
(
VALUES
(123, 80, 'XXX', '2019-05')
, (123, 81, 'XXX', '2019-10')
, (124, 80, 'YYY', '2019-01')
, (125, 80, 'ZZZ', '2019-05')
, (125, 81, 'ZZZ', '2019-06')
, (125, 80, 'ZZZ', '2019-07')
, (126, 80, 'III', '2019-05')
, (126, 80, 'III', '2019-09')
, (126, 80, 'III', '2019-07')
, (128, 80, 'AAA', '2021-01')
, (128, 80, 'AAA', '2021-03')
, (128, 80, 'AAA', '2021-05')
, (128, 80, 'AAA', '2021-07')
, (128, 80, 'AAA', '2021-08')
, (128, 80, 'AAA', '2021-09')
)
SELECT
T.*
, COUNT (1) OVER (PARTITION BY CODE) AS COUNT_LIMIT
, CASE
WHEN TO_DATE (LIMIT || '-01', 'YYYY-MM-DD') IN
(
LAG (TO_DATE (LIMIT || '-01', 'YYYY-MM-DD')) OVER (PARTITION BY CODE ORDER BY LIMIT) + 1 MONTH
, LEAD (TO_DATE (LIMIT || '-01', 'YYYY-MM-DD')) OVER (PARTITION BY CODE ORDER BY LIMIT) - 1 MONTH
)
THEN 'YES'
ELSE 'NO'
END AS CON
FROM T
ORDER BY CODE, LIMIT
The result is:
CODE
IDNR
NAME
LIMIT
COUNT_LIMIT
CON
123
80
XXX
2019-05
2
NO
123
81
XXX
2019-10
2
NO
124
80
YYY
2019-01
1
NO
125
80
ZZZ
2019-05
3
YES
125
81
ZZZ
2019-06
3
YES
125
80
ZZZ
2019-07
3
YES
126
80
III
2019-05
3
NO
126
80
III
2019-07
3
NO
126
80
III
2019-09
3
NO
128
80
AAA
2021-01
6
NO
128
80
AAA
2021-03
6
NO
128
80
AAA
2021-05
6
NO
128
80
AAA
2021-07
6
YES
128
80
AAA
2021-08
6
YES
128
80
AAA
2021-09
6
YES

Groupby each group and then do division of two divisions (current yr/latest yr for each column)

I'd like to create a new column by dividing current year by its latest year in Col_1 and Col_2 respectively for each group. Then, divide the two divisions.
Methodology: Calculate (EachYrCol_1/Yr2000Col_1)/(EachYrCol_2/Yr2000Col_2) for each group
See example below:
Year
Group
Col_1
Col_2
New Column
1995
A
100
11
(100/600)/(11/66)
1996
A
200
22
(200/600)/(22/66)
1997
A
300
33
(300/600)/(33/66)
1998
A
400
44
.............
1999
A
500
55
.............
2000
A
600
66
.............
1995
B
700
77
(700/1200)/(77/399)
1996
B
800
88
(800/1200)/(88/399)
1997
B
900
99
(900/1200)/(99/399)
1998
B
1000
199
.............
1999
B
1100
299
.............
2000
B
1200
399
.............
Sample dataset:
import pandas as pd
df = pd.DataFrame({'Year':[1995, 1996, 1997, 1998, 1999, 2000,1995, 1996, 1997, 1998, 1999, 2000],
'Group':['A', 'A', 'A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Col_1':[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200],
'Col_2':[11, 22, 33, 44, 55, 66, 77, 88, 99, 199, 299, 399]})
Use GroupBy.transform with GroupBy.last for helper DataFrame, so possible divide each column:
df1 = df.groupby('Group').transform('last')
df['New'] = df['Col_1'].div(df1['Col_1']).div(df['Col_2'].div(df1['Col_2']))
print (df)
Year Group Col_1 Col_2 New
0 1995 A 100 11 1.000000
1 1996 A 200 22 1.000000
2 1997 A 300 33 1.000000
3 1998 A 400 44 1.000000
4 1999 A 500 55 1.000000
5 2000 A 600 66 1.000000
6 1995 B 700 77 3.022727
7 1996 B 800 88 3.022727
8 1997 B 900 99 3.022727
9 1998 B 1000 199 1.670854
10 1999 B 1100 299 1.223244
11 2000 B 1200 399 1.000000

Changing value of column in pandas chaining

I have a dataset like this:
year artist track time date.entered wk1 wk2
2000 Pac Baby 4:22 2000-02-26 87 82
2000 Geher The 3:15 2000-09-02 91 87
2000 three_DoorsDown Kryptonite 3:53 2000-04-08 81 70
2000 ATeens Dancing_Queen 3:44 2000-07-08 97 97
2000 Aaliyah I_Dont_Wanna 4:15 2000-01-29 84 62
2000 Aaliyah Try_Again 4:03 2000-03-18 59 53
2000 Yolanda Open_My_Heart 5:30 2000-08-26 76 76
My desired output is like this:
year artist track time date week rank
0 2000 Pac Baby 4:22 2000-02-26 1 87
1 2000 Pac Baby 4:22 2000-03-04 2 82
6 2000 ATeens Dancing_Queen 3:44 2000-07-08 1 97
7 2000 ATeens Dancing_Queen 3:44 2000-07-15 2 97
8 2000 Aaliyah I_Dont_Wanna 4:15 2000-01-29 1 84
Basically, I am tidying up the given billboard data.
Without pandas chaining I could do this easily like this:
df = pd.read_clipboard()
df1 = (pd.wide_to_long(df, 'wk', i=df.columns.values[:5], j='week')
.reset_index()
.rename(columns={'date.entered': 'date', 'wk': 'rank'}))
df1['date'] = pd.to_datetime(df1['date']) + pd.to_timedelta((df1['week'] - 1) * 7, 'd')
df1 = df1.sort_values(by=['track', 'date'])
print(df1.head())
Question
Is there a way I can chain the df1['date'] = pd.to_datetime(...) part? So that the whole operation can fit into single chain?
Use assign:
df1 = (pd.wide_to_long(df, 'wk', i=df.columns.values[:5], j='week')
.reset_index()
.rename(columns={'date.entered': 'date', 'wk': 'rank'})
.assign(date = lambda x: pd.to_datetime(x['date']) +
pd.to_timedelta((x['week'] - 1) * 7, 'd'))
.sort_values(by=['track', 'date'])
)

switching every 2nd row value in pandas

My dataframe is a list of football games with varying stats, around 300 entries.
game_id team opp_team avg_marks
0 2919 STK BL 122
1 2919 BL STK 114
2 2920 RICH SYD 135
3 2920 SYD RICH 108
I would like to add the opposition stats as a new column for each entry. Resultant dataframe would appear like this
game_id team opp_team avg_marks opp_avg_marks
0 2919 STK BL 122 114
1 2919 BL STK 114 122
2 2920 RICH SYD 135 108
3 2920 SYD RICH 108 135
Any suggestions would be most welcome, I'm new to this forum. I have tried mapping but the entry is conditional on 2 columns, game_id and opp_team.
Ideally I would add it in original spreadsheet but I created a cumulative average for the season in pandas so was hoping there would be a way to incorporate this as well.
You could group on game_id and reverse the avg_marks values
In [725]: df.groupby('game_id')['avg_marks'].transform(lambda x: x[::-1])
Out[725]:
0 114
1 122
2 108
3 135
Name: avg_marks, dtype: int64
In [726]: df['opp_avg_marks'] = (df.groupby('game_id')['avg_marks']
.transform(lambda x: x[::-1]))
In [727]: df
Out[727]:
game_id team opp_team avg_marks opp_avg_marks
0 2919 STK BL 122 114
1 2919 BL STK 114 122
2 2920 RICH SYD 135 108
3 2920 SYD RICH 108 135
Or, get dict mapping from team and avg_marks, then use map on opp_team
In [729]: df['opp_team'].map(df.set_index('team')['avg_marks'].to_dict())
Out[729]:
0 114
1 122
2 108
3 135
Name: opp_team, dtype: int64

group by column not having specific value

I am trying to obtain a list of Case_Id's where the case does not contain a specific RoleId using Microsoft Sql Server 2012.
For example, I would like to obtain a collection of Case_Id's that do not contain a RoleId of 4.
So from the data set below the query would exclude Case_Id's 49, 50, and 53.
Id RoleId Person_Id Case_Id
--------------------------------------
108 4 108 49
109 1 109 49
110 4 110 50
111 1 111 50
112 1 112 51
113 2 113 52
114 1 114 52
115 7 115 53
116 4 116 53
117 3 117 53
So far I have tried the following
SELECT Case_Id
FROM [dbo].[caseRole] cr
WHERE cr.RoleId!=4
GROUP BY Case_Id ORDER BY Case_Id
The not exists operator seems to fit your need exactly:
SELECT DISTINCT Case_Id
FROM [dbo].[caseRole] cr
WHERE NOT EXISTS (SELECT *
FROM [dbo].[caseRole] cr_inner
WHERE cr_inner.Case_Id = cr.case_id
AND cr_inner.RoleId = 4);
Just add a having clause instead of where:
SELECT Case_Id
FROM [dbo].[caseRole] cr
GROUP BY Case_Id
HAVING SUM(case when cr.RoleId = 4 then 1 else 0 end) = 0
ORDER BY Case_Id;