Pandas pivot table selecting rows with maximum values - pandas

I have pandas dataframe as:
df
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.01 27.71
82 A1 case1.06 29.54
1558 A3 case1.06 29.54
82 A1 case1.11 12.09
1558 A3 case1.11 32.09
82 A1 case1.16 33.35
1558 A3 case1.16 33.35
For each Id, Name pair I need to select the CaseId with maximum value.
i.e. I am seeking the following output:
Id Name CaseId Value
82 A1 case1.01 37.71
1558 A3 case1.16 33.35
I tried the following:
import pandas as pd
pd.pivot_table(df, index=['Id', 'Name'], columns=['CaseId'], values=['Value'], aggfunc=[np.max])['amax']
But all it does is for each CaseId as column it gives maximum value and not the results that I am seeking above.

sort_values + drop_duplicates
df.sort_values('Value').drop_duplicates(['Id'],keep='last')
Out[93]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71
Since we post same time , adding more method
df.sort_values('Value').groupby('Id').tail(1)
Out[98]:
Id Name CaseId Value
7 1558 A3 case1.16 33.35
0 82 A1 case1.01 37.71

This should work:
df = df.sort_values('Value', ascending=False).drop_duplicates('Id').sort_index()
Output:
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35

With nlargest and groupby
pd.concat(d.nlargest(1, ['Value']) for _, d in df.groupby('Name'))
Id Name CaseId Value
0 82 A1 case1.01 37.71
7 1558 A3 case1.16 33.35

Another idea is to create a joint column, take its max, then split it back to two columns:
df['ValueCase'] = list(zip(df['Value'], df['CaseId']))
p = pd.pivot_table(df, index=['Id', 'Name'], values=['ValueCase'], aggfunc='max')
p['Value'], p['CaseId'] = list(zip(*p['ValueCase']))
del p['ValueCase']
Results in:
CaseId Value
Id Name
82 A1 case1.01 37.71
1558 A3 case1.16 33.35

Related

Splunk: Use output of search A row by row as input for search B, then produce common result table

In Splunk, I have a search producing a result table like this:
_time
A
B
C
2022-10-19 09:00:00
A1
B1
C1
2022-10-19 09:00:00
A2
B2
C2
2022-10-19 09:10:20
A3
B3
C3
Now, for each row, I want to run a second search, using the _time value as input parameter.
For above row 1 and 2 (same _time value), the result of the second search would be:
_time
D
E
2022-10-19 09:00:00
D1
E1
For above row 3, the result of the second search would be:
_time
D
E
2022-10-19 09:10:20
D3
E3
And now I want to output the results in a common table, like this:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
D1
E1
2022-10-19 09:10:20
A3
B3
C3
D3
E3
I experimented with join, append, map, appendcols and subsearch, but I am struggling both with the row-by-row character of the second search and with pulling to data together into one common table.
For example, appendcols simply tacks one result table onto another, even if they are completely unrelated and differently shaped. Like so:
_time
A
B
C
D
E
2022-10-19 09:00:00
A1
B1
C1
D1
E1
2022-10-19 09:00:00
A2
B2
C2
-
-
2022-10-19 09:10:20
A3
B3
C3
-
-
Can anybody please point me into the right direction?

Compare values from one column in table A and another column in table B

I need to create a NeedDate column in the expected output. I will compare the QtyShort from Table B with QtyReceive from table A.
In the expected output, if QtyShort = 0, NeedDate = MaltDueDate.
For the first row of table A, if 0 < QtyShort (in Table B) <= QtyReceive (=6), NeedDate = 10/08/2021 (DueDate from Table A).
If 6 < QtyShort <= 10 (QtyReceive), move to the second row, NeedDate = 10/22/2021 (DueDate from Table A).
If 10 < QtyShort <= 20 (QtyReceive), move to the third row, NeedDate = 02/01/2022 (DueDate from Table A).
If QtyShort > QtyReceive (=20), NeedDate = 09/09/9999.
This should continue in a loop until the last row on table B has been compared
How could we do this? Any help will be appreciated. Thank you in advance!
Table A
Item DueDate QtyReceive
A1 10/08/2021 6
A1 10/22/2021 10
A1 02/01/2022 20
Table B
Item MatlDueDate QtyShort
A1 06/01/2022 0
A1 06/02/2022 0
A1 06/03/2022 1
A1 06/04/2022 2
A1 06/05/2022 5
A1 06/06/2022 7
A1 06/07/2022 10
A1 06/08/2022 15
A1 06/09/2022 25
Expected Output:
Item MatlDueDate QtyShort NeedDate
A1 06/01/2022 0 06/01/2022
A1 06/02/2022 0 06/02/2022
A1 06/03/2022 1 10/08/2021
A1 06/04/2022 2 10/08/2021
A1 06/05/2022 5 10/08/2021
A1 06/06/2022 7 10/22/2021
A1 06/07/2022 10 10/22/2021
A1 06/08/2022 15 02/01/2022
A1 06/09/2022 25 09/09/9999
Use OUTER APPLY() operator to find the minimum DueDate from TableA that is able to fulfill the QtyShort
select b.Item, b.MatlDueDate, b.QtyShort,
NeedDate = case when b.QtyShort = 0
then b.MatlDueDate
else isnull(a.DueDate, '9999-09-09')
end
from TableB b
outer apply
(
select DueDate = min(a.DueDate)
from TableA a
where a.Item = b.Item
and a.QtyReceive >= b.QtyShort
) a
Result:
Item
MatlDueDate
QtyShort
NeedDate
A1
2022-06-01
0
2022-06-01
A1
2022-06-02
0
2022-06-02
A1
2022-06-03
1
2021-10-08
A1
2022-06-04
2
2021-10-08
A1
2022-06-05
5
2021-10-08
A1
2022-06-06
7
2021-10-22
A1
2022-06-07
10
2021-10-22
A1
2022-06-08
15
2022-02-01
A1
2022-06-09
25
9999-09-09
db<>fiddle demo

create a new table from 2 other tables

If I want to merge the table with 2 other tables b,c
where table a contains columns:( Parent, Style, Ending_Date, WeekNum, Net_Requirment)
tables and calculate how much is required to make product A in a certain date.
The table should like the BOM (Bill of Material)
Can it be applied by pandas?
table b represent the demand for product A per date:
Style Date WeekNum Quantity
A 24/11/2019 0 600
A 01/12/2019 1 500
table c represent Details and quantity used to make product A:
Parent Child Q
A A1 2
A1 A11 3
A1 A12 2
so table a should be filled like this:
Parent Child Date WeekNum Net_Quantity
A A1 24/11/2019 0 1200
A1 A11 24/11/2019 0 3600
A1 A12 24/11/2019 0 2400
A A1 01/12/2019 1 1000
A1 A11 01/12/2019 1 3000
A1 A12 01/12/2019 1 2000
Welcome, in order to properly merge these tables and the rest you would have to have a common key to merge on. What you could do is add said key to each table like this:
data2 = {'Parent':['A','A1','A1'], 'Child':['A1','A11','A12'],
'Q':[2,3,2], 'Style':['A','A','A']}
df2 = pd.DataFrame(data2)
After this you can do a left join on the first table and then you can have multiple rows for the same date. So essentially this:
(notice if you do a left join, your left table will create as many duplicate rows as needed tu suffice the matching key on the right table)
data = {'Style':['A','A'], 'Date':['24/11/2019', '01/12/2019'],
'WeekNum':[0,1], 'Quantity':[600,500]}
df = pd.DataFrame(data)
mergeDf = df.merge(df2,how='left', left_on='Style', right_on='Style')
mergeDf
Then to calculate:
test['Net_Quantity'] = test.Quantity*test.Q
test.drop(['Q'], axis = 1,inplace=True)
result:
Style Date WeekNum Quantity Parent Child Net_Quantity
0 A 24/11/2019 0 600 A A1 1200
1 A 24/11/2019 0 600 A1 A11 1800
2 A 24/11/2019 0 600 A1 A12 1200
3 A 01/12/2019 1 500 A A1 1000
4 A 01/12/2019 1 500 A1 A11 1500
5 A 01/12/2019 1 500 A1 A12 1000

DISTINCT columns from 2 different tables

I have 2 tables with similar information. Let's call them DAILYROWDATA and SUMMARYDATA.
Table DAILYROWDATA
NIP NAME DEPARTMENT
A1 ARIA BB
A2 CHLOE BB
A3 RYAN BB
A4 STEVE BB
Table SUMMARYDATA
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
And I need to combine these two tables and show all data in table DAILYROWDATA and set the value if STATUSIN = NULL and STATUSOUT= Null then write 'NA'. This is the output that I meant:
NIP NAME DEPARTMENT STATUSIN STATUSOUT
A1 ARIA BB 1/21/2020 8:06:23 AM 1/21/2020 8:07:53 AM
A2 CHLOE BB 1/21/2020 8:16:07 AM 1/21/2020 9:51:21 AM
A3 RYAN BB NA NA
A4 STEVE BB NA NA
A1 ARIA BB 1/22/2020 9:06:23 AM 1/22/2020 10:07:53 AM
A2 CHLOE BB 1/22/2020 9:16:07 AM 1/22/2020 10:51:21 AM
A3 RYAN BB 1/22/2020 8:15:03 AM 1/22/2020 9:12:03 AM
A4 STEVE BB NA NA
I need to add some condition, so, i wanna set the value STATUSIN = NULL just when there is no NIP,NAME,DEPARTMENT,STATUSIN,STATUSOUT in one date.. so, that's can be multiple
You want a left join to bring the two tables together. The trickier part is that you need strings in order to represent the 'NA':
select drd.*,
coalesce(cast(statusin as varchar(255)), 'NA') as statusin,
coalesce(cast(statusout as varchar(255)), 'NA') as statusout
from DAILYROWDATA drd left join
SUMMARYDATA sd
on drd.nip = sd.nip;

Secondary Sorting (individually)

How would I do the Secondary sorting on a bar chart, for each individual date ?
for example, I have data as follows
Date Type Value
1/1/2020 A1 4
1/1/2020 A2 2
1/1/2020 A3 9
1/1/2020 A4 5
1/1/2020 A5 7
2/1/2020 A1 7
2/1/2020 A2 5
2/1/2020 A3 0
2/1/2020 A4 3
2/1/2020 A5 1
3/1/2020 A1 3
3/1/2020 A2 5
3/1/2020 A3 7
3/1/2020 A4 9
3/1/2020 A5 8
now I need to plot daily bar chart only showing the top three maximum values of individual dates? i.e., the chart would be
Date Type Value
1/1/2020 A3 9
1/1/2020 A4 5
1/1/2020 A5 7
2/1/2020 A1 7
2/1/2020 A2 5
2/1/2020 A4 3
3/1/2020 A3 7
3/1/2020 A4 9
3/1/2020 A5 8
i.e. individual date top three, not like first sum up A1,A2,A3,A4,A5 for each date, and then sorting based on the cumulative sum.
You should be able to achieve the sorting you need through having Date as the dimension and Type as the breakdown dimension.
You should be able to then sort by Date and then secondary sort by type.
Restricting to 3 per date however is something you'd need to do in your data source as Data Studio can't currently do that.