How to extract info based on the latest row - sql

I have two tables:-
TABLE A :-
ORNO DEL PONO QTY
801 123 1 80
801 123 2 60
801 123 3 70
801 151 1 95
801 151 3 75
802 130 1 50
802 130 2 40
802 130 3 30
802 181 2 55
TABLE B:-
ORNO PONO STATUS ITEM
801 1 12 APPLE
801 2 12 ORANGE
801 3 12 MANGO
802 1 22 PEAR
802 2 22 KIWI
802 3 22 MELON
I wish to extract the info based on the latest DEL (in Table A) using SQL. The final output should look like this:-
OUTPUT:-
ORNO PONO STATUS ITEM QTY
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30
Thanks.

select b.*, y.QTY
from
(
select a.ORNO, a.PONO, MAX(a.DEL) [max]
from #tA a
group by a.ORNO, a.PONO
)x
join #tA y on y.ORNO = x.ORNO and y.PONO = x.PONO and y.DEL = x.max
join #tB b on b.ORNO = y.ORNO and b.PONO = y.PONO
Output:
ORNO PONO STATUS ITEM QTY
----------- ----------- ----------- ---------- -----------
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30

Related

standard sql: Get customer count and first purchase date per customer and store_id

I use standard sql and I need a query that gets the total count of purchases per customer, for each store_id. And also the first purchase date per customer, for each store_id.
I have a table with this structure:
customer_id
store_id
product_no
customer_no
purchase_date
price
1
10
100
200
2022-01-01
50
1
10
110
200
2022-01-02
70
1
20
120
200
2022-01-02
60
1
20
130
200
2022-01-02
40
1
30
140
200
2022-01-02
60
Current query:
Select
customer_id,
store_id,
product_id,
product_no,
customer_no,
purchase_date,
Price,
first_value(purchase_date) over (partition_by customer_no order by purchase_date) as first_purhcase_date,
count(customer_no) over (partition by customer_id, store_id, customer_no) as customer_purchase_count)
From my_table
This gives me this type of output:
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
1
10
100
200
2022-01-01
50
2022-01-01
2
1
10
110
200
2022-01-02
70
2022-01-01
2
1
20
120
210
2022-01-02
60
2022-01-02
2
1
20
130
210
2022-01-02
40
2022-01-02
2
1
30
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
140
220
2022-01-10
60
2022-01-10
3
1
10
150
220
2022-01-10
60
2022-01-10
1
However, I want it to look like the table below in its final form. How can I achieve that? If possible I would also like to add 4 colums called "only_in_store_10","only_in_store_20","only_in_store_30","only_in_store_40" for all customer_no that only shopped at that store. It should mark with at ○ on each row of each customer_no that satisfies the condition.
customer_id
store_id
product_no
customer_no
purchase_date
price
first_purchase_date
customer_purchase_count
first_purchase_date_per_store
first_purchase_date_per_store
store_row_nr
1
10
100
200
2022-01-01
50
2022-01-01
2
2022-01-01
1
1
1
10
110
200
2022-01-02
70
2022-01-01
2
2022-01-02
1
1
1
20
120
210
2022-01-02
60
2022-01-02
2
2022-01-02
2
1
1
20
130
210
2022-01-03
40
2022-01-02
2
2022-01-02
2
1
1
30
140
220
2022-01-10
60
2022-01-10
3
2022-01-10
1
1
1
10
140
220
2022-01-11
50
2022-01-11
3
2022-01-11
2
1
1
10
140
220
2022-01-12
40
2022-01-11
3
2022-01-11
2
2
1
10
150
220
2022-01-13
60
2022-01-13
1
2022-01-13
1
1

Getting an element and the next from a table

I have a table with ids, cities and some sequence number, say:
ID CITY SEQ_NO
1 Milan 123
2 Paris 124
1 Rome 125
1 Naples 126
1 Strasbourg 130
3 London 129
3 Manchester 132
2 Strasbourg 128
3 Rome 131
2 Rome 127
4 Moscow 135
5 New York 136
4 Helsinki 137
I want to get the city that comes after Rome for the same id, in this case, I can order them by doing something like:
SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY SEQ_NO) as rownum,
id,
city,
seq_no
FROM mytable
I get:
rownum ID CITY SEQ_NO
1 1 Milan 123
2 1 Rome 125
3 1 Naples 126
4 1 Strasbourg 130
1 2 Paris 124
2 2 Rome 127
3 2 Strasbourg 128
1 3 London 129
2 3 Rome 131
3 3 Manchester 132
1 4 Moscow 135
2 4 Helsinki 137
1 5 New York 136
and, I want to get
ID CITY SEQ_NO
1 Rome 125
1 Naples 126
2 Rome 127
2 Strasbourg 128
3 Rome 131
3 Manchester 132
How do I proceed?
Hmmm . . . I might suggest window functions:
select t.*
from (select t.*,
lag(city) over (partition by id order by seq_no) as prev_city
from mytable t
) t
where 'Rome' in (city, prev_city)

how to add incremental number to specific column in pandas

I have following dataframe in pandas
code tank length dia diff
123 3 625 210 -0.38
123 5 635 210 1.2
I want to add 1 only in length for 5 times if the diff is positive and subtract 1 if the dip is negative. My desired dataframe looks like
code tank length diameter
123 3 625 210
123 3 624 210
123 3 623 210
123 3 622 210
123 3 621 210
123 3 620 210
123 5 635 210
123 5 636 210
123 5 637 210
123 5 638 210
123 5 639 210
123 5 640 210
I am doing following in pandas.
df.add(1)
But, its adding 1 to all the columns.
Use Index.repeat 6 times, then add counter values by GroupBy.cumcount and last create default RangeIndex by DataFrame.set_index:
df1 = df.loc[df.index.repeat(6)].copy()
df1['length'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
Or:
df1 = (df.loc[df.index.repeat(6)]
.assign(length = lambda x: x.groupby(level=0).cumcount() + x['length'])
.reset_index(drop=True))
print (df1)
code tank length dia
0 123 3 625 210
1 123 3 626 210
2 123 3 627 210
3 123 3 628 210
4 123 3 629 210
5 123 3 630 210
6 123 5 635 210
7 123 5 636 210
8 123 5 637 210
9 123 5 638 210
10 123 5 639 210
11 123 5 640 210
EDIT:
df1 = df.loc[df.index.repeat(6)].copy()
add = df1.groupby(level=0).cumcount()
mask = df1['diff'] < 0
df1['length'] = np.where(mask, df1['length'] - add, df1['length'] + add)
df1 = df1.reset_index(drop=True)
print (df1)
code tank length dia diff
0 123 3 625 210 -0.38
1 123 3 624 210 -0.38
2 123 3 623 210 -0.38
3 123 3 622 210 -0.38
4 123 3 621 210 -0.38
5 123 3 620 210 -0.38
6 123 5 635 210 1.20
7 123 5 636 210 1.20
8 123 5 637 210 1.20
9 123 5 638 210 1.20
10 123 5 639 210 1.20
11 123 5 640 210 1.20
We can use pd.concat, np.cumsum and groupby + .add.
If you want to substract, simply multiply addition * -1 so for example: (np.cumsum(np.ones(n))-1) * -1
n = 6
new = pd.concat([df]*n).sort_values(['code', 'length']).reset_index(drop=True)
addition = np.cumsum(np.ones(n))-1
new['length'] = new.groupby(['code', 'tank'])['length'].apply(lambda x: x.add(addition))
Output
code tank length dia
0 123 3 625.0 210
1 123 3 626.0 210
2 123 3 627.0 210
3 123 3 628.0 210
4 123 3 629.0 210
5 123 3 630.0 210
6 123 5 635.0 210
7 123 5 636.0 210
8 123 5 637.0 210
9 123 5 638.0 210
10 123 5 639.0 210
11 123 5 640.0 210

Groupby filter based on count, calculate duration, penultimate status

I have a dataframe as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
28 5 M 2018-10-10 200
29 5 F 2019-06-10 500
30 6 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
where
F = Failure
M = Maintenance
P = Planned
Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure
Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
Now, for each id last status is failure
Then from the above df I would like to prepare below Data frame
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
7 2 1 1 M 518 151
SLS = Second Last Status
LS = Last Status
I tried the following code to calculate the duration.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
We can create a mask with gropuby + bfill that allows us to perform both selections.
m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:
s = df.Date.diff().dt.days
res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
(s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
.to_frame('NoDays_to_SLS')),
s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
axis=1)
Output:
No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487.0 151.0
2 3 3 1 M 487.0 61.0
3 3 2 2 P 640.0 90.0
4 3 2 1 M 518.0 151.0
7 2 2 1 M 518.0 151.0
Here's my attempt (Note: I am using pandas 0.25) :
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
df_1 = df.groupby('ID',group_keys=False)\
.apply(lambda x: x[(x['Status']=='F')[::-1].cumsum().astype(bool)])
df_2 = df_1[df_1.groupby('ID')['Status'].transform('count') > 2]
g = df_2.groupby('ID')
df_Counts = g['Status'].value_counts().unstack().add_prefix('No_of_')
df_SLS = g['Status'].agg(lambda x: x.iloc[-2]).rename('SLS')
df_dates = g['Date'].agg(NoDays_to_SLS = lambda x: x.iloc[-2]-x.iloc[0],
NoDays_to_SLS_LS = lambda x: x.iloc[-1]-x.iloc[-2])
pd.concat([df_Counts, df_SLS, df_dates], axis=1).reset_index()
Output:
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_to_SLS_LS
0 1 3 2 2 P 487 days 151 days
1 2 3 3 1 M 487 days 61 days
2 3 3 2 2 P 640 days 90 days
3 4 3 2 1 M 518 days 151 days
4 7 2 2 1 M 518 days 151 days
There are some enhancements in 0.25 that this code uses.

Update rank field based on most popular product

Trying to run an update on the following result set:
Row# ProductRankID ProductID ProductCategoryID ProductTypeID Rank Score
1 3 11266 9 80 0 765
2 14 25880 9 80 0 656
3 12 25864 9 80 0 547
4 7 11252 9 80 0 457
5 8 25719 9 80 0 456
6 4 13425 9 80 0 456
7 11 25677 9 80 0 456
8 9 25716 9 80 0 432
9 15 25714 9 80 0 324
10 13 13589 9 80 0 234
11 20 25803 9 80 0 234
12 17 25715 9 80 0 213
13 5 21269 9 80 0 154
14 10 25867 9 80 0 123
15 16 25676 9 80 0 123
16 22 17861 9 80 0 67
17 19 13534 9 80 0 55
18 23 13659 9 80 0 54
19 29 13658 9 80 0 34
20 21 13591 9 80 0 32
21 6 11249 9 80 0 23
22 18 11253 9 80 0 12
23 28 11253 9 87 0 65
24 27 13664 9 87 0 45
25 25 13658 9 87 0 14
26 26 13657 9 87 0 13
27 24 13659 9 87 0 13
28 30 11252 9 87 0 12
29 2 12345 11 80 0 324
I want the "Rank" column to be set 1...2..3..4 etc based on each row. Then on change of the ProductCategoryID + ProductTypeID, I want it to reset to 1...2...3...4 etc.
So the results should look something like:
Row# ProductRankID ProductID ProductCategoryID ProductTypeID Rank Score
1 3 11266 9 80 1 765
2 14 25880 9 80 2 656
3 12 25864 9 80 3 547
4 7 11252 9 80 4 457
5 8 25719 9 80 5 456
6 4 13425 9 80 6 456
7 11 25677 9 80 7 456
8 9 25716 9 80 8 432
9 15 25714 9 80 9 324
10 13 13589 9 80 10 234
11 20 25803 9 80 11 234
12 17 25715 9 80 12 213
13 5 21269 9 80 13 154
23 28 11253 9 87 1 65
24 27 13664 9 87 2 45
25 25 13658 9 87 3 14
26 26 13657 9 87 4 13
27 24 13659 9 87 5 13
28 30 11252 9 87 6 12
29 2 12345 11 80 1 324
Hope that makes some sense?
Thanks,
Richie
If you want a select:
select t.*,
row_number() over (partition by ProductCategoryID, ProductTypeID
order by score desc, productid
) as new_rank
from t;
If you want an update, use a CTE:
with toupdate as (
select t.*,
row_number() over (partition by ProductCategoryID, ProductTypeID
order by score desc, productid
) as new_rank
from t
)
update toupdate
set rank = new_rank;