how to add incremental number to specific column in pandas - pandas

I have following dataframe in pandas
code tank length dia diff
123 3 625 210 -0.38
123 5 635 210 1.2
I want to add 1 only in length for 5 times if the diff is positive and subtract 1 if the dip is negative. My desired dataframe looks like
code tank length diameter
123 3 625 210
123 3 624 210
123 3 623 210
123 3 622 210
123 3 621 210
123 3 620 210
123 5 635 210
123 5 636 210
123 5 637 210
123 5 638 210
123 5 639 210
123 5 640 210
I am doing following in pandas.
df.add(1)
But, its adding 1 to all the columns.

Use Index.repeat 6 times, then add counter values by GroupBy.cumcount and last create default RangeIndex by DataFrame.set_index:
df1 = df.loc[df.index.repeat(6)].copy()
df1['length'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
Or:
df1 = (df.loc[df.index.repeat(6)]
.assign(length = lambda x: x.groupby(level=0).cumcount() + x['length'])
.reset_index(drop=True))
print (df1)
code tank length dia
0 123 3 625 210
1 123 3 626 210
2 123 3 627 210
3 123 3 628 210
4 123 3 629 210
5 123 3 630 210
6 123 5 635 210
7 123 5 636 210
8 123 5 637 210
9 123 5 638 210
10 123 5 639 210
11 123 5 640 210
EDIT:
df1 = df.loc[df.index.repeat(6)].copy()
add = df1.groupby(level=0).cumcount()
mask = df1['diff'] < 0
df1['length'] = np.where(mask, df1['length'] - add, df1['length'] + add)
df1 = df1.reset_index(drop=True)
print (df1)
code tank length dia diff
0 123 3 625 210 -0.38
1 123 3 624 210 -0.38
2 123 3 623 210 -0.38
3 123 3 622 210 -0.38
4 123 3 621 210 -0.38
5 123 3 620 210 -0.38
6 123 5 635 210 1.20
7 123 5 636 210 1.20
8 123 5 637 210 1.20
9 123 5 638 210 1.20
10 123 5 639 210 1.20
11 123 5 640 210 1.20

We can use pd.concat, np.cumsum and groupby + .add.
If you want to substract, simply multiply addition * -1 so for example: (np.cumsum(np.ones(n))-1) * -1
n = 6
new = pd.concat([df]*n).sort_values(['code', 'length']).reset_index(drop=True)
addition = np.cumsum(np.ones(n))-1
new['length'] = new.groupby(['code', 'tank'])['length'].apply(lambda x: x.add(addition))
Output
code tank length dia
0 123 3 625.0 210
1 123 3 626.0 210
2 123 3 627.0 210
3 123 3 628.0 210
4 123 3 629.0 210
5 123 3 630.0 210
6 123 5 635.0 210
7 123 5 636.0 210
8 123 5 637.0 210
9 123 5 638.0 210
10 123 5 639.0 210
11 123 5 640.0 210

Related

Calculate Moving Average on Previous Calculated Moving Average (Snowflake)

I have a dataset that looks something like this. I wish to calculate a modified moving average (column Mod_MA) for sales column based on the following logic :
If there is no event, then ST Else Average last 4 dates.
Date
Item
Event
ST
Mod_MA
2022-10-01
ABC
100
100
2022-10-02
ABC
110
110
2022-10-03
ABC
120
120
2022-10-04
ABC
130
130
2022-10-05
ABC
EV1
140
115
2022-10-06
ABC
EV1
150
119
2022-10-07
ABC
160
160
2022-10-08
ABC
170
170
2022-10-09
ABC
180
180
2022-10-10
ABC
EV2
190
157
2022-10-11
ABC
EV2
200
167
2022-10-12
ABC
EV2
210
168
2022-10-01
XYZ
100
100
2022-10-02
XYZ
110
110
2022-10-03
XYZ
120
120
2022-10-04
XYZ
130
130
2022-10-05
XYZ
EV3
140
115
2022-10-06
XYZ
EV3
150
119
2022-10-07
XYZ
EV3
160
121
2022-10-08
XYZ
170
170
2022-10-09
XYZ
180
180
2022-10-10
XYZ
EV4
190
147
2022-10-11
XYZ
EV4
200
155
2022-10-12
XYZ
210
210
Hopefully the image helps clarify what I am going for.
I have tried LAG & AVG OVER ORDER BY but since I dont have an exact number of iterations I need to run, these dont work.
Calculation Formulae
Would appreciate any help.

Pandas multiply 2 series of different dimensions to give dataframe

I have a series;
Red 33
Blue 44
Green 22
And also this series;
0 100
1 100
2 100
3 200
4 200
5 200
I want to multiply these in a way to give the following dataframe
Red Blue Green
0 330 440 220
1 330 440 220
2 330 440 220
3 660 880 440
4 660 880 440
5 660 880 440
Can anyone see a simply / tidy way this could be done?
IIUC assuming s is the name of the first series and s1 is the name of the second series, try:
m=s.to_frame().T
pd.DataFrame(m.values*s1.values[:,None],columns=m.columns)
Red Blue Green
0 3300 4400 2200
1 3300 4400 2200
2 3300 4400 2200
3 6600 8800 4400
4 6600 8800 4400
5 6600 8800 4400

Groupby filter based on count, calculate duration, penultimate status

I have a dataframe as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
28 5 M 2018-10-10 200
29 5 F 2019-06-10 500
30 6 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
where
F = Failure
M = Maintenance
P = Planned
Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure
Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
Now, for each id last status is failure
Then from the above df I would like to prepare below Data frame
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
7 2 1 1 M 518 151
SLS = Second Last Status
LS = Last Status
I tried the following code to calculate the duration.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
We can create a mask with gropuby + bfill that allows us to perform both selections.
m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:
s = df.Date.diff().dt.days
res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
(s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
.to_frame('NoDays_to_SLS')),
s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
axis=1)
Output:
No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487.0 151.0
2 3 3 1 M 487.0 61.0
3 3 2 2 P 640.0 90.0
4 3 2 1 M 518.0 151.0
7 2 2 1 M 518.0 151.0
Here's my attempt (Note: I am using pandas 0.25) :
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
df_1 = df.groupby('ID',group_keys=False)\
.apply(lambda x: x[(x['Status']=='F')[::-1].cumsum().astype(bool)])
df_2 = df_1[df_1.groupby('ID')['Status'].transform('count') > 2]
g = df_2.groupby('ID')
df_Counts = g['Status'].value_counts().unstack().add_prefix('No_of_')
df_SLS = g['Status'].agg(lambda x: x.iloc[-2]).rename('SLS')
df_dates = g['Date'].agg(NoDays_to_SLS = lambda x: x.iloc[-2]-x.iloc[0],
NoDays_to_SLS_LS = lambda x: x.iloc[-1]-x.iloc[-2])
pd.concat([df_Counts, df_SLS, df_dates], axis=1).reset_index()
Output:
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_to_SLS_LS
0 1 3 2 2 P 487 days 151 days
1 2 3 3 1 M 487 days 61 days
2 3 3 2 2 P 640 days 90 days
3 4 3 2 1 M 518 days 151 days
4 7 2 2 1 M 518 days 151 days
There are some enhancements in 0.25 that this code uses.

How to set a flag in groupby with a condition in pandas

I have a following dataframe
code date time product tank stock out_value
123 2019-06-20 07:00 MS 1 370 350
123 2019-06-20 07:30 HS 3 340 350
123 2019-06-20 07:00 MS 2 340 350
123 2019-06-20 07:30 HS 4 340 350
123 2019-06-20 08:00 MS 1 470 350
123 2019-06-20 08:30 HS 3 450 350
123 2019-06-20 08:00 MS 2 470 350
123 2019-06-20 08:30 HS 4 490 350
123 2019-06-20 09:30 HS 4 0 350
234 2019-06-20 09:30 HS 1 200 350
I want to find out which stock values are less than out_value in above dataframe excluding 0 value.
e.g. at 07:30 for ro code 123 on date 2019-06-20 for product HS there are two tanks 3 and 4, so if stocks for both the tanks are below out_value then flag is set to 1.
My desired dataframe would be
code date time product tank stock out_value flag
123 2019-06-20 07:00 MS 1 370 350 0
123 2019-06-20 07:30 HS 3 340 350 1
123 2019-06-20 07:00 MS 2 340 350 0
123 2019-06-20 07:30 HS 4 340 350 1
123 2019-06-20 08:00 MS 1 470 350 0
123 2019-06-20 08:30 HS 3 450 350 0
123 2019-06-20 08:00 MS 2 470 350 0
123 2019-06-20 08:30 HS 4 490 350 0
123 2019-06-20 09:30 HS 4 0 350 0
234 2019-06-20 09:30 HS 1 200 350 1
How can I do it in pandas?
If need check difference with non 0 values and then check all True values per groups with GroupBy.transform and GroupBy.all:
df['flag'] = ((df['stock']<df['out_value']) & (df['stock'] !=0))
df['flag'] = df.groupby(['code','date','time','product'])['flag'].transform('all').astype(int)
print (df)
code date time product tank stock out_value flag
0 123 2019-06-20 07:00 MS 1 370 350 0
1 123 2019-06-20 07:30 HS 3 340 350 1
2 123 2019-06-20 07:00 MS 2 340 350 0
3 123 2019-06-20 07:30 HS 4 340 350 1
4 123 2019-06-20 08:00 MS 1 470 350 0
5 123 2019-06-20 08:30 HS 3 450 350 0
6 123 2019-06-20 08:00 MS 2 470 350 0
7 123 2019-06-20 08:30 HS 4 490 350 0
8 123 2019-06-20 09:30 HS 4 0 350 0
9 234 2019-06-20 09:30 HS 1 200 350 1
Or if need test only difference, test per groups and last chain with mask for test non 0 values:
df['flag'] = df['stock']<df['out_value']
mask = df.groupby(['code','date','time','product'])['flag'].transform('all')
df['flag'] = (mask & (df['stock'] !=0)).astype(int)
This should do it:
df['flag'] = (df.assign(flag=(df.stock<df.out_value)&(df.stock>0))
.groupby(['code', 'date', 'time', 'product'], as_index=False)['flag']
.transform(all)
.astype(int))
df
code date time product tank stock out_value flag
0 123 2019-06-20 07:00 MS 1 370 350 0
1 123 2019-06-20 07:30 HS 3 340 350 1
2 123 2019-06-20 07:00 MS 2 340 350 0
3 123 2019-06-20 07:30 HS 4 340 350 1
4 123 2019-06-20 08:00 MS 1 470 350 0
5 123 2019-06-20 08:30 HS 3 450 350 0
6 123 2019-06-20 08:00 MS 2 470 350 0
7 123 2019-06-20 08:30 HS 4 490 350 0
8 123 2019-06-20 09:30 HS 4 0 350 0
9 234 2019-06-20 09:30 HS 1 200 350 1
you could do, it gives(I guess) the right result for the dataframe you provided, but I'm not sure if that's what you want.
df['flag'] = ((df['stock']<df['out_value']) & (df['stock'] !=0)).astype(int)
To me, it is quite unclear what you're asking. If you want to flag as 1, all the rows which has stock below to out_value, except if they are 0, you can do...
df['flag'] = 0
df.loc[(df['stock'] < df['out_value']) & (df['stock'] != 0), 'flag'] = 1

How to extract info based on the latest row

I have two tables:-
TABLE A :-
ORNO DEL PONO QTY
801 123 1 80
801 123 2 60
801 123 3 70
801 151 1 95
801 151 3 75
802 130 1 50
802 130 2 40
802 130 3 30
802 181 2 55
TABLE B:-
ORNO PONO STATUS ITEM
801 1 12 APPLE
801 2 12 ORANGE
801 3 12 MANGO
802 1 22 PEAR
802 2 22 KIWI
802 3 22 MELON
I wish to extract the info based on the latest DEL (in Table A) using SQL. The final output should look like this:-
OUTPUT:-
ORNO PONO STATUS ITEM QTY
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30
Thanks.
select b.*, y.QTY
from
(
select a.ORNO, a.PONO, MAX(a.DEL) [max]
from #tA a
group by a.ORNO, a.PONO
)x
join #tA y on y.ORNO = x.ORNO and y.PONO = x.PONO and y.DEL = x.max
join #tB b on b.ORNO = y.ORNO and b.PONO = y.PONO
Output:
ORNO PONO STATUS ITEM QTY
----------- ----------- ----------- ---------- -----------
801 1 12 APPLE 95
801 2 12 ORANGE 60
801 3 12 MANGO 75
802 1 22 PEAR 50
802 2 22 KIWI 55
802 3 22 MELON 30