How to iterate over rows and get max values of any previous rows - pandas

I have this dataframe:
pd.DataFrame({'ids':['a','b','c','d','e','f']
,'id_order':[1,2,3,4,5,6]
,'value':[1000,500,3000,2000,1000,5000]})
What I want is to iterate over the rows and get the maximum value of all previous rows.
For example, when I iterate to id_order==2 I would get 1000 (from id_order 1).
When I move forward to id_order==5 I would get 3000 (from id_order 3)
The desired outcome should be as follows:
pd.DataFrame({'ids':['a','b','c','d','e','f']
,'id_order':[1,2,3,4,5,6]
,'value':[1000,500,2000,3000,1000,5000]
,'outcome':[0,1000,1000,2000,3000,3000]})
This will be done on a big dataset so efficiency is also a factor.
I would greatly appreciate your help in this.
Thanks

You can shift the value column and take the cumulative maximum:
df["outcome"] = df.value.shift(fill_value=0).cummax()
Since shifting nullifies the first entry we fill it with 0.
>>> df
ids id_order value outcome
0 a 1 1000 0
1 b 2 500 1000
2 c 3 3000 1000
3 d 4 2000 3000
4 e 5 1000 3000
5 f 6 5000 3000

Related

Is it possible to set a dynamic window frame bound in SQL OVER(ROW BETWEEN ...)-Clause?

Consider the following table, describing a patients medication plan. For example, the first row describes that the patient with patient_id = 1 is treated from timestamp 0 to 4. At time = 0, the patient has not yet become any medication (kum_amount_start = 0). At time = 4, the patient has received a kumulated amount of 100 units of a certain drug. It can be assumed, that the drug is given in with a constant rate. Regarding the first row, this means that the drug is given with a rate of 25 units/h.
patient_id
starttime [h]
endtime [h]
kum_amount_start
kum_amount_end
1
0
4
0
100
1
4
5
100
300
1
5
15
300
550
1
15
18
550
700
2
0
3
0
150
2
3
6
150
350
2
6
10
350
700
2
10
15
700
1100
2
15
19
1100
1500
I want to add the two columns "kum_amount_start_last_6hr" and "kum_amount_end_last_6hr" that describe the amount that has been given within the last 6 hours of the treatment (for the respective timestamps start, end).
I'm stuck with this problem for a while now.
I tried to tackle it with something like this
SUM(kum_amount) OVER (PARTITION BY patient_id ROWS BETWEEN "dynmaic window size" AND CURRENT ROW)
but I'm not sure whether this is the right approach.
I would be very happy if you could help me out here, thanks!

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

How to implement multiple aggregations using pandas groupby, referencing a specific column

I have data in a pandas data frame, and need to aggregate it. I need to do different aggregations across different columns similar to the below.
group min(rank) min(rank) min sum
title t_no t_descr rank stores
A 1 a 1 1000
B 1 a 1 1000
B 2 b 2 800
C 2 b 2 800
D 1 a 1 1000
D 2 b 2 800
F 4 d 4 500
E 3 c 3 700
to:
title t_no t_descr rank stores
A 1 a 1 1000
B 1 a 1 1800
C 2 b 2 800
D 1 a 1 1800
E 3 c 3 700
F 4 d 4 500
You'll notice that title B and D have been aggregated, keeping the t_no & t_descr that corresponded to the minimum of the rank for the respective title group, while stores are summed. t_no & t_descr are just arbitrary text. I need the top rank by title, sum the stores, and keep the corresponding t_no & t_descr.
How can I do this within a single pandas groupby? This is dummy data; the real problem that I'm working on has many more aggregations, and I'd prefer not to have to do each aggregation individually, which I know how to do.
I started with the below, but realized that I really need the mins & maxs for t_no & t_descr to be based on rank col of the subgroup, not the columns themselves.
aggs = {
'rank': 'min',
't_no': 'min', # need t_no for row that is min(rank) by title.
't_descr': 'min' # need t_descr for row that is min(rank) by title.
}
df2.groupby('title').agg(aggs).reset_index()
Perhaps there's a way to do this with a lambda? I'm sure there's a straightforward way to do this. And if groupby isn't the right method I'm obviously open to suggestions.
Thanks!
Two step process...
aggregate for sum of stores and idxmin for rank...
then use idxmin to slice original dataframe and join it with the aggregate
agged = df.groupby('title').agg(dict(rank='idxmin', stores='sum'))
df.loc[agged['rank'], ['title', 't_no', 't_descr', 'rank']].join(agged.stores, on='title')
title t_no t_descr rank stores
0 A 1 a 1 1000
1 B 1 a 1 1800
3 C 2 b 2 800
4 D 1 a 1 1800
7 E 3 c 3 700
6 F 4 d 4 500
This is a slightly different approach from #piRSquared, but gets you to the same spot:
Code:
# Set min and sum functions according to columns and generate new dataframe
f = {'rank':min, 'rank':min, 'stores':sum}
grouped = df.groupby('title').agg(f).reset_index()
# Then merge with original dataframe (keeping only the merged and new columns)
pd.merge(grouped, df[['title','rank','t_no','t_descr']], on=['title','rank'])
Output:
title stores rank t_no t_descr
0 A 1000 1 1 a
1 B 1800 1 1 a
2 C 800 2 2 b
3 D 1800 1 1 a
4 E 700 3 3 c
5 F 500 4 4 d
Of course you can organize the columns as you see fit.

Possible to group by counts?

I am trying to change something like this:
Index Record Time
1 10 100
1 10 200
1 10 300
1 10 400
1 3 500
1 10 600
1 10 700
2 10 800
2 10 900
2 10 1000
3 5 1100
3 5 1200
3 5 1300
into this:
Index CountSeq Record LastTime
1 4 10 400
1 1 3 500
1 2 10 700
2 3 10 1000
3 3 5 1300
I am trying to apply this logic per unique index -- I just included three indexes to show the outcome.
So for a given index I want to combine them by streaks of the same Record. So notice that the first four entries for Index 1 have Records 10, but it is more succinct to say that there were 4 entries with record 10, ending at time 400. Then I repeat the process going forward, in sequence.
In short I am trying to perform a count-grouping over sequential chunks of the same Record, within each index. In other words I am NOT looking for this:
select index, count(*) as countseq, record, max(time) as lasttime
from Table1
group by index,record
Which combines everything by the same record whereas I want them to be separated by sequence breaks.
Is there a way to do this in SQL?
It's hard to solve your problem without having a single primary key, so I'll assume you have a primary key column that increases each row (primkey). This request would return the same table with a 'diff' column that has value 1 if the previous primkey row has the same index and record as the current one, 0 otherwise :
SELECT *,
IF((SELECT index, record FROM yourTable p2 WHERE p1.primkey = p2.primkey)
= (SELECT index, record FROM yourTable p2 WHERE p1.primkey-1 = p2.primkey), 1, 0) as diff
FROM yourTable p1
If you use a temporary variable that increases each time the IF expression is false, you would get a result like this :
primkey Index Record Time diff
1 1 10 100 1
2 1 10 200 1
3 1 10 300 1
4 1 10 400 1
5 1 3 500 2
6 1 10 600 3
7 1 10 700 3
8 2 10 800 4
9 2 10 900 4
10 2 10 1000 4
11 3 5 1100 5
12 3 5 1200 5
13 3 5 1300 5
Which would solve your problem, you would just add 'diff' to the group by clause.
Unfortunately I can't test it on sqlite, but you should be able to use variables like this.
It's probably a dirty workaround but I couldn't find any better way, hope it helps.

ARRAYFORMULA ON SPECIFIC CELLS

So I have this formula, it basiclly mulitiplies every value in a row for every row and sums up the products, and while thats awesome and all
=sum(ArrayFormula(iferror(A1:A*B1:B*C1:C)))
I would like if there was a way to choose what rows it multiplies and sums up, if I can put a specific letter or like tag those cells in any way and like "filter" them out so it only sums up lets say row 1,2,4 and so on and for infinity, how ever many rows Ill like to add and whichever rows I want to include!
EXAMPLE:
1: 100 4 10
2: 120 2 12
3: 125 5 10
4: 105 3 15
Not sure I fully understand the question but I believe you could solve the problem by introducing a fourth column, using "1" to indicate rows to add to the sum and "0" for rows to ignore. By extending your formula to include the new column D each row is multiplied with 1 (adding the value, since N*1=N) or 0 (ignoring the value, since N*0=0):
=sum(ArrayFormula(iferror(A1:A*B1:B*C1:C*D1:D)))
The below example data would sum row 1, 2 and 4:
1: 100 4 10 1
2: 120 2 12 1
3: 125 5 10 0
4: 105 3 15 1