finding maximum value in a moving period? in data frame [duplicate] - pandas

I have a table df with columns "timestamp" and "Y". I want to add another column "MaxY" which contains the largest Y value at most 24 hours in the future. That is
df.MaxY.iloc[i] = df[(df.timestamp > df.timestamp.iloc[i]) &
(df.timestamp < df.timestamp.iloc[i] + timedelta(hours=24))].Y.max()
Obviously, computing it like that is very slow. Is there a better way?
In a similar case of computing "SumY" I can do it using a trick with cumsum(). However here similar tricks don't seem to work.
As requested, an example table (MaxY is the output. Input is the first two columns only).
-------------------------------
| timestamp | Y | MaxY |
-------------------------------
| 2016-03-29 12:00 | 1 | 3 | rows 2 and 3 fall within 24 hours, so MaxY = max(2,3)
| 2016-03-29 13:00 | 2 | 4 | rows 3 and 4 fall in the time interval, so MaxY = max(3, 4)
| 2016-03-30 11:00 | 3 | 4 | rows 4, 5, 6 all fall in the interval so MaxY = max(4, 3, 2)
| 2016-03-30 12:30 | 4 | 3 | max (3, 2)
| 2016-03-30 13:30 | 3 | 2 | row 6 is the only row in the interval
| 2016-03-30 14:00 | 2 | nan? | there are no rows in the time interval. Any value will do.
-------------------------------

Here's a way with resample/rolling. I get a weird warning using pandas version 0.18.0 and python 3.5. I don't think it's a concern but not sure why it is generated.
This assumes index is 'timestamp', if not, precede the following with df = df.set_index('timestamp'):
>>> df2 = df.resample('30min').sort_index(ascending=False).fillna(np.nan)
>>> df2 = df2.rolling(48,min_periods=1).max()
>>> df.join(df2,rsuffix='2')
Y Y2
timestamp
2016-03-29 12:00:00 1 3.0
2016-03-29 13:00:00 2 4.0
2016-03-30 11:00:00 3 4.0
2016-03-30 12:30:00 4 4.0
2016-03-30 13:30:00 3 3.0
2016-03-30 14:00:00 2 2.0
On this tiny dataframe it seems to be about twice as fast, but you'd have to test it on a larger dataframe to get a reasonable idea of relative speed.
Hopefully this is somewhat self expanatory. The ascending sort is necessary because rolling only allows a backwards or centered window as far as I can tell.

Consider an apply() solution that may run faster. Function returns the max of a time-conditional series from each row.
import pandas as pd
from datetime import timedelta
def daymax(row):
ser = df.Y[(df.timestamp > row) &
(df.timestamp <= row + timedelta(hours=24))]
return ser.max()
df['MaxY'] = df.timestamp.apply(daymax)
print(df)
# timestamp Y MaxY
#0 2016-03-29 12:00:00 1 3.0
#1 2016-03-29 13:00:00 2 4.0
#2 2016-03-30 11:00:00 3 4.0
#3 2016-03-30 12:30:00 4 3.0
#4 2016-03-30 13:30:00 3 2.0
#5 2016-03-30 14:00:00 2 NaN

what's wrong with
df['MaxY'] = df[::-1].Y.shift(-1).rolling('24H').max()
df[::-1] reverses the df (you want it "backwards") and shift(-1) takes care of the "in the future".

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

Drop Duplicates based on Nearest Datetime condition

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64

How to merge two pandas dataframes/ series

I'm trying to merge multiple pandas data frames into one. I have 1 main frame with the locations of measurements. The other data frames contain multiple measurements for one location. Like below:
df 1: Location ID | X | Y | Z
1 |1| 2 |3
2 |3| 2 |1
n
df 2: Location ID | Date | Measurement
1 |January 1 12:30 | 1
1 |January 16 12 :30 | 4
1 ...
df 2: Location ID | Date | Measurement
2 January 1 12:30 3
2 January 16 12 :30 9
2 ...
df n: Location ID | Date | Measurement
n January 1 12:30 4
n January 16 12 :30 6
n January 20 11:30 7 ...
I'm trying to create a data frame like this:
df_final: Location ID | X | Y | Z | january 1 12:00 | January 16 12 :30| January 20 11:30 etc.
1 1 2 3 1 4 NaN
2 3 2 1 3 9 NaN
n 2 5 7 4 6 7
The dates are already datetime objects and the Location ID is the index of both dataframes.
I tried to use the append, the merge and the concat functions both using two frames and converting the frame to a list by List = frame['measurements'] before adding it.
The problem is that either rows are added under the first data frame, while the measured values should be added in new columns on an existing row( the location ID resp.), or the dates end op to be new rows while new columns with location IDs are created.
I'm sorry my question lay-out is not so nice, but I'm new to this forum.
Found it myself.
I used frame. pivot to reshape df2-n and then used concat to ad it to the locations df.

How may I copy the same rows for each value in one column?

How may I copy rows for each value in one column
Let's say that we have column like:
---------------------
record time id
---------------------
1 12:00 [1,2,3]
2 12:01 [4,5,6,7]
3 12:07 [8,9]
And I would like to get result like:
---------------------
record time id
---------------------
1 12:00 1
2 12:00 2
3 12:00 3
4 12:01 4
5 12:01 5
...
9 12:07 9
I need to do this in Postgresql or R
One option is separate_rows if the 'id' is string
library(tidyverse)
df1 %>%
separate_rows(id) %>%
filter(id != "") %>%
mutate(record = row_number())
# record time id
#1 1 12:00 1
#2 2 12:00 2
#3 3 12:00 3
#4 4 12:01 4
#5 5 12:01 5
#6 6 12:01 6
#7 7 12:01 7
#8 8 12:07 8
#9 9 12:07 9
If the 'id' is a list
df1 %>%
unnest
data
df1 <- structure(list(record = 1:3, time = c("12:00", "12:01", "12:07"
), id = c("[1,2,3]", "[4,5,6,7]", "[8,9]")), class = "data.frame",
row.names = c(NA, -3L))
You could do it in PostgreSQL like so:
SELECT DISTINCT record, time, unnest(translate(id, '[]', '{}'):: int[]) AS ids
FROM tbl
ORDER BY record, time, ids;
Basically, you create an array out of your text field, and then use unnest to get your desired result.
record | time | ids
--------+----------+-----
1 | 12:00:00 | 1
1 | 12:00:00 | 2
1 | 12:00:00 | 3
2 | 12:01:00 | 4
2 | 12:01:00 | 5
2 | 12:01:00 | 6
2 | 12:01:00 | 7
3 | 12:07:00 | 8
3 | 12:07:00 | 9
DEMO
In Postgres, you would just use unnest():
select t.record, t.time, unnest(t.id) as id
from t;
This assumes that your column is actually stored as an array in Postgres. If it is a string, you can do something similar, but it requires more string manipulation.