Trying to mask a dataset based on multiple conditions - pandas

I am trying to mask a dataset I have based on two parameters
That I mask any station has a repeat value more than once per hour
a) I want the count to reset once the clock hits a
new hour
That I mask the data whenever the previous datapoint is lower than the current datapoint if it's within the hour and the station names are the same.
The mask I applied to it is this
mask = (df['station'] == df['station'].shift(1))
& (df['precip'] >= df['precip'].shift(1))
& (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour'))
df.loc[mask, 'to_remove'] = True
However it is not working properly giving me a df that looks like this
I want a dataframe that looks like this

Basically you want to mask two things, the first being a duplicate value per station & hour. This can be found by grouping by station and the hour of valid, plus the precip column. On this group by you can count the number of occurances and check if it is more than one:
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour), df.precip] #groupby the columns
).transform('count') > 1 #count the values and check if more than 1
The second one is not clear to me whether you want to reset once the clock hits a new hour (as mentioned in the first part of the mask). If this is also the case, you need to group by station and hour, and check values using shift (as you tried):
df.groupby(
[df.station, df['valid'].apply(lambda x: x.hour)] #group by station and hour
).precip.apply(lambda x: x < x.shift(1)) #value lower than previous
If this is not the case, as suggested by your expected output, you only group by station:
df.groupby(df.station).precip.apply(
lambda x: ( x < x.shift(1) ) & #value is less than previous
( abs(df['valid'] - df['valid'].shift(1) ) < pd.Timedelta('1 hour') ) #within hour
)
Combining these two masks will let you drop the right rows:
df['sameValueWithinHour'] = df.groupby([df.station, df['valid'].apply(lambda x: x.hour), df.precip]).transform('count')>1
df['previousValuelowerWithinHour'] = df.groupby(df.station).precip.transform(lambda x: (x<x.shift(1)) & (abs(df['valid'] - df['valid'].shift(1)) < pd.Timedelta('1 hour')))
df['to_remove'] = df.sameValueWithinHour | df.previousValuelowerWithinHour
df.loc[~df.to_remove, ['station', 'valid', 'precip']]
station valid precip
2 btv 2022-02-23 00:55:00 0.5
4 btv 2022-02-23 01:12:00 0.3

Related

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Comparing timedelta fields

I am looking at file delivery times and can't work out how to compare two timedelta fields using a for loop if statement.
time_diff is the difference between cob_date and last_update_time
average_diff is based on the average for a particular file
I want to find the delay for each row.
I have been able to produce a column delay using average_diff - time_diff
However, when the average_diff - time_diff < 0 I just want to return delay = 0 as this is not a delay.
I have made a for loop but this isn't working and I don't know why. I'm sure the answer is very simple but I can't get there.
test_pv_import_v2['delay2'] = pd.to_timedelta('0')
for index, row in test_pv_import_v2.iterrows():
if test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff'] :
test_pv_import_v2['delay2'] = test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff']
Use Series.where for set 0 Timedelta by condition:
mask = test_pv_import_v2['time_diff'] > test_pv_import_v2['average_diff']
s = (test_pv_import_v2['time_diff'] - test_pv_import_v2['average_diff'])
test_pv_import_v2['delay2'] = s.where(mask, pd.to_timedelta('0'))

Make a plot by occurence of a col by hour of a second col

I have this df :
and i would like to make a graph by half hour of how many row i have by half hour without including the day.
Just a graph with number of occurence by half hour not including the day.
3272 8711600410367 2019-03-11T20:23:45.415Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:45.415Z DISPLAY None
3273 8711600410367 2019-03-11T20:23:51.072Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:51.072Z DISPLAY None
Here is my try :
df["Created"] = pd.to_datetime(df["Created"])
df.groupby(df.Created.dt.hour).size().plot()
But it's not by half hour
I would like to show all half hour on my graph
One way you could do this is split up coding for hours and half-hours, and then bring them together. To illustrate, I extended your data example a bit:
import pandas as pd
df = pd.DataFrame({'Created':['2019-03-11T20:23:45.415Z', '2019-03-11T20:23:51.072Z', '2019-03-11T20:33:03.072Z', '2019-03-11T21:10:10.072Z']})
df["Created"] = pd.to_datetime(df["Created"])
First create a 'Hours column':
df['Hours'] = df.Created.dt.hour
Then create a column that codes half hours. That is, if the minutes are greater than 30, count it as half hour.
df['HalfHours'] = [0.5 if x>30 else 0 for x in df.Created.dt.minute]
Then bring them together again:
df['Hours_and_HalfHours'] = df['Hours']+df['HalfHours']
Finally, count the number of rows by groupby, and plot:
df.groupby(df['Hours_and_HalfHours']).size().plot()

Find overlapping timestamps using pandas

I have a dataframe containing start times, end times and transaction_ids like so:
tid starttime endtime
0 0.0 1537204247.00 1537204309.00
1 1.0 1537204248.00 1537204309.00
2 21.0 1537207170.00 1537207196.00
I need to find overlapping transactions. So far, the most optimized code I've been able to produce is the following:
p['overlap'] = False # This is my original dataframe
def compute_overlaps(df):
for i, row_curr in df.iterrows():
if( p.loc[row_curr['ix']]['overlap'] != True ):
overlap_indexes = df[(row_curr['ix'] != df['ix']) & (row_curr['starttime'] < df['endtime']) & (df['starttime'] < row_curr['endtime'])].index
p['overlap'].loc[row_curr['ix']] = True
p['overlap'].loc[overlap_indexes] = True
<p_grouped_by_something>.apply(compute_overlaps)
Output:
tid starttime endtime overlap
0 0.0 1537204247.00 1537204309.00 True
1 1.0 1537204248.00 1537204309.00 True
2 21.0 1537207170.00 1537207196.00 False
Note that for each transaction, I merely need to determine if it overlaps with at most one other transaction. If one is found, I don't need to check all other transactions; I can stop there and mark it as overlapping.
Initially, I had a nested for loop using iterrows that was abominably slow. I was then able to vectorize the inner loop, but the outer loop remains. Is there any way to vectorize the overall computation to make it run faster?
You can using numpy boradcast
s1=df.starttime.values
s2=df.endtime.values
sum(np.minimum(s2[:,None],s2)-np.maximum(s1[:,None],s1)>0)>1
Out[36]: array([ True, True, False])
Explanation :
1st : over lap for range
(x1,y1) with (x2,y2)
min(y2,y1)-max(x1,x2)>0 then two ranges have overlap
2nd : why it need great than 2 , since I using numpy braod cast , so the diagonal always represented itself comparison . Then we need greater two .
Update :
Assuming you have df and split df1 ....dfn (look at np.split)
s1=df.starttime.values
s2=df.endtime.values
l=[df1,df2,df3,df4,df5...]
n=[]
for x in l:
n.append(sum(np.minimum(s2[:,None],x.values)-np.maximum(s1[:,None],x.values)>0)>1)