replace 0 that is in between exactly two numbers in a column - pandas

I only want to replace 0 which lies between exactly two numbers with its average value.
My dataset looks like below:
time value
9:45:00 0
10:00:00 0
10:15:00 0
10:30:00 10
10:45:00 0
11:00:00 10
11:15:00 10
11:30:00 0
11:45:00 10
12:00:00 0
12:15:00 0
12:30:00 0
12:45:00 10
13:00:00 0
13:15:00 0
I want it to look like this:
time value
9:45:00 0
10:00:00 0
10:15:00 0
10:30:00 10
10:45:00 10
11:00:00 10
11:15:00 10
11:30:00 10
11:45:00 10
12:00:00 0
12:15:00 0
12:30:00 0
12:45:00 10
13:00:00 0
13:15:00 0
in this, since the 0 between 11:45 to 12:45 is not exactly between two numbers (ie multiple zeros), we are not filling in these values

How about this?
from io import StringIO as sio
data = sio("""
time value
9:45:00 0
10:00:00 0
10:15:00 0
10:30:00 10
10:45:00 0
11:00:00 10
11:15:00 10
11:30:00 0
11:45:00 10
12:00:00 0
12:15:00 0
12:30:00 0
12:45:00 10
13:00:00 0
13:15:00 0
""")
import pandas as pd
df = pd.read_csv(data, sep='\s+')
df['flag_to_fill'] = (df['value']==0) & (df['value'].shift(1)!=0) & (df['value'].shift(-1)!=0)
df.loc[df['flag_to_fill'], 'value'] = 0.5*(df['value'].shift(1) + df['value'].shift(-1))
df

Related

Calculate the rolling average every two weeks for the same day and hour in a DataFrame

I have a Dataframe like the following:
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2023-1-2', end='2023-1-29', freq='15min')
df['week'] = df['datetime'].apply(lambda x: int(x.isocalendar()[1]))
df['day_of_week'] = df['datetime'].dt.weekday
df['hour'] = df['datetime'].dt.hour
df['minutes'] = pd.DatetimeIndex(df['datetime']).minute
df['value'] = range(len(df))
df.set_index('datetime',inplace=True)
df = week day_of_week hour minutes value
datetime
2023-01-02 00:00:00 1 0 0 0 0
2023-01-02 00:15:00 1 0 0 15 1
2023-01-02 00:30:00 1 0 0 30 2
2023-01-02 00:45:00 1 0 0 45 3
2023-01-02 01:00:00 1 0 1 0 4
... ... ... ... ... ...
2023-01-08 23:00:00 1 6 23 0 668
2023-01-08 23:15:00 1 6 23 15 669
2023-01-08 23:30:00 1 6 23 30 670
2023-01-08 23:45:00 1 6 23 45 671
2023-01-09 00:00:00 2 0 0 0 672
And I want to calculate the average of the column "value" for the same hour/minute/day, every two consecutive weeks.
What I would like to get is the following:
df=
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336
2023-01-23 00:00:00 1008
15 2023-01-02 00:15:00 NaN
2023-01-09 00:15:00 NaN
2023-01-16 00:15:00 337
2023-01-23 00:15:00 1009
So the first two weeks should have NaN values and week-3 should be the average of week-1 and week-2 and then week-4 the average of week-2 and week-3 and so on.
I tried the following code but it does not seem to do what I expect:
df = pd.DataFrame(df.groupby(['day_of_week','hour','minutes'])['value'].rolling(window='14D', min_periods=1).mean())
As what I am getting is:
value
day_of_week hour minutes. datetime
0 0 0 2023-01-02 00:00:00 0
2023-01-09 00:00:00 336
2023-01-16 00:00:00 1008
2023-01-23 00:00:00 1680
15 2023-01-02 00:15:00 1
2023-01-09 00:15:00 337
2023-01-16 00:15:00 1009
2023-01-23 00:15:00 1681
I think you want to shift within each group. Then you need another groupby:
(df.groupby(['day_of_week','hour','minutes'])['value']
.rolling(window='14D', min_periods=2).mean() # `min_periods` is different
.groupby(['day_of_week','hour','minutes']).shift() # shift within each group
.to_frame()
)
Output:
value
day_of_week hour minutes datetime
0 0 0 2023-01-02 00:00:00 NaN
2023-01-09 00:00:00 NaN
2023-01-16 00:00:00 336.0
2023-01-23 00:00:00 1008.0
15 2023-01-02 00:15:00 NaN
... ...
6 23 30 2023-01-15 23:30:00 NaN
2023-01-22 23:30:00 1006.0
45 2023-01-08 23:45:00 NaN
2023-01-15 23:45:00 NaN
2023-01-22 23:45:00 1007.0

resample dataset with one irregular datetime

I have a dataframe like the following. I wanted to check the values for each 15minutes. But I see that there is a time at 09:05:51. How can I resample the dataframe for 15minutes?
hour_min value
06:30:00 0.0
06:45:00 0.0
07:00:00 0.0
07:15:00 0.0
07:30:00 102.754717
07:45:00 130.599057
08:00:00 154.117925
08:15:00 189.061321
08:30:00 214.924528
08:45:00 221.382075
09:00:00 190.839623
09:05:51 428.0
09:15:00 170.973995
09:30:00 0.0
09:45:00 0.0
10:00:00 174.448113
10:15:00 174.900943
10:30:00 182.976415
10:45:00 195.783019
11:00:00 200.337292
11:14:00 80.0
11:15:00 206.280952
11:30:00 218.87886
11:45:00 238.251781
12:00:00 115.5
12:15:00 85.5
12:30:00 130.0
12:45:00 141.0
13:00:00 267.353774
13:15:00 257.061321
13:21:00 8.0
13:27:19 80.0
13:30:00 258.761905
13:45:00 254.703088
13:53:52 278.0
14:00:00 254.790476
14:15:00 247.165094
14:30:00 250.061321
14:45:00 264.014151
15:00:00 132.0
15:15:00 108.0
15:30:00 158.5
15:45:00 457.0
16:00:00 273.745283
16:15:00 273.962264
16:30:00 279.089623
16:45:00 280.264151
17:00:00 296.061321
17:15:00 296.481132
17:30:00 282.957547
17:45:00 279.816038
I have tried this line, but i get a typeError.
res = s.resample('15T').sum()
I tried to make the index to date, but it does not work too.

Why do I get different values when I extract data from netCDF files using CDO and ArcGIS for a same grid point?

details of the raw data (Mnth.nc)
netcdf Mnth {
dimensions:
time = UNLIMITED ; // (480 currently)
bnds = 2 ;
longitude = 25 ;
latitude = 33 ;
variables:
double time(time) ;
time:standard_name = "time" ;
time:long_name = "verification time generated by wgrib2 function verftime()" ;
time:bounds = "time_bnds" ;
time:units = "seconds since 1970-01-01 00:00:00.0 0:00" ;
time:calendar = "standard" ;
time:axis = "T" ;
double time_bnds(time, bnds) ;
double longitude(longitude) ;
longitude:standard_name = "longitude" ;
longitude:long_name = "longitude" ;
longitude:units = "degrees_east" ;
longitude:axis = "X" ;
double latitude(latitude) ;
latitude:standard_name = "latitude" ;
latitude:long_name = "latitude" ;
latitude:units = "degrees_north" ;
latitude:axis = "Y" ;
float APCP_sfc(time, latitude, longitude) ;
APCP_sfc:long_name = "Total Precipitation" ;
APCP_sfc:units = "kg/m^2" ;
APCP_sfc:_FillValue = 9.999e+20f ;
APCP_sfc:missing_value = 9.999e+20f ;
APCP_sfc:cell_methods = "time: sum" ;
APCP_sfc:short_name = "APCP_surface" ;
APCP_sfc:level = "surface" ;
}
Detail information of the raw data (Mnth.nc)
File format : NetCDF4 classic
-1 : Institut Source T Steptype Levels Num Points Num Dtype : Parameter ID
1 : unknown unknown v instant 1 1 825 1 F32 : -1
Grid coordinates :
1 : lonlat : points=825 (25x33)
longitude : 87 to 89.88 by 0.12 degrees_east
latitude : 25.08 to 28.92 by 0.12 degrees_north
Vertical coordinates :
1 : surface : levels=1
Time coordinate : 480 steps
RefTime = 1970-01-01 00:00:00 Units = seconds Calendar = standard Bounds = true
YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss
1980-01-16 12:30:00 1980-02-15 12:30:00 1980-03-16 12:30:00 1980-04-16 00:30:00
1980-05-16 12:30:00 1980-06-16 00:30:00 1980-07-16 12:30:00 1980-08-16 12:30:00
1980-09-16 00:30:00 1980-10-16 12:30:00 1980-11-16 00:30:00 1980-12-16 12:30:00
1981-01-16 12:30:00 1981-02-15 00:30:00 1981-03-16 12:30:00 1981-04-16 00:30:00
1981-05-16 12:30:00 1981-06-16 00:30:00 1981-07-16 12:30:00 1981-08-16 12:30:00
1981-09-16 00:30:00 1981-10-16 12:30:00 1981-11-16 00:30:00 1981-12-16 12:30:00
1982-01-16 12:30:00 1982-02-15 00:30:00 1982-03-16 12:30:00 1982-04-16 00:30:00
1982-05-16 12:30:00 1982-06-16 00:30:00 1982-07-16 12:30:00 1982-08-16 12:30:00
1982-09-16 00:30:00 1982-10-16 12:30:00 1982-11-16 00:30:00 1982-12-16 12:30:00
1983-01-16 12:30:00 1983-02-15 00:30:00 1983-03-16 12:30:00 1983-04-16 00:30:00
1983-05-16 12:30:00 1983-06-16 00:30:00 1983-07-16 12:30:00 1983-08-16 12:30:00
1983-09-16 00:30:00 1983-10-16 12:30:00 1983-11-16 00:30:00 1983-12-16 12:30:00
1984-01-16 12:30:00 1984-02-15 12:30:00 1984-03-16 12:30:00 1984-04-16 00:30:00
1984-05-16 12:30:00 1984-06-16 00:30:00 1984-07-16 12:30:00 1984-08-16 12:30:00
1984-09-16 00:30:00 1984-10-16 12:30:00 1984-11-16 00:30:00 1984-12-16 12:30:00
................................................................................
............................
2016-01-16 12:30:00 2016-02-15 12:30:00 2016-03-16 12:30:00 2016-04-16 00:30:00
2016-05-16 12:30:00 2016-06-16 00:30:00 2016-07-16 12:30:00 2016-08-16 12:30:00
2016-09-16 00:30:00 2016-10-16 12:30:00 2016-11-16 00:30:00 2016-12-16 12:30:00
2017-01-16 12:30:00 2017-02-15 00:30:00 2017-03-16 12:30:00 2017-04-16 00:30:00
2017-05-16 12:30:00 2017-06-16 00:30:00 2017-07-16 12:30:00 2017-08-16 12:30:00
2017-09-16 00:30:00 2017-10-16 12:30:00 2017-11-16 00:30:00 2017-12-16 12:30:00
2018-01-16 12:30:00 2018-02-15 00:30:00 2018-03-16 12:30:00 2018-04-16 00:30:00
2018-05-16 12:30:00 2018-06-16 00:30:00 2018-07-16 12:30:00 2018-08-16 12:30:00
2018-09-16 00:30:00 2018-10-16 12:30:00 2018-11-16 00:30:00 2018-12-16 12:30:00
2019-01-16 12:30:00 2019-02-15 00:30:00 2019-03-16 12:30:00 2019-04-16 00:30:00
2019-05-16 12:30:00 2019-06-16 00:30:00 2019-07-16 12:30:00 2019-08-16 12:30:00
2019-09-16 00:30:00 2019-10-16 12:30:00 2019-11-16 00:30:00 2019-12-16 12:30:00
2020-01-16 12:30:00 2020-02-15 12:30:00 2020-03-16 12:30:00 2020-04-16 00:30:00
2020-05-16 12:30:00 2020-06-16 00:30:00 2020-07-16 12:30:00 2020-08-16 12:30:00
2020-09-16 00:30:00 2020-10-16 12:30:00 2020-11-16 00:30:00 2020-12-16 12:30:00
cdo sinfo: Processed 1 variable over 480 timesteps [0.50s 30MB].
I extracted monthly rainfall values from the Mnth.nc file for a location (lon: 88.44; lat: 27.12)using the following command
cdo remapnn,lon=88.44-lat=27.12 Mnth.nc Mnth1.nc
cdo outputtab,year, month, value Mnth1.nc > Mnth.csv
The output is as follows ()
Year month Value
1980 1 31.74219
1980 2 54.60938
1980 3 66.94531
1980 4 149.4062
1980 5 580.7227
1980 6 690.1328
1980 7 1146.305
1980 8 535.8164
1980 9 486.4688
1980 10 119.5391
1980 11 82.10547
1980 12 13.95703
Then I extracted the rainfall values from the same data (Mnth.nc) for the same location (lon: 88.44; lat: 27.12) using the features of the multidimensional toolbox provided in ArcGIS. The result is as follows-
year month Value
1980 1 38.8125
1980 2 58.6542969
1980 3 71.7382813
1980 4 148.6367188
1980 5 564.7070313
1980 6 653.0390625
1980 7 1026.832031
1980 8 501.3164063
1980 9 458.5429688
1980 10 113.078125
1980 11 74.0976563
1980 12 24.2265625
Why I'm getting different results in two different software for the same location and for the same variable? Any help will highly be appreciated.
Thanks in advance.
The question is perhaps misleading, in that you are not "extracting" the data in both cases. Instead you are interpolating it. The method used by CDO is nearest neighbour. arcGIS is probably simply using a different method, so you should get different results. They should give slightly different results.
The results look very similar, so both are almost certainly working as advertised.
I think I ended up in the same issues. I used CDO to extract a point and also used ArcGIS for cross checking. I found out that the values were different.
Just to be sure, I recorded the location extent of one particular cell and tried extracting values for different locations within the cell boundary extent. CDO seemed to have been giving the same results as expected because it uses nearest neighbour resampling method.
Then I tried the same with ArcGIS. Interestingly, in my case, I found out that ArcGIS also gave me same results sometimes within the same cell boundary extent and sometimes different. I checked the values by also using 'Panoply' and I realised that CDO gave accurate results, while ArcGIS was sometimes giving offset results,i.e., it was giving the values of the nearby cells. This was confirmed by cross-checking with Panoply. As #Robert Wilson mentioned that ArcGIS must be using different resampling method, I figured out in the results section after using the tool 'Netcdf to table view' that it also uses Nearest neighbour method. This is not an answer to your question, but just something I found.

Overlap in seconds between datetime range and a time range

I have a dataframe like this:
df11 = pd.DataFrame(
{
"Start_date": ["2018-01-31 12:00:00", "2018-02-28 16:00:00", "2018-02-27 22:00:00"],
"End_date": ["2019-01-31 21:45:00", "2019-03-24 22:00:00", "2018-02-28 01:00:00"],
}
)
Start_date End_date
0 2018-01-31 12:00:00 2019-01-31 21:45:00
1 2018-02-28 16:00:00 2019-03-24 22:00:00
2 2018-02-27 22:00:00 2018-02-28 01:00:00
I need to check the overlap time duration in specific periods in seconds. My expected results are like this:
Start_date End_date 12h-16h 16h-22h 22h-00h 00h-02h30
0 2018-01-31 12:00:00 2019-01-31 21:45:00 14400 20700 0 0
1 2018-02-28 16:00:00 2019-03-24 22:00:00 0 21600 0 0
2 2018-02-27 22:00:00 2018-02-28 01:00:00 0 0 7200 3600
I know it`s completely wrong and I´ve tried other solutions. This is one of my attempts:
df11['12h-16h']=np.where(df11['Start_date']<timedelta(hours=16, minutes=0, seconds=0) & df11['End_date']>timedelta(hours=12, minutes=0, seconds=0),(np.minimum(df11['End_date'],timedelta(hours=16, minutes=0, seconds=0)))-(np.maximum(df11['Start_date'],timedelta(hours=12, minutes=0, seconds=0)))

Time difference between two columns in Pandas

How can I subtract the time between two columns and convert it to minutes
Date Time Ordered Time Delivered
0 1/11/19 9:25:00 am 10:58:00 am
1 1/11/19 10:16:00 am 11:13:00 am
2 1/11/19 10:25:00 am 10:45:00 am
3 1/11/19 10:45:00 am 11:12:00 am
4 1/11/19 11:11:00 am 11:47:00 am
I want to subtract the Time_delivered - Time_ordered to get the minutes the delivery took.
df.time_ordered = pd.to_datetime(df.time_ordered)
This doesn't output the correct time instead it adds today's date the time
Convert both time columns to datetimes, get difference, convert to seconds by Series.dt.total_seconds and then to minutes by division by 60:
df['diff'] = (pd.to_datetime(df.time_ordered, format='%I:%M:%S %p')
.sub(pd.to_datetime(df.time_delivered, format='%I:%M:%S %p'))
.dt.total_seconds()
.div(60))
Try to_datetime()
df = pd.DataFrame([['9:25:00 AM','10:58:00 AM']],
columns=['time1', 'time2'])
print(pd.to_datetime(df.time2)-pd.to_datetime(df.time1))
Output:
01:33:00
another way is using np.timedelta64
print(df)
Date Time Ordered Time Delivered
0 1/11/19 9:25:00 am 10:58:00 am
1 1/11/19 10:16:00 am 11:13:00 am
2 1/11/19 10:25:00 am 10:45:00 am
3 1/11/19 10:45:00 am 11:12:00 am
4 1/11/19 11:11:00 am 11:47:00 am
df['mins'] = (
pd.to_datetime(df["Date"] + " " + df["Time Delivered"])
- pd.to_datetime(df["Date"] + " " + df["Time Ordered"])
) / np.timedelta64(1, "m")
output:
print(df)
Date Time Ordered Time Delivered mins
0 1/11/19 9:25:00 am 10:58:00 am 93.0
1 1/11/19 10:16:00 am 11:13:00 am 57.0
2 1/11/19 10:25:00 am 10:45:00 am 20.0
3 1/11/19 10:45:00 am 11:12:00 am 27.0
4 1/11/19 11:11:00 am 11:47:00 am 36.0