Transform the dataframe containing time series - pandas

I have dataframe like this:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 03:00:00 A01 50
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
Given a start and end date (start = 2018-01-01 00:00:00 and end = 2018-01-01 05:00:00)
I would like to transform the dataframe to below. (for all missing entry we need to have zero or NULL)
Output like:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 02:00:00 A01 00
2018-01-01 03:00:00 A01 50
2018-01-01 04:00:00 A01 00
2018-01-01 05:00:00 A01 00
2018-01-01 00:00:00 A02 00
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
2018-01-01 04:00:00 A02 00
2018-01-01 05:00:00 A02 00
I directionless so have no approach as of now.

Let's try pivot the table and reindex:
new_dates = pd.date_range('2018-01-01 00:00:00',
'2018-01-01 05:00:00',
freq='H', name='datestamp')
(df.pivot_table(index='datestamp',columns='Name',
values='Reading', fill_value=0)
.reindex(new_dates, fill_value=0)
.stack().sort_index(level=[1,0])
.reset_index(name='Reading')
)
Output:
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 0
3 2018-01-01 03:00:00 A01 50
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 00:00:00 A02 0
7 2018-01-01 01:00:00 A02 50
8 2018-01-01 02:00:00 A02 40
9 2018-01-01 03:00:00 A02 30
10 2018-01-01 04:00:00 A02 0
11 2018-01-01 05:00:00 A02 0

Use pd.date_range(), df.loc[] and a nested for loop:
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A02 50
4 2018-01-01 04:00:00 A02 40
5 2018-01-01 05:00:00 A02 30
start_date = dt.datetime.strptime('2018-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end_date = dt.datetime.strptime('2018-01-01 05:00:00', '%Y-%m-%d %H:%M:%S')
date_range = pd.date_range(start=start_date, end=end_date, freq='H')
for date in date_range:
for sensor in df.Name.unique():
if len(df.loc[(df['datestamp'] == date) & (df['Name'] == sensor)]) == 0:
df=df.append({'datestamp':date, 'Name':sensor, 'Reading':0}, ignore_index=True)
df = df.sort_values(by=['Name', 'datestamp']).reset_index(drop=True)
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A01 0
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 03:00:00 A02 50
7 2018-01-01 04:00:00 A02 40
8 2018-01-01 05:00:00 A02 30
9 2018-01-01 00:00:00 A02 0
10 2018-01-01 01:00:00 A02 0
11 2018-01-01 02:00:00 A02 0
Pardon the column headings.

You can use pandas fillna method, Link to documentation
df[‘column_name’] = df[‘column_name’].fillna(0)

Related

Analyze a Time Series

I am inserting data into a table with date/time column.
I want to find speed of inserts during a particular duration as follows :
Duration # of Records
1:00pm - 2:00PM 1000
2:00pm - 3:00PM 1400
.......................
11:00PM- 12:00 1100
Though I can find above by repeatedly executing follows:
select count(*) from table_A where insert_date between 1:00pm and 2:00pm
Is there Oracle supplied package/function which can produce above report - without having to execute separate statements ?
Here's a couple of examples. To get "sparse" results, ie, just the data that exists within the table, you simply use TRUNC
SQL> create table data ( d date );
Table created.
SQL>
SQL> insert into data
2 select date '2022-02-10' + dbms_random.normal/10
3 from dual
4 connect by level <= 10000;
10000 rows created.
SQL>
SQL> select trunc(d,'HH24'), count(*)
2 from data
3 group by trunc(d,'HH24')
4 order by 1;
TRUNC(D,'HH24') COUNT(*)
------------------- ----------
09/02/2022 13:00:00 1
09/02/2022 15:00:00 4
09/02/2022 16:00:00 10
09/02/2022 17:00:00 40
09/02/2022 18:00:00 126
09/02/2022 19:00:00 282
09/02/2022 20:00:00 595
09/02/2022 21:00:00 948
09/02/2022 22:00:00 1389
09/02/2022 23:00:00 1577
10/02/2022 00:00:00 1609
10/02/2022 01:00:00 1362
10/02/2022 02:00:00 956
10/02/2022 03:00:00 624
10/02/2022 04:00:00 281
10/02/2022 05:00:00 134
10/02/2022 06:00:00 43
10/02/2022 07:00:00 16
10/02/2022 08:00:00 2
10/02/2022 10:00:00 1
20 rows selected.
If you need to get ALL hours, even if there was no data for a given hour, you can OUTER JOIN the raw data to a synthetic list of rows with all hours for the desired range, eg
SQL> with full_range as
2 ( select date '2022-02-09' + rownum/24 hr
3 from dual
4 connect by level <= 48
5 ),
6 raw_data as
7 ( select trunc(d,'HH24') dhr, count(*) cnt
8 from data
9 group by trunc(d,'HH24')
10 )
11 select full_range.hr, raw_data.cnt
12 from raw_data, full_range
13 where full_range.hr = raw_data.dhr(+)
14 order by 1;
HR CNT
------------------- ----------
09/02/2022 01:00:00
09/02/2022 02:00:00
09/02/2022 03:00:00
09/02/2022 04:00:00
09/02/2022 05:00:00
09/02/2022 06:00:00
09/02/2022 07:00:00
09/02/2022 08:00:00
09/02/2022 09:00:00
09/02/2022 10:00:00
09/02/2022 11:00:00
09/02/2022 12:00:00
09/02/2022 13:00:00 1
09/02/2022 14:00:00
09/02/2022 15:00:00 4
09/02/2022 16:00:00 10
09/02/2022 17:00:00 40
09/02/2022 18:00:00 126
09/02/2022 19:00:00 282
09/02/2022 20:00:00 595
09/02/2022 21:00:00 948
09/02/2022 22:00:00 1389
09/02/2022 23:00:00 1577
10/02/2022 00:00:00 1609
10/02/2022 01:00:00 1362
10/02/2022 02:00:00 956
10/02/2022 03:00:00 624
10/02/2022 04:00:00 281
10/02/2022 05:00:00 134
10/02/2022 06:00:00 43
10/02/2022 07:00:00 16
10/02/2022 08:00:00 2
10/02/2022 09:00:00
10/02/2022 10:00:00 1
10/02/2022 11:00:00
10/02/2022 12:00:00
10/02/2022 13:00:00
10/02/2022 14:00:00
10/02/2022 15:00:00
10/02/2022 16:00:00
10/02/2022 17:00:00
10/02/2022 18:00:00
10/02/2022 19:00:00
10/02/2022 20:00:00
10/02/2022 21:00:00
10/02/2022 22:00:00
10/02/2022 23:00:00
11/02/2022 00:00:00
48 rows selected.

How to correctly resample time series data?

I have time series data on a PostgreSQL database. Currently I import this into pandas to do some analysis and my first step often times is resampling the 5 min frequency data to 1 hour average data. I do this so that I first pivot the data to bring it in wide form, then I resample it to 1 hour and after that I melt it to have it in long form again.
Now I want to do the resampling on the Database so that I can import the 1 hour average right away.
This is what the data looks like on the Database. I have two different types with 3 different names.
datetime value type name
0 2018-01-01 13:35:00+01:00 0.22 HLN NO2
1 2018-01-01 13:35:00+01:00 0.31 HLN CO
2 2018-01-01 13:35:00+01:00 1.15 HLN NO
3 2018-01-01 13:40:00+01:00 1.80 AIS NO2
4 2018-01-01 13:40:00+01:00 2.60 AIS CO
5 2018-01-01 13:40:00+01:00 2.30 AIS NO
6 2018-01-01 13:45:00+01:00 2.25 HLN NO2
7 2018-01-01 13:45:00+01:00 2.14 HLN CO
8 2018-01-01 13:45:00+01:00 2.96 HLN NO
9 2018-01-01 14:35:00+01:00 0.76 HLN NO2
10 2018-01-01 14:35:00+01:00 0.80 HLN CO
11 2018-01-01 14:35:00+01:00 1.19 HLN NO
12 2018-01-01 14:40:00+01:00 1.10 AIS NO2
13 2018-01-01 14:40:00+01:00 2.87 AIS CO
14 2018-01-01 14:40:00+01:00 2.80 AIS NO
15 2018-01-01 14:45:00+01:00 3.06 HLN NO2
16 2018-01-01 13:45:00+01:00 2.86 HLN CO
17 2018-01-01 13:45:00+01:00 2.22 HLN NO
Now comes the part which I have problems with. After resampling and plotting it in pandas and plotly I get the expected result which is right, one value for every hour:
After performing a SQL Query to resample it to one hour with the following code:
SELECT date_trunc('hour', datetime) AS hour, type, AVG(value) AS measure, name
FROM data_table
GROUP BY datetime, type, name
ORDER BY datetime
I get this after plotting:
This is not smooth and there are multiple values in the hour, I guess there are all the values within the hour.
My question, how can I correctly resample a time series in SQL?
Edit: Expected result in table form:
datetime value type name
2018-01-01 13:00:00 1.235 HLN NO2
2018-01-01 13:00:00 2.65 HLN CO
2018-01-01 13:00:00 2.96 HLN NO
2018-01-01 13:00:00 2.48 AIS NO2
2018-01-01 13:00:00 2.65 AIS CO
2018-01-01 13:00:00 2.26 AIS NO
2018-01-01 14:00:00 2.78 HLN NO2
2018-01-01 14:00:00 3.65 HLN CO
2018-01-01 14:00:00 1.95 HLN NO
2018-01-01 14:00:00 1.45 AIS NO2
2018-01-01 14:00:00 1.64 AIS CO
2018-01-01 14:00:00 3.23 AIS NO
An alternative is to create the time intervals using generate_series() or just with in a subquery / CTE truncate the hours per type and name, and in the outer query join both records and aggregate the values with avg() by the columns hour, type and name, e.g.
WITH j AS (
SELECT DISTINCT date_trunc('hour', datetime) AS hour, type,name
FROM data_table
)
SELECT j.*, avg(d.value)
FROM data_table d
JOIN j ON date_trunc('hour', d.datetime) = j.hour AND
j.type = d.type AND
d.name = j.name
GROUP BY j.hour, j.name, j.type
ORDER BY j.hour ASC,j.type DESC;
hour | type | name | avg
---------------------+------+------+------------------------
2018-01-01 13:00:00 | HLN | CO | 1.7700000000000000
2018-01-01 13:00:00 | HLN | NO | 2.1100000000000000
2018-01-01 13:00:00 | HLN | NO2 | 1.23500000000000000000
2018-01-01 13:00:00 | AIS | CO | 2.6000000000000000
2018-01-01 13:00:00 | AIS | NO | 2.3000000000000000
2018-01-01 13:00:00 | AIS | NO2 | 1.80000000000000000000
2018-01-01 14:00:00 | HLN | CO | 0.80000000000000000000
2018-01-01 14:00:00 | HLN | NO | 1.19000000000000000000
2018-01-01 14:00:00 | HLN | NO2 | 1.9100000000000000
2018-01-01 14:00:00 | AIS | CO | 2.8700000000000000
2018-01-01 14:00:00 | AIS | NO | 2.8000000000000000
2018-01-01 14:00:00 | AIS | NO2 | 1.10000000000000000000
Demo: db<>fiddle

Add 10 to 40 minutes randomly to a datetime column in pandas

I have a data frame as shown below
start
2010-01-06 09:00:00
2018-01-07 08:00:00
2012-01-08 11:00:00
2016-01-07 08:00:00
2010-02-06 14:00:00
2018-01-07 16:00:00
To the above df, I would like to add a column called 'finish' by adding minutes between 10 to 40 with start column randomly with replacement.
Expected Ouput:
start finish
2010-01-06 09:00:00 2010-01-06 09:20:00
2018-01-07 08:00:00 2018-01-07 08:12:00
2012-01-08 11:00:00 2012-01-08 11:38:00
2016-01-07 08:00:00 2016-01-07 08:15:00
2010-02-06 14:00:00 2010-02-06 14:24:00
2018-01-07 16:00:00 2018-01-07 16:36:00
Create timedeltas by to_timedelta and numpy.random.randint for integers between 10 and 40:
arr = np.random.randint(10, 40, size=len(df))
df['finish'] = df['start'] + pd.to_timedelta(arr, unit='Min')
print (df)
start finish
0 2010-01-06 09:00:00 2010-01-06 09:25:00
1 2018-01-07 08:00:00 2018-01-07 08:30:00
2 2012-01-08 11:00:00 2012-01-08 11:29:00
3 2016-01-07 08:00:00 2016-01-07 08:12:00
4 2010-02-06 14:00:00 2010-02-06 14:31:00
5 2018-01-07 16:00:00 2018-01-07 16:39:00
You can achieve it by using pandas.Series.apply() in combination with pandas.to_timedelta() and random.randint().
from random import randint
df['finish'] = df.start.apply(lambda dt: dt + pd.to_timedelta(randint(10, 40), unit='m'))

Splitting value dataframe over multiple timeslots

Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3

Splitting interval overlapping more days in PostgreSQL

I have a PostgreSQL table containing start timestamp and duration time.
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00
2018-01-04 09:00:00 | 2 days 16 hours
What I would like is to have the interval splitted into every day like this:
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 01:00:00
2018-01-03 00:00:00 | 03:00:00
2018-01-04 09:00:00 | 15:00:00
2018-01-05 00:00:00 | 24:00:00
2018-01-06 00:00:00 | 24:00:00
2018-01-07 00:00:00 | 01:00:00
I am playing with generate_series(), width_bucket(), range functions, but I still can't find plausible solution. Is there any existing or working solution?
not sure about all edge cases, but this seems working:
t=# with c as (select *,min(t) over (), max(t+i) over (), tsrange(date_trunc('day',t),t+i) tr from t)
, mid as (
select distinct t,i,g,tr
, case when g < t then t else g end tt
from c
right outer join (select generate_series(date_trunc('day',min),date_trunc('day',max),'1 day') g from c) e on g <# tr order by 3,1
)
select
tt
, i
, case when tt+'1 day' > upper(tr) and t < g then upper(tr)::time::interval when upper(tr) - lower(tr) < '1 day' then i else g+'1 day' - tt end
from mid
order by tt;
tt | i | case
---------------------+-----------------+----------
2018-01-01 15:00:00 | 06:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00 | 01:00:00
2018-01-03 00:00:00 | 04:00:00 | 03:00:00
2018-01-04 09:00:00 | 2 days 16:00:00 | 15:00:00
2018-01-05 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-06 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-07 00:00:00 | 2 days 16:00:00 | 01:00:00
(7 rows)
also please mind that timestamp without time zone can fail you when comparing timestamps...