Select telemetry data based on relational data in PostgreSQL/TimescaleDB - sql

I am storing some telemetry data from some sensors in an SQL table (PostgreSQL) and I want to know how I can I write a query that will group the telemetry data using relational information from two other tables.
I have one table which stores the telemetry data from the sensors. This table contains three fields, one for the timestamp, one for the sensor ID, one for the value of the sensor at that time. The value column is an incrementing count (it only increases)
Telemetry table
timestamp
sensor_id
value
2022-01-01 00:00:00
5
3
2022-01-01 00:00:01
5
5
2022-01-01 00:00:02
5
6
...
...
...
2022-01-01 01:00:00
5
675
I have another table which stores the state of the sensor, whether it was stationary or in motion and the start/end dates of that particular state for that sensor:
**Status **table
start_date
end_date
status
sensor_id
2022-01-01 00:00:00
2022-01-01 00:20:00
in_motion
5
2022-01-01 00:20:00
2022-01-01 00:40:00
stationary
5
2022-01-01 00:40:00
2022-01-01 01:00:00
in_motion
5
...
...
...
...
The sensor is located at a particular location. The Sensor table stores this metadata:
**Sensor **table
sensor_id
location_id
5
16
In the final table, I have the shifts that occur in each location.
**Shift **table
shift
location_id
occurrence_id
start_date
end_date
A Shift
16
123
2022-01-01 00:00:00
2022-01-01 00:30:00
B Shift
16
124
2022-01-01 00:30:00
2022-01-01 01:00:00
...
...
...
...
...
I want to write a query so that I can retrieve telemetry data that is grouped both by the shifts at the location of the sensor as well as the status of the sensor:
sensor_id
start_date
end_date
status
shift
value_start
value_end
5
2022-01-01 00:00:00
2022-01-01 00:20:00
in_motion
A Shift
3
250
5
2022-01-01 00:20:00
2022-01-01 00:30:00
stationary
A Shift
25
325
5
2022-01-01 00:30:00
2022-01-01 00:40:00
stationary
B Shift
325
490
5
2022-01-01 00:40:00
2022-01-01 01:00:00
in_motion
B Shift
490
675
As you can see, the telemetry data would be grouped both by the information contained in the Shift table as well as the Status table. Particularly, if you notice the sensor was in a stationary status between 2022-01-01 00:20:00 and 2022-01-01 00:40:00, however if you notice the 2nd and 3rd rows in the above table, this is cut into two rows based on the fact that the shift had changed at 2022-01-01 00:30:00.
Any idea about how to write a query that can do this? That would be really appreciated, thanks!

Related

Execute an SQL query depending on the parameters of a pandas dataframe

I have a pandas data frame called final_data that looks like this
cust_id
start_date
end_date
10001
2022-01-01
2022-01-30
10002
2022-02-01
2022-02-30
10003
2022-01-01
2022-01-30
10004
2022-03-01
2022-03-30
10005
2022-02-01
2022-02-30
I have another table in my sql database called penalties that looks like this
cust_id
level1_pen
level_2_pen
date
10001
1
4
2022-01-01
10001
1
1
2022-01-02
10001
0
1
2022-01-03
10002
1
1
2022-01-01
10002
5
0
2022-02-01
10002
4
0
2022-02-04
10003
1
6
2022-01-02
I want the final_data frame to look like this where it aggregates the data from the penalties table in SQL database based on the cust_id, start_date and end_date
cust_id
start_date
end_date
total_penalties
10001
2022-01-01
2022-01-30
8
10002
2022-02-01
2022-02-30
9
10003
2022-01-01
2022-01-30
7
How do I combine a lambda function for each row where it aggregates the data from the SQL query based on the cust_id, start_date, and end_date variables from each row of the pandas dataframe
Suppose
df = final_data table
df2 = penalties table
you can get the final_data frame that you want using this query:
SELECT
df.cust_id,
df.start_date,
df.end_date,
SUM(df2.level1_pen + df2.level_2_pen) as total_penalties
FROM
df
LEFT JOIN df2 ON df.cust_id = df2.cust_id
AND df2.date BETWEEN df.start_date AND df.end_date
GROUP BY
df.cust_id,
df.start_date,
df.end_date;

Rolling Sum Calculation Based on 2 Date Fields

Giving up after a few hours of failed attempts.
My data is in the following format - event_date can never be higher than create_date.
I'd need to calculate on a rolling n-day basis (let's say 3) the sum of units where the create_date and event_date were within the same 3-day window. The data is illustrative but each event_date can have over 500+ different create_dates associated with it and the number isn't constant. There is a possibility of event_dates missing.
So let's say for 2022-02-03, I only want to sum units where both the event_date and create_date values were between 2022-02-01 and 2022-02-03.
event_date
create_date
rowid
units
2022-02-01
2022-01-20
1
100
2022-02-01
2022-02-01
2
100
2022-02-02
2022-01-21
3
100
2022-02-02
2022-01-23
4
100
2022-02-02
2022-01-31
5
100
2022-02-02
2022-02-02
6
100
2022-02-03
2022-01-30
7
100
2022-02-03
2022-02-01
8
100
2022-02-03
2022-02-03
9
100
2022-02-05
2022-02-01
10
100
2022-02-05
2022-02-03
11
100
The output I'd need to get to (added in brackets the rows I'd need to include in the calculation for each date but my result would only need to include the numerical sum) . I tried calculating using either dates but neither of them returned the results I needed.
date
units
2022-02-01
100 (Row 2)
2022-02-02
300 (Row 2,5,6)
2022-02-03
300 (Row 2,6,8,9)
2022-02-04
200 (Row 6,9)
2022-02-05
200 (Row 9,11)
In Python I solved above with a definition that looped through filtering a dataframe for each date but I am struggling to do the same in SQL.
Thank you!
Consider below approach
with events_dates as (
select date from (
select min(event_date) min_date, max(event_date) max_date
from your_table
), unnest(generate_date_array(min_date, max_date)) date
)
select date, sum(units) as units, string_agg('' || rowid) rows_included
from events_dates
left join your_table
on create_date between date - 2 and date
and event_date between date - 2 and date
group by date
if applied to sample data in your question - output is

Transpose a table with multiple ID rows and different assessment dates

I would like to transpose my table to see trends in a data. The data is formatted as such:
UserId is can occur multiple times because of different assessment periods. Let's say a user with ID 1 inccured some charges in January, February, and March. There are currently three rows that contain data from these periods respectively.
I would like to see everything as one row - independently of the number of periods (up to 12 months), for each user ID.
This would enable me to see and compare changes between assessment periods and attributes.
Current format:
UserId AssessmentDate Attribute1 Attribute2 Attribute3
1 2020-01-01 00:00:00.000 -01:00 20.13 123.11 405.00
1 2021-02-01 00:00:00.000 -01:00 1.03 78.93 11.34
1 2021-03-01 00:00:00.000 -01:00 15.03 310.10 23.15
2 2021-02-01 00:00:00.000 -01:00 14.31 41.30 63.20
2 2021-03-01 00:03:45.000 -01:00 0.05 3.50 1.30
Desired format:
UserId LastAssessmentDate Attribute1_M-2 Attribute2_M-1 ... Attribute3_M0
1 2021-03-01 00:00:00.000 -01:00 20.13 123.11 23.15
2 2021-03-01 00:03:45.000 -01:00 NULL 41.30 1.30
Either SQL or Pandas - both work for me. Thanks for the help!

Need to join two tables on date range in hive for disc rate on transactions for prod catg at acc_no level monthly

I have mapping table mapping_table with column: discount rate value based on prod_catg on tot_bill by account number for 3 different product categories
for group of consumer and business.
I have one transaction table txn having monthly transactions for multiple acc_no on 1 day of every month (eg.txn_date: 2019-01-01).
Table txn with columns txn_date, acc_no, prod_catg, card_grp, and tot_bill(in $) as shown in below table with dummy values.
I need a Hive query to calculate total_disc_amount for each acc_no monthly at card_no level for card_catg wise for all product_hieracy for a year range: prod_serv_name as x for 2020-01-01 to 2020-12-31 and prod_serv_name as y for platinum category and year 2018-01-01 to 2018-12-31.
Join table on date range in hive on prod_catg and get disc_rate value to calculate tot_disc_bill (tot_bill*disc_rate) for at acc_no level monthly.
txn_date
card_no
prod_catg
card_grp
tot_bill
2019-01-01
201
Platinum
Consumer
900
2019-02-01
201
platinum
Consumer
500
2019-03-01
201
Platinum
Consumer
300
2020-02-01
201
Platinum
Consumer
400
2020-03-01
201
Platinum
Consumer
800
2020-03-01
202
Gold
Business
700
2020-01-01
203
Gold
Business
900
2018-10-01
204
Gold
Business
900
2018-09-01
205
Platinum
Business
100
2018-03-01
206
Bronze
Business
200
prod_serv_name
prod_catg
card_grp
disc_rate
start_date
end_date
x
Platinum
Consumer
2.5
2020-01-01
2020-12-31
x
Gold
Consumer
2.5
2020-01-01
2020-12-31
x
Bronze
Consumer
2.5
2020-01-01
2020-12-31
x
Platinum
Consumer
2.5
2019-01-01
2019-12-31
x
Gold
Consumer
3
2019-01-01
2019-12-31
x
Bronze
Consumer
3
2019-01-01
2019-12-31
x
Gold
Business
3
2020-01-01
2020-12-31
y
Gold
Business
3
2018-01-01
2018-12-31
y
Platinum
Business
3
2018-01-01
2018-12-31
y
Bronze
Business
3
2018-01-01
2018-12-31

Pandas take daily mean within resampled date

I have a dataframe with trip counts every 20 minutes during a whole month, let's say:
Date Trip count
0 2019-08-01 00:00:00 3
1 2019-08-01 00:20:00 2
2 2019-08-01 00:40:00 4
3 2019-08-02 00:00:00 6
4 2019-08-02 00:20:00 4
5 2019-08-02 00:40:00 2
I want to take daily mean of all trip counts every 20 minutes. Desired output (for above values) looks like:
Date mean
0 00:00:00 4.5
1 00:20:00 3
2 00:40:00 3
..
72 23:40:00 ..
You can aggregate by times created by Series.dt.time, because there are always 00, 20, 40 minutes only and no seconds:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby(df['Date'].dt.time).mean()
#alternative
#df1 = df.groupby(df['Date'].dt.strftime('%H:%M:%S')).mean()
print (df1)
Trip count
Date
00:00:00 4.5
00:20:00 3.0
00:40:00 3.0