Execute an SQL query depending on the parameters of a pandas dataframe - sql

I have a pandas data frame called final_data that looks like this
cust_id
start_date
end_date
10001
2022-01-01
2022-01-30
10002
2022-02-01
2022-02-30
10003
2022-01-01
2022-01-30
10004
2022-03-01
2022-03-30
10005
2022-02-01
2022-02-30
I have another table in my sql database called penalties that looks like this
cust_id
level1_pen
level_2_pen
date
10001
1
4
2022-01-01
10001
1
1
2022-01-02
10001
0
1
2022-01-03
10002
1
1
2022-01-01
10002
5
0
2022-02-01
10002
4
0
2022-02-04
10003
1
6
2022-01-02
I want the final_data frame to look like this where it aggregates the data from the penalties table in SQL database based on the cust_id, start_date and end_date
cust_id
start_date
end_date
total_penalties
10001
2022-01-01
2022-01-30
8
10002
2022-02-01
2022-02-30
9
10003
2022-01-01
2022-01-30
7
How do I combine a lambda function for each row where it aggregates the data from the SQL query based on the cust_id, start_date, and end_date variables from each row of the pandas dataframe

Suppose
df = final_data table
df2 = penalties table
you can get the final_data frame that you want using this query:
SELECT
df.cust_id,
df.start_date,
df.end_date,
SUM(df2.level1_pen + df2.level_2_pen) as total_penalties
FROM
df
LEFT JOIN df2 ON df.cust_id = df2.cust_id
AND df2.date BETWEEN df.start_date AND df.end_date
GROUP BY
df.cust_id,
df.start_date,
df.end_date;

Related

Select telemetry data based on relational data in PostgreSQL/TimescaleDB

I am storing some telemetry data from some sensors in an SQL table (PostgreSQL) and I want to know how I can I write a query that will group the telemetry data using relational information from two other tables.
I have one table which stores the telemetry data from the sensors. This table contains three fields, one for the timestamp, one for the sensor ID, one for the value of the sensor at that time. The value column is an incrementing count (it only increases)
Telemetry table
timestamp
sensor_id
value
2022-01-01 00:00:00
5
3
2022-01-01 00:00:01
5
5
2022-01-01 00:00:02
5
6
...
...
...
2022-01-01 01:00:00
5
675
I have another table which stores the state of the sensor, whether it was stationary or in motion and the start/end dates of that particular state for that sensor:
**Status **table
start_date
end_date
status
sensor_id
2022-01-01 00:00:00
2022-01-01 00:20:00
in_motion
5
2022-01-01 00:20:00
2022-01-01 00:40:00
stationary
5
2022-01-01 00:40:00
2022-01-01 01:00:00
in_motion
5
...
...
...
...
The sensor is located at a particular location. The Sensor table stores this metadata:
**Sensor **table
sensor_id
location_id
5
16
In the final table, I have the shifts that occur in each location.
**Shift **table
shift
location_id
occurrence_id
start_date
end_date
A Shift
16
123
2022-01-01 00:00:00
2022-01-01 00:30:00
B Shift
16
124
2022-01-01 00:30:00
2022-01-01 01:00:00
...
...
...
...
...
I want to write a query so that I can retrieve telemetry data that is grouped both by the shifts at the location of the sensor as well as the status of the sensor:
sensor_id
start_date
end_date
status
shift
value_start
value_end
5
2022-01-01 00:00:00
2022-01-01 00:20:00
in_motion
A Shift
3
250
5
2022-01-01 00:20:00
2022-01-01 00:30:00
stationary
A Shift
25
325
5
2022-01-01 00:30:00
2022-01-01 00:40:00
stationary
B Shift
325
490
5
2022-01-01 00:40:00
2022-01-01 01:00:00
in_motion
B Shift
490
675
As you can see, the telemetry data would be grouped both by the information contained in the Shift table as well as the Status table. Particularly, if you notice the sensor was in a stationary status between 2022-01-01 00:20:00 and 2022-01-01 00:40:00, however if you notice the 2nd and 3rd rows in the above table, this is cut into two rows based on the fact that the shift had changed at 2022-01-01 00:30:00.
Any idea about how to write a query that can do this? That would be really appreciated, thanks!

SQL - Calculate the average of a value in a table B from date range in table A

I am constructing a table in SQL like this
TABLE A
obj_id start_date end_date
1 2021-03-01 2022-08-02
1 2020-06-01 2021-07-02
2 2021-05-03 2022-08-04
3 2021-04-21 2022-06-05
And I have another table
TABLE B
obj_id date value
1 2021-04-12 21.45
3 2022-06-15 19.02
1 2020-11-02 3.11
2 2022-05-23 45.20
1 2022-07-31 32.45
3 2021-09-01 22.56
2 2021-10-10 34.04
I want to add to TABLE A a column with average value of TABLE B for corresponding obj_id of values where TABLE B date falls between TABLE A date range.
Expected result
TABLE A
obj_id start_date end_date average value
1 2021-03-01 2022-08-02 26.95 <-- Average value of 21.45 and 32.45 excluding 3.11 from average because date in table B is outside date range in table A
1 2020-06-01 2021-07-02 etc.
2 2021-05-03 2022-08-04 etc.
3 2021-04-21 2022-06-05 etc.
Sample query:
select
a.obj_id,
a.start_date,
a.end_date,
avg(b.value) as average
from table_a a
inner join table_b b
on a.obj_id = b.obj_id
and b.date >= a.start_date
and b.date <= a.end_date
group by
a.obj_id,
a.start_date,
a.end_date
order by
a.obj_id

Pandas groupby issue after melt bug?

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?
my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

How can I join two tables on an ID and a DATE RANGE in SQL

I have 2 query result tables containing records for different assessments. There are RAssessments and NAssessments which make up a complete review.
The aim is to eventually determine which reviews were completed. I would like to join the two tables on the ID, and on the date, HOWEVER the date each assessment is completed on may not be identical and may be several days apart, and some ID's may have more of an RAssessment than an NAssessment.
Therefore, I would like to join T1 on to T2 on ID & on T1Date(+ or - 7 days). There is no other way to match the two tables and to align the records other than using the date range, as this is a poorly designed database. I hope for some help with this as I am stumped.
Here is some sample data:
Table #1:
ID
RAssessmentDate
1
2020-01-03
1
2020-03-03
1
2020-05-03
2
2020-01-09
2
2020-04-09
3
2022-07-21
4
2020-06-30
4
2020-12-30
4
2021-06-30
4
2021-12-30
Table #2:
ID
NAssessmentDate
1
2020-01-07
1
2020-03-02
1
2020-05-03
2
2020-01-09
2
2020-07-06
2
2020-04-10
3
2022-07-21
4
2021-01-03
4
2021-06-28
4
2022-01-02
4
2022-06-26
I would like my end result table to look like this:
ID
RAssessmentDate
NAssessmentDate
1
2020-01-03
2020-01-07
1
2020-03-03
2020-03-02
1
2020-05-03
2020-05-03
2
2020-01-09
2020-01-09
2
2020-04-09
2020-04-10
2
NULL
2020-07-06
3
2022-07-21
2022-07-21
4
2020-06-30
NULL
4
2020-12-30
2021-01-03
4
2021-06-30
2021-06-28
4
2021-12-30
2022-01-02
4
NULL
2022-01-02
Try this:
SELECT
COALESCE(a.ID, b.ID) ID,
a.RAssessmentDate,
b.NAssessmentDate
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table1
) a
FULL OUTER JOIN (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RowId, *
FROM table2
) b ON a.ID = b.ID AND a.RowId = b.RowId
WHERE (a.RAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')
OR (b.NAssessmentDate BETWEEN '2020-01-01' AND '2022-01-02')

Get all rows from one table stream and the row before in time from an other table

Suppose I have one table (table_1) and one table stream (stream_1) that gets changes made to table_1, in my case only inserts of new rows. And once I have acted on these changes, the rowes will be removed from stream_1 but remain in table_1.
From that I would like to calculate delta values for var1 (var1 - lag(var1) as delta_var1) partitioned on a customer and just leave var2 as it is. So the data in table_1 could look something like this:
timemessage
customerid
var1
var2
2021-04-01 06:00:00
1
10
5
2021-04-01 07:00:00
2
100
7
2021-04-01 08:00:00
1
20
10
2021-04-01 09:00:00
1
40
3
2021-04-01 15:00:00
2
150
5
2021-04-01 23:00:00
1
50
6
2021-04-02 06:00:00
2
180
2
2021-04-02 07:00:00
1
55
9
2021-04-02 08:00:00
2
200
4
And the data in stream_1 that I want to act on could looks like this:
timemessage
customerid
var1
var2
2021-04-01 23:00:00
1
50
6
2021-04-02 06:00:00
2
180
2
2021-04-02 07:00:00
1
55
9
2021-04-02 08:00:00
2
200
4
But to be able to calculate delta_var1 for all customers I would need the previous row in time for each customer before the ones in stream_1.
For example: To be able to calculate how much var1 has increased for customerid = 1 between 2021-04-01 09:00:00 and 2021-04-01 23:00:00 I want to include the 2021-04-01 09:00:00 row for customerid = 1 in my output.
So I would like to create a select containing all rows in stream_1 + the previous row in time for each customerid from table_1: The wanted output is the following in regard to the mentioned table_1 and stream_1.
timemessage
customerid
var1
var2
2021-04-01 09:00:00
1
40
3
2021-04-01 15:00:00
2
150
5
2021-04-01 23:00:00
1
50
6
2021-04-02 06:00:00
2
180
2
2021-04-02 07:00:00
1
55
9
2021-04-02 08:00:00
2
200
4
So given you have the "last value per day" in your wanted output, you are want a QUALIFY to keep only the wanted rows and using ROW_NUMBER partitioned by customerid and timemessage. Assuming the accumulator it positive only you can order by accumulatedvalue thus:
WITH data(timemessage, customerid, accumulatedvalue) AS (
SELECT * FROM VALUES
('2021-04-01', 1, 10)
,('2021-04-01', 2, 100)
,('2021-04-02', 1, 20)
,('2021-04-03', 1, 40)
,('2021-04-03', 2, 150)
,('2021-04-04', 1, 50)
,('2021-04-04', 2, 180)
,('2021-04-05', 1, 55)
,('2021-04-05', 2, 200)
)
SELECT * FROM data
QUALIFY ROW_NUMBER() OVER (PARTITION BY customerid,timemessage ORDER BY accumulatedvalue DESC) = 1
ORDER BY 1,2;
gives:
TIMEMESSAGE CUSTOMERID ACCUMULATEDVALUE
2021-04-01 1 10
2021-04-01 2 100
2021-04-02 1 20
2021-04-03 1 40
2021-04-03 2 150
2021-04-04 1 50
2021-04-04 2 180
2021-04-05 1 55
2021-04-05 2 200
if you can trust your data and data in table2 starts right after data in table1 then you can just get the last records for each customer from table1 and union with table2:
select * from table1
qualify row_number() over (partitioned by customerid order by timemessage desc) = 1
union all
select * from table2
if not
select a.* from table1 a
join table2 b
on a.customerid = b.customerid
and a.timemessage < b.timemessage
qualify row_number() over (partitioned by a.customerid order by a.timemessage desc) = 1
union all
select * from table2
also you can add a condition to not look to data for more than 1 day (or 1 hour or whatever safe interval is to look at) for better performance