SQL Server - Aggregate data by minute over multiple days - sql

Context
I'm using Microsoft SQL Server 2016.
There is a database table "Raw_data", that contains the status of a machine, together with it's starting time. There are several machines and each one writes it's status to the database multiple times per minute.
To reduce the data volume I'm trying to aggregate the data into 1-Minute chunks to save it for further analysis. Due to a capacity constraint, I want to execute this transition-logic every few minutes (e.g. scheduled SQL Server Agent Job), delete the raw data and just keep the aggregated data.
To simplify the example, let's assume "Raw_data" looks something like this:
╔════╦════════════╦════════╦═════════════════════╗
║ id ║ fk_machine ║ status ║ created_at ║
╠════╬════════════╬════════╬═════════════════════╣
║ 1 ║ 2222 ║ 0 ║ 2020-08-19 22:15:00 ║
║ 2 ║ 2222 ║ 3 ║ 2020-08-19 22:15:30 ║
║ 3 ║ 2222 ║ 5 ║ 2020-08-19 23:07:00 ║
║ 4 ║ 2222 ║ 1 ║ 2020-08-20 00:20:00 ║
║ 5 ║ 2222 ║ 0 ║ 2020-08-20 00:45:00 ║
║ 6 ║ 2222 ║ 5 ║ 2020-08-20 02:20:00 ║
╚════╩════════════╩════════╩═════════════════════╝
Also there are database tables "Dim_date" and "Dim_time", that look something like that:
╔══════════╦══════════════╗
║ datekey ║ date_iso8601 ║
╠══════════╬══════════════╣
║ 20200101 ║ 2020-01-01 ║
║ 20200102 ║ 2020-01-02 ║
║ ... ║ ... ║
║ 20351231 ║ 2035-12-31 ║
╚══════════╩══════════════╝
╔═════════╦══════════╦═════════════════╗
║ timekey ║ time_iso ║ min_lower_bound ║
╠═════════╬══════════╬═════════════════╣
║ 1 ║ 00:00:01 ║ 00:00:00 ║
║ 2 ║ 00:00:02 ║ 00:00:00 ║
║ ... ║ ... ║ ... ║
║ 80345 ║ 08:03:45 ║ 08:03:00 ║
║ ... ║ ... ║ ... ║
║ 134504 ║ 13:45:04 ║ 13:45:00 ║
║ 134505 ║ 14:45:05 ║ 13:45:00 ║
║ ... ║ ... ║ ... ║
║ 235959 ║ 23:59:59 ║ 23:59:59 ║
╚═════════╩══════════╩═════════════════╝
The result should look like this:
╔══════════════╦═════════════════╦════════════╦════════╦═══════════════╗
║ date_iso8601 ║ min_lower_bound ║ fk_machine ║ status ║ total_seconds ║
╠══════════════╬═════════════════╬════════════╬════════╬═══════════════╣
║ 2020-08-19 ║ 22:15:00 ║ 2222 ║ 0 ║ 30 ║
║ 2020-08-19 ║ 20:15:00 ║ 2222 ║ 3 ║ 30 ║
║ 2020-08-19 ║ 20:16:00 ║ 2222 ║ 3 ║ 60 ║
║ 2020-08-19 ║ 20:17:00 ║ 2222 ║ 3 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-19 ║ 23:06:00 ║ 2222 ║ 3 ║ 60 ║
║ 2020-08-19 ║ 23:07:00 ║ 2222 ║ 5 ║ 60 ║
║ 2020-08-19 ║ 23:08:00 ║ 2222 ║ 5 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-20 ║ 00:19:00 ║ 2222 ║ 5 ║ 60 ║
║ 2020-08-20 ║ 00:20:00 ║ 2222 ║ 1 ║ 60 ║
║ 2020-08-20 ║ 00:21:00 ║ 2222 ║ 1 ║ 60 ║
║ ... ║ ... ║ ... ║ ... ║ ... ║
║ 2020-08-20 ║ 00:44:00 ║ 2222 ║ 1 ║ 60 ║
║ 2020-08-20 ║ 00:45:00 ║ 2222 ║ 0 ║ 60 ║
╚══════════════╩═════════════════╩════════════╩════════╩═══════════════╝
Attempt
To calculate the duration of each status per minute I used an CTE and LEAD to fetch the starting date and time from the next status in the database table, then joined with the dimension tables and aggregated the result.
WITH CTE_MACHINE_STATES(START_DATEKEY,
START_TIMEKEY,
FK_MACHINE,
END_DATEKEY,
END_TIMEKEY)
AS (SELECT CAST(CONVERT(CHAR(8), CREATED_AT, 112) AS INT), -- ISO: yyyymmdd
CONVERT(INT, REPLACE(CONVERT(CHAR(8), READING_TIME, 108), ':', '')),
FK_MACHINE,
STATUS,
CAST(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
ORDER BY CREATED_AT), 112) AS INT),
CONVERT(INT, REPLACE(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
ORDER BY CREATED_AT), 108), ':', ''))
FROM RAW_DATA)
SELECT DATE_ISO8601,
MIN_LOWER_BOUND,
FK_MACHINE,
STATUS,
SUM(1) AS TOTAL_SECONDS -- Duration
FROM CTE_MACHINE_STATES
CROSS JOIN DIM_DATE
CROSS JOIN DIM_TIME
WHERE TIMEKEY >= START_TIMEKEY AND
TIMEKEY < END_TIMEKEY AND
END_TIMEKEY IS NOT NULL AND -- last entry per machine and status
DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY
GROUP BY FK_MACHINE,
STATUS,
DATE_ISO8610,
MIN_LOWER_BOUND
ORDER BY DATE_ISO8610,
MIN_LOWER_BOUND;
The Problem
If the status lasts past midnight it won't be aggregated correctly. For example the status at id = 3 in "Raw_data" starts at 23:07 and ends on 00:20 the next day. Here, timekey is greater than end_timekey, so the status get's excluded from the resulting table by the filter TIMEKEY < END_TIMEKEY. I haven't come up with a solution on how to change the join-condition to include such long-lasting states, but get the expected result.
PS: I already wrote, that normally status-updates are happening every several seconds. Thus, the problem only occurs in edge cases, e.g. if a machine get's turned off.
Solution
Unfortunately I did not receive an answer on how to get the expected result using the date- and time dimension tables. But dnoeth's approach using a recursive CTE is good, so I went with it:
WITH cte_outer AS (
SELECT fk_machine,
status,
created_at,
DATEADD(minute, DATEDIFF(minute, '2000', created_at), '2000') AS min_lower_bound, --truncates seconds from start time
LEAD(created_at) OVER(PARTITION BY fk_machine ORDER BY created_at) AS end_time
FROM raw_data
),
cte_recursive AS (
SELECT fk_machine,
status,
min_lower_bound,
end_time,
CASE
WHEN end_time > DATEADD(minute, 1, min_lower_bound)
THEN DATEDIFF(s, created_at, DATEADD(minute, 1, min_lower_bound))
ELSE DATEDIFF(s, created_at, end_time)
END AS total_seconds
FROM cte_outer
UNION ALL
SELECT fk_machine,
status,
DATEADD(minute, 1, min_lower_bound), -- next time segment (minute)
end_time,
CASE
WHEN end_time >= DATEADD(minute, 2, min_lower_bound)
THEN 60
ELSE DATEDIFF(s, DATEADD(minute, 1, min_lower_bound), end_time)
END
FROM cte_recursive
WHERE end_time > DATEADD(minute, 1, min_lower_bound)
)
SELECT min_lower_bound,
fk_machine,
status,
total_seconds
FROM cte_recursive
ORDER BY fk_machine,
min_lower_bound

This is a use-case for a recursive CTE, increasing created_at by one minute per recursion:
with cte as
(
select fk_machine
,status
,start_minute
,end_time
,case
when end_time > dateadd(minute, 1,start_minute)
then datediff(s, created_at, dateadd(minute, 1,start_minute))
else datediff(s, created_at, end_time )
end as seconds
from
(
select fk_machine
,status
,created_at
,dateadd(minute, datediff(minute, 0, created_at), 0) as start_minute
,lead(created_at)
over (PARTITION BY fk_machine
order by created_at) as end_time
from tab
) as dt
union all
select fk_machine
,status
,dateadd(minute, 1,start_minute)
,end_time
,case
when end_time >= dateadd(minute, 2,start_minute)
then 60
else datediff(s, dateadd(minute, 1,start_minute), end_time)
end
from cte
where end_time > dateadd(minute, 1,start_minute)
)
select * from cte
order by 1,3,4;
See fiddle

For something like this, concatenating the keys to a single datetime isn’t as costly as it might seem. Then you can call DATEDIFF() to check for positive, negative, absolute, values for the comparison. I’ve run something similar translating instantaneous data to minute aggregates across multiple decades, and datediff really makes the difference. However, this would do much better if you simply pull the raw data and perform the calculations in a language with a good datetime library. SQL is always the answer until it isn’t.
What’s likely causing one of the problems here is the following statement:
WHERE TIMEKEY >= START_TIMEKEY AND
TIMEKEY < END_TIMEKEY AND
END_TIMEKEY IS NOT NULL AND
DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY
If the date and time aren’t separated, you can say:
WHERE DateTimeKey >= START_DateTimeKey AND
DateTimeKey < END_DateTimeKey AND
END_TIME-KEY IS NOT NULL
If you are trying to aggregate by a time value, it would be helpful to eliminate any timekey table, that may be another source of problems. It may be a good idea to replace the timekey table with a recursion and a period duration. You will also need to account for these conditions:
End time of the event must always be after the start time of the aggregate period start time:
DateDiff(second, Period_Start_Time, Event_End) > 0
Start time of the event must always be before the end of the aggregate period end time:
DateDiff(second, Period_Start_Time, Event_Start) <= #Period_Duration
There are several ways to distribute the event data across the periods, but datediff helps with linear distribution as well.

Related

How to Pivot this table on Oracle?

Can you help me figure out how to pivot this table:
╔═══════════╦═════════════╦══════╦════════╦════════╗
║ Big Group ║ Small Group ║ Kids ║ Adults ║ Elders ║
╠═══════════╬═════════════╬══════╬════════╬════════╣
║ 1 ║ 1 ║ 10 ║ 20 ║ 5 ║
║ 1 ║ 2 ║ 15 ║ 10 ║ 10 ║
║ 2 ║ 1 ║ 20 ║ 0 ║ 15 ║
╚═══════════╩═════════════╩══════╩════════╩════════╝
Into something like this?
╔═══════════╦═════════════╦══════╦════════╦════════╦═════════════╦══════╦════════╦════════╗
║ Big Group ║ Small Group ║ Kids ║ Adults ║ Elders ║ Small Group ║ Kids ║ Adults ║ Elders ║
╠═══════════╬═════════════╬══════╬════════╬════════╬═════════════╬══════╬════════╬════════╣
║ 1 ║ 1 ║ 10 ║ 20 ║ 5 ║ 2 ║ 15 ║ 10 ║ 10 ║
║ 2 ║ 1 ║ 20 ║ 0 ║ 15 ║ ║ ║ ║ ║
╚═══════════╩═════════════╩══════╩════════╩════════╩═════════════╩══════╩════════╩════════╝
The number of small groups per Big group is variable, and that's what is being difficult for me to understand how to do it.
Can anyone help me?
Thanks in advance
There is a way but the overhead of using PIVOT is to provide the list of all values which needs to be pivoted.
As you also need each small group to be pivoted we need to create a virtual column between big group and small group to be used in pivot clause as you see below
with table1
as
(select 1 bg
,1 sg,10 kids
,20 adult
from dual
union all
select 1,2,15,25 from dual
union all
select 2,1,20,0 from dual
)
select *
from
(
select t1.*,t1.bg||'_'||t1.sg piv
from table1 t1
)
pivot
(
max(sg) sg,max(kids) kids,max(adult) adult
for piv in ('1_1' as bg1_sg1
,'1_2' as bg2_sg2
,'2_1' as bg2_sg1)
)
order by bg
Demo

How to group by an expression in TSQL and capture the result?

How can I include the results of an expression in a GROUP BY clause and also select the output of the expression ?
Say I have this table:
╔════════════════════════╦═══════════╦═══════╗
║ Forest ║ Animal ║ Count ║
╠════════════════════════╬═══════════╬═══════╣
║ Tongass ║ Hyena ║ 600 ║
║ Tongass ║ Bear ║ 1200 ║
║ Mount Baker-Snoqualmie ║ Wolf ║ 30 ║
║ Mount Baker-Snoqualmie ║ Bunny ║ 2 ║
║ Ozark-St. Francis ║ Pigeon ║ 100 ║
║ Ozark-St. Francis ║ Ostrich ║ 1 ║
║ Bitterroot ║ Tarantula ║ 9001 ║
╚════════════════════════╩═══════════╩═══════╝
I need a row with the count of carnivores in each forest and a row for the count of non-carnivores (if there are any). This is the output I'm looking for in this example:
╔════════════════════════╦═══════════════╦═══════════════╗
║ Forest ║ AnimalsOfType ║ AreCarnivores ║
╠════════════════════════╬═══════════════╬═══════════════╣
║ Tongass ║ 1800 ║ 1 ║
║ Mount Baker-Snoqualmie ║ 2 ║ 0 ║
║ Mount Baker-Snoqualmie ║ 30 ║ 1 ║
║ Ozark-St. Francis ║ 101 ║ 0 ║
║ Bitterroot ║ 9001 ║ 1 ║
╚════════════════════════╩═══════════════╩═══════════════╝
The information for whether or not an animal is carnivorous is encoded in the expression.
What I'd like to do is include the expression in the group-by and reference its result in the select clause:
SELECT TOP (1000)
[Forest],
SUM([COUNT]) AS AnimalsOfType,
AreCarnivores
FROM [Tinker].[dbo].[ForestAnimals]
GROUP BY
Forest,
CASE WHEN ForestAnimals.Animal IN ('Pigeon', 'Ostrich', 'Bunny') THEN 0 ELSE 1 END AS AreCarnivores
However, this is not valid TSQL syntax.
If I include the Animal column in the GROUP BY clause to allow me to rerun the function in the SELECT, I'll get one row per animal type, which is not the desired behavior.
Doing separate selects into temp tables and unioning the results is undesirable because the real version of this query features a large number of expressions which need this behavior in the same result set, which would make for an extremely awkward stored procedure.
Use a CTE:
WITH X AS (
SELECT Forest, Animal, Count,
CASE WHEN ForestAnimals.Animal IN ('Pigeon', 'Ostrich', 'Bunny')
THEN 0
ELSE 1 END AS AreCarnivores
FROM [Tinker].[dbo].[ForestAnimals]
)
SELECT Forest, SUM(Count) AS AnimalsOfType, AreCarnivores
FROM X
Group by Forest, AreCarnivores;
Or be more verbose about it and repeat yourself:
SELECT Forest, SUM(Count) AS AnimalsOfType,
CASE WHEN ForestAnimals.Animal IN ('Pigeon', 'Ostrich', 'Bunny')
THEN 0
ELSE 1 END AS AreCarnivores
FROM [Tinker].[dbo].[ForestAnimals]
GROUP BY Forest, CASE WHEN ForestAnimals.Animal IN ('Pigeon', 'Ostrich', 'Bunny')
THEN 0
ELSE 1 END;
They're equivalent queries to the optimizer.

Have a SUM(Field_1) not exceed SUM(Field_2)

There might be a better way to accomplish but here is what I have:
Environment - Plex ERP -SQL Query Editor
Back-end - SQL Server 2012
Summary
Parts have a "unit" worth based on manufacturing complexity
Some days we ship part\s. Other days we don't
The part units are summed for each day they are scheduled to ship 'Rel_Units_Calc'
The plant gets credit for 5 units a day (when open) 'Unit_multiplier'
This daily credit is summed for each day 'Unit_Capacity'
In order to prevent an overloading of capacity in a slow month, I need to prevent the plant from getting the 5 unit credit when the SUM(Unit_Capacity) will exceed the SUM(Rel_Units_Calc).
A report will be created that will use a case statement to evaluate if the Rel_Units_Calc > Unit_Capcity, then show red else green.
Detailed Scope
I'm trying to create a sales report that will prevent the sales group from overloading (exceeding the capacity) of the plant. To simplify, lets say we have 3 parts (Part A, B, & C). Part A is simple and worth 1 "Unit". Part B is a little more complex and worth 2 "Units". Part C is the most complex and worth 5 "Units". The plant can process 5 units a day that it is open.
The report will show when a day has been overloaded by showing the color Red and green when days are not overloaded. Any days in red will need to have the sales order moved out.
My approach was to take the units * order quantity to give me the 'Release_Units'. Then I am doing a sum(Release_Units) to show a tally for each day in a field called 'Release_Units_Calc'.
I have another field called 'Unit_Multiplier' that gives the 5 unit per day credit on eligible days (excludes weekends and holidays). Then I am doing a sum(Unit_Multiplier) to show a tally for each day in a field called 'Unit_Capacity'.
The color Red and Green were going to be determined by using a case statement comparing the two columns Release_Units_Calc and Unit_Capacity. When Unit_capacity = Release_Unit then green else red.
This works ok until you look at December when we have a slow down for these parts and then we start banking Unit_Capacity. The Unit_Capacity field continues to accrue the 5 units per day even after it has surpassed the Release_Units_Calc. These parts are not produced in December so think 20 business days * 5 units per day gives us 100 Units on Jan 1 which is not good. Essentially, this would cause the sales group to overwhelm the plant in January as they will have 100 banked units to draw from.
I would like for the Unit_Capacity which again, is a SUM(Unit_Multiplier) to not exceed the Release_Units_Calc which is from SUM(Release_Units).
SQL Below:
This temp table marks the days that should be included for the capacity
SELECT
DISTINCT FDPO.FULL_DATE,
----case statement below to create an include flag. It will exclude weekends unless we have a shipment going out
(CASE WHEN (DATENAME(dw, DATEADD(d,0,FDPO.Full_Date)) NOT IN
('Saturday','Sunday')) THEN 1
WHEN (DATENAME(dw, DATEADD(d,0,FDPO.Full_Date)) IN
('Saturday','Sunday')) AND FDPO.DUE_DATE IS NOT NULL THEN 1
ELSE 0 END) AS 'Include'
INTO #Capacity_Temp1
FROM #FDPO AS FDPO
This temp table uses the include flag to remove the dates that should not accrue capacity and adds a capacity column.
SELECT
CT1.FULL_DATE,
#Unit_Multiplier AS 'Unit_multiplier'
INTO #Capacity_Temp2
FROM #Capacity_Temp1 AS ct1
WHERE ct1.INCLUDE= 1
The temp table below adds the unit multiplier up for each day
SELECT
DISTINCT CT2.FULL_DATE,
CT2.Unit_multiplier,
SUM(CT2.Unit_multiplier) OVER (Order By CT2.FULL_DATE) AS 'Unit_Capacity'
INTO #Unit_Capacity
FROM #Capacity_Temp2 AS CT2
The final display query
SELECT
RUC.FULL_DATE,
RUC.Release_Units,
RUC.Release_Units_Calc,--running talley of the release units
ISNULL(UC.Unit_multiplier,0) AS 'Unit_multiplier',
-- credit units given per day except when closed
UC.Unit_Capacity --running talley of the unit multiplier
FROM #RUC AS RUC
LEFT JOIN #Unit_Capacity AS UC
ON UC.FULL_DATE = RUC.FULL_DATE
The output at present:
╔══════╦═══════════════╦════════════════╦═════════════════╦═══════════════╗
║ DATE ║ Release_Units ║ Rel_Units_Calc ║ Unit_multiplier ║ Unit_Capacity ║
╠══════╬═══════════════╬════════════════╬═════════════════╬═══════════════╣
║ 8/3 ║ 15 ║ 15 ║ 5 ║ 5 ║
║ 8/4 ║ NULL ║ 15 ║ 5 ║ 10 ║
║ 8/5 ║ 20 ║ 50 ║ 5 ║ 15 ║
║ 8/5 ║ 15 ║ 50 ║ 5 ║ 15 ║
║ 8/6 ║ NULL ║ 50 ║ 0 ║ NULL ║
║ 8/7 ║ NULL ║ 50 ║ 5 ║ 20 ║
║ 8/8 ║ NULL ║ 50 ║ 5 ║ 25 ║
║ 8/9 ║ NULL ║ 50 ║ 5 ║ 30 ║
║ 8/10 ║ NULL ║ 50 ║ 5 ║ 35 ║
║ 8/11 ║ NULL ║ 50 ║ 5 ║ 40 ║
║ 8/12 ║ 15 ║ 65 ║ 5 ║ 45 ║
║ 8/13 ║ NULL ║ 65 ║ 0 ║ NULL ║
║ 8/14 ║ NULL ║ 65 ║ 5 ║ 50 ║
║ 8/15 ║ NULL ║ 65 ║ 5 ║ 55 ║
║ 8/16 ║ 10 ║ 75 ║ 5 ║ 60 ║
║ 8/17 ║ NULL ║ 75 ║ 5 ║ 65 ║
║ 8/18 ║ NULL ║ 75 ║ 5 ║ 70 ║
║ 8/19 ║ NULL ║ 75 ║ 0 ║ NULL ║
║ 8/20 ║ NULL ║ 75 ║ 0 ║ NULL ║
║ 8/21 ║ NULL ║ 75 ║ 5 ║ 75 ║
║ 8/22 ║ NULL ║ 75 ║ 5 ║ 80 ║
║ 8/23 ║ NULL ║ 75 ║ 5 ║ 85 ║
║ 8/24 ║ NULL ║ 75 ║ 5 ║ 90 ║
║ 8/25 ║ NULL ║ 75 ║ 5 ║ 95 ║
║ 8/26 ║ 10 ║ 95 ║ 5 ║ 100 ║
║ 8/27 ║ 10 ║ 95 ║ 5 ║ 105 ║
╚══════╩═══════════════╩════════════════╩═════════════════╩═══════════════╝
The problem occurs on 8/22 where we start to exceed the Rel_Units_Calc field. This allows an order to be placed on 8/27 that will not trigger the Red because the Unit_Capacity will be greater than the Rel_Units_Calc.
Sorry for the long post. I'm open to any suggestions if there is a better way to accomplish this.
Thanks in Advance,
Mike

Complex GROUP BY with Django's ORM

I have a Django application that tracks electricity consumption and I'm having a hard time trying to come up with a way to use Django's ORM to fetch some information.
My specific use case is this: I have a set of electricity consumption readings, each with a datetime field, consumption and cost (and a few others but these are the relevant ones). I need to sum the consumption and cost values grouped by month, year, electricity meter and electricity price. In other words, I need to be able to get the total energy consumption value and corresponding cost for each month, of each year, for each price (easier to understand if you look at the table further down the post).
This is my ElectricityReading model and its parent Reading model (separated because we also have consumption readings for water and gas, which also derive from Reading):
from model_utils.models import TimeStampedModel
# Other imports here...
class Reading(TimeStampedModel):
meter = models.ForeignKey(Meter)
datetime = models.DateTimeField() # Terrible property name, I know :)
class Meta:
abstract = True
class ElectricityReading(Reading):
price = models.ForeignKey(ElectricityPrice)
consumption = models.DecimalField(max_digits=18, decimal_places=3,
null=True, blank=True, default=None)
cost = models.DecimalField(max_digits=18, decimal_places=3, null=True,
blank=True, default=None)
Right now I'm doing this with this raw SQL, which I build depending on a few parameters:
SELECT
(EXTRACT(YEAR FROM datetime)) AS reading_date_year,
(EXTRACT(MONTH FROM datetime)) AS reading_date_month,
SUM(consumption) as total_consumption,
SUM(cost) as total_cost,
COUNT(id) as num_readings,
price_id
FROM electricity_reading
WHERE meter_id IN (10)
AND datetime >= '2015-10-01 00:00'
AND datetime <= '2015-12-31 23:59'
GROUP BY reading_date_year, reading_date_month, price_id, meter_id
ORDER BY meter_id, reading_date_year, reading_date_month, price_id
This SQL query results in something like the following data (made up values and simplified column names for better formatting):
╔══════╦═══════╦═════════════╦══════╦══════════════╦═══════╗
║ year ║ month ║ consumption ║ cost ║ num_readings ║ price ║
╠══════╬═══════╬═════════════╬══════╬══════════════╬═══════╣
║ 2015 ║ 10 ║ 600 ║ 804 ║ 456 ║ 1 ║
║ 2015 ║ 10 ║ 728 ║ 471 ║ 1998 ║ 2 ║
║ 2015 ║ 10 ║ 848 ║ 792 ║ 1266 ║ 3 ║
║ 2015 ║ 10 ║ 256 ║ 705 ║ 744 ║ 5 ║
║ 2015 ║ 11 ║ 528 ║ 377 ║ 630 ║ 1 ║
║ 2015 ║ 11 ║ 016 ║ 687 ║ 1680 ║ 2 ║
║ 2015 ║ 11 ║ 240 ║ 826 ║ 1289 ║ 3 ║
║ 2015 ║ 11 ║ 736 ║ 522 ║ 720 ║ 5 ║
║ 2015 ║ 12 ║ 584 ║ 627 ║ 608 ║ 1 ║
║ 2015 ║ 12 ║ 776 ║ 078 ║ 1627 ║ 2 ║
║ 2015 ║ 12 ║ 600 ║ 401 ║ 1410 ║ 3 ║
║ 2015 ║ 12 ║ 864 ║ 842 ║ 744 ║ 5 ║
╚══════╩═══════╩═════════════╩══════╩══════════════╩═══════╝
Using Django's ORM, I think the code I need is something along the lines of the following:
objs = ElectricityReading.objects\
.filter(
meter=10,
datetime__gte='2015-05-01 00:00',
datetime__lte='2015-08-31 23:59'
).only('price_id')\
.annotate(reading_date_year=YearTransform('datetime'))\
.annotate(reading_date_month=MonthTransform('datetime'))\
.annotate(total_consumption=Sum('consumption'))\
.annotate(total_cost=Sum('cost'))\
.annotate(num_readings=Count('id'))\
.order_by('meter_id', 'reading_date_year', 'reading_date_month', 'price_id')
But the SQL it generates is not what I need:
SELECT
id,
price_id,
EXTRACT('year' FROM datetime AT TIME ZONE 'Europe/Lisbon') AS reading_date_year,
EXTRACT('month' FROM datetime AT TIME ZONE 'Europe/Lisbon') AS reading_date_month,
SUM(consumption) AS total_consumption, SUM(cost) AS total_cost,
COUNT(id) AS num_readings
FROM geratriz_electricityreading
WHERE (
datetime >= '2015-05-01 00:00:00+01:00'
AND datetime <= '2015-08-31 23:59:00+01:00'
AND meter_id = 10)
GROUP BY
id,
EXTRACT('year' FROM datetime AT TIME ZONE 'Europe/Lisbon'),
EXTRACT('month' FROM datetime AT TIME ZONE 'Europe/Lisbon')
ORDER BY meter_id ASC, reading_date_year ASC, reading_date_month ASC, price_id ASC
This results in a lot more rows being returned from the database due to not being grouped as I need them to be.
The part of the SQL query I can't seem to replicate with Django's ORM is the GROUP BY clause at the end. Django insists on grouping by ID and I can't seem to find a way to make it group by meter_id and price_id.
Given how much time I spent on this already, I'm inclined to say that what I am trying to accomplish simply isn't possible with Django's ORM but I would love that someone would tell me I am missing something.
Try using values()
objs = ElectricityReading.objects\
.filter(
meter=10,
datetime__gte='2015-05-01 00:00',
datetime__lte='2015-08-31 23:59'
.values('price_id')\
.annotate(reading_date_year=YearTransform('datetime'))\
.annotate(reading_date_month=MonthTransform('datetime'))\
.annotate(total_consumption=Sum('consumption'))\
.annotate(total_cost=Sum('cost'))\
.annotate(num_readings=Count('id'))\
.order_by('meter_id', 'reading_date_year', 'reading_date_month', 'price_id')
This should group the results on price_id. If you were displaying several meters at once instead of meter=10, then you could do values('price_id', 'meter') and it would group on both fields.

How to scale with numbers table - exploding the memory and the hard disk

I am trying to populate a half fulfilled column with some values for some dates and having NULL for the rest.
The task is a basic fill in the gaps with the value of previous row.
It needs n iterations to fill the entire table.
I am using NUMBERS table to do the iterations and it works for small sample table like the following.
When it is done for 18 mn rows data, it cannot finish the query because it explodes computer resources and runtime is endless. How to scale this?
Or are there any better ways to do it? This solution seemed good for me at first.
'As is' and to be [statusTEST] column as follows:
╔════════════╦═══════════╦════════════╦═════════════════╦═════════════════╗
║ SOZLESMENO ║ tDuration ║ YRMONTH ║ statusTest_AsIs ║ statusTest_ToBE ║
╠════════════╬═══════════╬════════════╬═════════════════╬═════════════════╣
║ 40000001 ║ 0 ║ 2010-01-01 ║ 1 ║ 1 ║
║ 40000001 ║ 1 ║ 2010-02-01 ║ NULL ║ 1 ║
║ 40000001 ║ 2 ║ 2010-03-01 ║ NULL ║ 1 ║
║ 40000001 ║ 3 ║ 2010-04-01 ║ NULL ║ 1 ║
║ 40000001 ║ 4 ║ 2010-05-01 ║ 2 ║ 2 ║
║ 40000001 ║ 5 ║ 2010-06-01 ║ NULL ║ 2 ║
║ 40000001 ║ 6 ║ 2010-07-01 ║ NULL ║ 2 ║
║ 40000001 ║ 7 ║ 2010-08-01 ║ NULL ║ 2 ║
║ 40000001 ║ 8 ║ 2010-09-01 ║ 3 ║ 3 ║
║ 40000001 ║ 9 ║ 2010-10-01 ║ NULL ║ 3 ║
║ 40000001 ║ 10 ║ 2010-11-01 ║ NULL ║ 3 ║
╚════════════╩═══════════╩════════════╩═════════════════╩═════════════════╝
I use the following code with predefined Numbers table of 10,000 rows
--Numbers table defined
SELECT TOP 10000 H = IDENTITY(INT, 0, 1)
INTO dbo.Numbers
FROM master.dbo.syscolumns a
CROSS JOIN master.dbo.syscolumns b;
--Alternating the table H times to get statusTest_toBE column shown above
DECLARE #iteration_limit INT = 60
UPDATE X
SET X.statusTest = (
CASE
WHEN X.statusTest IS NOT NULL THEN X.statusTest
ELSE Y.statusTest
END
)
FROM
[Mainfiles].dbo.x2Skeleton X
CROSS JOIN [Mainfiles].dbo.Numbers3 N
LEFT JOIN [Mainfiles].dbo.x2Skeleton Y
ON (X.SOZLESMENO = Y.SOZLESMENO)
AND (DATEADD(MONTH, - N.H, X.YRMONTH) = Y.YRMONTH)
AND N.H BETWEEN 1 AND #iteration_limit
You can express what you want using window functions. If StatusTest_AsIs is always increasing, you can just use max():
with toupdate as (
select X.*, max(StatusTest_AsIs) over (partition by SOZLESMENO order by YRMONTH) as new_statusTest_ToBE
from [Mainfiles].dbo.x2Skeleton X
)
update toupdate
set statusTest_ToBE = new_statusTest_ToBE
where statusTest_ToBE <> new_statusTest_ToBE;
If the values are not increasing, you can still do this. Getting the previous non-NULL value is a bit tricky, but APPLY is a good way to do it:
with toupdate as (
select X.*, x2.StatusTest_AsIs as new_statusTest_ToBE
from [Mainfiles].dbo.x2Skeleton x cross apply
(select top 1
from [Mainfiles].dbo.x2Skeleton x2
where x2.SOZLESMENO = x.SOZLESMENO and x2.YRMONTH <= YRMONTH and
x2.StatusTest_AsIs is not null
order by YRMONTH desc
) x2
)
update toupdate
set statusTest_ToBE = new_statusTest_ToBE
where statusTest_ToBE <> new_statusTest_ToBE;
For both these queries, but this one in particular, you want an index on [Mainfiles].dbo.x2Skeleton(SOZLESMENO, YRMONTH, StatusTest_AsIs).