Modify Postgres query to use generate_series for overall summation over each of several consecutive range intervals - sql

I'm still quite new with SQL, coming from an ORM-centric environment, so please be patient with me.
Provided with a table in the form of:
CREATE TABLE event (id int, order_dates tsrange, flow int);
INSERT INTO event VALUES
(1,'[2021-09-01 10:55:01,2021-09-04 15:16:01)',50),
(2,'[2021-08-15 20:14:27,2021-08-18 22:19:27)',36),
(3,'[2021-08-03 12:51:47,2021-08-05 11:28:47)',41),
(4,'[2021-08-17 09:14:30,2021-08-20 13:57:30)',29),
(5,'[2021-08-02 20:29:07,2021-08-04 19:19:07)',27),
(6,'[2021-08-26 02:01:13,2021-08-26 08:01:13)',39),
(7,'[2021-08-25 23:03:25,2021-08-27 03:22:25)',10),
(8,'[2021-08-12 23:40:24,2021-08-15 08:32:24)',26),
(9,'[2021-08-24 17:19:59,2021-08-29 00:48:59)',5),
(10,'[2021-09-01 02:01:17,2021-09-02 12:31:17)',48); -- etc
the query below does the following:
(here, 'the range' is 2021-08-03T00:00:00 from to 2021-08-04T00:00:00)
For each event that overlaps with the range
Trim the Lower and Upper timestamp values of order_dates to the bounds of the range
Multiply the remaining duration of each applicable event by the event.flow value
Sum all of the multiplied values for a final single value output
Basically, I get all of the events that overlap the range, but only calculate the total value based on the portion of each event that is within the range.
SELECT SUM("total_value")
FROM
(SELECT (EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER * "event"."flow") AS "total_value"
FROM "event"
WHERE "event"."order_dates" && tsrange('2021-08-03T00:00:00'::timestamp, '2021-08-04T00:00:00'::timestamp, '[)')
GROUP BY "event"."id",
GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp),
LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp),
EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER, (EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER * "event"."flow")) subquery
The DB<>Fiddle demonstrating this: https://www.db-fiddle.com/f/jMBtKKRS33Qf2FEoY5EdPA/1
This query started out as a complex set of django annotations and aggregation, and I have simplified it to remove the parts not necessary for this question.
So with the above I get a single total value over the input range (in this case a 1-day range).
But I want to be able to use generate_series to perform this same overall summation to each of several consecutive range intervals
e.g.: query for the total during each of the following ranges:
['2021-08-01T00:00:00', '2021-08-02T00:00:00')
['2021-08-02T00:00:00', '2021-08-03T00:00:00')
['2021-08-03T00:00:00', '2021-08-04T00:00:00')
['2021-08-04T00:00:00', '2021-08-05T00:00:00')
This is somewhat related to my previous question here, but since the timestamps for the queried range are used in so many places within the query, I'm pretty lost for how to do this.
Any help/direction will be appreciated.

This should get you started: https://www.db-fiddle.com/f/qm4F7qqWZMrtXtMejimVJr/1.
Basically what I did was to prepare the ranges with a CTE up-front, then select from that table expression with a CROSS JOIN LATERAL of your original query. Next, I replaced all occurrences of 20210803 with lower(target_range) and 20210804 with upper(target_range), then added the GROUP BY of target_range. Note that only those ranges that overlap at least one row in the input will appear in the output; change the cross join to a LEFT JOIN to always see your input ranges in the output, even if value is null. (If so, ON TRUE is fine for the join condition, since you already do the filtering the WHERE of the inner subquery.)

Related

Finding the last 4, 3, 2, 1 months consecutive order drops among clients based on drop variance

Here I have this query that finds out the drop percentage of a bunch of clients based on the orders they have received(i.e. It finds the percentage difference in orders by comparing the current month with the previous month). What I want to achieve here is to have a field where I can see the clients who had 4 months continuous drop, 3 months drop, 2 months drop, and 1 month drop.
I know, it can only be achieved by comparing the last 4 months using the lag function or sub queries. can you guys pls help me out on this one, would appreciate it very much
select
fd.customers2, fd.Month1, fd.year1, fd.variance, case when
(fd.variance < -0.00001 and fd.year1 = '2022.0' and fd.Month1 = '1')
then '1month drop' else fd.customers2 end as 1_most_host_drop
from 
(SELECT
c.*,
sa.customers as customers2,
sum(sa.order) as orders,
date_part(mon, sa.date) as Month1,
date_part(year, sa.date) as year1,
(cast(orders - LAG(orders) OVER(Partition by customers2 ORDER BY
 year1, Month1) as NUMERIC(10,2))/NULLIF(LAG(orders) 
OVER(partition by customers2 ORDER BY year1, Month1) * 1, 0)) AS variance
FROM stats sa join (select distinct
    d.id, d.customers 
     from configer d 
    ) c on sa.customers=c.customers
WHERE sa.date >= '2021-04-1' 
GROUP BY Month1, sa.customers, c.id,  year1, 
     c.customers)fd
In a spirit of friendliness: I think you are a little premature in posting this here as there are several issues with the syntax before even reaching the point where you can solve the problem:
You have at least two places with a comma immediately preceding the word FROM:
...AS variance, FROM stats_archive sa ...
...d.customers, FROM config d...
Recommend you don't use VARIANCE as an alias (it is a system function in PostgreSQL and so is likely also a system function name in Redshift)
Not super important, but there's no need for c.* - just select the columns you will use
DATE_PART requires a string as the first parameter DATE_PART('mon',current_date)
I might be wrong about this, but I suspect you cannot use column aliases in the partition by or order by of a window function. Put the originating expressions there instead:
... OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
LAG has three parameters. (1) The column you want to retrieve the value from, (2) the row offset, where a positive integer indicates how many rows prior to the current row you should retrieve a value from according to the partition and order context and (3) the value the function should return as a default (in case of the first row in the partition). As such, you don't need NULLIF. So, to get the row immediately prior to the current row, or return 0 in case the current row is the first row in the partition:
LAG(orders,1,0) OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
If you use 0 as a default in the calculation of what is currently aliased variance, you will almost certainly run into a div/0 error either now, or worse, when you least expect it in the future. You should protect against that with some CASE logic or better, provide a more appropriate default value or even better, calculate the LAG with the default 0, then filter out the 0 rows before doing the calculation.
You can't use column aliases in the GROUP BY. You must reference each field that is not participating in an aggregate in the group by, whether through direct mention (sa.date) or indirectly in an expression (DATE_PART('mon',sa.date))
Your date should be '2021-04-01'
All in all, without sample data, expected results using the posted sample data and without first removing syntax errors, it is a tall order to attempt to offer advice on the problem which is any more specific than:
Build the source of the calculation as a completely separate query first. Calculate the LAG in that source query. Only when you've run that source query and verified that the LAG is producing the correct result should you then wrap it as a sub-query or CTE (not sure if Redshift supports these, but presumably) at which point you can filter out the rows with a zero as the denominator (the first month of orders for each customer).
Good luck!

Add a number value to column in SQL query using SELECT method

I have am working on adding a query that calculates tuition costs. It should do this by using the Tuition table which only includes the FullTimeCost (a static number for the student fees), and the PerUnitCost (the cost per credit hour).
I am trying to use a SELECT to return 3 more columns, 1 constant value of 12 called units, and 2 more that calculate the rest based on simple math.
The problem I am having is that I cannot seem to make the column Units have a default value of 12.
This is my code, and the issue I am having is that when I use this approach, the following formulas do not recognize the the columns being created in the previous lines.
All I need is for the 3rd Line to recognize Units so it can multiply by 12 as intended. Also this is for school, so a comment saying just change it to 12 is not useful.
SELECT
FullTimeCost, PerUnitCost,
12 AS Units,
PerUnitCost * Units AS TotalPerUnitCost,
FullTimeCost + TotalPerUnitCost AS TotalTuition
FROM
Tuition
You cannot re-use a column alias in the select. However, SQL Server gives you a convenient way to define the alias in the from clause, so you can use it:
SELECT t.FullTimeCost, t.PerUnitCost, v.Units,
v2.TotalPerUnitCost,
(t.FullTimeCost + v2.TotalPerUnitCost AS TotalTuition
FROM Tuition t CROSS APPLY
(VALUES (12)) v(units) CROSS APPLY
(VALUES (t.PerUnitCost * v.Units)) v2(TotalPerUnitCost);
Use a CTE to "add" your constant as a column and then apply the calculation. Without context, a variable would also be just as simple and useful.
with cte as (select FullTimeCost, PerUnitCost, 12 as Units
from dbo.Tuition
)
SELECT
FullTimeCost, PerUnitCost,
Units,
PerUnitCost * Units AS TotalPerUnitCost,
FullTimeCost + TotalPerUnitCost AS TotalTuition
FROM cte
order by ...;
There are, of course, other ways to accomplish this. Not certain what your coursework has covered but I assume that recent topics should have provided techniques to do this.
Using apply as shown by Gordon's answer is the most elegant solution and also noted in the comments is another way using a derived table.
As you have no doubt gathered, the problem is that during query compilation, the optimizer does not "see" the calculated column aliases as it can only (generally) access columns available from tables in the where clause, or as shown by Gordon, using an apply().
What you can also do is use a derived table, by first selecting the columns you need from your table and also adding your additional columns.
You then wrap this in parentheses - it's now a derived table ie, the results of the parenthesis content is itself a table available to an outer select.
You then use this as the source for an outer select which has visiblity of any additional columns you have added.
A complication with your query is that you want to add a constant value Units and then reference it, and also reference a second calculated column that makes use of Units.
I would simply use a single derived table to calculate the TotalPerUnitCost, you don't need Units since it's used only once.
select
FullTimeCost, PerUnitCost, TotalPerUnitCost,
FullTimeCost + TotalPerUnitCost as TotalTuition
from (
select FullTimeCost, PerUnitCost, TotalPerUnitCost, PerUnitCost * 12 as TotalPerUnitCost
from Tuition
)t

SQL to powerBI expression?

How to write this expression in PowerBI
select distinct([date]),Temperature from Device47A8F where Temperature>25
Totally new to PowerBI. Is there any tool that can change the query from sql to PowerBI expression?
I have tried so many type of different type of expressions but getting error, Most of the time I am getting this:
The expression refers to multiple columns. Multiple columns cannot be converted to a scalar value.
Need help, Thanks.
After I posted my answer, wondered if your expected result is get only one date by temperature, In other words, without repeated dates in your result set.
A side note: select distinct([date]),Temperature from Device47A8F where Temperature>25 returns repeated dates since DISTINCT keyword evaluate distinct columns values specified in the SELECT statement, it doesn't return distinct values in a specific column even if you surround it with parenthesis.
Now what brings us here. What I can see in your error is that you are trying to use a table-valued (produces a table with multiple columns) expression in a measure which only accepts scalar-valued (calculate only one value).
Supposing you have a table like this:
Running your SQL query you will get the highlighted in yellow rows:
You can see 01/09/2016 date is repeated. If you want to create a measure you have to define what calculation you want to show for temperature. i.e, average, max or min etc.
In the below expression is being calculated the maximum temperature greater than 25 per date:
MaxTempGreaterThan25 =
CALCULATE ( MAX ( Device47A8F[Temperature] ), Device47A8F[Temperature] > 25 )
In this case the measure MaxTempGreaterThan25 is calculated per date.
If you don't want to produce a measure but a table. In the Power BI Tool bar select Modeling tab and click New Table icon.
Use this expression:
MyTemperatureTable =
FILTER ( Device47A8F, Device47A8F[Temperature] > 25 )
It should produce a new table named MyTemperatureTable like this:
I recommend you learn some basics about DAX, it is pretty different from SQL / T-SQL and there are things you can't do depending on your model and data.
Let me know if this helps.
You probably don't need to write any code if your objective is to show the result in a Power BI visual e.g. a table. Power BI naturally aggregates data if the datatype is numeric (e.g. Temperature).
I would just add a Table visual on a Report page and add the Date and Temperature columns to it. Then in Visualizations / Fields / Values I would click the little down-arrow on the Temperature field and set the Aggregation e.g. Maximum. Then in Visualizations / Fields / Filters I would click the little down-arrow on the Temperature field and set the Filter e.g. is greater than: 25
Hard-coded solutions are unlikely to survive the next question from your users e.g. "but what if I want to see Temperature > 24? Or 20? Or 30?"

Update table with combined field from other tables

I have a table with actions that are being due in the future. I have a second table that holds all the cases, including the due date of the case. And I have a third table that holds numbers.
The problems is as follows. Our system automatically populates our table with future actions. For some clients however, we need to change these dates. I wanted to create an update query for this, and have this run through our scheduler. However, I am kind of stuck at the moment.
What I have on code so far is this:
UPDATE proxima_gestion p
SET fecha = (SELECT To_char(d.f_ult_vencim + c.hrem01, 'yyyyMMdd')
FROM deuda d,
c4u_activity_dates c,
proxima_gestion p
WHERE d.codigo_cliente = c.codigo_cliente
AND p.n_expediente = d.n_expediente
AND d.saldo > 1000
AND p.tipo_gestion_id = 914
AND p.codigo_oficina = 33
AND d.f_ult_vencim > sysdate)
WHERE EXISTS (SELECT *
FROM proxima_gestion p,
deuda d
WHERE p.n_expediente = d.n_expediente
AND d.saldo > 1000
AND p.tipo_gestion_id = 914
AND p.codigo_oficina = 33
AND d.f_ult_vencim > sysdate)
The field fecha is the current action date. Unfortunately, this is saved as a char instead of date. That is why I need to convert the date back to a char. F_ult_vencim is the due date, and hrem01 is the number of days the actions should be placed away from the due date. (for example, this could be 10, making the new date 10 days after the due date)
Apart from that, there are a few more criteria when we need to change the date (certain creditors, certain departments, only for future cases and starting from a certain amount, only for a certain action type.)
However, when I try and run this query, I get the error message
ORA-01427: single-row subquery returns more than one row
If I run both subqueries seperately, I get 2 results from both. What I am trying to accomplish, is that it connects these 2 queries, and updates the field to the new value. This value will be different for every case, as every due date will be different.
Is this even possible? And if so, how?
You're getting the error because the first SELECT is returning more than one row for each row in the table being updated.
The first thing I see is that the alias for the table in UPDATE is the same as the alias in both SELECTs (p). So all of the references to p in the subqueries are referencing proxima_gestion in the subquery rather than the outer query. That is, the subquery is not dependent on the outer query, which is required for an UPDATE.
Try removing "proxima_gestion p" from FROM in both subqueries. The references to p, then, will be to the outer UPDATE query.

Select case mess in JFreeChart

I have a Column(cliente_x_hora, a numeric field) i put in a interval and count the number in each interval.I have 3 textfields(number of intervals,value between intervals and initial value). When I select the two first(with 5 intervals and 1000 value), the query run flawless and generate the expect barchart.
Query(with two select textfields):
SELECT INTERVAL, COUNT(*) TOTAL FROM (
SELECT CASE WHEN CLIENTE_X_HORA>0 AND CLIENTE_X_HORA<=1000.00 THEN '0<CLIENTE_X_HORA> <=1000.00'
WHEN CLIENTE_X_HORA>1000.00 AND CLIENTE_X_HORA<=2000.00 THEN '1000.00<CLIENTE_X_HORA><=2000.00'
WHEN CLIENTE_X_HORA>2000.00 AND CLIENTE_X_HORA<=3000.00 THEN '2000.00<CLIENTE_X_HORA><=3000.00'
WHEN CLIENTE_X_HORA>3000.00 AND CLIENTE_X_HORA<=4000.00 THEN '3000.00<CLIENTE_X_HORA><=4000.00'
ELSE '4000.00<CLIENTE_X_HORA' END INTERVAL, CLIENTE_X_HORA FROM SGD_CAUSA)
GROUP BY INTERVAL ORDER BY TOTAL
The barchart is
The problem is when I select the last field(initial value with, per example 2000), my barchart go crazy(i believe is adding up the discarded values below 2000):
That ELSE(>6000) should be much smaller than is showing.How can I solve that?
Best Regards,
DDias
CLARIFICATION from OP:
The query is the same as above but begins in 2000:
SELECT CASE WHEN CLIENTE_X_HORA>2000 AND CLIENTE_X_HORA<=3000.00... and ends in 6000:ELSE '6000.00<CLIENTE_X_HORA' END INTERVAL, CLIENTE_X_HORA FROM SGD_CAUSA) GROUP BY INTERVAL ORDER BY TOTAL
put the result in table form is impractical(we are talking about over 87 thousand rows) That happens always when i give an initial value different than ZERO.
Your ELSE is just that. It includes everything that is not matched by specific WHENs.
So if you do not start from zero, that last column will include everything below a lowest limit in addition to greater than highest limit.
So if you do not want this behavior, do not use ELSE at all. Use WHEN CLIENTE_X_HORA > 6000.00 (or whatever your highest limit is) as the last condition.
EDIT:
In your internal query filter out (with WHERE) the values that are below the lowest limit.
Since we no longer have unneeded low range, you no longer need the HAVING clause we added and you can even go back to using ELSE.
If your lowest limit is zero, then you will be filtering everything below 0, which I assume is nothing.