MS SQL - Multiple Running Totals, each based on a different GROUP BY - sql

Need to generate the 2 running total columns which are each based on a different group-by. I would PREFER that the solution use the OUTER APPLY method like the one below, except modified to run multiple running totals/sums on different group bys/columns. See image for example of desired result
SELECT t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount,
RunningTotal = SUM(t2.TicketAmount)
FROM dbo.SpeedingTickets AS t1
OUTER APPLY
(
SELECT TicketAmount
FROM dbo.SpeedingTickets
WHERE LicenseNumber = t1.LicenseNumber
AND IncidentDate <= t1.IncidentDate
) AS t2
GROUP BY t1.LicenseNumber, t1.IncidentDate, t1.TicketAmount
ORDER BY t1.LicenseNumber, t1.IncidentDate;
Example + desires result:
i.stack.imgur.com/PvJQe.png

Use outer apply twice:
Here is how you get one running total:
SELECT st.*, r1.RunningTotal
FROM dbo.SpeedingTickets st OUTER APPLY
(SELECT SUM(st2.TicketAmount) as RunningTotal
FROM dbo.SpeedingTickets st2
WHERE st2.LicenseNumber = st.LicenseNumber AND
st2.IncidentDate <= st.IncidentDate
) r1
ORDER BY st.LicenseNumber, st.IncidentDate;
For two, you just add another OUTER APPLY. Your question doesn't specify what the second aggregation is, and the linked picture has no relevance to the description in the question.
Notes:
The aggregation goes in the subquery, not in the outer query.
Use table abbreviations for table aliases. Such consistency makes it easier to follow the query.
When using correlated subqueries, always use qualified column names for all columns.

Related

Access SQL: Update list with Max() and Min() values not possible

I've a list of dates: list_of_dates.
I want to find the max and min values of each number with this code (#1).
It works how it should, and therefore I get the table MinMax
Now I want to update a other list (list_of_things) with these newly acquired values (#2).
However, it is not possible.
I assume it's due to DISTINCT and the fact that I always get two rows per number, each with the min and max values. Therefore an update is not possible.
Unfortunately I don't know any other way.
#1
SELECT a.number, b.MaxDateTime, c.MinDateTime
FROM (list_of_dates AS a
INNER JOIN (
SELECT a.number, MAX(a.dat) AS MaxDateTime
FROM list_of_dates AS a
GROUP BY a.number) AS b
ON a.number = b.number)
INNER JOIN (SELECT a.number, MIN(a.dat) AS MinDateTime
FROM list_of_dates AS a
GROUP BY a.number) AS c
ON a.number = c.number;
#2
UPDATE list_of_things AS a
LEFT JOIN MinMax AS b
ON a.number = b.number
SET a.latest = b. MaxDateTime, a.ealiest = b.MinDateTime```
No part of an MS Access update query can contain aggregation, else the resulting recordset becomes 'not updateable'.
In your case, the use of the min & max aggregate functions within the MinMax subquery cause the final update query to become not updateable.
Whilst it is not always advisable to store aggregated data (in favour of generating an output from transactional data using queries), if you really need to do this, here are two possible methods:
1. Using a Temporary Table to store the Aggregated Result
Run a select into query such as the following:
select
t.number,
max(t.dat) as maxdatetime,
min(t.dat) as mindatetime
into
temptable
from
list_of_dates t
group by
t.number
To generate a temporary table called temptable, then run the following update query which sources date from this temporary table:
update
list_of_things t1 inner join temptable t2
on t1.number = t2.number
set
t1.latest = t2.maxdatetime,
t1.earliest = t2.mindatetime
2. Use Domain Aggregate Functions
Since domain aggregate functions (dcount, dsum, dmin, dmax etc.) are evaluated separately from the evaluation of the query, they do not break the updateable nature of a query.
As such, you might consider using a query such as:
update
list_of_things t1
set
t1.latest = dmax("dat","list_of_dates","number = " & t1.number),
t1.earliest = dmin("dat","list_of_dates","number = " & t1.number)
It's a shot in the dark, but try adding DistinctRow as per SQL Update woes in MS Access - Operation must use an updateable query
Also try using an inner join. If you need to, you can run an update to a null value first for all the records in the query to simulate the effect of the outer join.

SQL joined subquery problem / performance

Is there a way to write a query to obtain the same result set as the following "imaginary" query?
CREATE OR REPLACE VIEW v_report AS
SELECT
meta.refnum AS refnum,
codes.svc_codes AS svc_codes
FROM
t_bill AS meta
JOIN (SELECT
string_agg(p.service_code, ':') AS svc_codes
FROM
t_bill_service_services AS p
WHERE
p.refnum = meta.refnum
) AS codes
ON meta.refnum = codes.refnum
This query is imaginary because it won't run with an error message about the meta.refnum in the WHERE clause not being able to be referenced from this part of the query.
Note 1: many columns from a variety of other tables which are also JOINed are omitted in the interest of brevity. This may preclude some simpler solutions which eliminate the subquery.
Note 2: it is possible to make this work (for some definitions of "work") by adding the p.refnum column to the subquery and doing a GROUP BY p.refnum and removing the WHERE altogether, but this of course means the entire t_bill_service_services table gets scanned and sorted -- very very slow for my situation, as the table is reasonably large.
(The SQL flavor is Postgres, but should be irrelevant, as only the string_agg() call should be non-std SQL.)
Rather than JOINing to a derived table, you can place the subquery in the SELECT part of the query. In this section, you can access the values from the parent table in subqueries and so only aggregate the relevant entries in the other table. For example:
select meta.refnum,
(SELECT string_agg(p.service_code, ':')
FROM t_bill_service_services AS p
WHERE p.refnum = meta.refnum
) AS svc_codes
from t_bill meta
Demo on dbfiddle
What you are describing is a lateral join -- and Postgres supports these. You can write the query as:
SELECT meta.refnum AS refnum,
codes.svc_codes AS svc_codes
FROM t_bill meta CROSS JOIN LATERAL
(SELECT string_agg(p.service_code, ':') AS svc_codes
FROM t_bill_service_services p
WHERE p.refnum = meta.refnum
) codes;
In this case, a lateral join is a lot like a correlated subquery (Nick's answer). However, a lateral join is much more powerful, because it allows the subquery to return multiple columns and multiple rows.

Slow T-SQL query with datediff function

I have a query which runs fast when the date clause "and datediff(day,con2.DT_DateIncluded),'2017-01-01')<=0" in the code below isn't used in the query, but runs slowly when it is included. Though it runs fast when I run just the part "select top 2 ID_Contact...", even including the date clause. I have this query on a classic ASP application, and it can't be converted in to a stored procedure (project scope reasons). Can you help me find a way to improve the performance of the full query just by changing the query code?
select distinct top 10
ID_Contact, NO_CodCompany
from
tblContacts con1
where
ID_Contact in (select top 2 ID_Contact
from tblContacts con2
inner join tblCompanies cp on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany
and datediff(day, con2.DT_DateIncluded), '2017-01-01') <= 0)
Instead of `DATEDIFF() < 0' try using:
and con2.DT_DateIncluded <= '2017-01-01'
Also, ensure that there is an index on the `DT_DateIncluded' column.
The reason DATEDIFF() runs slow is that using it takes a bit of time to perform the calculation, the query optimizer is (probably) ending up running it for the entire table, and there is (probably) no index to help it select the required rows.
When you remove that clause the query runs faster, but that is probably helped along by the fact that you're only selecting the first two rows in the inner query and ten rows in the outer query, allowing a table scan to be performant enough.
This is essentially your query:
This is your query:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where ID_Contact in (select top 2 ID_Contact
from tblContacts con2 inner join
tblCompanies cp
on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany and
datediff(day, con2.DT_DateIncluded), '2017-01-01') <= 0
);
My first suggestion is to change the datediff() to a simple date comparison:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where ID_Contact in (select top 2 ID_Contact
from tblContacts con2 inner join
tblCompanies cp
on con2.NO_CodCompany = cp.ID_Company
where con2.NO_CodCompany = con1.NO_CodCompany and
con2.DT_DateIncluded < '2017-01-02'
);
Then, I would remove the JOIN in the subquery. I'm not 100% sure this is exactly equivalent, because that might depend on nuances in the data:
select distinct top 10 ID_Contact, NO_CodCompany
from tblContacts con1
where con1.ID_Contact in (select top 2 con2.ID_Contact
from tblCompanies cp
where con1.NO_CodCompany = cp.ID_Company and
con1.DT_DateIncluded < '2017-01-02'
);
Then, if you can remove the select distinct in the outermost query, you should do that.
Try this instead:
con2.DT_DateIncluded < '20170102'
It's better because it still allows the server to make use of any indexes on the DT_DateIncluded column. Currently, this is not possible. Even worse, the query is probably having to run that DATEDIFF() function on every record in the table.
Note that this is equivalent to what you posted, even if it might not match what you intended. I suspect con2.DT_DateIncluded < '20170101' is closer to what you really meant.
I also suspect you could do this either without the 2nd instance of tblContacts or with a windowing function to get much better results, or at least by using JOIN instead of IN to filter the results.
Finally, for historical reasons, when entering a date-only value, you should use the unseparated date format as described here:
The ultimate guide to the datetime datatypes
For date/time values, you can still use the separated yyyy-mm-dd hh:mm:ss you're used to, but if you only have the date part, yyyymmdd is better.
Based on this comment:
My goal with this query is to obtain contacts from companies but limited to "n" contacts per company
You should look into the APPLY operator. Unfortunately, it's still not clear to me how everything fits together, but I will least provide a demonstration using the APPLY operator to show two contacts per company that you can use as a starting point:
SELECT TOP 10 ct.ID_Contact, ct.NO_CodCompany
FROM tblCompanies cp
CROSS APPLY (
SELECT TOP 2 ID_Contact, NO_CodCompany
FROM tblContacs
WHERE NO_CodCompany = cp.ID_Company
AND DT_DateIncluded < '20170102'
ORDER BY DT_DateIncluded DESC
) ct
APPLY works kind of like a JOIN on a nested SELECT query, where there is no ON clause; the join conditional is instead included as part of the WHERE clause in the nested SELECT statement.
Note the use of CROSS. This will exclude companies that have no contacts at all. If you want to include those companies, change it to OUTER.
You should also look at what indexes you have defined. A single index on the tblContacts table that looks at NO_CodCompany and DT_DateIncluded (in that order!) might work wonders for this query, especially if it also has ID_Contact in the INCLUDES clause. Then you could complete the tblContacts portion of the query entirely from the index.

SELECT fields from one table with aggregates from related table

Here is a simplified description of 2 tables:
CREATE TABLE jobs(id PRIMARY KEY, description);
CREATE TABLE dates(id PRIMARY KEY, job REFERENCES jobs(id), date);
There may be one or more dates per job.
I would like create a query which generates the following (in pidgin):
jobs.id, jobs.description, min(dates.date) as start, max(dates.date) as finish
I have tried something like this:
SELECT id, description,
(SELECT min(date) as start FROM dates d WHERE d.job=j.id),
(SELECT max(date) as finish FROM dates d WHERE d.job=j.id)
FROM jobs j;
which works, but looks very inefficient.
I have tried an INNER JOIN, but can’t see how to join jobs with a suitable aggregate query on dates.
Can anybody suggest a clean efficient way to do this?
While retrieving all rows: aggregate first, join later:
SELECT id, j.description, d.start, d.finish
FROM jobs j
LEFT JOIN (
SELECT job AS id, min(date) AS start, max(date) AS finish
FROM dates
GROUP BY job
) d USING (id);
Related:
SQL: How to save order in sql query?
About JOIN .. USING
It's not a "different type of join". USING (col) is a standard SQL (!) syntax shortcut for ON a.col = b.col. More precisely, quoting the manual:
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON *T1*.a = *T2*.a AND *T1*.b = *T2*.b.
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.
It's particularly convenient that you can write SELECT * FROM ... and joining columns are only listed once.
In addition to Erwin's solution, you can also use a window clause:
SELECT j.id, j.description,
first_value(d.date) OVER w AS start,
last_value(d.date) OVER w AS finish
FROM jobs j
JOIN dates d ON d.job = j.id
WINDOW w AS (PARTITION BY j.id ORDER BY d.date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
Window functions effectively group by one or more columns (the PARTITION BY clause) and/or ORDER BY some other columns and then you can apply some window function to it, or even a regular aggregate function, without affecting grouping or ordering of any other columns (description in your case). It requires a somewhat different way of constructing queries, but once you get the idea it is pretty brilliant.
In your case you need to get the first value of a partition, which is easy because it is accessible by default. You also need to look beyond the window frame (which ends by default with the current row) to the last value in the partition and then you need the ROWS clause. Since you produce two columns using the same window definition, the WINDOW clause is used here; in case it applies to a single column you can just write the window function in the select list followed by the OVER clause and the window definition without its name (WINDOW w AS (...)).

Why would CROSS APPLY not be equivalent to INNER JOIN

This runs in 2 minutes:
SELECT
G.GKey,
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
INNER JOIN #g G ON
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
GROUP BY G.GKey;
This runs in 8 mins:
SELECT
G.GKey,
C.Amount
FROM
#g G
CROSS APPLY
(
SELECT
Amount = SUM(fct.AmountEUR)
FROM
WH.dbo.vw_Fact fct
WHERE
fct.DateKey >= G.Livedate AND
fct.GKey = G.GKey
) C;
These are both quite simple scripts and they look logically the same to me.
Table #G has 50 rows with a clustered index ON #G(Livedate,GKey)
Table WH.dbo.vw_Fact has a billion rows.
I actually felt initially that applying the bigger table to the small table was going to be more efficient.
My experience using CROSS APPLY is limited - is there an obvious reason (without exploring execution plans) for the slow time?
Is there a 'third way' that is likely to be quicker?
Here's the logical difference between the two joins:
CROSS APPLY: yields the Cartesian cross product of an aggregation on a given value of LiveDate and GKey, this gets re-executed for every row.
INNER JOIN: yields a 1-to-1 match on vw_Fact for every value of LiveDate and GKey, then sum accross common values of GKey, this creates the joined set first, then applies the aggregate.
As some of the other answers mentioned, cross apply is convenient when you join to a table valued function that is parameterized by some row level data from another table.
Is there a third way, that is faster? I would generally suggest not using open ended operators in joins (such as >=). Maybe try to pre-aggregate the large table on GKey and some date bucket. Also, set up a non-clustered key on LiveDate including AmountEUR
I think you trying to get Rolling sum. use Over() clause Try this.
SELECT G.GKey,
Amount = Sum(fct.AmountEUR)
OVER(
partition BY G.GKey
ORDER BY id rows UNBOUNDED PRECEDING)
FROM WH.dbo.vw_Fact fct
INNER JOIN #g G
ON fct.GKey = G.GKey
APPLY works on a row-by-row basis and is useful for more complex joins such as joining on the first X rows of a table based upon a value in the first table or for joining a function with parameters.
See here for examples.
The obvious reason for the cross apply being slower is that it works on a row by row basis!
So for each row of #g you are running the aggregate query in the cross apply.