SQL Server : getting sum of values in "calendar" table without joining - sql

Is it possible to get a the sum of value from the calendar_table to the main_table without joining like below?
select
date, sum(value)
from
main_table
inner join
calendar_table on start_date <= date and end_date >= date
group by
date
I am trying to avoid a join like this because main_table is a very large table with rows that have very large start and end dates, and it is absolutely killing my performance. And I've already indexed both tables.
Sample desired results:
+-----------+-------+
| date | total |
+-----------+-------+
| 7-24-2010 | 11 |
+-----------+-------+
Sample tables
calendar_table:
+-----------+-------+
| date | value |
+-----------+-------+
| 7-24-2010 | 5 |
| 7-25-2010 | 6 |
| ... | ... |
| 7-23-2020 | 2 |
| 7-24-2020 | 10 |
+-----------+-------+
main_table:
+------------+-----------+
| start_date | end_date |
+------------+-----------+
| 7-24-2010 | 7-25-2010 |
| 8-1-2011 | 8-5-2011 |
+------------+-----------+

You want the sum in the calendar table. So, I would recommend an "incremental" approach. This starts by unpivoting the data and putting the value as an increment and decrement in the results:
select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
union all
select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date;
The final step is to aggregate and do a cumulative sum:
select date, sum(inc) as value_on_date,
sum(sum(inc)) over (order by date) as net_value
from ((select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
) union all
(select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date
)
) c
group by date
order by date;
This is processing two rows of data for each row in the master table. Assuming that your time spans are longer than two days typically for each master row, the resulting data processed should be much smaller. And smaller data implies a faster query.

Here's a cross-apply example to possibly work from.
select main_table.date
, CalendarTable.ValueSum
from main_table
CROSS APPLY(
SELECT SUM(value) as ValueSum
FROM calendar_table
WHERE start_date <= main_table.date and main_table.end_date >= date
) as CalendarTable
group by date

You could try something like this ... but be aware, it is still technically 'joined' to the main table. If you look at an execution plan, you will see that there is a join operation of some kind going on.
select
date,
(select sum(value) from calendar_table t where m.start_date <= t.date and m.end_date >= t.date)
from
main_table m
The thing about that query is that the 'main_table' is not grouped as part of the results. You could possibly do that outside the select, but I don't know what you are trying to achieve. If you are grouping just to get the SUM, then perhaps maintaining the 'main_table' in the group is superflous.

As already mentioned, you must perform a join of some sort in order to get data from more than one table in a query.
You did not provide details if the indexes which are important for performance. I suggest the following indexes to optimize query performance.
For calendar_table, make sure you have a unique clustered index (or primary key) on date. Alternatively, a unique nonclustered index on date with the value column included.
A composite index on the main_table start_date and end_date columns may also be beneficial.
Even with optimal indexes, the query will still take some time against a 500M row table (e.g. a couple of minutes) with no additional filter criteria. If you need results in milliseconds, create an indexed view to materialize the join and aggregation results. Be aware the indexed view will add overhead for inserts/deletes on both tables as well as for updates to the value column in order to keep the index consistent with the underlying data.
Below is an indexed view DDL example.
CREATE VIEW dbo.vw_example
WITH SCHEMABINDING
AS
SELECT
date, sum(value) AS value, COUNT_BIG(*) AS countbig
from
dbo.main_table
inner join
dbo.calendar_table on start_date <= date and end_date >= date
group by
date;
GO
CREATE UNIQUE CLUSTERED INDEX cdx ON dbo.vw_example(date);
GO
Depending on your SQL Server edition, the optimizer may be able to use the indexed view automatically so your original query can use the view index without changes. Otherwise, query the view directly and specify a NOEXPAND hint:
SELECT date, value AS total
FROM dbo.vw_example WITH (NOEXPAND);
EDIT:
With the query improvement #GordonLinoff suggested, a non-clustered index on the main_table end_date column will help optimize that query.

Related

Using INNER JOIN resulting table in another INNER JOIN

I'm not really sure that the title actually corresponds, maybe my approach is wrong.
I have the following database structure:
TABLE producers
id
TABLE data
id
date
value
producer_id OneToMany
First thing first, for each producer, I want to get the latest date of registered data that there is. The code below does exactly this:
SELECT producers.id AS producer_id, max.date AS max_date
FROM producers
INNER JOIN data ON producers.id = data.producer_id
INNER JOIN (
SELECT producer_id, MAX(date) AS date
FROM data
GROUP BY producer_id
) AS max USING (producer_id, date)
And the resulting table is:
----------------------------------------------------
| producer_id | max_date |
----------------------------------------------------
| 5 | 2022-01-01 01:45:00.000 +0000 |
| 7 | 2022-01-01 01:45:00.000 +0000 |
| 14 | 2022-01-01 01:45:00.000 +0000 |
| 15 | 2022-01-01 01:45:00.000 +0000 |
| 17 | 2022-01-01 01:45:00.000 +0000 |
----------------------------------------------------
The next thing that I need is to SUM all the data records per producer WITH date bigger than the max_date we got for each producer after the INNER JOIN from the previous query. The SUM() will be performed on column value.
Hopefully that was clear, if not, let me know. I've tried doing another INNER JOIN and use table max in the WHERE clause but I got an error that told me that the table was there, but it wasn't possible to be used in that part of the query.
Maybe another INNER JOIN isn't the solution. Here I'm limited by my knowledge of SQL and I don't really know about which keywords to read more in-depth to understand what's the best approach and how to do it. So, an info to redirect me on the best path would be really helpful.
Thanks in advance.
EDIT: Forgot to specify on which column the SUM() will be executed on.
EDIT 2: Just realized that what I'm asking here, the result will always be an empty table because there won't ever be a record whose date will be bigger. When I wrote the simplified version of my database, forgot to add a table/join, that's why. But in the end imo the approach/solution will still be the same, just applied on a different table. Sorry for that again.
The first query in the question can be greatly simplified using distinct on and order by:
select distinct on (p.id)
p.id, d.date
from producers p
join data d on p.id = d.producer_id
order by p.id, d.date desc;
As for "SUM all the data records per producer WITH date bigger than the max_date" - well, none exists with date bigger than the latest one. Here is a query to do so (even the result will be empty)
select producer_id, sum(value)
from data d inner join -- the query above follows
(
select distinct on (p.id)
p.id producer_id, d.date
from producers p
join data d on p.id = d.producer_id
order by p.id, d.date desc
) t using (producer_id)
where d.date > t.date
group by producer_id;

Join future dates to table which only has dates until current day

I have these two tables:
table1: name (string), actual (double), yyyy_mm_dd (date)
table2: name (string), expected(double), yyyy_mm_dd (string)
table1 contains data from 2018-01-01 up until the current day, table2 contains predicted data for the year of 2020. My problem is that table1 doesn’t have any date values past the present date, so I get duplicate data when joining like below:
SELECT
kpi.yyyy_mm_dd,
kpi.name,
kpi.actual as actual,
pre.predicted as predicted
FROM
schema1.table1 kpi
LEFT JOIN
schema1.table2 pre
ON name = kpi.name --AND pre.yyyy_mm_dd = kpi.yyyy_mm_dd
WHERE
kpi.yyyy_mm_dd >= '2019-12-09'
Output:
+----------+------------+----------+-------------+
|yyyy_mm_dd| name |actual |predicted |
+----------+------------+----------+-------------+
|2019-12-10| Company | 100000 | 925,180 |
|2019-12-10| Company | 100000 | 1,145,723 |
|2019-12-10| Company | 100000 | 456,359 |
--------------------------------------------------
If I uncomment the AND condition in my join clause, I won’t get the predicted values as my first table has no 2020 data. How can I join these tables together without duplicating actual values? actual should be null for days which haven't happened yet.
I think you want UNION ALL and not a JOIN:
SELECT
yyyy_mm_dd,
name,
actual as actual,
NULL as predicted
FROM schema1.table1
WHERE yyyy_mm_dd >= '2019-12-09'
UNION ALL
SELECT
yyyy_mm_dd,
name,
NULL as actual,
predicted as predicted
FROM schema1.table2
Hive supports full join:
SELECT COALESCE(kpi.yyyy_mm_dd, pre.yyyy_mm_dd) as yyyy_mm_dd,
COALESCE(kpi.name, pre.name) as name,
kpi.actual as actual,
pre.predicted as predicted
FROM (SELECT kpi.*
FROM schema1.table1 kpi
WHERE kpi.yyyy_mm_dd >= '2019-12-09'
) kpi FULL JOIN
schema1.table2 pre
ON kpi.name = pre.name AND
kpi.yyyy_mm_dd = pre.yyyy_mm_dd
Try using
group by
clause in your query, below might solve your problem
SELECT
kpi.yyyy_mm_dd,
kpi.name,
kpi.actual as actual,
pre.predicted as predicted
FROM
schema1.table1 kpi
LEFT JOIN
schema1.table2 pre
ON name = kpi.name
group by kpi.yyyy_mm_dd,kpi.name,kpi.actual

Getting the latest entry per day / SQL Optimizing

Given the following database table, which records events (status) for different objects (id) with its timestamp:
ID | Date | Time | Status
-------------------------------
7 | 2016-10-10 | 8:23 | Passed
7 | 2016-10-10 | 8:29 | Failed
7 | 2016-10-13 | 5:23 | Passed
8 | 2016-10-09 | 5:43 | Passed
I want to get a result table using plain SQL (MS SQL) like this:
ID | Date | Status
------------------------
7 | 2016-10-10 | Failed
7 | 2016-10-13 | Passed
8 | 2016-10-09 | Passed
where the "status" is the latest entry on a day, given that at least one event for this object has been recorded.
My current solution is using "Outer Apply" and "TOP(1)" like this:
SELECT DISTINCT rn.id,
tmp.date,
tmp.status
FROM run rn OUTER apply
(SELECT rn2.date, tmp2.status AS 'status'
FROM run rn2 OUTER apply
(SELECT top(1) rn3.id, rn3.date, rn3.time, rn3.status
FROM run rn3
WHERE rn3.id = rn.id
AND rn3.date = rn2.date
ORDER BY rn3.id ASC, rn3.date + rn3.time DESC) tmp2
WHERE tmp2.status <> '' ) tmp
As far as I understand this outer apply command works like:
For every id
For every recorded day for this id
Select the newest status for this day and this id
But I'm facing performance issues, therefore I think that this solution is not adequate. Any suggestions how to solve this problem or how to optimize the sql?
Your code seems too complicated. Why not just do this?
SELECT r.id, r.date, r2.status
FROM run r OUTER APPLY
(SELECT TOP 1 r2.*
FROM run r2
WHERE r2.id = r.id AND r2.date = r.date AND r2.status <> ''
ORDER BY r2.time DESC
) r2;
For performance, I would suggest an index on run(id, date, status, time).
Using a CTE will probably be the fastest:
with cte as
(
select ID, Date, Status, row_number() over (partition by ID, Date order by Time desc) rn
from run
)
select ID, Date, Status
from cte
where rn = 1
Do not SELECT from a log table, instead, write a trigger that updates a latest_run table like:
CREATE TRIGGER tr_run_insert ON run FOR INSERT AS
BEGIN
UPDATE latest_run SET Status=INSERTED.Status WHERE ID=INSERTED.ID AND Date=INSERTED.Date
IF ##ROWCOUNT = 0
INSERT INTO latest_run (ID,Date,Status) SELECT (ID,Date,Status) FROM INSERTED
END
Then perform reads from the much shorter lastest_run table.
This will add a performance penalty on writes because you'll need two writes instead of one. But will give you much more stable response times on read. And if you do not need to SELECT from "run" table you can avoid indexing it, therefore the performance penalty of two writes is partly compensated by less indexes maintenance.

How to query partitions that you get from using window functions?

I have a table that has the following structure
------------------------------------
|Company_ID| Company_Name| Join_Key|
------------------------------------
| 1 | ACompany | AC |
| 2 | BCompany | BC |
While this table doesn't have many column, there are somewhere around 4 million rows.
I want to calculate some string distance calculations on these company names. I have the following query
select a.Company_Name as Name1,
b.Company_Name as Name2,
Fuzzy_Match(a.Company_Name, b.Company_Name, 'JaccardDistance') as Jaccard --this is a custom function
from [Companies] a, [Companies] b
While something like this would work on a smaller database, since my database is so large, there is no way for me to be able to get through all of the combinations in a reasonable amount of time. So I thought about partitioning the database with a window function.
select Company_Name,
ROW_NUMBER() over(partition by Join_Key order by Join_Key asc) as row_num
Join_Key
from [Companies]
This gives me a list of the companies numbered and partitioned by their join_key, but the thing that I'm not sure of is how to do both things.
How can I perform a cross join and calculate the string similarity measures for each partition so that I'm only comparing companies that both have 'AC' as their join key?

Adding in missing dates from results in SQL

I have a database that currently looks like this
Date | valid_entry | profile
1/6/2015 1 | 1
3/6/2015 2 | 1
3/6/2015 2 | 2
5/6/2015 4 | 4
I am trying to grab the dates but i need to make a query to display also for dates that does not exist in the list, such as 2/6/2015.
This is a sample of what i need it to be:
Date | valid_entry
1/6/2015 1
2/6/2015 0
3/6/2015 2
3/6/2015 2
4/6/2015 0
5/6/2015 4
My query:
select date, count(valid_entry)
from database
where profile = 1
group by 1;
This query will only display the dates that exist in there. Is there a way in query that I can populate the results with dates that does not exist in there?
You can generate a list of all dates that are between the start and end date from your source table using generate_series(). These dates can then be used in an outer join to sum the values for all dates.
with all_dates (date) as (
select dt::date
from generate_series( (select min(date) from some_table), (select max(date) from some_table), interval '1' day) as x(dt)
)
select ad.date, sum(coalesce(st.valid_entry,0))
from all_dates ad
left join some_table st on ad.date = st.date
group by ad.date, st.profile
order by ad.date;
some_table is your table with the sample data you have provided.
Based on your sample output, you also seem to want group by date and profile, otherwise there can't be two rows with 2015-06-03. You also don't seem to want where profile = 1 because that as well wouldn't generate two rows with 2015-06-03 as shown in your sample output.
SQLFiddle example: http://sqlfiddle.com/#!15/b0b2a/2
Unrelated, but: I hope that the column names are only made up. date is a horrible name for a column. For one because it is also a keyword, but more importantly it does not document what this date is for. A start date? An end date? A due date? A modification date?
You have to use a calendar table for this purpose. In this case you can create an in-line table with the tables required, then LEFT JOIN your table to it:
select "date", count(valid_entry)
from (
SELECT '2015-06-01' AS d UNION ALL '2015-06-02' UNION ALL '2015-06-03' UNION ALL
'2015-06-04' UNION ALL '2015-06-05' UNION ALL '2015-06-06') AS t
left join database AS db on t.d = db."date" and db.profile = 1
group by t.d;
Note: Predicate profile = 1 should be applied in the ON clause of the LEFT JOIN operation. If it is placed in the WHERE clause instead then LEFT JOIN essentially becomes an INNER JOIN.