Join future dates to table which only has dates until current day - sql

I have these two tables:
table1: name (string), actual (double), yyyy_mm_dd (date)
table2: name (string), expected(double), yyyy_mm_dd (string)
table1 contains data from 2018-01-01 up until the current day, table2 contains predicted data for the year of 2020. My problem is that table1 doesn’t have any date values past the present date, so I get duplicate data when joining like below:
SELECT
kpi.yyyy_mm_dd,
kpi.name,
kpi.actual as actual,
pre.predicted as predicted
FROM
schema1.table1 kpi
LEFT JOIN
schema1.table2 pre
ON name = kpi.name --AND pre.yyyy_mm_dd = kpi.yyyy_mm_dd
WHERE
kpi.yyyy_mm_dd >= '2019-12-09'
Output:
+----------+------------+----------+-------------+
|yyyy_mm_dd| name |actual |predicted |
+----------+------------+----------+-------------+
|2019-12-10| Company | 100000 | 925,180 |
|2019-12-10| Company | 100000 | 1,145,723 |
|2019-12-10| Company | 100000 | 456,359 |
--------------------------------------------------
If I uncomment the AND condition in my join clause, I won’t get the predicted values as my first table has no 2020 data. How can I join these tables together without duplicating actual values? actual should be null for days which haven't happened yet.

I think you want UNION ALL and not a JOIN:
SELECT
yyyy_mm_dd,
name,
actual as actual,
NULL as predicted
FROM schema1.table1
WHERE yyyy_mm_dd >= '2019-12-09'
UNION ALL
SELECT
yyyy_mm_dd,
name,
NULL as actual,
predicted as predicted
FROM schema1.table2

Hive supports full join:
SELECT COALESCE(kpi.yyyy_mm_dd, pre.yyyy_mm_dd) as yyyy_mm_dd,
COALESCE(kpi.name, pre.name) as name,
kpi.actual as actual,
pre.predicted as predicted
FROM (SELECT kpi.*
FROM schema1.table1 kpi
WHERE kpi.yyyy_mm_dd >= '2019-12-09'
) kpi FULL JOIN
schema1.table2 pre
ON kpi.name = pre.name AND
kpi.yyyy_mm_dd = pre.yyyy_mm_dd

Try using
group by
clause in your query, below might solve your problem
SELECT
kpi.yyyy_mm_dd,
kpi.name,
kpi.actual as actual,
pre.predicted as predicted
FROM
schema1.table1 kpi
LEFT JOIN
schema1.table2 pre
ON name = kpi.name
group by kpi.yyyy_mm_dd,kpi.name,kpi.actual

Related

SQL Server : getting sum of values in "calendar" table without joining

Is it possible to get a the sum of value from the calendar_table to the main_table without joining like below?
select
date, sum(value)
from
main_table
inner join
calendar_table on start_date <= date and end_date >= date
group by
date
I am trying to avoid a join like this because main_table is a very large table with rows that have very large start and end dates, and it is absolutely killing my performance. And I've already indexed both tables.
Sample desired results:
+-----------+-------+
| date | total |
+-----------+-------+
| 7-24-2010 | 11 |
+-----------+-------+
Sample tables
calendar_table:
+-----------+-------+
| date | value |
+-----------+-------+
| 7-24-2010 | 5 |
| 7-25-2010 | 6 |
| ... | ... |
| 7-23-2020 | 2 |
| 7-24-2020 | 10 |
+-----------+-------+
main_table:
+------------+-----------+
| start_date | end_date |
+------------+-----------+
| 7-24-2010 | 7-25-2010 |
| 8-1-2011 | 8-5-2011 |
+------------+-----------+
You want the sum in the calendar table. So, I would recommend an "incremental" approach. This starts by unpivoting the data and putting the value as an increment and decrement in the results:
select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
union all
select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date;
The final step is to aggregate and do a cumulative sum:
select date, sum(inc) as value_on_date,
sum(sum(inc)) over (order by date) as net_value
from ((select c.date, c.value as inc
from main_table m join
calendar_table t
on m.start_date = c.date
) union all
(select dateadd(day, 1, c.date), - c.value as inc
from main_table m join
calendar_table t
on m.end_date = c.date
)
) c
group by date
order by date;
This is processing two rows of data for each row in the master table. Assuming that your time spans are longer than two days typically for each master row, the resulting data processed should be much smaller. And smaller data implies a faster query.
Here's a cross-apply example to possibly work from.
select main_table.date
, CalendarTable.ValueSum
from main_table
CROSS APPLY(
SELECT SUM(value) as ValueSum
FROM calendar_table
WHERE start_date <= main_table.date and main_table.end_date >= date
) as CalendarTable
group by date
You could try something like this ... but be aware, it is still technically 'joined' to the main table. If you look at an execution plan, you will see that there is a join operation of some kind going on.
select
date,
(select sum(value) from calendar_table t where m.start_date <= t.date and m.end_date >= t.date)
from
main_table m
The thing about that query is that the 'main_table' is not grouped as part of the results. You could possibly do that outside the select, but I don't know what you are trying to achieve. If you are grouping just to get the SUM, then perhaps maintaining the 'main_table' in the group is superflous.
As already mentioned, you must perform a join of some sort in order to get data from more than one table in a query.
You did not provide details if the indexes which are important for performance. I suggest the following indexes to optimize query performance.
For calendar_table, make sure you have a unique clustered index (or primary key) on date. Alternatively, a unique nonclustered index on date with the value column included.
A composite index on the main_table start_date and end_date columns may also be beneficial.
Even with optimal indexes, the query will still take some time against a 500M row table (e.g. a couple of minutes) with no additional filter criteria. If you need results in milliseconds, create an indexed view to materialize the join and aggregation results. Be aware the indexed view will add overhead for inserts/deletes on both tables as well as for updates to the value column in order to keep the index consistent with the underlying data.
Below is an indexed view DDL example.
CREATE VIEW dbo.vw_example
WITH SCHEMABINDING
AS
SELECT
date, sum(value) AS value, COUNT_BIG(*) AS countbig
from
dbo.main_table
inner join
dbo.calendar_table on start_date <= date and end_date >= date
group by
date;
GO
CREATE UNIQUE CLUSTERED INDEX cdx ON dbo.vw_example(date);
GO
Depending on your SQL Server edition, the optimizer may be able to use the indexed view automatically so your original query can use the view index without changes. Otherwise, query the view directly and specify a NOEXPAND hint:
SELECT date, value AS total
FROM dbo.vw_example WITH (NOEXPAND);
EDIT:
With the query improvement #GordonLinoff suggested, a non-clustered index on the main_table end_date column will help optimize that query.

SQL join to and from dates - return most recent if no match found

I have two tables that I need to join. I have:
LEFT JOIN AutoBAF on (GETDATE() BETWEEN AutoBAF.FromDate and AutoBAF.ToDate)
and I get the expected result. Now if no matching record is found between the two dates (AutoBAF.FromDate and AutoBAF.ToDate) I would like to join the most recent matching record instead.
Can anyone point me in the right direction.
I am using a MS SQL database hosted in Azure.
Small example:
a small example of what I am trying to achieve:
Table Product:
Product | Description
A | Product A
Table Price
Product | FromDate | ToDate | Price
A | 01-01-20 | 31-01-20 | 100
A | 01-02-20 | 28-02-20 | 110
I need a query that will return the price according to the date returned by GETDATE().
If I run the query 15-01-20 I should get:
Product | Description | Price
A | Product A | 100
If I run the query 15-02-20 I should get:
Product | Description | Price
A | Product A | 110
and finally if I run the query 15-03-20 I will have no price in the Price table. Instead of returning null I would like to "fall back" to the most recent known price instead which in this example is 110
This is not the fastest query cause it joins products with all records with future dates. But if your tables are small, it works.
SELECT product.product, product.description, isnull(pr_curr.price, pr_fut.price) as price
FROM product
left join PRICE pr_curr on product.product=pr_curr.product
and GETDATE() BETWEEN pr_curr.FromDate and pr_curr.ToDate
left join PRICE pr_fut on product.product=pr_fut.product
and GETDATE() < pr_fut.FromDate
where pr_fut.FromDate = (
select min(FromDate) from PRICE dates
where dates.product=pr_fut.product and dates.FromDate>GETDATE()
) or pr_fut.FromDate is null
This looks like SQL Server code, which supports lateral joins via the apply keyword. Assuming you want only one match:
from product p outer apply
(select top (1) ab.*
from autobaf ab
where ab.product = p.product and
getdate() <= ab.todate
order by ab.todate desc
) ab
Note that this correlates on the product, which is not part of your question.
If that is not necessary, then you can use:
from t left join
(select top (1) ab.*
from autobaf ab
where getdate() <= ab.todate
order by ab.todate desc
) ab
on 1 = 1
If you know that there is some record in the past, then you can use cross join instead of left join and dispense with the on clause.
SELECT product.product, product.description, isnull(pr_curr.price, pr_fut.price) as price
FROM product
left join PRICE pr_curr on product.product=pr_curr.product
and GETDATE() BETWEEN pr_curr.FromDate and pr_curr.ToDate
left join PRICE pr_fut on product.product=pr_fut.product
and GETDATE() > pr_fut.FromDate
where pr_fut.FromDate = (
select max(FromDate) from PRICE dates
where dates.product=pr_fut.product and dates.FromDate<GETDATE()
) or pr_fut.FromDate is null

nest a join on a field not displayed into this query with a group by

Current Query:
Select Name, Charge_code, Charge, Max(Mod)Mod, Max(date)Date
From Table1
Where Name is not Null and Name <>'' and Charge is not Null and charge_code is not null
Group by Name, Charge_Code, Charge
In this same table1 I also have a Name identifier Number titled "IDNum" and in a separate table (table2) I also have that same Identifier number "IDNum" with a LocationID in a separate Column in table 2. In table 3 I have that location ID attached to an actual location name "location_Name". This is what I would like to have joined on to my data set.
Ultimately I'm wanting to return the following Results:
Name | Location | Charge Code | Charge | Mod | Date
How can I Nest a join into my existing query to pull back the Location Name based on that Identifier Number but not have the identifier number displayed in the results? sorry if this is a easy question I am new.
Thanks
tried this to no avail:
Select Name, Charge_code, Charge, Max(Mod)Mod, Max(date)Date, Location_Name
From Table1
Join Location table1.IDNum = Table2.IDNum = Table3.Location_Name
Where Name is not Null and Name <>'' and Charge is not Null and charge_code is not null
Group by Name, Charge_Code, Charge
If you want to join 2 tables, then you need to Join it twice with Table 2 is connected to Table 1, and Table 3 is connected to Table 2. This way, you could get the data in Table 3
Select Name, Table3.Location_Name as Location,
Charge_code as 'Charge Code', Charge,
Max(Mod) as Mod, Max(date) as Date
From Table1
Left Join Table2 on Table1.IDNum = Table2.IDNum
Left Join Table3 on Table2.LocationID = Table3.LocationID
Where Name is not Null and Name <>'' and Charge is not Null and charge_code is not null
Group by Name, Charge_Code, Charge
In here you could learn more about JOIN. Then you can decide which JOIN that suits the most for your case

Adding in missing dates from results in SQL

I have a database that currently looks like this
Date | valid_entry | profile
1/6/2015 1 | 1
3/6/2015 2 | 1
3/6/2015 2 | 2
5/6/2015 4 | 4
I am trying to grab the dates but i need to make a query to display also for dates that does not exist in the list, such as 2/6/2015.
This is a sample of what i need it to be:
Date | valid_entry
1/6/2015 1
2/6/2015 0
3/6/2015 2
3/6/2015 2
4/6/2015 0
5/6/2015 4
My query:
select date, count(valid_entry)
from database
where profile = 1
group by 1;
This query will only display the dates that exist in there. Is there a way in query that I can populate the results with dates that does not exist in there?
You can generate a list of all dates that are between the start and end date from your source table using generate_series(). These dates can then be used in an outer join to sum the values for all dates.
with all_dates (date) as (
select dt::date
from generate_series( (select min(date) from some_table), (select max(date) from some_table), interval '1' day) as x(dt)
)
select ad.date, sum(coalesce(st.valid_entry,0))
from all_dates ad
left join some_table st on ad.date = st.date
group by ad.date, st.profile
order by ad.date;
some_table is your table with the sample data you have provided.
Based on your sample output, you also seem to want group by date and profile, otherwise there can't be two rows with 2015-06-03. You also don't seem to want where profile = 1 because that as well wouldn't generate two rows with 2015-06-03 as shown in your sample output.
SQLFiddle example: http://sqlfiddle.com/#!15/b0b2a/2
Unrelated, but: I hope that the column names are only made up. date is a horrible name for a column. For one because it is also a keyword, but more importantly it does not document what this date is for. A start date? An end date? A due date? A modification date?
You have to use a calendar table for this purpose. In this case you can create an in-line table with the tables required, then LEFT JOIN your table to it:
select "date", count(valid_entry)
from (
SELECT '2015-06-01' AS d UNION ALL '2015-06-02' UNION ALL '2015-06-03' UNION ALL
'2015-06-04' UNION ALL '2015-06-05' UNION ALL '2015-06-06') AS t
left join database AS db on t.d = db."date" and db.profile = 1
group by t.d;
Note: Predicate profile = 1 should be applied in the ON clause of the LEFT JOIN operation. If it is placed in the WHERE clause instead then LEFT JOIN essentially becomes an INNER JOIN.

SQL to get minimum of two different fields

I have two different tables to track location of equipment. The "equipment" table tracks the current location and when it was installed there. If the equipment was previously at a different location, that information is kept in the "locationHistory" table. There is one row per equip_id in the equipment table. There can be 0 or more entries for each equip_id in the locationHistory table.
equipment
equip_id
current_location
install_date_at_location
locationHistory
equip_id
location
install_date
pickup_date
I want an SQL query that gets the date of the FIRST install_date for each piece of eqipment...
Example:
equipment
=========
equip_id | current_location | install_date_at_location
123 location1 1/23/2011
locationHistory
===============
equip_id | location | install_date | pickup_date
123 location2 1/1/2011 1/5/2011
123 location3 1/7/2011 1/20/2011
Should return: 123, 1/1/2011
Thoughts?
You will want to union the queries that each look at one field, then use a MIN against it.
Or you can use the CASE and MIN for the same effect
select e.equip_id, MIN(CASE WHEN h.install_date < e.install_date_at_location
THEN h.install_date
ELSE e.install_date_at_location
END) as first_install_date
from equipment e
left join locationHistory h on h.equip_id = e.equip_id
group by e.equip_id
Well, the critical piece of information is whether the install_at_location_date in equipment can ever be less than what I assume is the historical information in locationHistory. If that's not possible, you can do:
SELECT * FROM locationHistory L INNER JOIN
(SELECT equip_id, MIN(install_date) AS firstDate FROM locationHistory)
AS firstInstalls F
ON L.equip_id = F.equip_id AND L.install_date = F.firstDate
But if you have to worry about both tables, you need to create view that normalizes the tables for you, and then apply the query against the view:
CREATE VIEW normalLocations (equip_id, location, install_date) AS
SELECT equip_id, location, install_date_at_location FROM equipment
UNION ALL
SELECT equip_id, location, install_date FROM equipment;
SELECT * FROM normalLocations L INNER JOIN
(SELECT equip_id, MIN(install_date) AS firstDate FROM normalLocations)
AS firstInstalls F
ON L.equip_id = F.equip_id AND L.install_date = F.firstDate
A simple way to do it is:
SELECT U.Equip_ID, MIN(U.Install_Date)
FROM (SELECT E.Equip_ID, E.Install_Date_At_Location AS Install_Date
FROM Equipment AS E
UNION
SELECT L.Equip_ID, L.Install_Date
FROM LocationHistory AS L
) AS U
GROUP BY U.Equip_ID
This could generate a lot of rows from the LocationHistory table, but it isn't clear that it is worth 'optimizing' it by trying to apply a GROUP BY and MIN to the second half of the UNION (because you'd immediately redo the grouping with the result from the information in the equipment table).