Find list of dates in a table closest to specific date from different table. - sql

I have a list of unique ID's in one table that has a date column. Example:
TABLE1
ID Date
0 2018-01-01
1 2018-01-05
2 2018-01-15
3 2018-01-06
4 2018-01-09
5 2018-01-12
6 2018-01-15
7 2018-01-02
8 2018-01-04
9 2018-02-25
Then in another table I have a list of different values that appear multiple times for each ID with various dates.
TABLE 2
ID Value Date
0 18 2017-11-28
0 24 2017-12-29
0 28 2018-01-06
1 455 2018-01-03
1 468 2018-01-16
2 55 2018-01-03
3 100 2017-12-27
3 110 2018-01-04
3 119 2018-01-10
3 128 2018-01-30
4 223 2018-01-01
4 250 2018-01-09
4 258 2018-01-11
etc
I want to find the value in table 2 that is closest to the unique date in table 1.
Sometimes table 2 does contain a value that matches the date exactly and I have had no problem in pulling through those values. But I can't work out the code to pull through the value closest to the date requested from table 1.
My desired result based on the examples above would be
ID Value Date
0 24 2017-12-29
1 455 2018-01-03
2 55 2018-01-03
3 110 2018-01-04
4 250 2018-01-09
Since I can easily find the ID's with an exact match, one thing I have tried is taking the ID's that don't have an exact date match and placing them with their corresponding values into a temporary table. Then trying to find the values where I need the closest possible match, but it's here that I'm not sure where to begin on the coding of that.
Apologies if I'm missing a basic function or clause for this, I'm still learning!

The below would be one method:
WITH Table1 AS(
SELECT ID, CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,'20180101'),
(1,'20180105'),
(2,'20180115'),
(3,'20180106'),
(4,'20180109'),
(5,'20180112'),
(6,'20180115'),
(7,'20180102'),
(8,'20180104'),
(9,'20180225')) V(ID, DateColumn)),
Table2 AS(
SELECT ID, [value], CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,18 ,'2017-11-28'),
(0,24 ,'2017-12-29'),
(0,28 ,'2018-01-06'),
(1,455,'2018-01-03'),
(1,468,'2018-01-16'),
(2,55 ,'2018-01-03'),
(3,100,'2017-12-27'),
(3,110,'2018-01-04'),
(3,119,'2018-01-10'),
(3,128,'2018-01-30'),
(4,223,'2018-01-01'),
(4,250,'2018-01-09'),
(4,258,'2018-01-11')) V(ID, [Value],DateColumn))
SELECT T1.ID,
T2.[Value],
T2.DateColumn
FROM Table1 T1
CROSS APPLY (SELECT TOP 1 *
FROM Table2 ca
WHERE T1.ID = ca.ID
ORDER BY ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn))) T2;
Note that if the difference is days is the same, the row returned will be random (and could differ each time the query is run). For example, if Table had the date 20180804 and Table2 had the dates 20180803 and 20180805 they would both have the value 1 for ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn)). You therefore might need to include additional logic in your ORDER BY to ensure consistent results.

dude.
I'll say a couple of things here for you to consider, since SQL Server is not my comfort zone, while SQL itself is.
First of all, I'd join TABLE1 with TABLE2 per ID. That way, I can specify on my SELECT clause the following tuple:
SELECT ID, Value, DateDiff(d, T1.Date, T2.Date) qt_diff_days
Obviously, depending on the precision of the dates kept there, rather they have times or not, you can change the date field on DateDiff function.
Going forward, I'd also make this date difference an absolute number (to resolve positive / negative differences and consider only the elapsed time).
After that, and that's where it gets tricky because I don't know the SQL Server version you're using, but basically I'd use a ROW_NUMBER window function to rank all my lines per difference. Something like the following:
SELECT
ID, Value, Abs(DateDiff(d, T1.Date, T2.Date)) qt_diff_days,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Abs(DateDiff(d, T1.Date, T2.Date)) ASC) nu_row
ROW_NUMBER (Transact-SQL)
Numbers the output of a result set. More specifically, returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.
If you could run ROW_NUMBER properly, you should notice the query will rank it's data per ID, starting with 1 and increasing this ranking by it's difference between both dates, reseting it's rank to 1 when ID changes.
After that, all you need to do is select only those lines where nu_row equals to 1. I'd use a CTE to that.
WITH common_table_expression (Transact-SQL)
Specifies a temporary named result set, known as a common table expression (CTE).

Related

Dynamically generating date range starts in SQL

Imagine you have a set of dates. You want any date which is within X days of the lowest date to be "merged" into that date. Then you want to repeat until you have merged all date points.
For example:
ID
DatePoints
1
2023-01-01
2
2023-01-02
3
2023-01-12
4
2023-01-21
5
2023-02-01
6
2023-02-02
7
2023-03-01
If you applied this rule to this data using 10 days as your X, you would end up with this output:
DateRangeStarts
2023-01-01
2023-01-12
2023-02-01
2023-03-01
IDs 1 and 2 into range 1, IDs 3 and 4 into range 2, IDs 5 and 6 into range 3, and ID 7 into range 4.
Is there any way to do this without a loop? Answer can work in SQL Server or BigQuery. Thanks
You could consider something like the following. It's not pretty and I'm not at all confident it is the best solution, but I do think it works. Maybe it's a good starting point for you to work from.
WITH cte AS
(
SELECT min(datepoint) datepoint
FROM test
UNION ALL
SELECT min(t.datepoint) OVER() datepoint
FROM test t CROSS APPLY (SELECT max(cte.datepoint) OVER() md FROM cte) c
WHERE t.datepoint > DATEADD(DAY, 10, c.md)
)
SELECT distinct datepoint
FROM cte
ORDER BY datepoint
(You might want to change the > to a >=, depending on what counts as within X days.)
The basic idea is to get the minimum date from your table into the cte, then recursively get the minimum date from your table that is bigger than the current maximum date in the cte + X days.
It gets messy because of the limitations SQL Server places on recursive CTEs. They can't be used in subqueries, with normal OUTER JOINs, or with aggregate functions. Therefore, I use CROSS APPLY and the window versions of min/max. This gets the correct result, but multiple times, so I'm forced to use DISTINCT to clean it up afterward.
Depending on your data, it might be better to do a loop anyway, but I think this is an option to consider.
Here's a Fiddle of it working.

how to distribute data approximately in a table so that the counts remain close to the given value

I've a table TEST with 14.5 million records(columns id and created_date).The end goal is to split this table into approximately N number of splits. Let's assume 15 in this case, close to a million records. I'm using created_date to split this data.
I've come up with the below query.
with cte as (
select created_date,
ntile(15) over (order by created_date) as created_date_range
from TEST
)
select created_date_range ,min(created_date),max(created_date),count(*)
from cte
group by created_date_range
order by created_date_range ;
i get the desired result with the table being split into 15 equal parts. Here's an example of the date i get
created_date_range A
min(created_date)
min(created_date)
count(*)
1
2022-04-14 00:00:02
2022-05-02 22:56:40
946455
2
2022-05-02 22:56:40
2022-05-21 17:10:20
946455
3
2022-05-21 17:10:21
2022-06-15 20:16:47
946455
.
.
.
14
2022-10-24 18:55:22
2022-11-04 17:12:26
946454
15
2022-11-04 17:12:26
2022-11-18 06:01:08
946454
How can i avoid the same date data to be distributed into two different ranges?
Am i doing this correct? Is there another way of achieving the result?
I tried to use the ceil function but i had issues with group by statement.

Filter by top N changes from last month to this month

I have a dataset of parts, price per part, and month. I am accessing this data via a live connection to a SQL Server database. This database gets updated monthly with new prices for each part. What I would like to do is graph one year of price data for the ten parts whose prices changed the most over the last month (either as a percentage of last month's price or as a total change in dollars.)
Since my database connection is live, ideally Tableau would grab the new price data each month, updating the top ten parts whose prices changed for the new period. I don't want to manually have to change the months or use a stored procedure if possible.
part price date
110 167.66 2018-12-01 00:00:00.000
113 157.82 2018-12-01 00:00:00.000
121 99.16 2018-12-01 00:00:00.000
133 109.82 2018-12-01 00:00:00.000
137 178.66 2018-12-01 00:00:00.000
138 154.99 2018-12-01 00:00:00.000
143 67.32 2018-12-01 00:00:00.000
149 103.82 2018-12-01 00:00:00.000
113 167.34 2018-11-01 00:00:00.000
121 88.37 2018-11-01 00:00:00.000
133 264.02 2018-11-01 00:00:00.000
Create a calculated field called Recent_Price as
if DateDiff(‘month’, [date], Today()) <= 1 then [price] end. This returns the price for recent records and null for older records. You might need to tweak the condition based on details, or use an LOD calc to always get the last 2 values regardless of today’s date.
Create a calculated field called Price_Change as Max([Recent_Price]) - Min([Recent_Price]) Note you can’t tell from this whether the change was positive or negative, just its magnitude.
Make sure part is a discrete dimension. Drag it to the Filter Shelf. Set the filter to show the the Top N part by Price_Change
It’s not hard to extend this to include the sign in the price change, or to convert it a percentage. Hint, you’ll probably need a pair of calcs like that in step 1 to select prices for specific months
You haven't provided any sample data, but you could follow something like this,
;WITH top_parts AS (
-- select the top 10 parts based on some criteria
SELECT TOP 10 parts.[id], parts.[name] FROM parts
ORDER BY <most changed>
)
SELECT price.[date], p.[name], price.[price] FROM top_parts p
INNER JOIN part_price price ON p.[id] = price.[part_id]
ORDER BY price.[date]
Use a CTE to get your top parts.
Select from the CTE, join to the price table to get the prices for each part.
Order the prices or bucketize them into months.
Feed it to your graph.
It will be something like this for just one month. If you need the whole year you have to specify clearly what exactly you want to see:
;WITH cte as (
SELECT TOP 10 m0.Part
, Diff = ABS(m0.Price - m1.Price)
, DiffPrc = ABS(m0.Price - m1.Price) / m1.Price
FROM Parts as m0
INNER JOIN (SELECT MaxDate = MAX([Date] FROM Parts) as md
ON md.MaxDate = m0.[Date]
INNER JOIN Parts as m1 ON m0.Part = m1.Part and DATEADD(MONTH,-1,md.MaxDate) = m1.[Date]
ORDER BY ABS(m0.Price - m1.Price) DESC
-- Top 10 by percentage:
-- ORDER BY ABS(m0.Price - m1.Price) / m1.Price DESC
)
SELECT * FROM Parts as p
INNER JOIN cte ON cte.Part = p.Part
-- Input from user,you decide in which format last month date will be pass
-- In other words , #InputLastMonth is parameter of proc
--Suppose it pass in yyyy-MM-dd manner
Declare #InputLastMonth date='2018-12-31'
-- to get last one year data
--Declare local variable which is not pass
declare #From date= dateadd(day,1,dateadd(month,-12, #InputLastMonth))
Declare #TopN int=10-- requirement
--select #InputLastMonth,#From
Select TOP (#TopN) parts,ChangePrice
(
select parts,ABS(max(price)-min(price)) as ChangePrice
from dbo.Table1
where dates>=#From and dates<=#InputLastMonth
group by parts
)t4
order by ChangePrice desc
By change most ,I understand that,suppose there is one parts 'Part1' which was price 100 in first month and change to 1000 in last months.
On the other hand Part2 change several times during same period but final change was only 12.
In other word Part1 change only twice but change difference was huge,Part2 change several time but change difference was small.
So Part1 will be preferred.
Second thing is change can be negative as well as positive.
Correct me if I have not understood your requirement.

Possible to calculate iterated count of timestamps relative to one another?

This question is a bit complicated but to make it as simple as possible:
I have a list of timestamps (it is in the millions but let's say for simplicity sake it is much smaller):
order_times
-----------
2014-10-11 15:00:00
2014-10-11 15:02:00
2014-10-11 15:03:31
2014-10-11 15:07:00
2014-10-11 16:00:00
2014-10-11 16:04:00
I am trying to build a query (in PostgeSQL) that would allow me to determine the number of times a an order_time occurs within 10 minutes of 2 order_times prior to it (and no more).
In the sample data above:
first time stamp is considered 0 as there were no orders before it
second timestamp is considered 0 as it was within 10 minutes of it
prior but there was only 1 request before it
third timestamp is considered 1 because there were at least 2 orders within 10 minutes of it
(and so on)...
I hope this is clear!
You don't need to look at the first previous, just the one 2 prior to each. If that is within 10 minutes, then the one after it will be also.
Best way is to get the data that is important to you into a single row, so you can do set operations on it. For that, use the windowing function ROW_NUMBER() and a self join. This is the MS SQL way of doing what you want.
WITH T1 AS (
SELECT ID, Order_Time, ROW_NUMBER() OVER( ORDER BY Order_Time) AS RowNumber FROM myTest)
SELECT T1.ID,T1.Order_Time, T2.ID AS CompareID,T2.Order_Time AS CompareTime
FROM T1 LEFT OUTER JOIN T1 AS T2 ON T1.RowNumber-2 = T2.RowNumber
WHERE DATEDIFF(n,t2.Order_Time,t1.Order_Time)<=10
First we create a query that has the row numbers, then use it as an inline table to do a self join to build a row that contains each order, and the one that happened 2 orders prior to it. Then just do a simple date comparison to select out the rows you want.

Exponential decay in SQL for different dates page views

I have a different dates with the amount of products viewed on a webpage over a 30 day time frame. I am trying to create a exponential decay model in SQL. I am using exponential decay because I want to highlight the latest events over older ones. I not sure how to write this in SQL without getting an error. I have never done this before with this type of model so want to make sure I am doing it correctly too.
=================================
Data looks like this
product views date
a 1 2014-05-15
a 2 2014-05-01
b 2 2014-05-10
c 4 2014-05-02
c 1 2014-05-12
d 3 2014-05-11
================================
Code:
create table decay model as
select product,views,date
case when......
from table abc
group by product;
not sure what to write to do the model
I want to penalize products that were viewed that were older vs products that were viewed more recently
Thank you for your help
You can do it like this:
Choose the partition in which you want to apply exponential decay, then order descending by date within such a group.
use the function ROW_NUMBER() with ascendent ordering to get the row numbering within each subgroup.
calculate pow(your_variable_in_[0,1], rownum) and apply it to your result.
Code might look like this (might work in Oracle SQL or db2):
SELECT <your_partitioning>, date, <whatever>*power(<your_variable>,rownum-1)
FROM (SELECT a.*
, ROW_NUMBER() OVER (PARTITION BY <your_partitioning> ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY <your_partitioning>, date DESC
EDIT: I read again over your problem and think I understood now what you asked for, so here is a solution which might work (decay factor is 0.9 here):
SELECT product, sum(adjusted_views) // (i)
FROM (SELECT product, views*power(0.9, rownum-1) AS adjusted_views, date, rownum // (ii)
FROM (SELECT product, views, date // (iii)
, ROW_NUMBER() OVER (PARTITION BY product ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY product, date DESC)
GROUP BY product
The inner select statement (iii) creates a temporary table that might look like this
product views date rownum
--------------------------------------------------
a 1 2014-05-15 1
a 2 2014-05-14 2
a 2 2014-05-13 3
b 2 2014-05-10 1
b 3 2014-05-09 2
b 2 2014-05-08 3
b 1 2014-05-07 4
The next query (ii) then uses the rownumber to construct an exponentially decaying factor 0.9^(rownum-1) and applies it to views. The result is
product adjusted_views date rownum
--------------------------------------------------
a 1 * 0.9^0 2014-05-15 1
a 2 * 0.9^1 2014-05-14 2
a 2 * 0.9^2 2014-05-13 3
b 2 * 0.9^0 2014-05-10 1
b 3 * 0.9^1 2014-05-09 2
b 2 * 0.9^2 2014-05-08 3
b 1 * 0.9^3 2014-05-07 4
In a last step (the outer query) the adjusted views are summed up, as this seems to be the quantity you are interested in.
Note, however, that in order to be consistent there should be regular distances between the dates, e.g., always on day (--not one day here and a month there, because these will be weighted in a similar fashion although they shouldn't).