Exponential decay in SQL for different dates page views - sql

I have a different dates with the amount of products viewed on a webpage over a 30 day time frame. I am trying to create a exponential decay model in SQL. I am using exponential decay because I want to highlight the latest events over older ones. I not sure how to write this in SQL without getting an error. I have never done this before with this type of model so want to make sure I am doing it correctly too.
=================================
Data looks like this
product views date
a 1 2014-05-15
a 2 2014-05-01
b 2 2014-05-10
c 4 2014-05-02
c 1 2014-05-12
d 3 2014-05-11
================================
Code:
create table decay model as
select product,views,date
case when......
from table abc
group by product;
not sure what to write to do the model
I want to penalize products that were viewed that were older vs products that were viewed more recently
Thank you for your help

You can do it like this:
Choose the partition in which you want to apply exponential decay, then order descending by date within such a group.
use the function ROW_NUMBER() with ascendent ordering to get the row numbering within each subgroup.
calculate pow(your_variable_in_[0,1], rownum) and apply it to your result.
Code might look like this (might work in Oracle SQL or db2):
SELECT <your_partitioning>, date, <whatever>*power(<your_variable>,rownum-1)
FROM (SELECT a.*
, ROW_NUMBER() OVER (PARTITION BY <your_partitioning> ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY <your_partitioning>, date DESC
EDIT: I read again over your problem and think I understood now what you asked for, so here is a solution which might work (decay factor is 0.9 here):
SELECT product, sum(adjusted_views) // (i)
FROM (SELECT product, views*power(0.9, rownum-1) AS adjusted_views, date, rownum // (ii)
FROM (SELECT product, views, date // (iii)
, ROW_NUMBER() OVER (PARTITION BY product ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY product, date DESC)
GROUP BY product
The inner select statement (iii) creates a temporary table that might look like this
product views date rownum
--------------------------------------------------
a 1 2014-05-15 1
a 2 2014-05-14 2
a 2 2014-05-13 3
b 2 2014-05-10 1
b 3 2014-05-09 2
b 2 2014-05-08 3
b 1 2014-05-07 4
The next query (ii) then uses the rownumber to construct an exponentially decaying factor 0.9^(rownum-1) and applies it to views. The result is
product adjusted_views date rownum
--------------------------------------------------
a 1 * 0.9^0 2014-05-15 1
a 2 * 0.9^1 2014-05-14 2
a 2 * 0.9^2 2014-05-13 3
b 2 * 0.9^0 2014-05-10 1
b 3 * 0.9^1 2014-05-09 2
b 2 * 0.9^2 2014-05-08 3
b 1 * 0.9^3 2014-05-07 4
In a last step (the outer query) the adjusted views are summed up, as this seems to be the quantity you are interested in.
Note, however, that in order to be consistent there should be regular distances between the dates, e.g., always on day (--not one day here and a month there, because these will be weighted in a similar fashion although they shouldn't).

Related

How to get correct min, max date for each customer's changing label in wide format in BigQuery?

I have a table that records customer purchases, for example:
customer_id
label
date
purchase_id
price
2
A
2022-01-01
asd
10
3
A
2022-01-01
asdf
5
4
B
2022-02-04
asdfg
200
2
A
2022-01-03
asdjg
4
3
B
2022-02-01
dfs
20
2
G
2022-04-05
fdg
40
2
G
2022-04-10
fdg
40
2
A
2022-06-06
fgd
20
I want to see how many days/money each customer has spent in each label, so far what I'm doing is:
SELECT
customer_id,
label,
COUNT(DISTINCT(purchase_id) as orders_count,
SUM(price) as total_spent,
min(date) as first_date,
max(date) as last_date,
DATE_DIFF(max(date), min(date), DAY) as days
FROM
TABLE
WHERE
date > '2022-01-01'
GROUP BY
customer_id,
label
which gives me a long table, like this:
customer_id
label
orders_count
total_spent
first_date
last_date
days
2
A
3
34
2022-01-01
2022-06-06
180
2
G
1
40
2022-04-05
2022-04-10
5
etc
Just for simplicity I show a few columns, but customers have orders all the time. The issue with the above is that, for example for customer 2, that he starts with label A, then changes to G, then he is back to A so this is not visible in the results table (min(date) is correct, but max(date) takes their 2nd A max(date)) and that I'd prefer to have it in wide format. For instance, ideally, columns called next_label_{i} that you get values for each changing label would be the best for me.
Could you advise me of a way of a) dealing with accomodating with this label change(future label change is the same as an earlier label) and b) a way to produce it into a wide format?
Thanks
edit:
example output (correct date, wide format) [columns would go as wide as the max number of unique labels for any customer]
customer_id
first_label
first_first_date
first_last_date
first_total_spent
first_days
next_label
next_first_date
next_last_date
next_days
next_label_2
next_first_date_2
next_last_date_2
next_days_2
2
A
2022-01-01
2022-01-03
2
14
G
2022-04-05
2022-04-05
0
A
2022-06-06
2022-06-06
0
etc
Sorry this is not exactly accurate (missing the orders_count, total_spent) but it's a pain in the ass for format it here, but hopefully you get the idea. In principle, it's something as if you used python's pivot_table on the previous dataset.
Alternatively, I'd be glad for just a solution in the long format that distinguishes between a customer's label and the same customer's repeated label ( as in customer 2 who starts with A and after changing to G, returns to A)
Could you advise me of ... b) a way to produce it into a wide format?
First, I want to say that I hope you have really good reason to get that output as usually it is not what is considered a best practices and rather is being left for presentation layer to handle.
With that in mind - consider below approach
select * from (
select customer_id, offset, purchase.*
from (
select customer_id,
array_agg((struct(label, date, purchase_id, price)) order by date) purchases
from your_table
group by customer_id
), unnest(purchases) purchase with offset
order by customer_id, offset
)
pivot (
any_value(label) label,
any_value(date) date,
any_value(purchase_id) purchase_id,
any_value(price) price
for offset in (0,1,2,3,4,5)
)
if applied to sample data in your question - output is
Note: Above has silly assumption that you know the max number of steps (in this case I used 6 - from 0 till 5). There are plenty of posts here on SO that shows how to use same technique to make it dynamic. I do not want to duplicate them as it is against SO policies. So, just do your extra homework on this :o)

Find list of dates in a table closest to specific date from different table.

I have a list of unique ID's in one table that has a date column. Example:
TABLE1
ID Date
0 2018-01-01
1 2018-01-05
2 2018-01-15
3 2018-01-06
4 2018-01-09
5 2018-01-12
6 2018-01-15
7 2018-01-02
8 2018-01-04
9 2018-02-25
Then in another table I have a list of different values that appear multiple times for each ID with various dates.
TABLE 2
ID Value Date
0 18 2017-11-28
0 24 2017-12-29
0 28 2018-01-06
1 455 2018-01-03
1 468 2018-01-16
2 55 2018-01-03
3 100 2017-12-27
3 110 2018-01-04
3 119 2018-01-10
3 128 2018-01-30
4 223 2018-01-01
4 250 2018-01-09
4 258 2018-01-11
etc
I want to find the value in table 2 that is closest to the unique date in table 1.
Sometimes table 2 does contain a value that matches the date exactly and I have had no problem in pulling through those values. But I can't work out the code to pull through the value closest to the date requested from table 1.
My desired result based on the examples above would be
ID Value Date
0 24 2017-12-29
1 455 2018-01-03
2 55 2018-01-03
3 110 2018-01-04
4 250 2018-01-09
Since I can easily find the ID's with an exact match, one thing I have tried is taking the ID's that don't have an exact date match and placing them with their corresponding values into a temporary table. Then trying to find the values where I need the closest possible match, but it's here that I'm not sure where to begin on the coding of that.
Apologies if I'm missing a basic function or clause for this, I'm still learning!
The below would be one method:
WITH Table1 AS(
SELECT ID, CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,'20180101'),
(1,'20180105'),
(2,'20180115'),
(3,'20180106'),
(4,'20180109'),
(5,'20180112'),
(6,'20180115'),
(7,'20180102'),
(8,'20180104'),
(9,'20180225')) V(ID, DateColumn)),
Table2 AS(
SELECT ID, [value], CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,18 ,'2017-11-28'),
(0,24 ,'2017-12-29'),
(0,28 ,'2018-01-06'),
(1,455,'2018-01-03'),
(1,468,'2018-01-16'),
(2,55 ,'2018-01-03'),
(3,100,'2017-12-27'),
(3,110,'2018-01-04'),
(3,119,'2018-01-10'),
(3,128,'2018-01-30'),
(4,223,'2018-01-01'),
(4,250,'2018-01-09'),
(4,258,'2018-01-11')) V(ID, [Value],DateColumn))
SELECT T1.ID,
T2.[Value],
T2.DateColumn
FROM Table1 T1
CROSS APPLY (SELECT TOP 1 *
FROM Table2 ca
WHERE T1.ID = ca.ID
ORDER BY ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn))) T2;
Note that if the difference is days is the same, the row returned will be random (and could differ each time the query is run). For example, if Table had the date 20180804 and Table2 had the dates 20180803 and 20180805 they would both have the value 1 for ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn)). You therefore might need to include additional logic in your ORDER BY to ensure consistent results.
dude.
I'll say a couple of things here for you to consider, since SQL Server is not my comfort zone, while SQL itself is.
First of all, I'd join TABLE1 with TABLE2 per ID. That way, I can specify on my SELECT clause the following tuple:
SELECT ID, Value, DateDiff(d, T1.Date, T2.Date) qt_diff_days
Obviously, depending on the precision of the dates kept there, rather they have times or not, you can change the date field on DateDiff function.
Going forward, I'd also make this date difference an absolute number (to resolve positive / negative differences and consider only the elapsed time).
After that, and that's where it gets tricky because I don't know the SQL Server version you're using, but basically I'd use a ROW_NUMBER window function to rank all my lines per difference. Something like the following:
SELECT
ID, Value, Abs(DateDiff(d, T1.Date, T2.Date)) qt_diff_days,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Abs(DateDiff(d, T1.Date, T2.Date)) ASC) nu_row
ROW_NUMBER (Transact-SQL)
Numbers the output of a result set. More specifically, returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.
If you could run ROW_NUMBER properly, you should notice the query will rank it's data per ID, starting with 1 and increasing this ranking by it's difference between both dates, reseting it's rank to 1 when ID changes.
After that, all you need to do is select only those lines where nu_row equals to 1. I'd use a CTE to that.
WITH common_table_expression (Transact-SQL)
Specifies a temporary named result set, known as a common table expression (CTE).

How can I return rows which match on some columns and fulfil a DateTime comparison between two other columns using SQL?

I have a table which contains rows for jobs, example below, where 01/01/1980 is used rather than null in the ClosedDate column for jobs which are not finished:
JobNumber JobCategory CustomerID CreatedDate ClosedDate
1 Small 1 01/01/2016 03/01/2016
2 Small 2 03/01/2016 07/01/2016
3 Large 2 06/01/2016 07/01/2016
4 Medium 1 08/01/2016 10/01/2016
5 Small 3 10/01/2016 01/01/1980
6 Medium 3 15/01/2016 01/01/1980
7 Large 2 16/01/2016 17/01/2016
8 Large 2 19/01/2016 20/01/2016
9 Small 1 19/01/2016 01/01/1980
10 Medium 2 19/01/2016 01/01/1980
I need to return a list of any jobs where the same customer has had a job of the same category created within 3 days of the previous job being closed.
So, I would want to return:
7 Large 2 16/01/2016 17/01/2016
8 Large 2 19/01/2016 20/01/2016
because Customer 2 had a Large job closed on 17/01/2016 and another Large job opened on 19/01/2016, which is within 3 days.
In order to do this, I assume I need to compare each record in the table with each subsequent record, looking for a match on JobCategory and comparing CreatedDate with ClosedDate between rows.
Can anyone advise my best option for this using SQL? I'm using SQL Server 2012.
The first thing that you should do is get rid of "magic dates" in your system. If the job hasn't been closed yet then the ClosedDate is not known. SQL has a value for exactly that - NULL. That prevents anyone in the future from having to know the magic date of 1/1/1980 or from that having to be hard-coded throughout your system.
Next, you don't have to compare each row with each one after it. Define what you're looking for and find matches that meet those qualifications. You didn't specify which type of SQL Server you're using (you should tag your question with Oracle or MySQL or SQL Server), so the below query is written for SQL Server. Your version might have different date functions.
SELECT
J1.JobNumber,
J1.JobCategory,
J1.CustomerID,
J1.CreatedDate,
J1.ClosedDate,
J2.JobNumber,
J2.CreatedDate,
J2.ClosedDate
FROM
Jobs J1
INNER JOIN Jobs J2 ON
J2.CustomerID = J1.CustomerID AND
J2.JobCategory = J1.JobCategory AND
DATEDIFF(DAY, J1.ClosedDate, J2.CreatedDate) BETWEEN 0 AND 3 AND
J2.JobNumber <> J1.JobNumber
This will return the jobs in a single row instead of two rows. If that's a problem then the query could be altered slightly to do so. This can also be done a little more easily with windowed functions, but again, since you didn't specify your SQL vendor I didn't want to use those.
Since you're using SQL Server, you should be able to use windowed functions like so:
;WITH CTE_JobsWithDates AS -- Probably a poor name for the CTE
(
SELECT
JobNumber,
JobCategory,
CustomerID,
CreatedDate,
ClosedDate,
LEAD(CreatedDate, 1) OVER (PARTITION BY JobCategory, CustomerID ORDER BY CreatedDate) AS NextCreatedDate,
LAG(ClosedDate, 1) OVER (PARTITION BY JobCategory, CustomerID ORDER BY CreatedDate) AS PreviousClosedDate
FROM
Jobs
)
SELECT
JobNumber,
JobCategory,
CustomerID,
CreatedDate,
ClosedDate
FROM
CTE_JobsWithDates
WHERE
DATEDIFF(DAY, ClosedDate, NextCreatedDate) BETWEEN 0 AND 3 OR
DATEDIFF(DAY, LastClosedDate, CreatedDate) BETWEEN 0 AND 3
That was off the cuff, so please test and let me know if anything isn't quite right.
Try:
SELECT a.*
FROM
Job AS a
JOIN
Job AS b ON
a.CustomerID = b.CustomerID AND a.JobCategory = b.JobCategory
WHERE
a.JobNumber != b.JobNumber
AND (
b.CreatedDate - a.ClosedDate BETWEEN 0 AND 3
OR
a.CreatedDate - b.ClosedDate BETWEEN 0 AND 3)

Retrieve last N records from a table

I have searched but not found an answer for my question.
I have a table orders that consists of
id (primary key autonumber)
client_id : identifies each client (unique)
date: order dates for each client
I want to retrieve the last N order dates for each client in a single view
Of course I could use SELECT TOP N date FROM orders WHERE client = 'xx' ORDER DESC and then use UNION for the different values for client. The problem is that with changes in client base the statement would require revision and that the UNION statement is impractical with a large client base.
As an additional requirement this needs to work in Access SQL.
Step 1: Create a query that yields a rank order by date per client for every row. Since Access SQL does not have ROW_NUMBER() OVER (...) like SQL Server, you can simulate this by using the technique described in the following question:
Access query producing results like ROW_NUMBER() in T-SQL
If you have done step 1 correctly, your result should be as follows:
id client_id date rank
----------------------------------
1 2014-12-01 7
1 2014-12-02 6
1 2014-12-05 5
1 2014-12-07 4
1 2014-12-11 3
1 2014-12-14 2
1 2014-12-15 1
2 2014-12-01 2
2 2014-12-02 1
...
Step 2: Use the result from step 1 as a subquery and filter the result such that only records with rank <= N are returned.
I think the following will work in MS Access:
select t.*
from table as t
where t.date in (select top N t2.date
from table as t2
where t2.client_id = t.client_id
order by t2.date desc
);
One problem with MS Access is that top N will retrieve more than N records if there are ties. If you want exactly "N", then you can use order by date, id in the subquery.

Oracle Database Temporal Query Implementation - Collapse Date Ranges

This is the result of one of my queries:
SURGERY_D
---------
01-APR-05
02-APR-05
03-APR-05
04-APR-05
05-APR-05
06-APR-05
07-APR-05
11-APR-05
12-APR-05
13-APR-05
14-APR-05
15-APR-05
16-APR-05
19-APR-05
20-APR-05
21-APR-05
22-APR-05
23-APR-05
24-APR-05
26-APR-05
27-APR-05
28-APR-05
29-APR-05
30-APR-05
I want to collapse the date ranges which are continuous, into intervals. For examples,
[01-APR-05, 07-APR-05], [11-APR-05, 16-APR-05] and so on.
In terms of temporal databases, I want to 'collapse' the dates. Any idea how to do that on Oracle? I am using version 11. I searched for it and read a book but couldn't find/understand how to do it. It might be simple, but everyone has their own flaws and Oracle is mine. Also, I am new to SO so my apologies if I have violated any rules. Thank You!
You can take advantage of the ROW_NUMBER analytical function to generate a unique, sequential number for each of the records (we'll assign that number to the dates in ascending order).
Then, you group the dates by difference between the date and the generated number - the consecutive dates will have the same difference:
Date Number Difference
01-APR-05 1 1 -- MIN(date_val) in group with diff. = 1
02-APR-05 2 1
03-APR-05 3 1
04-APR-05 4 1
05-APR-05 5 1
06-APR-05 6 1
07-APR-05 7 1 -- MAX(date_val) in group with diff. = 1
11-APR-05 8 3 -- MIN(date_val) in group with diff. = 3
12-APR-05 9 3
13-APR-05 10 3
14-APR-05 11 3
15-APR-05 12 3
16-APR-05 13 3 -- MAX(date_val) in group with diff. = 3
Finally, you select the minimal and maximal date in each of the groups to get the beginning and ending of each range.
Here's the query:
SELECT
MIN(date_val) start_date,
MAX(date_val) end_date
FROM (
SELECT
date_val,
row_number() OVER (ORDER BY date_val) AS rn
FROM date_tab
)
GROUP BY date_val - rn
ORDER BY 1
;
Output:
START_DATE END_DATE
------------ ----------
01-04-2005 07-04-2005
11-04-2005 16-04-2005
19-04-2005 24-04-2005
26-04-2005 30-04-2005
You can check how that works on SQLFidlle: Dates ranges example