Dealing with required duplicates in table records

Dealing with required duplicates in table records - sql

Here's the situation. My team forecasts sales and revenue numbers at a monthly resolution but would like all reporting to be at a daily resolution. So what I am doing is ingesting these numbers and dividing the monthly targets by number of days and saving it in a table.
So I start of with something like this:
| date | forecasted_units | forecasted_revenue |
|---------|------------------|--------------------|
| 2020-01 | 372 | 9300 |
| 2020-02 | 435 | 9280 |
...
My target table now looks like this:
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-01-01 | 12 | 300 |
| 2020-01-02 | 12 | 300 |
| 2020-01-03 | 12 | 300 |
...
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-02-01 | 15 | 320 |
| 2020-02-02 | 15 | 320 |
| 2020-02-03 | 15 | 320 |
...
Now my table is quite a lot wider than the one above and all of them have duplicate records. As you can see there's a lot of data redundancy. Now my question is, Is there a more efficient method to save the same resolution of data in one table.
My immediate thought is to reshape the table to include a start date and end date to look like this:
| start_date | end_date | forecasted_units | forecasted_revenue |
|------------|------------|------------------|--------------------|
| 2020-01-01 | 2020-01-31 | 12 | 300 |
| 2020-02-01 | 2020-02-29 | 15 | 320 |
But that would offload all the computation to the instance generating all the reports because it would have to generate the data for each day in between the start and end date.
Is there a better way to do this?

Unfortunately, Redshift does not support handy Postgres function generate_series(), which would have largely simplified the task here.
Typical alternative solutions would involve a calendar table - basically, a table that enumerates all possible dates. If you have a table with a sufficient number of rows, you can generate such dataset on the fly with row_number() and dateadd():
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table;
You can store the results in another table (using the create table ... as select ... syntax), or use the query result directly. In both cases, you would then join it with your actual table. To count the number of days in the month, we use a window count:
select
d.dt,
t.forecasted_unit / count(*) over(partition by t.date) forecasted_units,
t.forecasted_revenue / count(*) over(partition by t.date) forecasted_revenue
from (
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table
) d
inner join mytable t on t.date = date_trunc('month', d.dt)

Related

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you

You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.

I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

SQLite: generating customer counts for a date range (months) using a normalized table

I have a sales funnel dataset in SQLite and each row represents a movement through the funnel. As there are quite a few ways a potential customer can move through the funnel (and possibly even go backwards), I wasn't planning on flattening/denormalizing the table. How could I calculate "the number of customers per month up to today"?
customer | opp_value | status_old | status_new | current_status | status_change_date | current_lead_status | lead_created_date
cust_8 | 22 | confirmed | paying | paying | 2020-01-01 | Customer | 2020-01-01
cust_9 | 23 | confirmed | paying | churned | 2020-01-03 | Customer | 2020-01-02
cust_9 | 23 | paying | churned | churned | 2020-03-24 | Customer | 2020-02-25
cust_13 | 30 | negotiation | lost | paying | 2020-04-03 | Lost | 2020-03-20
cust_14 | 45 | qualified | confirmed | paying | 2020-03-03 | Customer | 2020-02-28
cust_14 | 45 | confirmed | paying | paying | 2020-04-03 | Customer | 2020-02-28
... | ... | ... | ... | ... | ... | ... | ...
We're assuming we use end-of-month as definition for whether a customer is still with us.
The result, with the above data should be:
month | customers
Jan-2020 | 2 (cust_8, cust_9)
Feb-2020 | 1 (cust_8, cust_9)
Mar-2020 | 1 (cust_8) # cust_9 churned
Apr-2020 | 2 (cust_8, cust_14)
May-2020 | 2 (cust_8, cust_14)
The part I'd really like to understand is how to create the month column, as I can't rely on the dates of status_change_date as there might be missing records. Would one have to manually generate that column? I know I can generate dates manually using:
WITH RECURSIVE cnt (
x
) AS (
SELECT 0
UNION ALL
SELECT x + 1
FROM cnt
LIMIT (
SELECT
ROUND(((julianday ('2020-05-01') - julianday ('2020-01-01')) / 30) + 1))
)
SELECT
date(julianday ('2020-01-01'), '+' || x || ' month') AS month
FROM cnt
but wondering if there is a better way? Would it possibly be easier to create a snapshot table and generate the current state of each customer for each date?

If you have the dates, you can use a brute-force method. This determines the most recent status for each customer for each date:
select d.date,
sum(as_of_status = 'paying')
from (select distinct d.date, t.customer,
first_value(status_new) over (partition by d.date, t.customer order by t. status_change_date desc) as as_of_status
from dates d join
t
on t.status_change_date <= d.date
) dc
group by d.date
order by d.date;

SQL interpolating missing values for a specific date range - with some conditions

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d

You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

SQL: Get an aggregate (SUM) of a calculation of two fields (DATEDIFF) that has conditional logic (CASE WHEN)

I have a dataset that includes a bunch of stay data (at a hotel). Each row contains a start date and an end date, but no duration field. I need to get a sum of the durations.
Sample Data:
| Stay ID | Client ID | Start Date | End Date |
| 1 | 38 | 01/01/2018 | 01/31/2019 |
| 2 | 16 | 01/03/2019 | 01/07/2019 |
| 3 | 27 | 01/10/2019 | 01/12/2019 |
| 4 | 27 | 05/15/2019 | NULL |
| 5 | 38 | 05/17/2019 | NULL |
There are some added complications:
I am using Crystal Reports and this is a SQL Expression, which obeys slightly different rules. Basically, it returns a single scalar value. Here is some more info: http://www.cogniza.com/wordpress/2005/11/07/crystal-reports-using-sql-expression-fields/
Sometimes, the end date field is blank (they haven't booked out yet). If blank, I would like to replace it with the current timestamp.
I only want to count nights that have occurred in the past year. If the start date of a given stay is more than a year ago, I need to adjust it.
I need to get a sum by Client ID
I'm not actually any good at SQL so all I have is guesswork.
The proper syntax for a Crystal Reports SQL Expression is something like this:
(
SELECT (CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
And that's giving me the correct value for a single row, if I wanted to do this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 210 | // only days since June 4 2018 are counted
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 2 |
| 4 | 27 | 05/15/2019 | NULL | 21 |
| 5 | 38 | 05/17/2019 | NULL | 19 |
But I want to get the SUM of Duration per client, so I want this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 229 | // 210+19
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 23 | // 2+21
| 4 | 27 | 05/15/2019 | NULL | 23 |
| 5 | 38 | 05/17/2019 | NULL | 229 |
I've tried to just wrap a SUM() around my CASE but that doesn't work:
(
SELECT SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
It gives me an error that the StayDateEnd is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. But I don't even know what that means, so I'm not sure how to troubleshoot, or where to go from here. And then the next step is to get the SUM by Client ID.
Any help would be greatly appreciated!

Although the explanation and data set are almost impossible to match, I think this is an approximation to what you want.
declare #your_data table (StayId int, ClientId int, StartDate date, EndDate date)
insert into #your_data values
(1,38,'2018-01-01','2019-01-31'),
(2,16,'2019-01-03','2019-01-07'),
(3,27,'2019-01-10','2019-01-12'),
(4,27,'2019-05-15',NULL),
(5,38,'2019-05-17',NULL)
;with data as (
select *,
datediff(day,
case
when datediff(day,StartDate,getdate())>365 then dateadd(year,-1,getdate())
else StartDate
end,
isnull(EndDate,getdate())
) days
from #your_data
)
select *,
sum(days) over (partition by ClientId)
from data
https://rextester.com/HCKOR53440

You need a subquery for sum based on group by client_id and a join between you table the subquery eg:
select Stay_id, client_id, Start_date, End_date, t.sum_duration
from your_table
inner join (
select Client_id,
SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END) sum_duration
from your_table
group by Client_id
) t on t.Client_id = your_table.client_id

SQL query to select today and previous day's price

I have historic stock price data that looks like the below. I want to generate a new table that has one row for each ticker with the most recent day's price and its previous day's price. What would be the best way to do this? My database is Postgres.
+---------+------------+------------+
| ticker | price | date |
+---------+------------+------------|
| AAPL | 6 | 10-23-2015 |
| AAPL | 5 | 10-22-2015 |
| AAPL | 4 | 10-21-2015 |
| AXP | 5 | 10-23-2015 |
| AXP | 3 | 10-22-2015 |
| AXP | 5 | 10-21-2015 |
+------- +-------------+------------+

You can do something like this:
with ranking as (
select ticker, price, dt,
rank() over (partition by ticker order by dt desc) as rank
from stocks
)
select * from ranking where rank in (1,2);
Example: http://sqlfiddle.com/#!15/e45ea/3
Results for your example will look like this:
| ticker | price | dt | rank |
|--------|-------|---------------------------|------|
| AAPL | 6 | October, 23 2015 00:00:00 | 1 |
| AAPL | 5 | October, 22 2015 00:00:00 | 2 |
| AXP | 5 | October, 23 2015 00:00:00 | 1 |
| AXP | 3 | October, 22 2015 00:00:00 | 2 |
If your table is large and have performance issues, use a where to restrict the data to last 30 days or so.

Best bet is to use a window function with an aggregated case statement which is used to create a pivot on the data.
You can see more on window functions here: http://www.postgresql.org/docs/current/static/tutorial-window.html
Below is a pseudo code version of where you may need to head to answer your question (sorry I couldn't validate it due to not have a postgres database setup).
Select
ticker,
SUM(CASE WHEN rank = 1 THEN price ELSE 0 END) today,
SUM(CASE WHEN rank = 2 THEN price ELSE 0 END) yesterday
FROM (
SELECT
ticker,
price,
date,
rank() OVER (PARTITION BY ticker ORDER BY date DESC) as rank
FROM your_table) p
WHERE rank in (1,2)
GROUP BY ticker.
Edit - Updated the case statement with an 'else'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dealing with required duplicates in table records - sql

Related

30 day rolling count of distinct IDs

SQLite: generating customer counts for a date range (months) using a normalized table

SQL interpolating missing values for a specific date range - with some conditions

SQL: Get an aggregate (SUM) of a calculation of two fields (DATEDIFF) that has conditional logic (CASE WHEN)

SQL query to select today and previous day's price

Categories

Resources