Compare data from different timestamps - sql

What I have is a table with restaurants, a restaurant is either active (1) or inactive (0). This can change weekly. I want to determine how many restaurants have been deactivated since last week. E.g. if a restaurant was active = 1 in Week 50, but active = 0 in Week 51 then it should be counted. Hence, I want to compare active on a weekly basis.
My table looks like this:
restaurant | week_nm | active | date
-----------------------------------------
rest1 | 50 | 1 | xxx-xx-xx
rest1 | 51 | 0 | xxx-xx-xx
rest2 | 50 | 1 | xxx-xx-xx
rest2 | 51 | 1 | xxx-xx-xx
rest3 | 50 | 1 | xxx-xx-xx
rest3 | 51 | 0 | xxx-xx-xx
What I want to have is this:
week_nm | restaurants_deactivated
---------------------------------
51 | 2
A count of restaurants that went from active = 1 to active = 0.

As jarlh commented correctly, taking the year change into account will be a bit complicated. So if the date column (btw: a horrible name for a column) correctly reflects the week in which the the restaurant was active or not, then you can use the ISO year/week combination (derived from that date) to properly deal with the year change:
select to_char(r1.date, 'iyyy-iw'), count(*)
from rest r1
where to_char(r1.date, 'iyyy-iw') = '2015-51'
and not active
and exists (select 1
from rest r2
where to_char(to_date(to_char(r1.date, 'iyyy-iw'), 'iyyy-iw') - 7, 'iyyy-iw') = to_char(r2.date, 'iyyy-iw')
and r2.restaurant = r1.restaurant
and r2.active)
group by to_char(date, 'iyyy-iw')
SQLFiddle: http://sqlfiddle.com/#!15/8da24/2
to_char(r1.date, 'iyyy-iw') calculates the year and week number based on the ISO definition of the week in the year. This returns e.g. 2015-51 for 2015-12-21.
The part:
where to_char(r1.date, 'iyyy-iw') = '2015-51'
and not active
retrieves all rows from week 51 where the restaurant was not active (this assumes that active is a boolean column).
The tricky part is to calculate the "previous" week. This is done using the expression:
to_char(to_date(to_char(r1.date, 'iyyy-iw'), 'iyyy-iw') - 7, 'iyyy-iw')
The "date" 2015-51 is converted back to a date, which result in the first day of that week. Then 7 days are subtracted and the result of that date, is then converted back into the year/week display. This is then used in the co-related subquery. The effect of that is, that it returns all rows that have been active in the "previous" week (where exists (...))
This should work from December to January as well (just keep in mind that e.g. the ISO week #1 in 2016 starts on January 4th).

try this
select top 1 (a.[week_nm]) as [week_nm], (select count(b.active) from [as] b where b.[week_nm] = a.[week_nm] and b.active=0) as restaurants_deactivated from [as] a order by a.[week_nm] desc

Related

Running sum of unique users in redshift

I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;

SQL counting distinct users over a growing timeframe

I don't think I properly titled this, but in essence I'm wanting to be able to count distinct users but have those previous distinct users be considered as time goes on. As an example, say we have a dataset of user purchases over time:
Date | User
-----------------
2/3/22 | A
2/4/22 | B
2/22/22 | C
3/2/22 | A
3/4/22 | D
3/15/22 | A
4/30/22 | B
Generally, if I were to count distincts grouped by months as would be normal we would get:
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 2
4/1/22 | 1
But what I'm really wanting to see would be how the total number of distinct users increases over the time period.
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 4
4/1/22 | 4
As such it would be 3 distinct users for the first month. Then 4 for the second month considering the total number of distinct users grew by one with the addition of "D" while "A" isn't counted because it was already recognized as a distinct user in the previous month. The third month would then still be 4 because no new distinct user performed an action that month.
Any help would be greatly appreciated (even if it is just a better title so that it reaches more people more appropriately haha)
here's a solution based on running sum in Postgres that should translate well to Vertica.
select date_trunc('month', "Date") as "Date"
,sum(count(case rn when 1 then 1 end)) over (order by date_trunc('month', "Date")) as "Count"
from (
select "Date"
,"User"
,row_number() over(partition by "User" order by "Date") as rn
from t
) t
group by date_trunc('month', "Date")
order by "Date"
Date
Count
2022-02-01 00:00:00
3
2022-03-01 00:00:00
4
2022-04-01 00:00:00
4
Fiddle

SQL - BigQuery - How do I fill in dates from a calendar table?

My goal is to join a sales program table to a calendar table so that there would be a joined table with the full trailing 52 weeks by day, and then the sales data would be joined to it. The idea would be that there are nulls I could COALESCE after the fact. However, my problem is that I only get results without nulls from my sales data table.
The questions I've consulted so far are:
Join to Calendar Table - 5 Business Days
Joining missing dates from calendar table Which points to
MySQL how to fill missing dates in range?
My Calendar table is all 364 days previous to today (today being day 0). And the sales data has a program field, a store field, and then a start date and an end date for the program.
Here's what I have coded:
SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC
What I expected was 365 records with dates where there were nulls and dates where there were filled in records. My query resulted in a few dates with null values but otherwise just the dates where a program exists.
DATE | ITEM | PROGRAM | SALE_DT | PRGM_BGN | PRGM_END |
----------|--------|---------|----------|-----------|-----------|
8/27/2020 | | | | | |
8/26/2020 | | | | | |
8/25/2020 | | | | | |
8/24/2020 | | | | | |
6/7/2020 | 1 | 5 | 6/7/2020 | 2/13/2016 | 6/7/2020 |
6/6/2020 | 1 | 5 | 6/6/2020 | 2/13/2016 | 6/7/2020 |
6/5/2020 | 1 | 5 | 6/5/2020 | 2/13/2016 | 6/7/2020 |
6/4/2020 | 1 | 5 | 6/4/2020 | 2/13/2016 | 6/7/2020 |
Date = Calendar day.
Item = Item number being sold.
Program = Unique numeric ID of program.
Sale_Dt = Field populated if at least one item was sold under this program.
Prgm_bgn = First day when item was eligible to be sold under this program.
Prgm_end = Last day when item was eligible to be sold under this program.
What I would have expected would have been records between June 7 and August 24 which just had the DATE column populated for each day and null values as what happens in the most recent four records.
I'm trying to understand why a calendar table and what I've written are not providing the in-between dates.
EDIT: I've removed the request for feedback to shorten the question as well as an example I don't think added value. But please continue to give feedback as you see necessary.
I'd be more than happy to delete this whole question or have someone else give a better answer, but after staring at the logic in some of the answers in this thread (MySQL how to fill missing dates in range?) long enough, I came up with this:
SELECT
CAL.DATE,
t.* EXCEPT (DATE)
FROM
CALENDER_TABLE AS CAL
LEFT JOIN
(SELECT
CAL.DATE,
CAL.DAY,
SALES.ITEM,
SALES.PROGRAM,
SALES.SALE_DT,
SALES.EFF_BGN_DT,
SALES.EFF_END_DT
FROM
CALENDAR_TABLE AS CAL
LEFT JOIN
SALES_TABLE AS SALES
ON CAL.DATE = SALES.SALE_DT
WHERE 1=1
and SALES.ITEM = 1 or SALES.ITEM is null
ORDER BY DATE ASC) **t**
ON CAL.DATE = t.DATE
From what I can tell, it seems to be what I needed. It allows for the subquery to connect a date to all those records, then just joins on the calendar table again solely on date to allow for those nulls to be created.

Best Practice in Scenario

I am currently trying to accomplish the following:
get the Last Weekstamp for the last 6 Months, the following ilustrates how the end result might look like:
Month | Weekstamp |
2013-12| 2013-52 |
2014-01| 2014-05 |
.... and so on
I have a auxiliary Table, which has all Weeks in it and allows me to connect to a Calender Table, which in turn has all months, meaning i am able to get all weekstamps per Month,
but how do i get all of the Last Week Numbers for the Last 6 Months ?
my idea was a Temporary table of some sor (never used one, am a beginner when it Comes to SQL)
which calculates all of the Weekstamps needing to be filtered out per month, and than gives out only values which i could than use to filter a query which contains all the data i Need.
Anybody have a better idea?
As i said I am just a beginner so i can't really say what the best way would be
Thanks a lot in Advance!
My guess is that your challenge is determining what the last six months are. To do this you can use a tally table (spt_values) and DateDiff to determine when the last six months are.
You can also depending on which DB and version easily do this without a calander or weeks table.
This
WITH rnge
AS (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number > 0
AND number < 7),
dates
AS (SELECT EOMONTH(Dateadd(m, number * -1, Getdate())) dt
FROM rnge)
SELECT Year(dt) year,
Month(dt) month,
Datepart(wk, dt) week
FROM dates
Produces this output
| YEAR | MONTH | WEEK |
|------|-------|------|
| 2014 | 1 | 5 |
| 2013 | 12 | 53 |
| 2013 | 11 | 48 |
| 2013 | 10 | 44 |
| 2013 | 9 | 40 |
| 2013 | 8 | 35 |
Demo
I'll leave it to you to format the values
This assumes SQL Server 2012 since it uses EOMONTH see Get the last day of the month in SQL for previous versions of SQL Server

Find last (first) instance in table but exclude most recent (oldest) date

I have a table that reflects a monthly census of a certain population. Each month on an unpredictable day early in that month, the population is polled. Any member who existed at that point is included in that month's poll, any member who didn't is not.
My task is to look through an arbitrary date range and determine which members were added or lost during that time period. Consider the sample table:
ID | Date
2 | 1/3/2010
3 | 1/3/2010
1 | 2/5/2010
2 | 2/5/2010
3 | 2/5/2010
1 | 3/3/2010
3 | 3/3/2010
In this case, member with ID "1" was added between Jan and Feb, and member with ID 2 was lost between Feb and Mar.
The problem I am having is that if I just poll to try and find the most recent entry, I will capture all the members that were dropped, but also all the members that exist on the last date. For example, I could run this query:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
This would return:
ID | Date
1 | 3/3/2010
2 | 2/5/2010
3 | 3/3/2010
What I actually want, however, is just:
ID | Date
2 | 2/5/2010
Of course I can manually filter out the last date, but since the start and end date are parameters I want to generalize that. One way would be to run sequential queries. In the first query I'd find the last date, and then use that to filter in the second query. It would really help, however, if I could wrap this logic into a single query.
I'm also having a related problem when I try to find when a member was first added to the population. In that case I'm using a different type of query:
SELECT
ID,
Date
FROM
tableName i
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
AND
NOT EXISTS(
SELECT
ID,
Date
FROM
tableName ii
WHERE
ii.ID=i.ID
AND
ii.Date < i.Date
AND
Date BETWEEN '1/1/2010' AND '3/27/2010'
)
This returns:
ID | Date
1 | 2/5/2010
2 | 1/1/2010
3 | 1/1/2010
But what I want is:
ID | Date
1 | 2/5/2010
I would like to know:
1. Which approach (the MAX() or the subquery with NOT EXISTS) is more efficient and
2. How to fix the queries so that they only return the rows I want, excluding the first (last) date.
Thanks!
You could do something like this:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
having max(date) < '3/1/2010'
This filters out anyone polled in March.