How to you identify date ranges/intervals for specified values? - sql

I have a set of data that tells me the owner for something for each date, sample data below. There are some breaks in the date column.
| owner | date |
|-------------+-------------+
| Samantha | 2010-01-02 |
| Max | 2010-01-03 |
| Max | 2010-01-04 |
| Max | 2010-01-06 |
| Max | 2010-01-07 |
| Conor | 2010-01-08 |
| Conor | 2010-01-09 |
| Conor | 2010-01-10 |
| Conor | 2010-01-11 |
| Abigail | 2010-01-12 |
| Abigail | 2010-01-13 |
| Abigail | 2010-01-14 |
| Abigail | 2010-01-15 |
| Max | 2010-01-17 |
| Max | 2010-01-18 |
| Abigail | 2010-01-20 |
| Conor | 2010-01-21 |
I am trying to write a query that can capture date ranges for when each owner's interval.. such as
| owner | start | end |
|-------------+------------+------------+
| Samantha | 2010-01-02 | 2010-01-02 |
| Max | 2010-01-03 | 2010-01-04 |
| Max | 2010-01-06 | 2010-01-07 |
| Conor | 2010-01-08 | 2010-01-11 |
| Abigail | 2010-01-12 | 2010-01-15 |
| Max | 2010-01-17 | 2010-01-18 |
| Abigail | 2010-01-20 | 2010-01-20 |
| Conor | 2010-01-21 | 2010-01-21 |
I tried think of this using min() and max() but I am stuck. I feel like I need to use lead() and lag() but not sure how to use them to get the output I want. Any ideas? Thanks in advance!

This is a typical gaps-and-island problem. Here is one way to solve it using row_number():
select owner, min(date) start, max(date) end
from (
select
owner,
row_number() over(order by date) rn1,
row_number() over(partition by owner, order by date) rn2
from mytable
) t
group by owner, rn1 - rn2
This works by ranking records by date over two different partitions (within the whole table and within groups having the same owner). The difference between the ranks gives you the group each record belongs to. You can run the inner query and look at the results to understand the logic.

This is a gaps-and-islands problem. You want to solve it by subtracting a sequential value from the date and aggregating:
select owner, min(date), max(date)
from (select t.*,
row_number() over (partition by owner order by date) as seqnum
from t
) t
group by owner, (date - seqnum * interval '1 day')
order by min(date);
The magic is that the sequence subtracted from the date is constant when the date values increment.

Related

How can I create a column with the number of months since a user joined (first date for each id), and group all ids that joined on that month in SQL?

Here's some more detail, since it's a bit hard to clearly ask this question in a sentence:
Basically, I have a table with some of the following fields:
| ID | date | start_date | amount_paid | last_amount_paid | field |
| -------- | ---------------------| ----------------------| ----------- | ---------------- | ---------- |
| ID_00001 | 2020-08-01 00:00:00 | 2019-11-06 20:23:36 | 0 | 0 | cosmetics |
| ID_00002 | 2020-08-02 00:00:00 | 2018-10-06 10:34:21 | 10 | 0 | finance |
| ... | ... | ... | ... | ... | ... |
| ID_99999 | 2021-11-06 00:00:00 | 2020-08-01 11:54:47 | 15 | 10 | software |
What I want is to add a "months" column that counts the number of months between the start date and date for each ID, for example:
| ID | date | start_date | ... | months |
| -------- | ---------------------| ----------------------| ---- | ---------- |
| ID_00001 | 2020-08-01 00:00:00 | 2019-11-06 20:23:36 | ... | 9 |
| ID_00002 | 2020-08-02 00:00:00 | 2018-10-06 10:34:21 | ... | 22 |
| ... | ... | ... | ... | ... |
| ID_99999 | 2021-11-06 00:00:00 | 2020-08-01 11:54:47 | ... | 15 |
I then want to group all IDs that have started (first start date) at the same time together (i.e. I want to group users by number of months).
I'm having a difficult time wrapping my mind around doing this using SnowflakeSQL.
The goal here is basically to track revenue by cohorts based on when they joined. Please let me know if my approach is wrong and how you would go about implementing that.
Much appreciated!
Using computed/generated column and DATEDIFF:
CREATE OR REPLACE TABLE t(id TEXT,
date DATE,
start_date DATE,
months INT AS (DATEDIFF(MONTH, start_date, date))
);
Sample data:
INSERT INTO t(id, date, start_date)
SELECT 'ID_00001', '2020-08-01 00:00:00', '2019-11-06 20:23:36'
UNION SELECT 'ID_00002', '2020-08-02 00:00:00', '2018-10-06 10:34:21';
SELECT * FROM t;
Output:
Alernatively wrapping table with a view:
CREATE VIEW t_vw
AS
SELECT t.id, t.start_date, t.date, DATEDIFF(MONTH, start_date, date) AS months
FROM t;

How to query table to get current total count including previous dates?

I want to get the current total count of registered users by the day in an SQL Database from this data:
| userID | date_registered |
| -------- | --------------- |
| 10012 | 2021-03-01 |
| 10043 | 2021-03-01 |
| 10065 | 2021-03-04 |
| 10087 | 2021-03-05 |
| 10091 | 2021-03-05 |
| 10123 | 2021-03-05 |
| 10231 | 2021-03-06 |
| 10421 | 2021-03-09 |
So for 2021-03-01, there are currently 2 registered users.
For 2021-03-04, there are currently 3 registered users (including registers from previous dates)
For 2021-03-05, there are currently 6 registered users (including registers from previous dates)
and so on...
So the expected result should be
| total_user | date |
| ---------- | --------------- |
| 2 | 2021-03-01 |
| 3 | 2021-03-04 |
| 6 | 2021-03-05 |
| 7 | 2021-03-06 |
| 8 | 2021-03-09 |
Is there an SQL query possible to accomplish this result in BigQuery?
Much appreciated the help.
In BigQuery or any reasonable database, we can aggregate by date and then use SUM as an analytic function:
SELECT
SUM(COUNT(*)) OVER (ORDER BY date_registered) AS total_user,
date_registered AS date
FROM yourTable
GROUP BY
date_registered
ORDER BY
date_registered;
Note that if the same user might be reported more than once on a given date, then use COUNT(DISTINCT userID) instead of COUNT(*).
You can use this, but there are more practical ways for mysql 8+
SELECT e1.date_registered, (SELECT COUNT(e2.userID) FROM example e2
WHERE e2.date_registered <= e1.date_registered) AS count_ FROM example e1
GROUP BY date_registered
SqlFiddle

SQL (Redshift) get start and end values for consecutive data in a given column

I have a table that has the subscription state of users on any given day. The data looks like this
+------------+------------+--------------+
| account_id | date | current_plan |
+------------+------------+--------------+
| 1 | 2019-08-01 | free |
| 1 | 2019-08-02 | free |
| 1 | 2019-08-03 | yearly |
| 1 | 2019-08-04 | yearly |
| 1 | 2019-08-05 | yearly |
| ... | | |
| 1 | 2020-08-02 | yearly |
| 1 | 2020-08-03 | free |
| 2 | 2019-08-01 | monthly |
| 2 | 2019-08-02 | monthly |
| ... | | |
| 2 | 2019-08-31 | monthly |
| 2 | 2019-09-01 | free |
| ... | | |
| 2 | 2019-11-26 | free |
| 2 | 2019-11-27 | monthly |
| ... | | |
| 2 | 2019-12-27 | monthly |
| 2 | 2019-12-28 | free |
+------------+------------+--------------+
I would like to have a table that gives the start and end dats of a subscription. It would look something like this:
+------------+------------+------------+-------------------+
| account_id | start_date | end_date | subscription_type |
+------------+------------+------------+-------------------+
| 1 | 2019-08-03 | 2020-08-02 | yearly |
| 2 | 2019-08-01 | 2019-08-31 | monthly |
| 2 | 2019-11-27 | 2019-12-27 | monthly |
+------------+------------+------------+-------------------+
I started by doing a LAG windown function with a bunch of WHERE statements to grab the "state changes", but this makes it difficult to see when customers float in and out of subscriptions and i'm not sure this is the best method.
lag as (
select *, LAG(tier) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan
, LAG(date) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan_date
from data
)
SELECT *
FROM lag
where (current_plan = 'free' and previous_plan in ('monthly', 'yearly'))
This is a gaps-and-islands problem. I think a difference of row numbers works:
select account_id, current_plan, min(date), max(date)
from (select d.*,
row_number() over (partition by account_id order by date) as seqnum,
row_number() over (partition by account_id, current_plan order by date) as seqnum_2
from data
) d
where current_plan <> free
group by account_id, current_plan, (seqnum - seqnum_2);

Returning rows where the WHERE clause was not an exact match

I apologise in advance for the question title. I really struggled to write something succinct !
I have a table similar to the following:
| Item | Date | Value |
| A | 2018-12-01 | 1 |
| B | 2018-12-01 | 2 |
| C | 2018-12-01 | 2 |
| A | 2018-12-02 | 3 |
| B | 2018-12-02 | 3 |
I would like to write a query so that when I give it a particular date, it returns one row for each unique Item in the table, and it's Value on the given date, or if it was not observed on the given date, the last time it was observed.
So with the above table, if I supply 2018-12-01 it will return:
| Item | Date | Value |
| A | 2018-12-01 | 1 |
| B | 2018-12-01 | 2 |
| C | 2018-12-01 | 2 |
but if I supply 2018-12-02 it will return:
| Item | Date | Value |
| A | 2018-12-02 | 3 |
| B | 2018-12-02 | 3 |
| C | 2018-12-01 | 2 |
You can use distinct on:
select distinct on (item) t.*
from t
where date <= $your_date
order by item, date desc;
One approach could be to query all the rows at or before that date and then use the rank window function to take the first one per item:
SELECT item, date, value
FROM (SELECT item, date, value, RANK() OVER (PARTITION BY item ORDER BY date DESC) AS rk
FROM mytable
WHERE date <= :param_date) t
WHERE rk = 1

SQLite: Get all dates between dates

I need some help with a SQL (in particular: SQLite) related problem. I have a table 'vacation'
CREATE TABLE vacation(
name TEXT,
to_date TEXT,
from_date TEXT
);
where I store the date (YYYY-MM-DD), when somebody leaves for vacation and comes back again. Now, I would like to get a distinctive list of all dates, where somebody is on vacation. Let's assume my table looks like:
+------------+-------------+------------+
| name | to_date | from_date |
+------------+-------------+------------+
| Peter | 2013-07-01 | 2013-07-10 |
| Paul | 2013-06-30 | 2013-07-05 |
| Simon | 2013-05-10 | 2013-05-15 |
+------------+-------------+------------+
The result from the query should look like:
+------------------------------+
| dates_people_are_on_vacation |
+------------------------------+
| 2013-05-10 |
| 2013-05-11 |
| 2013-05-13 |
| 2013-05-14 |
| 2013-05-15 |
| 2013-06-30 |
| 2013-07-01 |
| 2013-07-02 |
| 2013-07-03 |
| 2013-07-04 |
| 2013-07-05 |
| 2013-07-06 |
| 2013-07-07 |
| 2013-07-08 |
| 2013-07-09 |
| 2013-07-10 |
+------------------------------+
I thought about using a date - table 'all_dates'
CREATE TABLES all_dates(
date_entry TEXT
);
which covers a 20 year time span (2010-01-01 to 2030-01-01) and the following query:
SELECT date_entry FROM all_dates WHERE date_entry BETWEEN (SELECT from_date FROM vacation) AND (SELECT to_date FROM vacation);
However, If i apply this query on the above dataset, I only get a fraction of my desired result:
+------------------------------+
| dates_people_are_on_vacation |
+------------------------------+
| 2013-07-01 |
| 2013-07-02 |
| 2013-07-03 |
| 2013-07-04 |
| 2013-07-05 |
| 2013-07-06 |
| 2013-07-07 |
| 2013-07-08 |
| 2013-07-09 |
| 2013-07-10 |
+------------------------------+
Can it be done with SQLite? Or it is better, if I just return the 'to_date' and 'from_date' column and fill the gaps between these dates in my Python application?
Any help is appreciated!
You can try that:
SELECT date_entry
FROM vacation
JOIN all_dates ON date_entry BETWEEN from_date AND to_date
GROUP BY date_entry
ORDER BY date_entry