SQL grouping user count by Mondays - sql

Given a Users table like so:
Users: id, created_at
How can I get the # of users created grouped by day? My goal is to see the number of users created this Monday versus previous Monday's.

If created_at is of type timestamp, the simplest and fastest way is a plain cast to date:
SELECT created_at::date AS day, count(*) AS ct
FROM users
GROUP BY 1;
Since I am assuming that id cannot be NULL, count(*) is a tiny bit shorter and faster than count(id), while doing the same.
If you just want to see days since "last Monday":
SELECT created_at::date, count(*) AS ct
FROM users
WHERE created_at >= (now()::date - (EXTRACT(ISODOW FROM now())::int + 6))
GROUP BY 1
ORDER BY 1;
This is carefully drafted to use a sargable condition, so it can use a simple index on created_at if present.
Consider the manual for EXTRACT.

SELECT COUNT(id) AS cnt, EXTRACT(DOW FROM created_at) AS dow
FROM Users
GROUP BY EXTRACT(DAY FROM created_at)

If you want to see the days, use to_char(<date>, 'Day').
So, one way to do what you want:
select date_trunc('day', created_at), count(*)
from users u
where to_char(created_at, 'Dy') = 'Mon'
group by date_trunc('day', created_at)
order by 1 desc;
Perhaps a more general way to look at it would be to summarize the results by day of the week, for this week and last week. Something like:
select to_char(created_at, 'Day'),
sum(case when created_at >= current_date - 6 then 1 else 0 end) as ThisWeek,
sum(case when trunc(created_at) between current_date - 13 and current_date - 7 then 1 else 0 end) as LastWeek
from users u
group by to_char(created_at, 'Day')

I am from a T-SQL background and I would do something like this
CREATE TABLE #users
(id int,
created_at datetime
)
INSERT INTO #users
(id, created_at)
VALUES
(
1, getdate()
)
INSERT INTO #users
(id, created_at)
VALUES
(
1, getdate()
)
INSERT INTO #users
(id, created_at)
VALUES
(
1, dateadd(DAY, 1,getdate())
)
SELECT id, created_at, count(id) FROM #users
GROUP BY id, created_at
DROP TABLE #users
You will get better results if you only group by day part and not the entire datetime value.
Coming to second part - only comparing for Mondays; you can use something like
select datename(dw,getdate())
the above will give you the name of the weekday which you can compare against a string literal 'Monday'.

Related

How to get number of IDs in the current month that also appears in the previous three months in Snowflake - SQL

I have a table in the snowflake with a time range from for example 2019.01 to 2020.01. An ID can appear multiple times (match with) on any of the dates.
For example:
my_table: two columns dddate and id
dddate
id
2019-02-03
607
2019-01-07
356
2019-08-06
491
2019-01-01
607
2019-12-17
529
2019-04-15
356
......
Is there a way I can find the total number of IDs that appeared at least one time in the current month that also appeared at least one time in the previous three months, and group by month to show each month's number count starting from 2019-04 (The first month that has previous three months data available in the table) until 2020-01.
I am thinking of some code like this:
WITH PREV_THREE AS (
SELECT
DATE_TRUNC('MONTH', dddate) AS MONTH,
ID AS CURR_ID
FROM my_table mt
INNER JOIN
(
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -1, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
UNION ALL
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -2, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
UNION ALL
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -3, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
) AS PREV_3_MON
ON mt.CURR_ID = PREV_3_MON.PREV_3_MON_ID
)
SELECT MONTH, COUNT(DISTINCT ID) AS COUNTER
FROM PREV_THREE
GROUP BY 1
ORDER BY 1
However, it somehow returns an error and doesn't seem working. Could anyone please help me with this? Thank you in advance!
You can use lag():
select distinct id
from (select t.*,
lag(dddate) over (partition by id order by dddate) as prev_dddate
from my_table t
) t
where dddate >= date_trunc('MONTH', current_date) and
prev_dddate < date_trunc('MONTH', current_date) and
prev_dddate >= date_trunc('MONTH', current_date) - interval '3 month';
You can do this for multiple months as:
select date_trunc('MONTH', dddate), count(distinct id)
from (select t.*,
lag(dddate) over (partition by id order by dddate) as prev_dddate
from my_table t
) t
where prev_dddate < date_trunc('MONTH', date_trunc('MONTH', dddate)) and
prev_dddate >= date_trunc('MONTH', date_trunc('MONTH', dddate)) - interval '3 month'
group by date_trunc('MONTH', dddate);
Even if an id appears multiple times in one month, one of those will be first and the lag() will identify the most recent previous month.

Data recurring in previous 90 days

I hope you can suppor me with a piece of code I'm writing. I'm working with the following query:
SELECT case_id, case_date, people_id FROM table_1;
and I've to search in the DB how many times the same people_id is repeted in the DB, (different case_id) considering the case_date -90days timeframe. Any advise on how to address that?
Data sample
Additional info: as results I'm expecting to have the list of people_id with how many cases received in the 90 days from the last case_date.
expected result sample:
The way I understood the question, it would be something like this:
select people_id,
case_id,
count(*)
from table_1
where case_date >= trunc(sysdate) - 90
group by people_id,
case_id
You want to filter WHERE the case_date is greater than or equal to 90 days before the start of today and then GROUP BY the people_id and COUNT the number of DISTINCT (different) case_id:
SELECT people_id,
COUNT( DISTINCT case_id ) AS number_of_cases
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id;
If you only want to count repeated case_id per person_id then:
SELECT person_id,
COUNT(*) AS number_of_repeated_cases
FROM (
SELECT case_id,
person_id,
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id,
case_id
HAVING COUNT(*) >= 2
)
GROUP BY
people_id;
I think you want window functions:
select t.*,
count(*) over (partition by people_idorder by case_date
range between interval '90' day preceding and current row
) as person_count_90_day
from t;

Month over Month percent change in user registrations

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

How do I calculate how many X requests were open as of each date?

Using
https://data.seattle.gov/Public-Safety/PDRs-After-using-City-of-Seattle-Public-Records-Re/wj44-r6br/data I want the know on each date the number of public disclosure requests were open. This means per date I want the number of requests created before or same day as date and don't have a close date after the date.
I copied it to https://data.world/timacbackup/seattle-police-public-disclosure-requests where I can use SQL.
The closest I've gotten is
SELECT CAST(seattle_police_records_requests.request_create_date AS DATE) AS the_date,
count(*)
FROM seattle_police_records_requests
GROUP BY CAST(seattle_police_records_requests.request_create_date AS DATE)
ORDER BY the_date DESC;
I tried
SELECT CAST(request_create_date AS DATE) AS the_date,
count((
SELECT request_create_date
FROM seattle_police_records_requests AS t
WHERE CAST(t.request_create_date AS DATE) < d.request_create_date
))
FROM seattle_police_records_requests AS d
GROUP BY CAST(request_create_date AS DATE)
ORDER BY the_date DESC;
but get unknown table 'd' for the count subquery.
The last query I tried is
WITH dates
AS (
SELECT CAST(request_create_date AS DATE) AS create_date,
CAST(request_closed_date AS DATE) AS closed_date
FROM seattle_police_records_requests
),
create_dates
AS (
SELECT DISTINCT CAST(request_create_date AS DATE) AS create_date
FROM seattle_police_records_requests
)
SELECT create_dates.create_date,
COUNT(*)
FROM dates
INNER JOIN create_dates ON dates.create_date = create_dates.create_date
GROUP BY create_dates.create_date
HAVING dates.create_date <= create_dates.create_date
ORDER BY create_dates.create_date DESC
and basically it's just counting # of requested opened on given day not all that were open as of given day.
After importing the "created" and "closed" values into SQL Server as datetime columns I was able to generate the counts per day like so:
WITH
given_dates AS
(
SELECT DISTINCT CAST(created AS DATE) AS given_date
FROM seattle_police_records_requests
)
SELECT
given_date,
(
SELECT COUNT(*)
FROM seattle_police_records_requests
WHERE created <= DATEADD(DAY, 1, given_date) AND (closed > given_date OR closed IS NULL)
) AS num_open
FROM given_dates
ORDER BY given_date;
The DATEADD was necessary to include requests opened during that day, since the comparison of a date and a datetime implies that the date value is midnight (i.e., the very beginning of that day).

How to narrow down count query by a finite time frame?

I have a query where I am identifying more than 1 submission by user for a particular form:
select userid, form_id, count(*)
from table_A
group by userid, form_id
having count(userid) > 1
However, I am trying to see which users are submitting more than 1 form within a 5 second timeframe (We have a field for the submission timestamp in this table). How would I narrow this query down by that criteria?
#nikotromus
You've not provided a lot of details about your schema and other columns available, nor about what / how and where this information will be used.
However if you want to do it "live" so compare results in your time against current timestamp it would look something like:
SELECT userid, form_id, count(*)
FROM table_A
WHERE DATEDIFF(SECOND,YourColumnWithSubmissionTimestamp, getdate()) <= 5
GROUP BY userid, form_id
HAVING count(userid) > 1
One way is to add to the group by DATEDIFF(Second, '2017-01-01', SubmittionTimeStamp) / 5.
This will group records based on the userid, form_id and a five seconds interval:
select userid, form_id, count(*)
from table_A
group by userid, form_id, datediff(Second, '2017-01-01', SubmittionTimeStamp) / 5
having count(userid) > 1
Read this SO post for a more detailed explanation.
You can use lag to form groups of rows that are within 5 seconds of each other and then do aggregation on them:
select distinct userid,
form_id
from (
select t.*,
sum(val) over (
order by t.submission_timestamp
) as grp
from (
select t.*,
case
when datediff(ms, lag(t.submission_timestamp, 1, t.submission_timestamp) over (
order by t.submission_timestamp
), t.submission_timestamp) > 5000
then 1
else 0
end val
from your_table t
) t
) t
group by userid,
form_id,
grp
having count(*) > 1;
See this answer for more explanation:
Group records by consecutive dates when dates are not exactly consecutive
I would just use exists to get the users:
select userid, form_id
from table_A a
where exists (select 1
from table_A a2
where a2.userid = a.userid and a2.timestamp >= a.timestamp and a2.timestamp < dateadd(second, 5, a.timestamp
);
If you want a count, you can just add group by and count(*).