How to count certain the ages of people who have a log record from another table in sql? - sql

I want to get a count of how many people who are 18 are recorded in the logs table only once. Now if I have the same person who entered 2 times, I can see that there are 2 people with age 18. I can't make it appear only once. How do I do this???
My logs table and people table are connected by card_id.
My logs table has the login date and card_id.
While my members' table has the birthdate and card_id columns.
HERE is the query I made
select
card_id, sum("18") as "18"
from
( select logs.login, members.card_id,
count(distinct (case when 0 <= age and age <= 18 then age end)) as "18",
count( (case when 19 <= age and age <= 30 then age end)) as "30",
count ( (case when 31 <= age and age <= 50 then age end)) as "50"
from
(select login, date_part('year', age(birthdate)) as age, members.card_id as card_id,
logs.login
from members
left join logs on logs.card_id=members.card_id
) as members
left join logs on logs.card_id=members.card_id
group by logs.login, members.card_id
) as members
where login <= '20221029' group by card_id;
I want to create a table like this:
18 | 30 | 50 |
---------------
2 | 0 | 0

Count the distinct card_id-s.
select count(distinct card_id)
from members join logs using (card_id)
where extract('year' from age(birthdate)) = 18
and login <= '20221029';
Unrelated but it seems that you are storing login as text. This is not a good idea. Use type date instead.
Addition afer the question update
select count(*) filter (where user_age = 18) as age_18,
count(*) filter (where user_age between 19 and 30) as age_30,
count(*) filter (where user_age between 31 and 50) as age_50
from
(
select distinct on (card_id)
extract('year' from age(birthdate)) user_age
from members inner join logs using (card_id)
where login <= '20221029'
order by card_id, login desc -- pick the latest login
) AS t;

Related

Can I left join twice to do multiple calculations?

I am trying to calculate if a member shops in January, what proportion shop again in February and what proportion shop again within 3 months. Ultimately to create a table similar to the image attached.
I have tried the below code. The first left join works, but when I add the second one to calculate within_3months the error: "FROM keyword not found where expected" is shown (for the separate line). Can I left join twice or must I do separate scripts for columns?
, count(distinct B.members)/count(distinct A.members) *100 as 1month_retention_rate
select
year_month_january21
, count(distinct A.members) as num_of_mems_shopped_january21
, count(distinct B.members)as retained_february21
, count(distinct B.members)/count(distinct A.members) *100 as 1month_retention_rate
, count(distinct C.members)/count(distinct A.members) *100 as within_3months
from
(select
members
, year_month as year_month_january21
from table.members t
join table.date tm on t.dt_key = tm.date_key
and year_month = 202101
group by
members
, year_month) A
left join
(select
members
, year_month as year_month_february21
from table.members t
join table.date tm on t.dt_key = tm.date_key
and year_month = 202102
group by
members
, year_month) B on A.members = B.members
left join
(select
members
, year_month as year_month_3months
from table.members t
join table.date tm on t.dt_key = tm.date_key
and year_month between 202102 and 202104
group by
members
, year_month) C on A.members = C.members
group by
year_month_january21;
I have tried left creating a separate time table and joining to this. It does not work. Doing calculations separately works but I must do this for multiple time frames so will take a long time.
The error isn't coming from the added left join, it's from the as 1month_retention_rate part, because it's an illegal name.
You can see that more simply with:
select dummy as 1month_retention_rate
from dual;
ORA-00923: FROM keyword not found where expected
You could change the column alias so it follows the naming rules (specifically here, does not start with a digit), or if that specific name is actually required then you could make it a quoted identifier - generally not a good option, but sometimes OK in the final output of a query.
fiddle
So in your code you would just change your new line
, count(distinct B.members)/count(distinct A.members) *100 as 1month_retention_rate
to something like
, count(distinct B.members)/count(distinct A.members) *100 as one_month_retention_rate
or with a quoted identifier
, count(distinct B.members)/count(distinct A.members) *100 as "1month_retention_rate"
fiddle - which still errors but now with ORA-00942 as I don't have your tables, and that is after changing your obfuscated schema/table names to something legal too.
There may be more efficient ways to perform the calculation, but that's a separate issue...
I could understand that you want to get :
count of all members who visited in Jan.
count of all members who visited in Jan and visited again in Feb.
count of all members who visited in Jan and visited again in Feb, Mars and April.
If my understanding is true then you could simplify your inner query using IF instead of LEFT JOIN .
Take a look on the following query. Assuming that table members have an ID field :
SELECT
mem_jan AS num_of_mems_shopped_january21,
mem_feb AS retained_february21,
mem_feb / mem_jan * 100 as 1month_retention_rate
mem_3m / mem_jan * 100 as within_3months
FROM(
SELECT
SUM(IF(mm_jan>0,1,0) AS mem_jan,
SUM(IF(mm_jan>0 AND mm_feb>0,1,0) AS mem_feb,
SUM(IF(mm_jan>0 AND mm_count_3m>0,1,0) AS mem_3m
FROM
(
SELECT
t.Id,
SUM(IF(year_month = 202101, 1,0)) AS mm_jan, /*visit for a member in Jan*/
SUM(IF(year_month = 202102, 1,0)) AS mm_feb, /*visit for a member in Feb*/
SUM(IF(year_month between 202102 and 202104,1,0)) AS mem_3m/*visit for a member in 3 months*/
FROM
table.members t
join table.date tm on t.dt_key = tm.date_key
WHERE
year_month between 202101 and 202104
GROUP BY
t.Id
) AS t1
) AS t2
This is not a final running query but it can explain my idea. According to your engine you may use CASE or IF THEN ELSE
Don't use multiple joins, count the shops per member per month and then use conditional aggregation.
In Oracle, that would be:
SELECT 202101 AS year_month,
COUNT(CASE WHEN cnt_202101 > 0 THEN 1 END)
AS members_shopped_202101,
COUNT(CASE WHEN cnt_202101 > 0 AND cnt_202102 > 0 THEN 1 END)
AS members_retained_202102,
COUNT(CASE WHEN cnt_202101 > 0 AND cnt_202102 > 0 THEN 1 END)
/ COUNT(CASE WHEN cnt_202101 > 0 THEN 1 END) * 100
AS one_month_retention_rate,
COUNT(CASE WHEN cnt_202101 > 0 AND (cnt_202102 > 0 OR cnt_202103 > 0 OR cnt_202104 > 0) THEN 1 END)
/ COUNT(CASE WHEN cnt_202101 > 0 THEN 1 END) * 100
AS within_3months
FROM (
SELECT members,
year_month
FROM members m
INNER JOIN date d
ON m.dt_key = d.date_key
)
PIVOT (
COUNT(*)
FOR year_month IN (
202101 AS cnt_202101,
202102 AS cnt_202102,
202103 AS cnt_202103,
202104 AS cnt_202104
)
);

Cohort Analysis using SQL (Snowflake)

I am doing a cohort analysis using the table TRANSACTIONS. Below is the table schema,
USER_ID NUMBER,
PAYMENT_DATE_UTC DATE,
IS_PAYMENT_ADDED BOOLEAN
Below is a quick query to see how USER_ID 12345 (an example) goes through the different cohorts based on the date filter provided,
WITH RESULT(
SELECT
USER_ID,
TO_DATE(PAYMENT_DATE_UTC) AS PAYMENT_DATE,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY 1,2
HAVING PAYMENT_ADDED_COUNT>=1
ORDER BY 2
)
SELECT
COUNT(DISTINCT r.USER_ID),
SUM(r.PAYMENT_ADDED_COUNT)
FROM RESULT r
WHERE r.USER_ID=12345
AND (r.PAYMENT_DATE>='2021-02-01' AND r.PAYMENT_DATE<'2021-02-15')
The result for this query with the time frame (two weeks) would be
| 1 | 55 |
and this USER_ID would be classified as a Regular User Cohort (one who has made more than 10 payments) for the provided date filter
If the same query is run with the time frame as just one day say '2021-02-07', the result would be
| 1 | 10 |
and this USER_ID would be classified as as Occasional User Cohort (one who has made between 1 and 10 payments) for the provided date filter
I have this below query to bucket the USER_ID's into the two different cohorts based on the sum of the payments added,
WITH
ALL_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
),
OCASSIONAL_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING (PAYMENT_ADDED_COUNT>=1 AND PAYMENT_ADDED_COUNT<=10)
),
REGULAR_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING PAYMENT_ADDED_COUNT>10
)
SELECT
COUNT(DISTINCT ou.USER_ID) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.USER_ID) AS "REGULAR USERS"
FROM ALL_USER_COHORT au
LEFT JOIN OCASSIONAL_USER_COHORT ou ON au.USER_ID=ou.USER_ID
LEFT JOIN REGULAR_USER_COHORT ru ON au.USER_ID=ru.USER_ID
LEFT JOIN TRANSACTIONS t ON au.USER_ID=t.USER_ID
WHERE au.USER_ID=12345
AND TO_DATE(t.PAYMENT_DATE_UTC)>='2021-02-07'
Ideally the USER_ID 12345 should be bucketed as "OCCASIONAL USERS" as per the provided date filter but the query buckets it as "REGULAR USERS" instead.
For starters you CTE could have the redundancy removed like so:
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
GROUP BY user_id
), ocassional_user_cohort AS (
SELECT * FROM all_user_cohort
WHERE PAYMENT_ADDED_COUNT between 1 AND 10
), regular_user_cohort AS (
SELECT * FROM all_user_cohort
WHERE PAYMENT_ADDED_COUNT > 10
)
SELECT
COUNT(DISTINCT ou.user_id) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.user_id) AS "REGULAR USERS"
FROM all_user_cohort AS au
LEFT JOIN ocassional_user_cohort ou ON au.user_id=ou.user_id
LEFT JOIN regular_user_cohort ru ON au.user_id=ru.user_id
LEFT JOIN transactions t ON au.user_id=t.user_id
WHERE au.user_id=12345
AND TO_DATE(t.payment_date_utc)>='2021-03-01'
But the reason you are getting this problem is you are doing the which do the belong in across all time.
What you are wanting is to move the date filter into all_user_cohort, and not making tables when you can just sum the number of rows meeting the need.
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
WHERE TO_DATE(payment_date_utc)>='2021-03-01'
GROUP BY user_id
)
SELECT
SUM(IFF(payment_added_count between 1 AND 10, 1,0)) AS "OCCASIONAL USERS"
SUM(IFF(payment_added_count > 10, 1,0)) AS "REGULAR USERS"
FROM transactions
WHERE au.user_id=12345
Which can also be done differently, if that is more what your looking for, for other reasons.
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
WHERE TO_DATE(payment_date_utc)>='2021-03-01'
GROUP BY user_id
), classify_users AS (
SELECT user_id
,CASE
WHEN payment_added_count between 1 AND 10 THEN 'OCCASIONAL USERS'
WHEN payment_added_count > 10 THEN 'REGULAR USERS'
ELSE 'users with zero payments'
END AS classified
FROM all_user_cohort
)
SELECT classified
,count(*)
FROM classify_users
WHERE user_id=12345
GROUP BY 1

Adding another column with a different condition in SQL

I'm trying to find different information about my data all in the same query. I want to create new columns based on separate conditions. Here is what my table looks like right now
SELECT COUNT (distinct ext_project_id) as "Total Projects"
FROM dbo.v_report_project
INNER JOIN dbo.v_report_program
ON dbo.v_report_program.program_id = dbo.v_report_project.program_id
WHERE project_status = 'Active'
AND datediff (day, creation_date, getdate()) < 15 ;
The outcome is:
Total Projects
163
Here is what I want my table to look like:
Total Projects | Projects Under 15 Days | Projects Between 15 and 60 Days | Projects Over 60 Days
163 ?? ?? ??
How can I find these different counts, all at the same time?
You can use a case to limit a particular count to a certain date range:
select count(distinct project_id) as total
, count(distinct case
when datediff(day, creation_date, getdate()) < 6
then project_id end) as total_6day
, count(distinct case
when datediff(day, creation_date, getdate()) < 7
then project_id end) as total_7day
, count(distinct case
when datediff(day, creation_date, getdate()) < 42
then project_id end) as total_42day
from dbo.v_report_project proj
join dbo.v_report_program prog
on proj.program_id = prog.program_id
where project_status = 'Active'
If no when matches and no else is specified, a case returns null. Since count ignores null you get the proper subtotals.

SQL Query Help (Advanced - for me!)

I have a question about a SQL query I am trying to write.
I need to query data from a database.
The database has, amongst others, these 3 fields:
Account_ID #, Date_Created, Time_Created
I need to write a query that tells me how many accounts were opened per hour.
I have written said query, but there are times that there were 0 accounts created, so these "hours" are not populated in the results.
For example:
Volume Date__Hour
435 12-Aug-12 03
213 12-Aug-12 04
125 12-Aug-12 06
As seen in the example above, hour 5 did not have any accounts opened.
Is there a way that the result can populate the hour but and display 0 accounts opened for this hour?
Example of how I want my results to look like:
Volume Date_Hour
435 12-Aug-12 03
213 12-Aug-12 04
0 12-Aug-12 05
125 12-Aug-12 06
Thanks!
Update: This is what I have so far
SELECT count(*) as num_apps, to_date(created_ts,'DD-Mon-RR') as app_date, to_char(created_ts,'HH24') as app_hour
FROM accounts
WHERE To_Date(created_ts,'DD-Mon-RR') >= To_Date('16-Aug-12','DD-Mon-RR')
GROUP BY To_Date(created_ts,'DD-Mon-RR'), To_Char(created_ts,'HH24')
ORDER BY app_date, app_hour
To get the results you want, you will need to create a table (or use a query to generate a "temp" table) and then use a left join to your calculation query to get rows for every hour - even those with 0 volume.
For example, assume I have a table with app_date and app_hour fields. Also assume that this table has a row for every day/hour you wish to report on.
The query would be:
SELECT NVL(c.num_apps,0) as num_apps, t.app_date, t.app_hour
FROM time_table t
LEFT OUTER JOIN
(
SELECT count(*) as num_apps, to_date(created_ts,'DD-Mon-RR') as app_date, to_char(created_ts,'HH24') as app_hour
FROM accounts
WHERE To_Date(created_ts,'DD-Mon-RR') >= To_Date('16-Aug-12','DD-Mon-RR')
GROUP BY To_Date(created_ts,'DD-Mon-RR'), To_Char(created_ts,'HH24')
ORDER BY app_date, app_hour
) c ON (t.app_date = c.app_date AND t.app_hour = c.app_hour)
I believe the best solution is not to create some fancy temporary table but just use this construct:
select level
FROM Dual
CONNECT BY level <= 10
ORDER BY level;
This will give you (in ten rows):
1
2
3
4
5
6
7
8
9
10
For hours interval just little modification:
select 0 as num_apps, (To_Date('16-09-12','DD-MM-RR') + level / 24) as created_ts
FROM dual
CONNECT BY level <= (sysdate - To_Date('16-09-12','DD-MM-RR')) * 24 ;
And just for the fun of it adding solution for you(I didn't try syntax, so I'm sorry for any mistake, but the idea is clear):
SELECT SUM(num_apps) as num_apps, to_date(created_ts,'DD-Mon-RR') as app_date, to_char(created_ts,'HH24') as app_hour
FROM(
SELECT count(*) as num_apps, created_ts
FROM accounts
WHERE To_Date(created_ts,'DD-Mon-RR') >= To_Date('16-09-12','DD-MM-RR')
UNION ALL
select 0 as num_apps, (To_Date('16-09-12','DD-MM-RR') + level / 24) as created_ts
FROM dual
CONNECT BY level <= (sysdate - To_Date('16-09-12','DD-MM-RR')) * 24 ;
)
GROUP BY To_Date(created_ts,'DD-Mon-RR'), To_Char(created_ts,'HH24')
ORDER BY app_date, app_hour
;
You can also use a CASE statement in the SELECT to force the value you want.
It can be useful to have a "sequence table" kicking around, for all sorts of reasons, something that looks like this:
create table dbo.sequence
(
id int not null primary key clustered ,
)
Load it up with million or so rows, covering positive and negative values.
Then, given a table that looks like this
create table dbo.SomeTable
(
account_id int not null primary key clustered ,
date_created date not null ,
time_created time not null ,
)
Your query is then as simple as (in SQL Server):
select year_created = years.id ,
month_created = months.id ,
day_created = days.id ,
hour_created = hours.id ,
volume = t.volume
from ( select * ,
is_leap_year = case
when id % 400 = 0 then 1
when id % 100 = 0 then 0
when id % 4 = 0 then 1
else 0
end
from dbo.sequence
where id between 1980 and year(current_timestamp)
) years
cross join ( select *
from dbo.sequence
where id between 1 and 12
) months
left join ( select *
from dbo.sequence
where id between 1 and 31
) days on days.id <= case months.id
when 2 then 28 + years.is_leap_year
when 4 then 30
when 6 then 30
when 9 then 30
when 11 then 30
else 31
end
cross join ( select *
from dbo.sequence
where id between 0 and 23
) hours
left join ( select date_created ,
hour_created = datepart(hour,time_created ) ,
volume = count(*)
from dbo.SomeTable
group by date_created ,
datepart(hour,time_created)
) t on datepart( year , t.date_created ) = years.id
and datepart( month , t.date_created ) = months.id
and datepart( day , t.date_created ) = days.id
and t.hour_created = hours.id
order by 1,2,3,4
It's not clear to me if created_ts is a datetime or a varchar. If it's a datetime, you shouldn't use to_date; if it's a varchar, you shouldn't use to_char.
Assuming it's a datetime, and borrowing #jakub.petr's FROM Dual CONNECT BY level trick, I suggest:
SELECT count(*) as num_apps, to_char(created_ts,'DD-Mon-RR') as app_date, to_char(created_ts,'HH24') as app_hour
FROM (select level-1 as hour FROM Dual CONNECT BY level <= 24) h
LEFT JOIN accounts a on h.hour = to_number(to_char(a.created_ts,'HH24'))
WHERE created_ts >= To_Date('16-Aug-12','DD-Mon-RR')
GROUP BY trunc(created_ts), h.hour
ORDER BY app_date, app_hour

Oracle: How can I determine if there are gaps records based on a date field?

I have an application that manages employee time sheets.
My tables look like:
TIMESHEET
TIMESHEET_ID
EMPLOYEE_ID
etc
TIMESHEET_DAY:
TIMESHEETDAY_ID
TIMESHEET_ID
DATE_WORKED
HOURS_WORKED
Each time sheet covers a 14 day period, so there are 14 TIMESHEET_DAY records for each TIMESHEET record. And if someone goes on vacation, they do not need to enter a timesheet if there are no hours worked during that 14 day period.
Now, I need to determine whether or not employees have a 7 day gap in the prior 6 months. This means I have to look for either 7 consecutive TIMESHEET_DAY records with 0 hours, OR a 7 day period with a combination of no records submitted and records submitted with 0 hours worked. I need to know the DATE_WORKED of the last TIMESHEET_DAY record with hours in that case.
The application is asp.net, so I could retrieve all of the TIMESHEET_DAY records and iterate through them, but I think there must be a more efficient way to do this with SQL.
SELECT t1.EMPLOYEE_ID, t1.TIMESHEETDAY_ID, t1.DATE_WORKED
FROM (SELECT ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID, TIMESHEETDAY_ID ORDER BY DATE_WORKED) AS RowNumber,
EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
FROM (SELECT EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
FROM TIMESHEET_DAY d
INNER JOIN TIMESHEET t ON t.TIMESHEET_ID = d.TIMESHEET_ID
GROUP BY EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
HAVING SUM(HOURS_WORKED) > 0) t ) t1
INNER JOIN
(SELECT ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID, TIMESHEETDAY_ID ORDER BY DATE_WORKED) AS RowNumber,
EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
FROM (SELECT EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
FROM TIMESHEET_DAY d
INNER JOIN TIMESHEET t ON t.TIMESHEET_ID = d.TIMESHEET_ID
GROUP BY EMPLOYEE_ID, TIMESHEETDAY_ID, DATE_WORKED
HAVING SUM(HOURS_WORKED) > 0) t ) t2 ON t1.RowNumber = t2.RowNumber + 1
WHERE t2.DATE_WORKED - t1.DATE_WORKED >= 7
I would do it at least partly with SQL by counting the days per timesheet and then do the rest of the logic in the program. This identifies time sheets with missing days:
SELECT * FROM
(SELECT s.timesheet_id, SUM(CASE WHEN d.hours_worked > 0 THEN 1 ELSE 0 END) AS days_worked
FROM
TimeSheet s
LEFT JOIN TimeSheet_Day d
ON s.timesheet_id = d.timesheet_id
GROUP BY
s.timesheet_id
HAVING SUM(CASE WHEN d.hours_worked > 0 THEN 1 ELSE 0 END) < 14) X
INNER JOIN TimeSheet
ON TimeSheet.timesheet_id = X.timesheet_id
LEFT JOIN TimeSheet_Day
ON TimeSheet.timesheet_id = TimeSheet_Day.timesheet_id
ORDER BY
TimeSheet.employee_id, TimeSheet.timesheet_id, TimeSheet_Day.date_worked