Is there a way to find active users in SQL? - sql

I'm trying to find the total count of active users in a database. "Active" users here as defined as those who have registered an event on the selected day or later than the selected day. So if a user registered an event on days 1, 2 and 5, they are counted as "active" throughout days 1, 2, 3, 4 and 5.
My original dataset looks like this (note that this is a sample - the real dataset will run to up to 365 days, and has around 1000 users).
Day ID
0 1
0 2
0 3
0 4
0 5
1 1
1 2
2 1
3 1
4 1
4 2
As you can see, all 5 IDs are active on Day 0, and 2 IDs (1 and 2) are active until Day 4, so I'd like the finished table to look like this:
Day Count
0 5
1 2
2 2
3 2
4 2
I've tried using the following query:
select Day as days, sum(case when Day <= days then 1 else 0 end)
from df
But it gives incorrect output (only counts users who were active on each specific days).
I'm at a loss as to what I could try next. Does anyone have any ideas? Many thanks in advance!

I think I would just use generate_series():
select gs.d, count(*)
from (select id, min(day) as min_day, max(day) as max_day
from t
group by id
) t cross join lateral
generate_series(t.min_day, .max_day, 1) gs(d)
group by gs.d
order by gs.d;
If you want to count everyone as active from day 1 -- but not all have a value on day 1 -- then use 1 instead of min_day.
Here is a db<>fiddle.

A bit verbose, but this should do:
with dt as (
select 0 d, 1 id
union all
select 0 d, 2 id
union all
select 0 d, 3 id
union all
select 0 d, 4 id
union all
select 0 d, 5 id
union all
select 1 d, 1 id
union all
select 1 d, 2 id
union all
select 2 d, 1 id
union all
select 3 d, 1 id
union all
select 4 d, 1 id
union all
select 4 d, 2 id
)
, active_periods as (
select id
, min(d) min_d
, max(d) max_d
from dt
group by id
)
, days as (
select distinct d
from dt
)
select d.d
, count(ap.id)
from days d
join active_periods ap on d.d between ap.min_d and ap.max_d
group by 1
order by 1 asc

You need count by day.
select
id,
count(*)
from df
GROUP BY
id

Related

Find records on group level which are connected to all other record within the group

I have a scenario where I have to find IDs within each group which are connected to all other IDs in the same group. So basically we have to treat each group separately.
In the table below, the group A has 3 IDs 1, 2 and 3. 1 is connected to both 2 and 3, 2 is connected to both 1 and 3, but 3 is not connected to 1 and 2. So 1 and 2 should be output from group A. Similarly in group B only 5 is connected to all other IDs namely 4 and 6 within group B, so 5 should be output. Similarly from group C, that should be 8, and from group D no records should be output.
So the output of the select statement should be 1, 2, 5, 8.
GRP
ID
CONNECTED_TO
A
1
2
A
1
3
A
2
3
A
2
1
A
3
5
B
4
5
B
5
4
B
5
6
B
6
4
C
7
21
C
7
25
C
8
7
D
9
31
D
10
35
D
11
37
I was able to do this if group level was not required, by below SQL:
SELECT ID FROM <table>
where CONNECTED_TO in (select ID from <table>)
group by ID
having count(*) = <number of records - 1>
But not able to find correct SQL for my scenario. Any help is appreciated.
You may use count and count(distinct) functions as the following:
select id
from tbl T
where connected_to in
(
select id from tbl T2
where T2.grp = T.grp
)
group by grp, id
having count(connected_to) =
(
select count(distinct D.id) - 1
from tbl D
where T.grp = D.grp
)
When count(connected_to) group by grp, id equals to the count(distinct id) - 1 with the same grp, this means that the ID is connected to all other IDs.

How to group user sessions by converted row

I'm doing simple multichannel attribution exploration and got stuck with grouping user sessions.
For example, I have simple sessions table:
client channel time converted
1 social 1 0
1 cpc 2 0
1 email 3 1
1 email 4 0
1 cpc 5 1
2 organic 1 0
2 cpc 2 1
3 email 1 0
Each row contains user sessions and converted column, which shows if user converted in particular session.
I need to group sessions which lead conversion for each user and for each conversion, so perfect result should be:
client channels time converted
1 [social,cpc,email] 3 1
1 [email,cpc] 5 1
2 [organic,cpc] 2 1
3 [email] 1 0
Notice user 3, he's not converted but I need to have his sessions
You need to assign a group. For this purpose, an inverse sum of converted looks like the right thing:
select client, array_agg(channel order by time) as channels,
max(time) as time, max(converted) as converted
from (select t.*,
sum(t.converted) over (partition by t.client order by t.time desc) as grp
from t
) t
group by client, grp;
Below is for BigQuery Standard SQL
#standardSQL
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
-- ORDER BY client, time
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 client, 'social' channel, 1 time, 0 converted UNION ALL
SELECT 1, 'cpc', 2, 0 UNION ALL
SELECT 1, 'email', 3, 1 UNION ALL
SELECT 1, 'email', 4, 0 UNION ALL
SELECT 1, 'cpc', 5, 1 UNION ALL
SELECT 2, 'organic', 1, 0 UNION ALL
SELECT 2, 'cpc', 2, 1 UNION ALL
SELECT 3, 'email', 1, 0
)
SELECT
client,
STRING_AGG(channel ORDER BY time) channels,
MAX(time) time,
MAX(converted) converted
FROM (
SELECT *, COUNTIF(converted = 1) OVER(PARTITION BY client ORDER BY time DESC) session
FROM `project.dataset.table`
)
GROUP BY client, session
ORDER BY client, time
with result
Row client channels time converted
1 1 social,cpc,email 3 1
2 1 email,cpc 5 1
3 2 organic,cpc 2 1
4 3 email 1 0

Get MAX from row with column name (SQL)

Sorry if my questin is simple, but I spent one day for googling and still can't figure out how to solve this:
I have table like:
userId A B C D E
1 5 0 2 3 2
2 3 2 0 7 3
And I need each MAX per row with column name:
userId MAX
1 A
2 D
All ideas will be much appreciated! Thanks!
I use Google Big Query so my possibilities are different form MySQL as I understand, but I will try to translate if you have ideas in MySQL way.
You can use GREATEST:
SELECT userid, CASE GREATEST(A,B,C,D,E)
WHEN A THEN 'A'
WHEN B THEN 'B'
WHEN C THEN 'C'
WHEN D THEN 'D'
WHEN E THEN 'E'
END AS MAX
FROM TableName
Result:
userId MAX
1 A
2 D
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
( SELECT key
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'".*?":[^,}]*')) kv,
UNNEST([STRUCT(TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') AS key, SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) AS value)])
WHERE key != 'userId'
ORDER BY value DESC
LIMIT 1
) max_column
FROM `project.dataset.table` t
if to apply to sample data from question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 userId, 5 A, 0 B, 2 C, 3 D, 2 E UNION ALL
SELECT 2, 3, 2, 0, 7, 3 UNION ALL
SELECT 3, 1, 2, NULL, 4, 5
)
SELECT *,
( SELECT key
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'".*?":[^,}]*')) kv,
UNNEST([STRUCT(TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') AS key, SAFE_CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64) AS value)])
WHERE key != 'userId'
ORDER BY value DESC
LIMIT 1
) max_column
FROM `project.dataset.table` t
output is
Row userId A B C D E max_column
1 1 5 0 2 3 2 A
2 2 3 2 0 7 3 D
3 3 1 2 null 4 5 E

SQL query - sum of values by status for date interval

I get crazy because of one query. I have a table like following and I want to get a data - Summa of Values by Status For every Date in interval.
Table
Id Name Value Date Status
1 pro1 2 01.04.14 0
2 pro1 8 02.04.14 1
3 pro2 6 02.04.14 1
4 pro3 0 03.04.14 0
5 pro4 7 03.04.14 0
6 pro4 2 03.04.14 0
7 pro4 4 03.04.14 1
8 pro4 6 04.04.14 1
9 pro4 1 04.04.14 1
For example,
Input: Name = pro4, minDate = 01.02.14, maxDate = 04.09.14
Output:
Date Values sum for 0 Status Values sum for 1 Status
01.04.14 0 0
02.04.14 0 0
03.04.14 9 (=7+2) 4 (only 4 exist)
04.04.14 0 7 (6+1)
In 01.02.14 and 02.04.14 dates, pro4 has not values by status, but I want to show that rows, because I need all dates in that interval. Can anyone help me to create this query?
Edit:
I can not change structure, I have already that table with data. Every day exist in table many times (minimum 1 time)
Thanks in advance.
Assuming you have a row for each date in the table, use conditional aggregation:
select date,
sum(Case when name = 'pro4' and status = 0 then Value else 0 end) as values_0,
sum(case when name = 'pro4' and status = 1 then Value else 0 end) as values_1
from Table t
where date >= '2014-04-01' and date <= '2014-04-09'
group by date
order by date;
If you don't have this list of dates, you can take this approach instead:
with dates as (
select cast('2014-04-01' as date) as thedate
union all
select dateadd(day, 1, thedate)
from dates
where thedate < '2014-04-09'
)
select dates.thedate,
sum(Case when status = 0 then Value else 0 end) as values_0,
sum(case when status = 1 then Value else 0 end) as values_1
from dates left outer join
table t
on t.date = dates.thedate and t.name = 'pro4'
group by dates.thedate;
just an assumption query :
select Distinct date ,case when status = 0 and MAX(date) then SUM(value) ELSE 0 END Status0 ,
case when status = 1 and MAX(date) then SUM(value) ELSE 0 END Status1 from table
To expand my comment the complete query is
WITH [counter](N) AS
(SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1)
, days(N) AS (
SELECT row_number() over (ORDER BY (SELECT NULL)) FROM [counter])
, months (N) AS (
SELECT N - 1 FROM days WHERE N < 13)
, calendar ([date]) AS (
SELECT DISTINCT cast(dateadd(DAY, days.n
, dateadd(MONTH, months.n, '20131231')) AS date)
FROM months
CROSS JOIN days
)
SELECT a.Name
, c.Date
, [Sum of 0] = SUM(CASE Status WHEN 0 THEN Value ELSE 0 END)
, [Sum of 1] = SUM(CASE Status WHEN 1 THEN Value ELSE 0 END)
FROM Calendar c
LEFT JOIN myTable a ON c.Date = a.Date AND a.name = 'pro4'
WHERE c.date BETWEEN '20140201' AND '20140904'
GROUP BY c.Date, a.Name
ORDER BY c.Date
Note that the condition on the name need to be in the JOIN, otherwise you'll get only the date of your table.
If you need multiple years just add another CTE for the count and a dateadd(YEAR,...) in the CTE calendar
This is not really the exact query, but I think you can get that by having a query that looks like:
select date, status, sum(value) from table
where (date between mindate and maxdate) and name = product_name
group by date, status;
this page gives more info.
EDIT
So the above query only gives a part of the answer required by the OP. A LEFT OUTER JOIN of the original table and the result of the above query on thedate and status fields will give the missing info.
e.g.
select x.date, x.status, x.sum_of_values from table as y
left outer join
(select date, status, sum(value) as sum_of_values
from table
where (date between mindate and maxdate) and name = product_name
group by date, status) as x
on y.date= x.date and y.status = x.status
order by x.date;

Checking for missing data in SQL

I am having a hard time with this not knowing if there's a solution for this.
I am trying to detect missing hourly data. Sample:
Table HRLY_DATA:
NAME HOUR
Me 0
Me 1
Me 2
Me 3
Me 6
Me 7
You 0
You 1
You 2
You 3
You 4
You 5
You 6
You 7
As you can see, [HOUR] data of Me is missing 4 and 5. I want a query that will output:
NAME HOUR
Me 4
Me 5
For now, here's what I've got:
SELECT d.NAME, HR FROM HRs c
LEFT OUTER JOIN
(
SELECT distinct a.NAME
FROM HRLY_DATA a
INNER JOIN
(
SELECT NAME FROM
(
SELECT NAME, count(*) as CNT
FROM
(
SELECT DISTINCT NAME, HOUR
FROM HRLY_DATA
) as i
GROUP BY NAME
) as ii
WHERE CNT < 8
) as b
ON a.NAME=b.NAME
) as d
ON c.HR=d.HOUR
WHERE d.HOUR IS NULL
HRs
HR
0
1
2
3
4
5
6
7
I am getting this output:
NAME HR
NULL 4
NULL 5
Data for HOUR will range only from 0 - 7..
BTW, I am using SQL SERVER/ MSSQL for this.
:(
Sorry if I can't explain my problem clearly. :(
Please try:
select
distinct x.NAME, number HOUR
From
master.dbo.spt_values cross join HRLY_DATA x
where number between 0 and 7
except
select NAME, HOUR FROM HRLY_DATA
Since table HR contains data 0-7, try:
select
distinct NAME, HR
From
#HR cross join HRLY_DATA
except
select * from HRLY_DATA