Column into multiple columns with distinct count - sql

I have a table which looks like:
event_date
event_name
user_id
20220407
n1
a
20220407
n2
b
20220407
n3
a
20220408
n1
a
20220408
n1
a
20220408
n2
c
Each row is presenting single event with params (it’s actually a bigquery table with data from firebase)
I want to select only needed events and place their sum for distinct users grouped by day into another table, like this:
date
n1 distinct users count
n2 distinct users count
20220407
1
1
20220408
2
0
I've tried smth like:
SELECT COUNT (DISTINCT user_pseudo_id) as users
,event_date
event_name,
case app_info.id when 'com.kaspersky.standalone-vpn' then 'KSeC-iOS'
when 'com.kaspersky.secure.connection' then 'KSeC-Android'
when 'com.kaspersky.securityadvisor' then 'KSC-iOS'
when 'com.kaspersky.security.cloud' then 'KSC-Android'
else app_info.id end as product
, SUBSTRING(device.language, 1, 2) as language
, geo.country
, app_info.version as app_version
FROM `ksec-android.analytics_156657667.events_*`
WHERE (event_name = 'first_open' OR event_name = 'user_engagement' OR 'event_name' = 'app_remove')
and _table_suffix >= FORMAT_DATE("%Y%m%d",(date_sub(CURRENT_DATE(), interval 1 day)))
group by event_date
,product
,language
,country
,app_version
,event_name
) src
pivot
(
count(users)
for event_name in ([first_open], [user_engagement], [app_remove])
) piv
group by event_date
,product
,language
,country
,app_version
I really don’t get it, would be so thankful for help

consider below approach
select * from your_table
pivot (count(distinct user_id) as count for event_name in ('n1', 'n2'))
if applied to sample data in your question - output is

Related

User Life Cycle SQL Query Logic in Snowflake

I am working on building a query to track the life cycle of an user through the platform via events. The table EVENTS has 3 columns USER_ID, DATE_TIME and EVENT_NAME. Below is a snapshot of the table,
My query should return the below result (the first timestamp for the registered event followed by the immediate/next timestamp of the following log_in event and finally followed by the immediate/next timestamp of the final landing_page event),
Below is my query ,
WITH FIRST_STEP AS
(SELECT
USER_ID,
MIN(CASE WHEN EVENT_NAME = 'registered' THEN DATE_TIME ELSE NULL END) AS REGISTERED_TIMESTAMP
FROM EVENTS
GROUP BY 1
),
SECOND_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'log_in'
ORDER BY DATE_TIME
),
THIRD_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'landing_page'
ORDER BY DATE_TIME
)
SELECT
a.USER_ID,
a.REGISTERED_TIMESTAMP,
(SELECT
CASE WHEN b.DATE_TIME >= a.REGISTRATIONS_TIMESTAMP THEN b.DATE_TIME END AS LOG_IN_TIMESTAMP
FROM SECOND_STEP
LIMIT 1
),
(SELECT
CASE WHEN c.DATE_TIME >= LOG_IN_TIMESTAMP THEN c.DATE_TIME END AS LANDING_PAGE_TIMESTAMP
FROM THIRD_STEP
LIMIT 1
)
FROM FIRST_STEP AS a
LEFT JOIN SECOND_STEP AS b ON a.USER_ID = b.USER_ID
LEFT JOIN THIRD_STEP AS c ON b.USER_ID = c.USER_ID;
Unfortunately I am getting the "SQL compilation error: Unsupported subquery type cannot be evaluated" error when I try to run the query
This is a perfect use case for MATCH_RECOGNIZE.
The pattern you are looking for is register anything* login anything* landing and the measures are the min(iff(event_name='x', date_time, null)) for each.
Check:
https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1
https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html
Set the output to one row per match.
Untested sample query:
select *
from data
match_recognize(
partition by user_id
order by date_time
measures min(iff(event_name='register', date_time, null)) as t1
, min(iff(event_name='log_in', date_time, null)) as t2
, min(iff(event_name='landing_page', date_time, null)) as t3
one row per match
pattern(register anything* login anything* landing)
define
register as event_name = 'register'
, login as event_name = 'log_in'
, landing as event_name = 'landing_page'
);

HPE Vertica live aggregate projection example for user retention

create table events(
id char(36) PRIMARY KEY,
game_id varchar(24) not null,
user_device_id char(36) not null,
event_name varchar(100) not null,
generated_at timestamp with time zone not null
);
SELECT
events.generated_at::DATE AS time_stamp,
COUNT(DISTINCT (
CASE WHEN
events.event_name = 'new_user' THEN events.user_device_id
END
)
) as new_users,
COUNT(DISTINCT (
CASE WHEN
future_events.event_name <> 'new_user' THEN future_events.user_device_id
END
)
) as returned_users,
COUNT(DISTINCT (
CASE WHEN
future_events.event_name <> 'new_user' THEN future_events.user_device_id
END
)) / COUNT(DISTINCT (
CASE WHEN
events.event_name = 'new_user' THEN events.user_device_id
END
))::float as retention
FROM events
LEFT JOIN events AS future_events ON
events.user_device_id = future_events.user_device_id AND
events.generated_at = future_events.generated_at - interval '1 day' AND
events.game_id = future_events.game_id
GROUP BY
time_stamp
ORDER BY
time_stamp;
I am trying to get the Day N ('N' -> any number between 1 to 7) user retention via the above sql query. Due to the fact that I am a noob in HPE vertica, I am not being able to come up the optimum aggregate projection creating statement, Since projection significantly improves the performance of the query.
Aggregated projection won't help with a join query.
You can create a regular projection, segmented and sorted by the join columns, to achieve performance improvement:
CREATE PROJECTION events_p1 (
id,
game_id ENCODING RLE,
user_device_id ENCODING RLE,
event_name,
generated_at ENCODING RLE
) AS
SELECT id,
game_id,
user_device_id,
event_name,
generated_at
FROM events
ORDER BY generated_at,
game_id,
user_device_id
SEGMENTED BY hash(generated_at,game_id,user_device_id) ALL NODES KSAFE 1;

SQL count DISTINCT ONCE user_id multiple attributes

Hello there I cant manage to get a good result for the following case:
I have a table which is like this:
UserID | Label
-------- ------
1 | Private
1 | Public
2 | Private
3 | Hidden
4 | Public
5 | Hidden
I want to have the following happening if a User has following assigned he is:
Private and Hidden are treaten the same: lets say Business
Public: BtoC
Public and Private and/or Hidden: both
So in the end I have a count(DISTINCT UserID) of
Business 3
BtoC 1
both 1
I have tried to use CASE WHEN but it doesn't work my current total query looks like this:
SELECT gen_month,
count(DISTINCT cu.id) as leads,
a.label
FROM generate_series(DATE_TRUNC('month', CURRENT_DATE::date - 96*INTERVAL '1 month'), CURRENT_DATE::date, '1 month') m(gen_month)
LEFT OUTER JOIN company_user AS cu
ON (date_trunc('month', cu.creation_date) = date_trunc('month', gen_month))
LEFT JOIN user u
ON u.user_id = cu.id
LEFT join user_account_status as uas
on cu.id = uas.user_id
LEFT JOIN account as a
on uas.account_id = a.id
where gen_month >= DATE_TRUNC('month',NOW() - INTERVAL '5 months')
group by m.gen_month, a.label
order by gen_month
So my main problem now is that the count appears in every attribute once.
How can I make a userid only count once under condition CASE WHEN user_id appears Public and (Private or Hidden) THEN count(DISTINCT user_id) as Both?
Addition: its mySQL mariaDB and postgreSQL. But first I would happy with Postgres
This is not implemented in your total query, but for counting users for each category, you can:
with the_table(UserID , Label) as(
select 1 ,'Private' union all
select 1 ,'Public' union all
select 2 ,'Private' union all
select 3 ,'Hidden' union all
select 4 ,'Public' union all
select 5 ,'Hidden'
)
select result, count(*) from (
select UserID, case when min(Label) = 'Public' then 'BtoC' when max(Label) in('Private','Hidden') then 'Business' else 'both' end as result
from the_table
group by UserID
) t
group by result
with
my_table(user_id, label) as (values
(1,'Private'),
(1,'Public'),
(2,'Private'),
(3,'Hidden'),
(4,'Public'),
(5,'Hidden')),
t as (
select
user_id,
string_agg('{'||label||'}', '') as labels
from my_table
group by user_id),
tt as (
select
user_id,
labels,
case
when
position('{Public}' in labels) > 0 and (position('{Private}' in labels) > 0 or position('{Hidden}' in labels) > 0) then 'Both'
when
position('{Private}' in labels) > 0 or position('{Hidden}' in labels) > 0 then 'Business'
when
position('{Public}' in labels) > 0 then 'BtoC'
end as kind
from t)
select kind, count(*) from tt group by kind;
For MariaDB use GROUP_CONCAT() instead of PostgreSQL string_agg().
Note that the case statement check conditions in order of appearance and returns the value for the first satisfied condition.
PS: Using PostgreSQL's arrays the conditions would be more elegant.

Sql query to return one single record per each combination in a table

I need the result for every combination of (from_id, to_id) which has the minimun value and the loop matching a criteria.
So basically I need the loop that has the minimun value. e.g. From A to B i need the minimun value and the loop_id .
The table has the following fields:
value from_id to_id loop_id
-------------------------------------
2.3 A B 2
0.1 A C 2
2.1 A B 4
5.4 A C 4
So a result will be:
value from_id to_id loop_id
-------------------------------------
2.1 A B 4
0.1 A C 2
I have tried with the following:
SELECT t.value, t.from_id, t.to_id,t.loop_id
FROM myresults t
INNER JOIN (
SELECT min(m.value), m.from_id, m.to_id, m.loop_id
FROM myresults m where m.loop_id % 2 = 0
GROUP BY m.from_id, m.to_id, m.loop_id
) x
ON (x.from_id = t.from_id and x.to_id=t.to_id and x.loop_id=t.loop_id )
AND x.from_id = t.from_id and x.to_id=t.to_id and x.loop_id=t.loop_id
But it is returning all the loops.
Thanks in advance!
As I understand the problem this will work:
SELECT t.value, t.from_id, t.to_id, t.loop_id
FROM MyResults t
INNER JOIN
( SELECT From_ID, To_ID, MIN(Value) [Value]
FROM MyResults
WHERE Loop_ID % 2 = 0
GROUP BY From_ID, To_ID
) MinT
ON MinT.From_ID = t.From_ID
AND MinT.To_ID = t.To_ID
AND MinT.Value = t.Value
However, if you had duplicate values for a From_ID and To_ID combination e.g.
value from_id to_id loop_id
-------------------------------------
0.1 A B 2
0.1 A B 4
This would return both rows.
If you are using SQL-Server 2005 or later and you want the duplicate rows as stated above you could use:
SELECT Value, From_ID, To_ID, Loop_ID
FROM ( SELECT *, MIN(Value) OVER(PARTITION BY From_ID, To_ID) [MinValue]
FROM MyResults
) t
WHERE Value = MinValue
If you did not want the duplicate rows you could use this:
SELECT Value, From_ID, To_ID, Loop_ID
FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY From_ID, To_ID ORDER BY Value, Loop_ID) [RowNumber]
FROM MyResults
) t
WHERE RowNumber = 1
Can't you do this a lot more simply?
SELECT
from_id,
to_id,
MIN(value)
FROM
myresults
WHERE
loop_id % 2 = 0
GROUP BY
from_id,
to_id
Or maybe I'm misunderstanding the question.
EDIT: To include loop_id
SELECT
m2.from_id,
m2.to_id,
m2.value,
m2.loop_id
FROM
myresults m2 INNER JOIN
(SELECT
m1.from_id,
m1.to_id,
MIN(m1.value)
FROM
myresults m1
WHERE
m1.loop_id % 2 = 0
GROUP BY
m1.from_id,
m1.to_id) minset
ON
m2.from_id = minset.from_id
AND m2.to_id = minset.to_id
AND m2.value = minset.value

speed up SQL Query

I have a query which is taking some serious time to execute on anything older than the past, say, hours worth of data. This is going to create a view which will be used for datamining, so the expectations are that it would be able to search back weeks or months of data and return in a reasonable amount of time (even a couple minutes is fine... I ran for a date range of 10/3/2011 12:00pm to 10/3/2011 1:00pm and it took 44 minutes!)
The problem is with the two LEFT OUTER JOINs in the bottom. When I take those out, it can run in about 10 seconds. However, those are the bread and butter of this query.
This is all coming from one table. The ONLY thing this query returns differently than the original table is the column xweb_range. xweb_range is a calculated field column (range) which will only use the values from [LO,LC,RO,RC]_Avg where their corresponding [LO,LC,RO,RC]_Sensor_Alarm = 0 (do not include in range calculation if sensor alarm = 1)
WITH Alarm (sub_id,
LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm) AS (
SELECT sub_id, LO_Avg, LO_Sensor_Alarm, LC_Avg, LC_Sensor_Alarm, RO_Avg, RO_Sensor_Alarm, RC_Avg, RC_Sensor_Alarm
FROM dbo.some_table
where sub_id <> '0'
)
, AddRowNumbers AS (
SELECT rowNumber = ROW_NUMBER() OVER (ORDER BY LO_Avg)
, sub_id
, LO_Avg, LO_Sensor_Alarm
, LC_Avg, LC_Sensor_Alarm
, RO_Avg, RO_Sensor_Alarm
, RC_Avg, RC_Sensor_Alarm
FROM Alarm
)
, UnPivotColumns AS (
SELECT rowNumber, value = LO_Avg FROM AddRowNumbers WHERE LO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, LC_Avg FROM AddRowNumbers WHERE LC_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RO_Avg FROM AddRowNumbers WHERE RO_Sensor_Alarm = 0
UNION ALL SELECT rowNumber, RC_Avg FROM AddRowNumbers WHERE RC_Sensor_Alarm = 0
)
SELECT rowNumber.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
, COALESCE(range1.range, range2.range) AS xweb_range
FROM AddRowNumbers rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = MAX(value) - MIN(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) > 1) range1 ON range1.rowNumber = rowNumber.rowNumber
LEFT OUTER JOIN (SELECT rowNumber, range = AVG(value) FROM UnPivotColumns GROUP BY rowNumber HAVING COUNT(*) = 1) range2 ON range2.rowNumber = rowNumber.rowNumber
INNER JOIN dbo.some_table cds
ON rowNumber.sub_id = cds.sub_id
It's difficult to understand exactly what your query is trying to do without knowing the domain. However, it seems to me like your query is simply trying to find, for each row in dbo.some_table where sub_id is not 0, the range of the following columns in the record (or, if only one matches, that single value):
LO_AVG when LO_SENSOR_ALARM=0
LC_AVG when LC_SENSOR_ALARM=0
RO_AVG when RO_SENSOR_ALARM=0
RC_AVG when RC_SENSOR_ALARM=0
You constructed this query assigning each row a sequential row number, unpivoted the _AVG columns along with their row number, computed the range aggregate grouping by row number and then joining back to the original records by row number. CTEs don't materialize results (nor are they indexed, as discussed in the comments). So each reference to AddRowNumbers is expensive, because ROW_NUMBER() OVER (ORDER BY LO_Avg) is a sort.
Instead of cutting this table up just to join it back together by row number, why not do something like:
SELECT cds.sub_id
, cds.equipment_id
, cds.read_time
, cds.LC_Avg
, cds.LC_Dev
, cds.LC_Ref_Gap
, cds.LC_Sensor_Alarm
, cds.LO_Avg
, cds.LO_Dev
, cds.LO_Ref_Gap
, cds.LO_Sensor_Alarm
, cds.RC_Avg
, cds.RC_Dev
, cds.RC_Ref_Gap
, cds.RC_Sensor_Alarm
, cds.RO_Avg
, cds.RO_Dev
, cds.RO_Ref_Gap
, cds.RO_Sensor_Alarm
--if the COUNT is 0, xweb_range will be null (since MAX will be null), if it's 1, then use MAX, else use MAX - MIN (as per your example)
, (CASE WHEN stats.[Count] < 2 THEN stats.[MAX] ELSE stats.[MAX] - stats.[MIN] END) xweb_range
FROM dbo.some_table cds
--cross join on the following table derived from values in cds - it will always contain 1 record per row of cds
CROSS APPLY
(
SELECT COUNT(*), MIN(Value), MAX(Value)
FROM
(
--construct a table using the column values from cds we wish to aggregate
VALUES (LO_AVG, LO_SENSOR_ALARM),
(LC_AVG, LC_SENSOR_ALARM),
(RO_AVG, RO_SENSORALARM),
(RC_AVG, RC_SENSOR_ALARM)
) x (Value, Sensor_Alarm) --give a name to the columns for _AVG and _ALARM
WHERE Sensor_Alarm = 0 --filter our constructed table where _ALARM=0
) stats([Count], [Min], [Max]) --give our derived table and its columns some names
WHERE cds.sub_id <> '0' --this is a filter carried over from the first CTE in your example