HPE Vertica live aggregate projection example for user retention

HPE Vertica live aggregate projection example for user retention - sql

create table events(
id char(36) PRIMARY KEY,
game_id varchar(24) not null,
user_device_id char(36) not null,
event_name varchar(100) not null,
generated_at timestamp with time zone not null
);
SELECT
events.generated_at::DATE AS time_stamp,
COUNT(DISTINCT (
CASE WHEN
events.event_name = 'new_user' THEN events.user_device_id
END
)
) as new_users,
COUNT(DISTINCT (
CASE WHEN
future_events.event_name <> 'new_user' THEN future_events.user_device_id
END
)
) as returned_users,
COUNT(DISTINCT (
CASE WHEN
future_events.event_name <> 'new_user' THEN future_events.user_device_id
END
)) / COUNT(DISTINCT (
CASE WHEN
events.event_name = 'new_user' THEN events.user_device_id
END
))::float as retention
FROM events
LEFT JOIN events AS future_events ON
events.user_device_id = future_events.user_device_id AND
events.generated_at = future_events.generated_at - interval '1 day' AND
events.game_id = future_events.game_id
GROUP BY
time_stamp
ORDER BY
time_stamp;
I am trying to get the Day N ('N' -> any number between 1 to 7) user retention via the above sql query. Due to the fact that I am a noob in HPE vertica, I am not being able to come up the optimum aggregate projection creating statement, Since projection significantly improves the performance of the query.

Aggregated projection won't help with a join query.
You can create a regular projection, segmented and sorted by the join columns, to achieve performance improvement:
CREATE PROJECTION events_p1 (
id,
game_id ENCODING RLE,
user_device_id ENCODING RLE,
event_name,
generated_at ENCODING RLE
) AS
SELECT id,
game_id,
user_device_id,
event_name,
generated_at
FROM events
ORDER BY generated_at,
game_id,
user_device_id
SEGMENTED BY hash(generated_at,game_id,user_device_id) ALL NODES KSAFE 1;

Related

Oracle SQL - Timestamp splits query result into 2 rows, Need all in one with

I need a time-based query (Random or Current) with all results in one row. My current query is as follows:
WITH started AS
(
SELECT f.*, CURRENT_DATE + ROWNUM / 24
FROM
(
SELECT
d.route_name,
d.op_name,
d.route_step_name,
nvl(MAX(DECODE(d.complete_reason, NULL, d.op_STARTS)), 0) started_units,
round(nvl(MAX(DECODE(d.complete_reason, 'PASS', d.op_complete)), 0) / d.op_starts * 100, 2) yield
FROM
(
SELECT route_name,
op_name,
route_step_name,
complete_reason,
complete_quantity,
sum(start_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON) op_starts,
sum(complete_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON ) op_complete
FROM FTPC_LT_PRDACT.tracked_object_history
WHERE route_name = 'HEADER FINAL ASSEMBLY'
AND OP_NAME NOT LIKE '%DISPOSITION%'
and (tobj_type = 'Lot')
AND xfr_insert_pid IN
(
SELECT xfr_start_id
FROM FTPC_LT_PRDACT.xfr_interval_id
WHERE last_modified_time <= SYSDATE
AND OP_NAME NOT LIKE '%DISPOSITION%'
and complete_reason = 'PASS' OR complete_reason IS NULL
)
) d
GROUP BY d.route_name, d.op_name, d.route_step_name, complete_reason, d.op_starts
ORDER BY d.route_step_name
) f
),
queued AS
(
SELECT
ts.route_name,
ts.queue_name,
o.op_name,
sum (th.complete_quantity) queued_units
FROM
FTPC_LT_PRDACT.tracked_object_HISTORY th,
FTPC_LT_PRDACT.tracked_object_status ts,
FTPC_LT_PRDACT.route_arc a,
FTPC_LT_PRDACT.route_step r,
FTPC_LT_PRDACT.operation o,
FTPC_LT_PRDACT.lot l
WHERE r.op_key = o.op_key
and l.lot_key = th.tobj_key
AND a.to_node_key = r.route_step_key
AND a.from_node_key = ts.queue_key
and th.tobj_history_key = ts.tobj_history_key
AND a.main_path = 1
AND (ts.tobj_type = 'Lot')
AND O.OP_NAME NOT LIKE '%DISPOSITION%'
and th.route_name = 'HEADER FINAL ASSEMBLY'
GROUP BY ts.route_name, ts.queue_name, o.op_name
)
SELECT
started.route_name,
started.op_name,
started.route_step_name,
max(started.yield) started_yield,
max(started.started_units) started_units,
case when queued.queue_name is NULL then 'N/A' else queued.queue_name end QUEUE_NAME,
case when queued.queued_units is NULL then 0 else queued.queued_units end QUEUED_UNITS
FROM started
left JOIN queued ON started.op_name = queued.op_name
group by started.route_name, started.op_name, started.route_step_name, queued.queue_name, QUEUED_UNITS
order by started.route_step_name asc
;
Current Query (as expected) but missing timestamp:
I need to have a timestamp for each individual row for a different application to display the results. Any help would be greatly appreciated! When I try to add a timestamp my query is altered:
Query once timestamp is added:
Edit: I need to display the query in a visualization tool. That tool is time based and will skew the table results unless there is a datetime associated with each field. The date time value can be random, but cannot be the same for each result.
The query is to be displayed on a live dashboard, every time the application is refreshed, the query is expected to be updated.

Access top query column from subquery in SQL

In my subquery I want to access the columns of the upper query to perform a coalesce. This is obviously not possible, but how can I manipulate my query to make it work? I am out of ideas.
WITH selected_students AS
(
SELECT amka, course_code
FROM "Register"
WHERE EXISTS (SELECT course_code FROM "CourseRun"
WHERE semesterrunsin = 24
AND course_code = "Register".course_code
AND serial_number = "Register".serial_number)
)
SELECT
amka, course_code,
FLOOR(RANDOM()*11) :: numeric,
COALESCE((SELECT lab_grade
FROM
(SELECT lab_grade FROM "Register_tmp"
WHERE "Register_tmp".amka = <outer.amka>
AND course_code = <outer.course_code>
AND serial_number < 12
ORDER BY serial_number DESC
LIMIT(1)) a
WHERE lab_grade >= 5), FLOOR(RANDOM()*11) :: numeric)
FROM
selected_students

User Life Cycle SQL Query Logic in Snowflake

I am working on building a query to track the life cycle of an user through the platform via events. The table EVENTS has 3 columns USER_ID, DATE_TIME and EVENT_NAME. Below is a snapshot of the table,
My query should return the below result (the first timestamp for the registered event followed by the immediate/next timestamp of the following log_in event and finally followed by the immediate/next timestamp of the final landing_page event),
Below is my query ,
WITH FIRST_STEP AS
(SELECT
USER_ID,
MIN(CASE WHEN EVENT_NAME = 'registered' THEN DATE_TIME ELSE NULL END) AS REGISTERED_TIMESTAMP
FROM EVENTS
GROUP BY 1
),
SECOND_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'log_in'
ORDER BY DATE_TIME
),
THIRD_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'landing_page'
ORDER BY DATE_TIME
)
SELECT
a.USER_ID,
a.REGISTERED_TIMESTAMP,
(SELECT
CASE WHEN b.DATE_TIME >= a.REGISTRATIONS_TIMESTAMP THEN b.DATE_TIME END AS LOG_IN_TIMESTAMP
FROM SECOND_STEP
LIMIT 1
),
(SELECT
CASE WHEN c.DATE_TIME >= LOG_IN_TIMESTAMP THEN c.DATE_TIME END AS LANDING_PAGE_TIMESTAMP
FROM THIRD_STEP
LIMIT 1
)
FROM FIRST_STEP AS a
LEFT JOIN SECOND_STEP AS b ON a.USER_ID = b.USER_ID
LEFT JOIN THIRD_STEP AS c ON b.USER_ID = c.USER_ID;
Unfortunately I am getting the "SQL compilation error: Unsupported subquery type cannot be evaluated" error when I try to run the query

This is a perfect use case for MATCH_RECOGNIZE.
The pattern you are looking for is register anything* login anything* landing and the measures are the min(iff(event_name='x', date_time, null)) for each.
Check:
https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1
https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html
Set the output to one row per match.
Untested sample query:
select *
from data
match_recognize(
partition by user_id
order by date_time
measures min(iff(event_name='register', date_time, null)) as t1
, min(iff(event_name='log_in', date_time, null)) as t2
, min(iff(event_name='landing_page', date_time, null)) as t3
one row per match
pattern(register anything* login anything* landing)
define
register as event_name = 'register'
, login as event_name = 'log_in'
, landing as event_name = 'landing_page'
);

I want to reduce my SQL Query on big Query

I want to fetch data from bigQuery database but I get an error
=>The query is too large. The maximum query length is 256.000K characters, including comments and white space characters.
i will show a part of query which i repeated 21 times
WITH data AS
(
SELECT
IFNULL(department, 'UNKNOWN_DEPARTMENT') AS dept,
> 'C7s'
AS campus,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1542565800000 AND 1543170599999) AS taskCount_0,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1542565800000 AND 1543170599999
AND IF (task.deadline.currentEscalationLevel NOT IN
(
'ESC_ACKNOWLEDGEMENT'
)
, task.deadline.currentEscalationLevel, 'NOT_ESCALATED') NOT IN
(
'NOT_ESCALATED'
)
) AS escCount_0,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1541961000000 AND 1542565799999) AS taskCount_1,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1541961000000 AND 1542565799999
AND IF (task.deadline.currentEscalationLevel NOT IN
(
'ESC_ACKNOWLEDGEMENT'
)
, task.deadline.currentEscalationLevel, 'NOT_ESCALATED') NOT IN
(
'NOT_ESCALATED'
)
) AS escCount_1,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1541356200000 AND 1541960999999) AS taskCount_2,
COUNTIF(task.taskRaised.raisedAt.milliSeconds BETWEEN 1541356200000 AND 1541960999999
AND IF (task.deadline.currentEscalationLevel NOT IN
(
'ESC_ACKNOWLEDGEMENT'
)
, task.deadline.currentEscalationLevel, 'NOT_ESCALATED') NOT IN
(
'NOT_ESCALATED'
)
) AS escCount_2
FROM
> `nsimplbigquery.TaskManagement.C7s_*`
WHERE
_TABLE_SUFFIX IN
(
'2018_47_11',
'2018_45_11',
'2018_46_11'
)
AND IFNULL(department, 'UNKNOWN_DEPARTMENT') IN
(
'ENGG_AND_MAINT_DEPARTMENT',
'FNB_DEPARTMENT',
'TELECOM_DEPARTMENT',
'IT_DEPARTMENT',
'BILLING_AND_INSURANCE',
'HOUSEKEEPING_DEPARTMENT'
)
AND task.taskRaised.raisedAt.milliSeconds BETWEEN 1541356200000 AND 1543170599999
GROUP BY
dept
)
,
mainQuery AS
(
SELECT
dept,
campus,
SUM(taskCount_0) AS taskCount_0,
SUM(escCount_0) AS escCount_0,
CAST(SAFE_DIVIDE(SUM(escCount_0), SUM(taskCount_0)) * 10000 AS INT64) AS escPerc_0,
SUM(taskCount_1) AS taskCount_1,
SUM(escCount_1) AS escCount_1,
CAST(SAFE_DIVIDE(SUM(escCount_1), SUM(taskCount_1)) * 10000 AS INT64) AS escPerc_1,
SUM(taskCount_2) AS taskCount_2,
SUM(escCount_2) AS escCount_2,
CAST(SAFE_DIVIDE(SUM(escCount_2), SUM(taskCount_2)) * 10000 AS INT64) AS escPerc_2
FROM
data
GROUP BY
ROLLUP (campus, dept)
)
SELECT
dept,
campus,
taskCount_0,
escCount_0,
escPerc_0,
taskCount_1,
escCount_1,
escPerc_1,
taskCount_2,
escCount_2,
escPerc_2
FROM
mainQuery
WHERE
campus IS NOT NULL
ORDER BY
CASE
WHEN
dept IS NULL
THEN
1
ELSE
0
END
ASC, dept ASC, campus ASC;
This is the query which I repeat so many times so can due to I have so many ids Where C7s i changed with following ids
C7z,
C7u,
H0B,
IDp,
ITR,
C7i,
C7j,
C7k,
C7l,
C7m,
C7o,
C71,
C7t,
F6qZ,
C7w,
GIui,
Fs,
C70,
C7p,
C7r
if you see my explainantion i quote a line this nsimplbigquery.TaskManagement.C7s_*
so at next query the table names is changed
like
nsimplbigquery.TaskManagement.C7z_*

Instead of repeating your whole SELECT statement 21 times, rather use below approach. You will have 3x21=63 entries in the that list for _TABLE_SUFFIX - but you will be able to get around your issue with query length
FROM `nsimplbigquery.TaskManagement.*`
WHERE _TABLE_SUFFIX IN (
'C7s_2018_47_11',
'C7s_2018_45_11',
'C7s_2018_46_11',
'C7z_2018_47_11',
'C7z_2018_45_11',
'C7z_2018_46_11',
'C7u_2018_47_11',
'C7u_2018_45_11',
'C7u_2018_46_11',
...
...
...
'C7r_2018_47_11',
'C7r_2018_45_11',
'C7r_2018_46_11',
)

function based index

select
<here I have functions like to_char, nvl, rtrim, ltrim, sum, decode>
from
table1
table2
where
joining conditions 1
joining conditions 2
group by
<here I have functions like to_char, nvl, rtrim, ltrim, sum, decode>
I got this query from production and looking at it need to provide few solutions to tune, I m thinking of using function based inbex for group by columns. I think select columns need not be index. I will get enviornment in couple of days but before that I need to come up with different apporaches. What all things I need to check if function by index is useful? Also, apart from explain plan which other documents I need to ask from DBAs?
I m adding actual sql here, I have asked for explain plan, which I will get in sometime :-
SELECT
D_E_TRADE.DATE_VALUE,
to_char(D_E_TRADE.DATE_VALUE,'Mon-yyyy'),
NVL(P_DIM.P_NAME,' '),
rtrim(ltrim(P_DIM.C_CTRY)),
D_E_TRADE.YEAR,
L_E_DIM.L_CODE,
NVL(D_DIM.DESCR,' '),
( decode(D_DIM.DEPT_ID,'-1',' ',D_DIM.DEPT_ID) ),
sum(A_CGE.TOTAL_CALC_NET_FEES),
L_E_DIM.L_NAME,
decode(A_CGE.E_M_CENTER,-9,0,A_CGE.E_M_CENTER),
NVL(F_DIM.S_DESC,'-1'),
sum(A_CGE.C_TOTAL_SHARES)
FROM
DATE_D D_E_TRADE,
P_DIM,
L_E_DIM,
D_DIM,
A_CGE,
F_DIM
WHERE
( D_E_TRADE.DATE_KEY=A_CGE.T_KEY )
AND ( P_DIM.PARTY_KEY=A_CGE.E_P_KEY )
AND ( F_DIM.F_T_KEY=A_CGE.F_T_KEY )
AND ( L_E_DIM.L_E_KEY=A_CGE.L_E_KEY )
AND ( D_DIM.DEPT_KEY=A_CGE.DEPT_KEY )
AND
(
rtrim(ltrim(P_DIM.C_CTRY)) = 'AC'
AND
( A_CGE.T_KEY >= (SELECT DATE_D_PROMPTS.DATE_KEY FROM DATE_D DATE_D_PROMPTS WHERE ( DATE_D_PROMPTS.DATE_VALUE = '01-01-2012 00:00:00' ) )
AND
A_CGE.T_KEY <= (SELECT DATE_D_PROMPTS.DATE_KEY FROM DATE_D DATE_D_PROMPTS WHERE ( DATE_D_PROMPTS.DATE_VALUE = '31-08-2012 00:00:00' ))
AND
A_CGE.TRANS_REGION_KEY IN (SELECT REGION_KEY FROM REGION_DIM WHERE REGION_DIM.REGION_NAME IN ('Americas') ) )
AND
( A_CGE.T_KEY >= (SELECT DATE_D_PROMPTS.DATE_KEY FROM DATE_D DATE_D_PROMPTS WHERE ( DATE_D_PROMPTS.DATE_VALUE = '01-01-2012 00:00:00' ) )
AND
A_CGE.T_KEY <= (SELECT DATE_D_PROMPTS.DATE_KEY FROM DATE_D DATE_D_PROMPTS WHERE ( DATE_D_PROMPTS.DATE_VALUE = '31-08-2012 00:00:00' ))
AND
A_CGE.TRANS_REGION_KEY IN (SELECT REGION_KEY FROM REGION_DIM WHERE REGION_DIM.REGION_NAME IN ('Americas') ) )
AND
( 'All Fees' IN ('2 - E','3 - P','4 - F','5 - C,') OR A_CGE.F_T_KEY IN (SELECT F_T_KEY FROM F_DIM WHERE (F_DIM.s_id ) || ' - ' || ( F_DIM.CHARGE_LVL1_NAME ) IN ('2 - E','3 - P','4 - F','5 - C')) )
)
GROUP BY
D_E_TRADE.DATE_VALUE,
to_char(D_E_TRADE.DATE_VALUE,'Mon-yyyy'),
NVL(P_DIM.P_NAME,' '),
rtrim(ltrim(P_DIM.C_CTRY)),
D_E_TRADE.YEAR,
L_E_DIM.L_CODE,
NVL(D_DIM.DESCR,' '),
( decode(D_DIM.DEPT_ID,'-1',' ',D_DIM.DEPT_ID) ),
L_E_DIM.L_NAME,
decode(A_CGE.E_M_CENTER,-9,0,A_CGE.E_M_CENTER),
NVL(F_DIM.S_DESC,'-1')

Generaly, indexes help you on fast retrieval of data when you have filtering conditions wich may use the indexes.
(Another case whold be when you retrieve only column that are in the index, so the engine does not need to read anything from table)
In your case, you may need indexes on filtering/join conditions in the following part:
joining conditions 1
joining conditions 2
But keep in mind. If the you get more than 15%-20% of rows of a table, is better to read from table, not to use the index. That is, the index may not be used.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HPE Vertica live aggregate projection example for user retention - sql

Related

Oracle SQL - Timestamp splits query result into 2 rows, Need all in one with

Access top query column from subquery in SQL

User Life Cycle SQL Query Logic in Snowflake

I want to reduce my SQL Query on big Query

function based index

Categories

Resources