Postgres: Subquery with GROUP BY

Postgres: Subquery with GROUP BY - sql

I'm trying to optimize a query (instead of repeating it a lot of time), with this NOT FUNCTIONAL CODE above (since subqueries only return 1 column):
SELECT
e.pageview_identifier,
e.created_at,
e.pageview_current_url,
e.pageview_mobile,
(
SELECT event_type, COUNT(event_identifier)
FROM events v
WHERE
v.company_identifier = e.company_identifier AND
v.user_identifier = e.user_identifier AND
v.pageview_identifier = e.pageview_identifier
GROUP BY v.event_type
)
FROM events e
WHERE
company_identifier = 'xyz' AND
user_identifier = '01CDQZVSJFBDA8W444JS2CS3BA' AND
event_type = 'page:view';
Basically, I want to retrieve the columns as
pageview_identifier, created_at, ..., event_type_a_count, event_type_b_count, ...
A FUNCTIONAL code that works is:
SELECT
e.pageview_identifier,
e.created_at,
e.pageview_current_url,
e.pageview_mobile,
(
SELECT COUNT(event_identifier)
FROM events v
WHERE
v.company_identifier = e.company_identifier AND
v.user_identifier = e.user_identifier AND
v.pageview_identifier = e.pageview_identifier AND
v.event_type = 'mouse:move'
) as mouse_move_count
FROM events e
WHERE
company_identifier = 'xyz' AND
user_identifier = '01CDQZVSJFBDA8W444JS2CS3BA' AND
event_type = 'page:view';
But in this case, I would need to repeat a lot of time this subquery for each kind of event_type.
Edit 1 - More information:
On my WHERE clause, I restrict it to only event_type = 'page:view'. I have some possible values for event_type, and for each page:view, I need to count related events (with different event_type) to it based on the condition e.pageview_identifier = v.pageview_identifier.

Just use a window function:
SELECT e.pageview_identifier,
e.created_at,
e.pageview_current_url,
e.pageview_mobile,
COUNT(*) OVER (PARTITION BY e.company_identifier, e.user_identifier, e.pageview_identifier) as cnt
FROM events e
WHERE e.company_identifier = 'xyz' AND
e.user_identifier = '01CDQZVSJFBDA8W444JS2CS3BA' AND
e.event_type = 'page:view';
Note: This counts only 'page:view' events. If you want a count of each event, then one way is:
SELECT e.*
FROM (SELECT e.pageview_identifier,
e.created_at,
e.pageview_current_url,
e.pageview_mobile,
COUNT(*) FILTER (WHERE .event_type = 'mouse:move') OVER (PARTITION BY e.company_identifier, e.user_identifier, e.pageview_identifier) as cnt_mouse_move,
COUNT(*) FILTER (WHERE .event_type = ''page:view'') OVER (PARTITION BY e.company_identifier, e.user_identifier, e.pageview_identifier) as cnt_page_view,
. . .
FROM events e
WHERE e.company_identifier = 'xyz' AND
e.user_identifier = '01CDQZVSJFBDA8W444JS2CS3BA'
) e
WHERE e.event_type = 'page:view';

Related

User Life Cycle SQL Query Logic in Snowflake

I am working on building a query to track the life cycle of an user through the platform via events. The table EVENTS has 3 columns USER_ID, DATE_TIME and EVENT_NAME. Below is a snapshot of the table,
My query should return the below result (the first timestamp for the registered event followed by the immediate/next timestamp of the following log_in event and finally followed by the immediate/next timestamp of the final landing_page event),
Below is my query ,
WITH FIRST_STEP AS
(SELECT
USER_ID,
MIN(CASE WHEN EVENT_NAME = 'registered' THEN DATE_TIME ELSE NULL END) AS REGISTERED_TIMESTAMP
FROM EVENTS
GROUP BY 1
),
SECOND_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'log_in'
ORDER BY DATE_TIME
),
THIRD_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'landing_page'
ORDER BY DATE_TIME
)
SELECT
a.USER_ID,
a.REGISTERED_TIMESTAMP,
(SELECT
CASE WHEN b.DATE_TIME >= a.REGISTRATIONS_TIMESTAMP THEN b.DATE_TIME END AS LOG_IN_TIMESTAMP
FROM SECOND_STEP
LIMIT 1
),
(SELECT
CASE WHEN c.DATE_TIME >= LOG_IN_TIMESTAMP THEN c.DATE_TIME END AS LANDING_PAGE_TIMESTAMP
FROM THIRD_STEP
LIMIT 1
)
FROM FIRST_STEP AS a
LEFT JOIN SECOND_STEP AS b ON a.USER_ID = b.USER_ID
LEFT JOIN THIRD_STEP AS c ON b.USER_ID = c.USER_ID;
Unfortunately I am getting the "SQL compilation error: Unsupported subquery type cannot be evaluated" error when I try to run the query

This is a perfect use case for MATCH_RECOGNIZE.
The pattern you are looking for is register anything* login anything* landing and the measures are the min(iff(event_name='x', date_time, null)) for each.
Check:
https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1
https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html
Set the output to one row per match.
Untested sample query:
select *
from data
match_recognize(
partition by user_id
order by date_time
measures min(iff(event_name='register', date_time, null)) as t1
, min(iff(event_name='log_in', date_time, null)) as t2
, min(iff(event_name='landing_page', date_time, null)) as t3
one row per match
pattern(register anything* login anything* landing)
define
register as event_name = 'register'
, login as event_name = 'log_in'
, landing as event_name = 'landing_page'
);

How to select a single row for each unique ID

SQL novice here learning on the job, still a greenhorn. I have a problem I don't know how to overcome. Using IBM Netezza and Aginity Workbench.
My current output will try to return one row per case number based on when a task was created. It will only keep the row with the newest task. This gets me about 85% of the way there. The issue is that sometimes multiple tasks have a create day of the same day.
I would like to incorporate Task Followup Date to only keep the newest row if there are multiple rows with the same Case Number. I posted an example of what my current code outputs and what i would like it to output.
Current code
SELECT
A.PS_CASE_ID AS Case_Number
,D.CASE_TASK_TYPE_NM AS Task
,C.TASK_CRTE_TMS
,C.TASK_FLWUP_DT AS Task_Followup_Date
FROM VW_CC_CASE A
INNER JOIN VW_CASE_TASK C ON (A.CASE_ID = C.CASE_ID)
INNER JOIN VW_CASE_TASK_TYPE D ON (C.CASE_TASK_TYPE_ID = D.CASE_TASK_TYPE_ID)
INNER JOIN ADMIN.VW_RSN_CTGY B ON (A.RSN_CTGY_ID = B.RSN_CTGY_ID)
WHERE
(A.PS_Z_SPSR_ID LIKE '%EFT' OR A.PS_Z_SPSR_ID LIKE '%CRDT')
AND CAST(A.CASE_CRTE_TMS AS DATE) >= '2020-01-01'
AND B.RSN_CTGY_NM = 'Chargeback Initiation'
AND CAST(C.TASK_CRTE_TMS AS DATE) = (SELECT MAX(CAST(C2.TASK_CRTE_TMS AS DATE)) from VW_CASE_TASK C2 WHERE C2.CASE_ID = C.CASE_ID)
GROUP BY
A.PS_CASE_ID
,D.CASE_TASK_TYPE_NM
,C.TASK_CRTE_TMS
,C.TASK_FLWUP_DT
Current output
Desired output

You could use ROW_NUMBER here:
WITH cte AS (
SELECT DISTINCT A.PS_CASE_ID AS Case_Number, D.CASE_TASK_TYPE_NM AS Task,
C.TASK_CRTE_TMS, C.TASK_FLWUP_DT AS Task_Followup_Date,
ROW_NUMBER() OVER (PARTITION BY A.PS_CASE_ID ORDER BY C.TASK_FLWUP_DT DESC) rn
FROM VW_CC_CASE A
INNER JOIN VW_CASE_TASK C ON A.CASE_ID = C.CASE_ID
INNER JOIN VW_CASE_TASK_TYPE D ON C.CASE_TASK_TYPE_ID = D.CASE_TASK_TYPE_ID
INNER JOIN ADMIN.VW_RSN_CTGY B ON A.RSN_CTGY_ID = B.RSN_CTGY_ID
WHERE (A.PS_Z_SPSR_ID LIKE '%EFT' OR A.PS_Z_SPSR_ID LIKE '%CRDT') AND
CAST(A.CASE_CRTE_TMS AS DATE) >= '2020-01-01' AND
B.RSN_CTGY_NM = 'Chargeback Initiation' AND
CAST(C.TASK_CRTE_TMS AS DATE) = (SELECT MAX(CAST(C2.TASK_CRTE_TMS AS DATE))
FROM VW_CASE_TASK C2
WHERE C2.CASE_ID = C.CASE_ID)
)
SELECT
Case_Number,
Task,
TASK_CRTE_TMS,
Task_Followup_Date
FROM cte
WHERE rn = 1;

One method used window functions:
with cte as (
< your query here >
)
select x.*
from (select cte.*,
row_number() over (partition by case_number, Task_Followup_Date
order by TASK_CRTE_TMS asc
) as seqnum
from cte
) x
where seqnum = 1;

Distinct keyword not fetching results in Oracle

I have the following query where I unique records for patient_id, meaning patient_id should not be duplicate. Each time I try executing the query, seems like the DB hangs or it takes hours to execute, I'm not sure. I need my records to load quickly. Any quick resolution will be highly appreciated.
SELECT DISTINCT a.patient_id,
a.study_id,
a.procstep_id,
a.formdata_seq,
0,
(SELECT MAX(audit_id)
FROM audit_info
WHERE patient_id =a.patient_id
AND study_id = a.study_id
AND procstep_id = a.procstep_id
AND formdata_seq = a.formdata_seq
) AS data_session_id
FROM frm_rg_ps_rg a,
PATIENT_STUDY_STEP pss
WHERE ((SELECT COUNT(*)
FROM frm_rg_ps_rg b
WHERE a.patient_id = b.patient_id
AND a.formdata_seq = b.formdata_seq
AND a.psdate IS NOT NULL
AND b.psdate IS NOT NULL
AND a.psresult IS NOT NULL
AND b.psresult IS NOT NULL) = 1)
OR NOT EXISTS
(SELECT *
FROM frm_rg_ps_rg c
WHERE a.psdate IS NOT NULL
AND c.psdate IS NOT NULL
AND a.psresult IS NOT NULL
AND c.psresult IS NOT NULL
AND a.patient_id = c.patient_id
AND a.formdata_seq = c.formdata_seq
AND a.elemdata_seq! =c.elemdata_seq
AND a.psresult != c.psresult
AND ((SELECT (a.psdate - c.psdate) FROM dual)>=7
OR (SELECT (a.psdate - c.psdate) FROM dual) <=-7)
)
AND a.psresult IS NOT NULL
AND a.psdate IS NOT NULL;

For start, you have a cartesian product with PATIENT_STUDY_STEP (pss).
It is not connected to anything.
select *
from (select t.*
,count (*) over (partition by patient_id) as cnt
from frm_rg_ps_rg t
) t
where cnt = 1
;

Trying to Show The records with the most recent Date

Im trying to use the Over Partition to create a row number based on SupplierAccountNumber then Sort by DateTimeCreated and then only show record 1. my current script i get an error saying Invalid column name 'RowNum'??
I have a list of email addresses for suppliers which have multiple addresses, i only want to pick out the most recent email address. Is there a better way of doing it?
SELECT plsuppliercontact.plsuppliercontactid,
plsupplieraccount.supplieraccountnumber,
plsupplieraccount.supplieraccountname,
plsupplieraccount.supplieraccountshortname,
plsuppliercontactvalue.contactvalue,
syscontacttype.name,
Rownum = Row_number()
OVER(
partition BY plsupplieraccount.supplieraccountnumber
ORDER BY plsuppliercontactvalue.datetimecreated DESC)
FROM alops.dbo.plsupplieraccount PLSupplierAccount,
alops.dbo.plsuppliercontact PLSupplierContact,
alops.dbo.plsuppliercontactvalue PLSupplierContactValue,
alops.dbo.syscontacttype SYSContactType
WHERE plsupplieraccount.plsupplieraccountid =
plsuppliercontact.plsupplieraccountid
AND plsuppliercontactvalue.plsuppliercontactid =
plsuppliercontact.plsuppliercontactid
AND syscontacttype.syscontacttypeid =
plsuppliercontactvalue.syscontacttypeid
AND (( syscontacttype.name = 'E-mail Address' ))
AND rownum = 1;

You didn't specify with RDBMS you were using, but most of them only apply aliases after the query is executed.
One trick is to wrap the query in an another query that takes care of this condition. E.g.:
SELECT *
FROM (
SELECT plsuppliercontact.plsuppliercontactid,
plsupplieraccount.supplieraccountnumber,
plsupplieraccount.supplieraccountname,
plsupplieraccount.supplieraccountshortname,
plsuppliercontactvalue.contactvalue,
syscontacttype.name,
Rownum = Row_number()
OVER(
partition BY plsupplieraccount.supplieraccountnumber
ORDER BY plsuppliercontactvalue.datetimecreated DESC)
FROM alops.dbo.plsupplieraccount PLSupplierAccount,
alops.dbo.plsuppliercontact PLSupplierContact,
alops.dbo.plsuppliercontactvalue PLSupplierContactValue,
alops.dbo.syscontacttype SYSContactType
WHERE plsupplieraccount.plsupplieraccountid =
plsuppliercontact.plsupplieraccountid
AND plsuppliercontactvalue.plsuppliercontactid =
plsuppliercontact.plsuppliercontactid
AND syscontacttype.syscontacttypeid =
plsuppliercontactvalue.syscontacttypeid
AND (( syscontacttype.name = 'E-mail Address' ))
)
WHERE rownum = 1;

Use the MAX aggregate
SELECT plsuppliercontact.plsuppliercontactid,
plsupplieraccount.supplieraccountnumber,
plsupplieraccount.supplieraccountname,
plsupplieraccount.supplieraccountshortname,
plsuppliercontactvalue.contactvalue,
syscontacttype.name,
MAX(plsuppliercontactvalue.datetimecreated)
FROM alops.dbo.plsupplieraccount PLSupplierAccount,
alops.dbo.plsuppliercontact PLSupplierContact,
alops.dbo.plsuppliercontactvalue PLSupplierContactValue,
alops.dbo.syscontacttype SYSContactType
WHERE plsupplieraccount.plsupplieraccountid =
plsuppliercontact.plsupplieraccountid
AND plsuppliercontactvalue.plsuppliercontactid =
plsuppliercontact.plsuppliercontactid
AND syscontacttype.syscontacttypeid =
plsuppliercontactvalue.syscontacttypeid
AND (( syscontacttype.name = 'E-mail Address' ))
GROUP BY plsuppliercontact.plsuppliercontactid,
plsupplieraccount.supplieraccountnumber,
plsupplieraccount.supplieraccountname,
plsupplieraccount.supplieraccountshortname,
plsuppliercontactvalue.contactvalue,
syscontacttype.name

Have a subquery return a row instead of a column

I have this query:
SELECT
[Address]=e.Address,
[LastEmail] =
(
SELECT TOP 1 [Email]
FROM Email innerE
WHERE e.UserID = innerE.UserID
AND innerE.Contact = #emailId
AND (IsSent is null OR isSent = 0)
ORDER BY Timestamp DESC
)
FROM Emails e
This works fine, but now, I realized i'd like to get the entire row containing that lastemail column, if this is possible, any ideas on how it could be done?

You can do this:
;WITH LastEmails
AS
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY Timestamp DESC) rownum
FROM Emails
WHERE Contact = #emailId
AND (IsSent is null OR isSent = 0)
)
SELECT * FROM LastEmails
WHERE rownum = 1;

If your DBMS supports it you can use APPLY (which I think it does as it looks like SQL-Server Syntax)
SELECT [Address]=e.Address,
[LastEmail] = ie.Email
FROM Emails e
OUTER APPLY
( SELECT TOP 1 *
FROM Email innerE
WHERE e.UserID = innerE.UserID
AND innerE.Contact = #emailId
AND (IsSent is null OR isSent = 0)
ORDER BY Timestamp DESC
) ie
This works similar a correlated subquery but allows mulitple rows and multiple columns.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Postgres: Subquery with GROUP BY - sql

Related

User Life Cycle SQL Query Logic in Snowflake

How to select a single row for each unique ID

Distinct keyword not fetching results in Oracle

Trying to Show The records with the most recent Date

Have a subquery return a row instead of a column

Categories

Resources