Need help in forming a SQL query - sql

We have 2 tables called tbl1 and tbl2. It contains columns such as Visit_ID, Customer ID, and so on. There are instances where a Visit_ID will be associated with multiple Customer IDs.
For example, if customer logs into a website, a unique Visit_ID will be generated for each time he visits the website.
In one visit, multiple customers can login to their accounts and make individual purchases.
There are instances where a visit will be associated to multiple customer IDs. If there are more than 2 instances, append any other retail customer ID instances in this column.
For instances there are visit, which had 200 Customer IDs attached to that visit.
For example, if there are 7 Customer IDs in 1 visit, for Customer 1,
it should have the first customer 1. For Customer 2, we will need to display the 2nd customer ID.
For 3rd to 7, all those 5 will be comma separated.
Can someone help how to frame a SQL query using this logic?
with CTE as (
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
)
select VISIT_ID, arr[0], arr[1], array_to_string( array_slice(arr, 2, 99999), ', ')
from (
select VISIT_ID, array_agg(distinct CUSTOMER_ID) within group(order by CUSTOMER_ID) arr
from CTE
group by 1
);
Thanks for those who have responded. I really appreciate their guidance. The logic worked fine. When I'm joining 3 tables inside CTE, I'm getting lot of duplicates. I want to eliminate the duplicate values.
When I run the below query that I have included inside CTE, I'm getting records which are duplicates.
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
Row VISIT_ID CUSTOMER_ID VISITPG_NBR RANK
**1 12345 100 1 1**
2 12345 100 2 1
3 12345 100 3 1
4 12345 100 4 1
5 12345 100 5 1
**6 67891 101 6 2**
7 67891 101 7 2
8 67891 101 8 2
9 67891 101 9 2
10 67891 101 10 2
**11 78910 102 11 3**
12 78910 102 12 3
13 78910 102 13 3
14 78910 102 14 3
Is there any logic to display the distinct results in the CTE temp table?
The final result should be populated as below.
VISIT_ID First_Customer Second_Customer Other_Customers
1 100 101 102,103,104,105,106
2 200 201 202,203,204,205
First Customer_ID should get displayed in the First_Customer column, Second_Customer_Id should get displayed in Second_Customer column.. All the other customer_ids should be displayed in the final column and it should be comma separated.
Also, I wanted the results to be ordered by visitpg_nbr

You should be able to get this with array_agg(), and then choosing the first, second, and subsequent (array_slice()) elements:
with data as (
select *
from snowflake_sample_data.tpch_sf100.orders
where o_custkey between 5411266 and 5411290
)
select o_custkey, arr[0], arr[1], array_to_string(array_slice(arr, 2, 99999), ', ')
from (
select o_custkey, array_agg(o_orderkey) within group(order by o_orderdate) arr
from data
group by 1
);
You might need to get unique ids in case there are many, you can solve that with a subquery before array_agg().

slightly different to Felipe's answer, not sure which would be more performant. I suspect his, but anyways here is another way to try it.
SELECT visit_id, first_customer, second_customer
,array_agg(other_ids) within group (order by order_id) as other_customer
FROM(
SELECT visit_id,
order_id,
first_value(customer_id) over (partition by visit_id order by order_id) as first_customer,
first_value(customer_id) over (partition by visit_id order by order_id) as second_customer,
IFF(row_number() over (partition by visit_id order by order_id) > 2, customer_id, null) as other_ids
FROM VALUES
(1,100, 1),
(1,101, 2),
(1,102, 3),
(1,103, 5),
(1,104, 6),
(1,105, 6),
(1,106, 7),
(2,200, 1),
(2,201, 2),
(2,202, 3),
(2,203, 4)
v(visit_id, customer_id, order_id)
)
GROUP BY 1,2,3
ORDER BY 1,2,3;
VISIT_ID
FIRST_CUSTOMER
SECOND_CUSTOMER
OTHER_CUSTOMER
1
100
100
[ 102, 103, 104, 105, 106 ]
2
200
200
[ 202, 203 ]

Related

How to select rows where values changed for an ID

I have a table that looks like the following
id effective_date number_of_int_customers
123 10/01/19 0
123 02/01/20 3
456 10/01/19 6
456 02/01/20 6
789 10/01/19 5
789 02/01/20 4
999 10/01/19 0
999 02/01/20 1
I want to write a query that looks at each ID to see if the salespeople have newly started working internationally between October 1st and February 1st.
The result I am looking for is the following:
id effective_date number_of_int_customers
123 02/01/20 3
999 02/01/20 1
The result would return only the salespeople who originally had 0 international customers and now have at least 1.
I have seen similar posts here that use nested queries to pull records where the first date and last have different values. But I only want to pull records where the original value was 0. Is there a way to do this in one query in SQL?
In your case, a simple aggregation would do -- assuming that 0 is the earliest value:
select id, max(number_of_int_customers)
from t
where effective_date in ('2019-10-01', '2020-02-01')
group by id
having min(number_of_int_customers) = 0;
Obviously, this is not correct if the values can decrease to zero. But this having clause fixes that problem:
having min(case when number_of_int_customers = 0 then effective_date end) = min(effective_date)
An alternative is to use window functions, such asfirst_value():
select distinct id, last_noic
from (select t.*,
first_value(number_of_int_customers) over (partition by id order by effective_date) as first_noic,
first_value(number_of_int_customers) over (partition by id order by effective_date desc) as last_noic,
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where first_noic = 0;
Hmmm, on second thought, I like lag() better:
select id, number_of_int_customers
from (select t.*,
lag(number_of_int_customers) over (partition by id order by effective_date) as prev_noic
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where prev_noic = 0;

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.
Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.
Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

How to find number of distinct phones per customer and put the customers(counts) in different buckets as per the counts?

Below is the table where I have customer_id and different phones they have.
customer_id phone_number
101 123456789
102 234567891
103 345678912
102 456789123
101 567891234
104 678912345
105 789123456
106 891234567
106 912345678
106 456457234
101 655435664
107 453426782
Now, I want to find customer_id and distinct phone number count.
So I used this query:
select distinct customer_id ,count(distinct phone_number)
from customer_phone;
customer_id no of phones
101 3
102 2
103 1
104 1
105 1
106 3
107 1
And, from the above table my final goal is to achieve the below output which takes the counts and puts in different buckets and then count number of consumers that fall in those buckets.
Buckets no of consumers
3 2
2 1
1 4
There are close to 200 million records. Can you please explain an efficient way to work on this?
You can use width_bucket for that:
select bucket, count(*)
from (
select width_bucket(count(distinct phone_number), 1, 10, 10) as bucket
from customer_phone
group by customer_id
) t
group by bucket;
width_bucket(..., 1, 10, 10) creates ten buckets for the values 1 through 10.
Online Example: http://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=1e6d55305570499f363837aba21bdc7e
Use two aggregations:
select cnt, count(*), min(customer_id), max(customer_id)
from (select customer_id, count(distinct phone_number) as cnt
from customer_phone
group by customer_id
) c
group by cnt
order by cnt;

Firebird Query- Return first row each group

In a firebird database with a table "Sales", I need to select the first sale of all customers. See below a sample that show the table and desired result of query.
---------------------------------------
SALES
---------------------------------------
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
3 25 05/04/16 08:10
4 31 07/03/16 10:22
5 22 01/02/16 12:30
6 22 10/01/16 08:45
Result: only first sale, based on sale date.
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
4 31 07/03/16 10:22
6 22 10/01/16 08:45
I've already tested following code "Select first row in each GROUP BY group?", but it did not work.
In Firebird 2.5 you can do this with the following query; this is a minor modification of the second part of the accepted answer of the question you linked to tailored to your schema and requirements:
select x.id,
x.customerid,
x.dthrsale
from sales x
join (select customerid,
min(dthrsale) as first_sale
from sales
group by customerid) p on p.customerid = x.customerid
and p.first_sale = x.dthrsale
order by x.id
The order by is not necessary, I just added it to make it give the order as shown in your question.
With Firebird 3 you can use the window function ROW_NUMBER which is also described in the linked answer. The linked answer incorrectly said the first solution would work on Firebird 2.1 and higher. I have now edited it.
Search for the sales with no earlier sales:
SELECT S1.*
FROM SALES S1
LEFT JOIN SALES S2 ON S2.CUSTOMERID = S1.CUSTOMERID AND S2.DTHRSALE < S1.DTHRSALE
WHERE S2.ID IS NULL
Define an index over (customerid, dthrsale) to make it fast.
in Firebird 3 , get first row foreach customer by min sales_date :
SELECT id, customer_id, total, sales_date
FROM (
SELECT id, customer_id, total, sales_date
, row_number() OVER(PARTITION BY customer_id ORDER BY sales_date ASC ) AS rn
FROM SALES
) sub
WHERE rn = 1;
İf you want to get other related columns, This is where your self-answer fails.
select customer_id , min(sales_date)
, id, total --what about other colums
from SALES
group by customer_id
So simple as:
select CUSTOMERID min(DTHRSALE) from SALES group by CUSTOMERID

SQL server group by show which rows

I have a table called phonecalls which stores records of phone calls history from a company looks like this
ID Number Duration Cost
-------------------------------------
1 123456 13 1
2 222222 39 1
3 333333 69 2
4 222222 36 1
What I want to do is get sum of duration and cost for each number.
SELECT
Number,
SUM(Duration) as [Total_Duration],
SUM(Cost) as [Total_Cost]
FROM
phonecalls
GROUP BY
Number;
Now the catch is I also need which id did i included in each group by for auditing purpose so that others will know which rows are processed. So my question is how to you include ids when you do a group by?
Almighty Stackoverflow please help me.
Thanks in advance.
--EDIT:
Is it possible to get all ids with same number in one cell? something like
ID Number Total_Duration Total_Cost
---------------------------------------------------
1 123456 13 1
2,4 222222 75 2
3 333333 69 2
Hope im not asking for too much.
You're looking for the SUM OVER() function:
SQL Fiddle
SELECT
ID,
Number,
Total_Duration = SUM(Duration) OVER(PARTITION BY Number),
Total_Cost = SUM(Cost) OVER(PARTITION BY Number)
FROM phonecalls
If you want to concatenate the IDs, you can use FOR XML PATH(''):
SQL Fiddle
SELECT
ID = STUFF((
SELECT ',' + CONVERT(VARCHAR(10), ID)
FROM phonecalls
WHERE Number = p.Number
FOR XML PATH('')
), 1, 1, ''),
Number,
Total_Duration = SUM(Duration),
Total_Cost = SUM(Cost)
FROM phonecalls p
GROUP BY Number