Count item after joining two table - sql

I have two tables,
First is Product table
+-----------+------+-----+------------+
| id | pnum | year |month|date |
+-----------+------+-----+------------+
|12 | S5 | 2021 | 2 | 2021-02-21 |
|12 | S5 | 2021 | 2 | 2021-02-22 |
|12 | S5 | 2021 | 2 | 2021-02-23 |
|33 | A55| 2021 | 3 | 2021-03-01 |
|44 | B1 | 2021 | 6 | 2021-06-01 |
Second is Deal table
+-----------+------+-----+------------+
| id | pnum| year |month|date |
+-----------+------+-----+------------+
|12 | S5 | 2021 | 2 | 2021-02-28 |
|12 | S5 | 2021 | 2 | 2021-02-01 |
|33 | A55| 2021 | 3 | 2021-03-01 |
I need a result which can tell me how many product got launch
for a year_month and count of deal in first 15 days or in next 15 days
+----------- +------------+----------------+--------------------+
| num | count| year-month |deal_in_first15 |deal_after_first15 |
+----------- +------+-----+-------------------------------------
|S5 | 3 | 2021-02 | 1 | 1 |
|A55 | 1 | 2021-03 | 1 | 0 |
I was trying to do it like below
select * from Product p inner join Deal d on
p.num=d.num AND p.id=d.id AND p.month=d.month
but it is not helping me to get exact result as intended.
I have some java and python background and not expert in sql hence applying count and case statement is not working out.

You can try to use the condition aggregate function in subquery then do JOIN
select p.pnum ,
COUNT(*) count,
FORMAT(p.[date],'yyyy-MM') 'year-month',
deal_in_first15 ,
deal_after_first15
from Product p
inner join (
SELECT id ,
pnum ,
month,
year ,
COUNT(CASE WHEN DATEPART(day,[date]) < 15 THEN 1 END) deal_in_first15 ,
COUNT(CASE WHEN DATEPART(day,[date]) >= 15 THEN 1 END) deal_after_first15
FROM Deal
GROUP BY id ,pnum ,month,year
) d on
p.pnum=d.pnum AND p.id=d.id AND p.month=d.month
group by FORMAT(p.[date],'yyyy-MM') ,
p.pnum,
deal_in_first15 ,
deal_after_first15
I think there is another way might you want, using two subquery then JOIN
select p.pnum ,
p.cnt 'count',
CONCAT(p.year,'-',FORMAT(p.month,'0#')) 'year-month',
deal_in_first15 ,
deal_after_first15
from (
SELECT id ,
pnum ,
month,
year,
count(*) cnt
FROM Product
GROUP BY id ,
pnum ,
month,
year
) p
inner join (
SELECT id ,
pnum ,
month,
year ,
COUNT(CASE WHEN DATEPART(day,[date]) < 15 THEN 1 END) deal_in_first15 ,
COUNT(CASE WHEN DATEPART(day,[date]) >= 15 THEN 1 END) deal_after_first15
FROM Deal
GROUP BY id ,pnum ,month,year
) d on
p.pnum=d.pnum
AND p.id=d.id
AND p.month=d.month
AND p.year=d.year
sqlfiddle
Note
I would use the year column to be JOIN condition. otherwise, the result will be wrong when you cross-year meet the same month.

This is it:
select pnum,cnt,sum(mn1),sum(mn2) from (select d."pnum",(select count(*) from Product p where p."pnum"=d."pnum"
) cnt,
case when
EXTRACT(DAY FROM d."date")<=15 then sum(1) else sum(0) end mn1 ,
case when
EXTRACT(DAY FROM d."date")>15 then sum(1) else sum(0) end mn2,
d."year"||'-'||d."month"
from deal d
group by 1,d."year"||'-'||d."month",d."date")abc group by 1,2;
please check at:http://sqlfiddle.com/#!17/3bad9/18

Related

Getting duplication WITH partition by

I want to know which customers who ordered in June 2020 also ordered in June 2021. My code returns the correct DISTINCT orders, but discounted sales is incorrect for customers who placed more than one order in either year. For example, a customer who placed one order in 2020 and four orders in 2021 has 2020 discounted sales at 4x the actual amount. The four orders in 2021 have four rows, and the one 2020 order populates against each. I saw this by using ROW_NUMBER () which exposed the underlying problem. I cannot use DISTINCT with discounted sales because customers do place multiple orders for identical dollar amounts. How do I get the exact discounted sales using standard SQL for BQ?
SELECT
DISTINCT ly.cuid AS cuid,
COUNT(DISTINCT ly.order_id) OVER (PARTITION BY ly.cuid) AS ly_orders,
SUM(ly.discounted_sales) OVER (PARTITION BY ly.cuid) AS ly_demand,
COUNT(DISTINCT ty.order_id) OVER (PARTITION BY ty.cuid) AS ty_orders,
SUM(ty.discounted_sales) OVER (PARTITION BY ly.cuid) AS ty_demand
FROM table ly
LEFT JOIN table ty
ON ly.cuid = ty.cuid
WHERE ly.order_date BETWEEN '2020-06-01' AND '2020-06-30'
AND ty.order_date BETWEEN '2021-06-01'AND '2021-06-30'
AND ly.financial_status <> 'credit'
AND ty.financial_status <> 'credit'
AND ly.discounted_sales >0
AND ty.discounted_sales >0
AND ly.channel = 'b2b'
AND ty.channel = 'b2b'
ORDER BY ly.cuid asc
[Results]
cuid ly_orders ly_demand ty_orders ty_demand comments
D 1 22,466.40 4 154,596.24 ly is 4x actual
F 2 2,573.20 1 1,944.40 ty is 2x actual
G 1 32,134.40 4 1,632.00 ly is 4x actual
I 2 757.56 1 730.56 ty is 2x actual
J 2 54,859.00 2 23,822.32 both are 2x actual
THIS WORKED:
WITH prior_period AS (
SELECT
DISTINCT cuid AS ly_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ly_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ly_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2020 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc),
this_period AS (
SELECT
DISTINCT cuid AS ty_cuid,
COUNT(DISTINCT order_id) OVER (PARTITION BY cuid) AS ty_orders,
SUM(discounted_sales) OVER (PARTITION BY cuid) AS ty_demand
FROM TABLE
WHERE EXTRACT (YEAR FROM order_date) = 2021 AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales >0
AND channel = 'b2b'
GROUP BY cuid, order_id, discounted_sales
ORDER BY cuid asc)
SELECT *
FROM prior_period ly
JOIN this_period ty ON ly.ly_cuid = ty.ty_cuid
ORDER BY ly.ly_cuid
Updated with your schema and approximate data:
Try this...
WITH periods AS (
SELECT cuid AS cuid
, COUNT(*) AS orders
, SUM(discounted_sales) AS demand
, EXTRACT(YEAR FROM order_date) AS yr
FROM demand2
WHERE EXTRACT(YEAR FROM order_date) IN (2020, 2021) AND EXTRACT(MONTH FROM order_date) = 6
AND financial_status <> 'credit'
AND discounted_sales > 0
AND channel = 'b2b'
GROUP BY cuid, EXTRACT(YEAR FROM order_date)
)
SELECT ly.cuid
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM periods AS ly
JOIN periods AS ty
ON ly.cuid = ty.cuid
AND ly.yr = 2020
AND ty.yr = 2021
ORDER BY ly.cuid
;
The result:
+------+-----------+-----------+-----------+-----------+
| cuid | ly_orders | ly_demand | ty_orders | ty_demand |
+------+-----------+-----------+-----------+-----------+
| D | 1 | 5616.60 | 4 | 154596.24 |
| F | 2 | 2573.20 | 1 | 972.20 |
| G | 1 | 8033.60 | 4 | 1632.56 |
| I | 2 | 757.56 | 1 | 365.28 |
| J | 2 | 27429.50 | 2 | 11911.16 |
+------+-----------+-----------+-----------+-----------+
Here's a similar example with data, SQL and results to show both the incorrect result and the correct result.
The data:
SELECT * FROM demand ORDER BY account_id, period;
+----+------------+--------+--------+
| id | account_id | period | demand |
+----+------------+--------+--------+
| 1 | 1 | 202005 | 100 |
| 2 | 1 | 202005 | 120 |
| 3 | 1 | 202105 | 105 |
| 4 | 1 | 202105 | 125 |
| 5 | 1 | 202105 | 30 |
| 6 | 2 | 202005 | 200 |
| 7 | 2 | 202105 | 240 |
+----+------------+--------+--------+
The incorrect SQL, without SUMs to just show the join behavior:
SELECT t1.id, t1.account_id, t1.period, t1.demand AS demand1
, t2.id, t2.period, t2.demand AS demand2
FROM demand AS t1
LEFT JOIN demand AS t2
ON t1.account_id = t2.account_id
AND t2.period = 202105
WHERE t1.period = 202005
ORDER BY t1.account_id, t1.period, demand1, demand2
;
+----+------------+--------+---------+------+--------+---------+
| id | account_id | period | demand1 | id | period | demand2 |
+----+------------+--------+---------+------+--------+---------+
| 1 | 1 | 202005 | 100 | 5 | 202105 | 30 |
| 1 | 1 | 202005 | 100 | 3 | 202105 | 105 |
| 1 | 1 | 202005 | 100 | 4 | 202105 | 125 |
| 2 | 1 | 202005 | 120 | 5 | 202105 | 30 |
| 2 | 1 | 202005 | 120 | 3 | 202105 | 105 |
| 2 | 1 | 202005 | 120 | 4 | 202105 | 125 |
| 6 | 2 | 202005 | 200 | 7 | 202105 | 240 |
+----+------------+--------+---------+------+--------+---------+
Notice account 2 doesn't have a problem because only one demand was found last year and this year.
But account 1 found 2 demand rows for last year and 3 demand rows for this year, leading to (2 x 3) = 6 rows in the joined result. This is the source of the problem.
To correct this, we aggregate before the JOIN, so that we have at most one (1) row per account / period to be joined.
One form of the correct solution, with SUMs derived in the CTE term:
WITH demands AS (
SELECT account_id, period
, SUM(demand) AS demand
, COUNT(*) AS orders
FROM demand
GROUP BY account_id, period
)
SELECT ly.account_id, ly.period
, ly.orders AS ly_orders
, ly.demand AS ly_demand
, ty.orders AS ty_orders
, ty.demand AS ty_demand
FROM demands AS ly
LEFT JOIN demands AS ty
ON ly.account_id = ty.account_id
AND ty.period = 202105
WHERE ly.period = 202005
ORDER BY ly.account_id, ly.period, ly_demand, ty_demand
;
The result:
+------------+--------+-----------+-----------+-----------+-----------+
| account_id | period | ly_orders | ly_demand | ty_orders | ty_demand |
+------------+--------+-----------+-----------+-----------+-----------+
| 1 | 202005 | 2 | 220 | 3 | 260 |
| 2 | 202005 | 1 | 200 | 1 | 240 |
+------------+--------+-----------+-----------+-----------+-----------+
Since we performed aggregation in the CTE term (demands), the join found at most one row for each period for each account.

SQL Query to apply a command to multiple rows

I am new to SQL and trying to write a statement similar to a 'for loop' in other languages and am stuck. I want to filter out rows of the table where for all of attribute 1, attribute2=attribute3 without using functions.
For example:
| Year | Month | Day|
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 4 |
| 2 | 3 | 4 |
| 2 | 3 | 3 |
| 2 | 4 | 4 |
| 3 | 4 | 4 |
| 3 | 4 | 4 |
| 3 | 4 | 4 |
I would only want the row
| Year | Month | Day|
|:---- |:------:| -----:|
| 3 | 4 | 4 |
because it is the only where month and day are equal for all of the values of year they share.
So far I have
select year, month, day from dates
where month=day
but unsure how to apply the constraint for all of year
-- month/day need to appear in aggregate functions (since they are not in the GROUP BY clause),
-- but the HAVING clause ensure we only have 1 month/day value (per year) here, so MIN/AVG/SUM/... would all work too
SELECT year, MAX(month), MAX(day)
FROM my_table
GROUP BY year
HAVING COUNT(DISTINCT (month, day)) = 1;
year
max
max
3
4
4
View on DB Fiddle
So one way would be
select distinct [year], [month], [day]
from [Table] t
where [month]=[day]
and not exists (
select * from [Table] x
where t.[year]=x.[year] and t.[month] <> x.[month] and t.[day] <> x.[day]
)
And another way would be
select distinct [year], [month], [day] from (
select *,
Lead([month],1) over(partition by [year] order by [month])m2,
Lead([day],1) over(partition by [year] order by [day])d2
from [table]
)x
where [month]=m2 and [day]=d2

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

postgresql - cumul. sum active customers by month (removing churn)

I want to create a query to get the cumulative sum by month of our active customers. The tricky thing here is that (unfortunately) some customers churn and so I need to remove them from the cumulative sum on the month they leave us.
Here is a sample of my customers table :
customer_id | begin_date | end_date
-----------------------------------------
1 | 15/09/2017 |
2 | 15/09/2017 |
3 | 19/09/2017 |
4 | 23/09/2017 |
5 | 27/09/2017 |
6 | 28/09/2017 | 15/10/2017
7 | 29/09/2017 | 16/10/2017
8 | 04/10/2017 |
9 | 04/10/2017 |
10 | 05/10/2017 |
11 | 07/10/2017 |
12 | 09/10/2017 |
13 | 11/10/2017 |
14 | 12/10/2017 |
15 | 14/10/2017 |
Here is what I am looking to achieve :
month | active customers
-----------------------------------------
2017-09 | 7
2017-10 | 6
I've managed to achieve it with the following query ... However, I'd like to know if there are a better way.
select
"begin_date" as "date",
sum((new_customers.new_customers-COALESCE(churn_customers.churn_customers,0))) OVER (ORDER BY new_customers."begin_date") as active_customers
FROM (
select
date_trunc('month',begin_date)::date as "begin_date",
count(id) as new_customers
from customers
group by 1
) as new_customers
LEFT JOIN(
select
date_trunc('month',end_date)::date as "end_date",
count(id) as churn_customers
from customers
where
end_date is not null
group by 1
) as churn_customers on new_customers."begin_date" = churn_customers."end_date"
order by 1
;
You may use a CTE to compute the total end_dates and then subtract it from the counts of start dates by using a left join
SQL Fiddle
Query 1:
WITH edt
AS (
SELECT to_char(end_date, 'yyyy-mm') AS mon
,count(*) AS ct
FROM customers
WHERE end_date IS NOT NULL
GROUP BY to_char(end_date, 'yyyy-mm')
)
SELECT to_char(c.begin_date, 'yyyy-mm') as month
,COUNT(*) - MAX(COALESCE(ct, 0)) AS active_customers
FROM customers c
LEFT JOIN edt ON to_char(c.begin_date, 'yyyy-mm') = edt.mon
GROUP BY to_char(begin_date, 'yyyy-mm')
ORDER BY month;
Results:
| month | active_customers |
|---------|------------------|
| 2017-09 | 7 |
| 2017-10 | 6 |

Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)

I'm trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I've got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.
I'm looking for a more effective way of achieving the same result.
Table looks something like this:
+---+------------+-------------+
| | date | userid |
+---+------------+-------------+
| 1 | 2013-05-14 | xxxxx |
| 2 | 2017-03-14 | xxxxx |
| 3 | 2018-01-24 | xxxxx |
| 4 | 2013-03-21 | xxxxx |
| 5 | 2014-03-19 | xxxxx |
| 6 | 2015-09-03 | xxxxx |
| 7 | 2014-02-06 | xxxxx |
| 8 | 2014-10-30 | xxxxx |
| ..| ... | ... |
+---+------------+-------------+
Format of the desired result:
+---+------------+---------------------------------------------+
| | date | active_users_7_days | active_users_90_days |
+---+------------+---------------------------------------------+
| 1 | 2013-05-14 | 1240 | 34339 |
| 2 | 2017-03-14 | 4334 | 54343 |
| 3 | 2018-01-24 | ..... | ..... |
| 4 | 2013-03-21 | ..... | ..... |
| 5 | 2014-03-19 | ..... | ..... |
| 6 | 2015-09-03 | ..... | ..... |
| 7 | 2014-02-06 | ..... | ..... |
| 8 | 2014-10-30 | ..... | ..... |
| ..| ... | ..... | ..... |
+---+------------+---------------------------------------------+
My query looks like this:
#standardSQL
WITH
T1 AS(
SELECT
date,
STRING_AGG(DISTINCT userid) AS IDs
FROM
`consumer.events`
GROUP BY
date ),
T2 AS(
SELECT
date,
STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING
AND CURRENT ROW) AS IDs
FROM
T1 )
SELECT
date,
(
SELECT
COUNT(DISTINCT (userid))
FROM
UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays
FROM
T2
Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:
https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a
For an exact count, this would work (but gets slower as the window gets larger):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT owner_user_id) unique_90_day_users
, COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
You can aggregate the date and do the sum. What is the aggregation? Take the most recent date:
select count(*) as num_users,
sum(case when date > datediff(current_date, interval -30 day) then 1 else 0 end) as num_users_30days,
sum(case when date > datediff(current_date, interval -60 day) then 1 else 0 end) as num_users_60days,
sum(case when date > datediff(current_date, interval -90 day) then 1 else 0 end) as num_users_90days
from (select user_id, max(date) as max(date)
from `consumer.events` e
group by user_id
) e;
If the most recent date for the user is in the period, then the user should be counted.
You can get this "as-of" a particular date by using a where clause in the subquery.