Distinct New Customers in SQL - sql

The table below has customer ID, product purchased, and purchase date.
customer_id products purchase_date
93738117783 product a 5/24/2022
93738117783 product a 6/8/2022
93738117783 product a 7/19/2022
93738117783 product a 8/18/2022
93738117783 product a 9/22/2022
93738117783 product a 10/19/2022
93738117783 product a 11/17/2022
93738117783 product a 12/27/2022
93738554027 product a 5/5/2021
93738738408 product b 8/2/2021
93738738408 product b 9/20/2021
93738738408 product b 10/26/2021
93738738408 product b 12/2/2021
93738738408 product b 1/2/2022
93738738408 product b 3/27/2022
93738738408 product b 5/2/2022
93738738408 product b 6/10/2022
93738738408 product b 7/8/2022
93738738408 product b 7/31/2022
93738117783 product a 8/1/2022
93738117783 product a 9/5/2022
93738117783 product a 10/8/2022
93738117783 product a 11/16/2022
93738117783 product a 12/19/2022
93738943799 product a 10/21/2020
93738943799 product a 11/20/2020
93738943799 product a 1/24/2021
93739310547 product b 5/3/2022
93739310547 product b 8/19/2022
93739310547 product b 1/5/2023
From this table, I want to create a SQL query to get the following output -
product a new_customers week_ending_Friday
product a 2 12/16/2022
product b 3 12/16/2022
product a 1 12/10/2022
product b 4 12/10/2022
new_customers = new to purchasing the product in 1yr from the purchase_date
week_ending_Friday = date rolled upto week ending friday
Any idea will help me.

First step: get a date's week's Friday. By ISO definition a week starts on Monday, so you'll have to add 4 days to the ISO week's start day to get Friday.
DATE_TRUNC('weekiso', purchase_date) + INTERVAL '4 day'
Correction: The general docs on date and time functions in snowflake show that weekiso is not allowed with DATE_TRUNC. This is a pity, for this means we must find out how many days to add to the purchase date. I think this should do the trick:
purchase_date + (interval '1 day' * (5 - date_part(dayofweekiso, purchase_date)))
Then, in order to find new customers, we would get the first purchse day per customer and product, a simple aggregation, but in your comments you say that you also consider a customer new, when their last puchase was made more than a year before. So, we need a lookup, which we do with [NOT] EXISTS.
select
day as week_ending_friday,
product,
count_if(is_new) as new_customers
from
(
select
customer_id,
product,
purchase_date +
(interval '1 day' * (5 - date_part(dayofweekiso, purchase_date))) as day,
not exists
(
select null
from purchases pp
where pp.customer_id = p.customer_id
and pp.product = p.product
and pp.purchase_date < p.purchase_date
and pp.purchase_date >= p.purchase_date - interval '1 year'
) as is_new
from purchases p
) evaluated
group by day, product
order by day, product;

The qualify bit :
datediff(day,coalesce(lag(purchase_date)over(partition by customer_id,products order by purchase_date)+365,purchase_date),purchase_date)>=0
This identifies if the customer is new for this product. As the requirements define new to include over 1 year ago (+365) we need to grab the previous purchase (lag) and see how many days previous this was. I defaulted 0 if it's actually the very first purchase, meaning I can add +365 to the datediff and anything 0 and above is either brand new or over 1 year since the previous purchase.
I prefer the qualify approach over nested queries as the SQL looks more elegant, and I'm a lazy typist (ask Thorsten Kettner). The other answer by Thorsten uses almost 50% word/functions etc. However if you find the readability better -> that's the solution for you. So (1) find SQL you can read, (2) choose the understandable SQL with least chars.
The array functions array_agg() and array_size() are pretty cool - have a look at the Snowflake docs they're pretty self explanatory - reach back if you have any questions. I prefer to link to the docs as Snowflake is always fiddling with these functions so the docs can change. Note - I've never seen them break stuff -> but they sometimes add functionality.
The reason I add the cte at the front of the code - is so when smarter folks than I come and look at answers they can copy/paste into Snowflake and get the example working asap. This means we're not wasting their time - quite often actual Snowflake employees answer questions so it's best to make things quick and easy for them to improve answers. I'd recommend all questions have this -> as you're likely to get more answers.
with cte as (select 93738117783 customer_id, 'product a' products, '5/24/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '6/8/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '7/19/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '8/18/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '9/22/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '10/19/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '11/17/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '12/27/2022'::date purchase_date
union all select 93738554027 customer_id, 'product a' products, '5/5/2021'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '8/2/2021'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '9/20/2021'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '10/26/2021'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '12/2/2021'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '1/2/2022'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '3/27/2022'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '5/2/2022'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '6/10/2022'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '7/8/2022'::date purchase_date
union all select 93738738408 customer_id, 'product b' products, '7/31/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '8/1/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '9/5/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '10/8/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '11/16/2022'::date purchase_date
union all select 93738117783 customer_id, 'product a' products, '12/19/2022'::date purchase_date
union all select 93738943799 customer_id, 'product a' products, '10/21/2020'::date purchase_date
union all select 93738943799 customer_id, 'product a' products, '11/20/2020'::date purchase_date
union all select 93738943799 customer_id, 'product a' products, '1/24/2021'::date purchase_date
union all select 93739310547 customer_id, 'product b' products, '5/3/2022'::date purchase_date
union all select 93739310547 customer_id, 'product b' products, '8/19/2020'::date purchase_date
union all select 93739310547 customer_id, 'product b' products, '1/5/2023'::date purchase_date)
select
DATE_TRUNC('week', purchase_date) + INTERVAL '4 day' week_number
,products
,array_unique_agg(customer_id)over (partition by week_number, products) new_custs
,array_size(new_custs) number_new_custs
from cte
qualify
datediff(day,coalesce(lag(purchase_date)over(partition by customer_id,products order by purchase_date)+365,purchase_date),purchase_date)>=0

Related

How to pivot multiple aggregation in Snowflake

I have the table structure as below
product_id
Period
Sales
Profit
x1
L13
$100
$10
x1
L26
$200
$20
x1
L52
$300
$30
x2
L13
$500
$110
x2
L26
$600
$120
x2
L52
$700
$130
I want to pivot the period column over and have the sales value and profit in those columns. I need a table like below.
product_id
SALES_L13
SALES_L26
SALES_L52
PROFIT_L13
PROFIT_L26
PROFIT_L52
x1
$100
$200
$300
$10
$20
$30
x2
$500
$600
$700
$110
$120
$130
I am using the snowflake to write the queries. I tried using the pivot function of snowflake but there I can only specify one aggregation function.
Can anyone help as how I can achieve this solution ?
Any help is appreciated.
Thanks
How about we stack sales and profit before we pivot? I'll leave it up to you to fix the column names that I messed up.
with cte (product_id, period, amount) as
(select product_id, period||'_profit', profit from t
union all
select product_id, period||'_sales', sales from t)
select *
from cte
pivot(max(amount) for period in ('L13_sales','L26_sales','L52_sales','L13_profit','L26_profit','L52_profit'))
as p (product_id,L13_sales,L26_sales,L52_sales,L13_profit,L26_profit,L52_profit);
If you wish to pivot period twice for sales and profit, you'll need to duplicate the column so you have one for each instance of pivot. Obviously, this will create nulls due to duplicate column still being present after the first pivot. To handle that, we can use max in the final select. Here's what the implementation looks like
select product_id,
max(L13_sales) as L13_sales,
max(L26_sales) as L26_sales,
max(L52_sales) as L52_sales,
max(L13_profit) as L13_profit,
max(L26_profit) as L26_profit,
max(L52_profit) as L52_profit
from (select *, period as period2 from t) t
pivot(max(sales) for period in ('L13','L26','L52'))
pivot(max(profit) for period2 in ('L13','L26','L52'))
as p (product_id, L13_sales,L26_sales,L52_sales,L13_profit,L26_profit,L52_profit)
group by product_id;
At this point, it's an eye soar. You might as well use conditional aggregation or better yet, handle pivoting inside the reporting application. A more compact alternative of conditional aggregation uses decode
select product_id,
max(decode(period,'L13',sales)) as L13_sales,
max(decode(period,'L26',sales)) as L26_sales,
max(decode(period,'L52',sales)) as L52_sales,
max(decode(period,'L13',profit)) as L13_profit,
max(decode(period,'L26',profit)) as L26_profit,
max(decode(period,'L52',profit)) as L52_profit
from t
group by product_id;
Using conditional aggregation:
SELECT product_id
,SUM(CASE WHEN Period = 'L13' THEN Sales END) AS SALES_L13
,SUM(CASE WHEN Period = 'L26' THEN Sales END) AS SALES_L26
,SUM(CASE WHEN Period = 'L52' THEN Sales END) AS SALES_L52
,SUM(CASE WHEN Period = 'L13' THEN Profit END) AS PROFIT_L52
,SUM(CASE WHEN Period = 'L26' THEN Profit END) AS PROFIT_L52
,SUM(CASE WHEN Period = 'L52' THEN Profit END) AS PROFIT_L52
FROM tab
GROUP BY product_id
I'm not 100% happy with this answer ... pretty sure someone can improve on this approach.
Basically PIVOTING an ARRAY ... the list of aggregation functions available to an ARRAY is not huge ... there's just one ARRAY_AGG. And PIVOT only supposed to support AVG, COUNT, MAX, MIN, and SUM. So this shouldn't work ... it does as I think PIVOT just requires an aggregation of some sorts.
I'd recommend aggregating your metrics PRIOR to constructing the ARRAY ... but does let you pivot multiple Metrics at once - which from reading Stack Overflow shouldn't be possible!
Copy|Paste|Run| .. and IMPROVE please :-)
WITH CTE AS( SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT)
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM
(SELECT * FROM
(
SELECT PRODUCT_ID, PERIOD,ARRAY_CONSTRUCT(SALES,PROFIT) S FROM CTE)
PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52')
)
)
Example with aggregations (added 1700,1130 to L52 X2)
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM
(SELECT * FROM
(
SELECT PRODUCT_ID, PERIOD,ARRAY_CONSTRUCT(SUM(SALES),SUM(PROFIT)) S FROM CTE GROUP BY 1,2)
PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52')
)
)
Heres an alternative form using OBJECT_AGG with LATERAL FLATTEN that avoids the potential support issue of PIVOT with ARRAY_AGG proposed by Adrian White.
This should work for any aggregates on multiple input columns included within the initial ARRAY_CONSTRUCT in the OBJ_TALL CTE. I expect that the conditional aggregation option with CASE statements would be faster but you'd need to test at scale to see.
-- OBJECT FORM USING LATERAL FLATTEN
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
,OBJ_TALL AS ( SELECT PRODUCT_ID,
OBJECT_CONSTRUCT(PERIOD,
ARRAY_CONSTRUCT( SUM(SALES)
,SUM(PROFIT)
)
) S
FROM CTE
GROUP BY PRODUCT_ID, PERIOD)
SELECT * FROM OBJ_TALL;
,OBJ_WIDE AS ( SELECT PRODUCT_ID, OBJECT_AGG(KEY,VALUE) OA
FROM OBJ_TALL, LATERAL FLATTEN(INPUT => S)
GROUP BY PRODUCT_ID)
-- SELECT * FROM OBJ_WIDE;
SELECT
PRODUCT_ID
,OA:L13[0] SALES_L13
,OA:L13[1] PROFIT_L13
,OA:L26[0] SALES_L26
,OA:L26[1] PROFIT_L26
,OA:L52[0] SALES_L52
,OA:L52[1] PROFIT_L52
FROM OBJ_WIDE
ORDER BY 1;
For easy comparison to the above, heres Adrians ARRAY_AGG and PIVOT version reformatted using CTE's.
-- ARRAY FORM - RE-WRITTEN WITH CTES FOR CLARITY AND COMPARISON TO OBJECT FORM
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
,ARR_TALL AS (SELECT PRODUCT_ID,
PERIOD,
ARRAY_CONSTRUCT( SUM(SALES)
,SUM(PROFIT)
) S
FROM CTE GROUP BY 1,2)
,ARR_WIDE AS (SELECT *
FROM ARR_TALL PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52') ) )
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM ARR_WIDE
ORDER BY 1;
I believe you can only have one pivot at one time but you can check by running the first code below. Then you can run separately only with one pivot to see if it is working fine. Unfortunately, if multiple pivots are not allowed i.e first code then you can use the third code i.e case when method OR use union first to combine them i.e (Phil Culson method from above).
select *
from [table name]
pivot(sum(amount) for PERIOD in (L13, L26, L52)),
pivot(sum(profit) for PERIOD in (L13, L26, L52))
order by product_id;
if the above one doesn't work try with one for example:
https://count.co/sql-resources/snowflake/pivot-tables
select *
from [table name]
pivot(sum(amount) for PERIOD in (L13, L26, L52))
order by product_id;
Otherwise you will have to apply the manual case when logic:
select
product_id,
sum(case when Period = 'L13' then Sales end) as sales_l13,
sum(case when Period = 'L26' then Sales end) as sales_l26,
sum(case when Period = 'L52' then Sales end) as sales_l52,
sum(case when Period = 'L13' then Profit end) as profi_l13,
sum(case when Period = 'L26' then Profit end) as profit_l26,
sum(case when Period = 'L52' then Profit end) as profit_l52
from [table name]
group by 1

PARTITION BY first use of a particular product

I'm trying to produce a table that lists the month, account and product name from our billing database. However, I also want to understand (for subsequent cohort analysis) what the earliest use is of "Product A" for each line item too. I was hoping I could do the following:
SELECT
Month,
AccountID,
ProductName,
SUM(NetRevenue) AS NetRevenue,
MIN(Month) OVER(PARTITION BY AccountID, 'Product A') AS EarliestUse
FROM
<<my-billing-table>>
WHERE
NetRevenue > 0
AND AccountID IN (
SELECT DISTINCT AccountID
FROM <<my-billing-table>>
WHERE ProductName = 'Product A' AND NetRevenue > 0
)
GROUP BY 1,2,3
...but it seems that just using "Product A" within the OVER clause does not have the desired effect (it seems to just return the first month for AccountID).
While the syntax is fine and the query runs, I'm obviously missing something regarding PARTITIONing the OVER clause. Any help much appreciated!
I think you want conditional aggregation along with a window function:
SELECT Month, AccountID, ProductName,
SUM(NetRevenue) AS NetRevenue,
MIN(MIN(CASE WHEN ProductName = 'Product A' THEN month END)) OVER (PARTITION BY AccountID) AS EarliestUse
FROM <<my-billing-table>>
WHERE NetRevenue > 0 AND
AccountID IN (SELECT AccountID
FROM <<my-billing-table>>
WHERE ProductName = 'Product A' AND NetRevenue > 0
)
GROUP BY 1,2,3;
The key expression here is an aggregation function nestled inside a window function. The aggregation function is MIN(CASE WHEN ProductName = 'Product A' THEN month END). This calculates the earliest month for the specified product on each row. This could be a column in the result set, and you would see the minimum value on the product row.
The window function then "spreads" this value over all rows for a given AccountID.
you are using a constant in partition it will not impact in your result, should use the column ProductName in partition to get the earliest use of the product
SELECT
Month,
AccountID,
ProductName,
SUM(NetRevenue) AS NetRevenue,
MIN(Month) OVER(PARTITION BY AccountID, ProductName) AS EarliestUse
FROM
<<my-billing-table>>
WHERE
NetRevenue > 0
AND AccountID IN (
SELECT DISTINCT AccountID
FROM <<my-billing-table>>
WHERE ProductName = 'Product A' AND NetRevenue > 0
)
GROUP BY 1,2,3

SQL: Looking at Bundle of Products Sold

I have a sample DB below. I'm looking to see how many TV and Internet bundles we sold. In the sample data, only Bob and Trevor sold that bundle so we sold 2.
How do I write the query for the number of bundles sold by each Sales rep and the total price of the bundles sold?
Thanks
I imagine that, for a bundle to happen, the same sales person needs to have sold both products to the same customer.
I would approach this with two levels of aggregation. First group by sales person and customer in a subquery to identify the bundles, then, in an outer query, count how many such bundles happened for each sales person:
SELECT sales_person, COUNT(*) bundles_sold, SUM(total_price) total_price
FROM (
SELECT sales_person, customer_name, SUM(total_price) total_price
FROM mytable
WHERE product_name in ('TV', 'Phone')
GROUP BY sales_person
HAVING COUNT(DISTINCT product_name) = 2
) x
You can simply group the salesman's by counting the distinct products they sold -
SELECT Sales_Person, FLOOR(COUNT(DISTINCT product_name)/2) NO_OF_BUNDLES, sum(total_price)
FROM YOUR_TAB
WHERE product_name IN ('TV', 'Internet')
GROUP BY Sales_Person
HAVING COUNT(DISTINCT product_name) >= 2
Using cte as below:
with cte1(sales_person, customer_name, product_count) as
(
select sales_person, customer_name, count(product_name)
from sales
where product_name in ('TV', 'Internet')
group by sales_person, customer_name
having count(product_name) = 2
)
select sales_person, count(product_count)
from cte1
group by sales_person
I would suggest two levels of aggregation:
select sales_person, count(*), sum(total_price)
from (select sales_person, customer_name,
sum(total_price) as total_price,
max(case when product_name = 'tv' then 1 else 0 end) as has_tv,
max(case when product_name = 'phone' then 1 else 0 end) as has_phone,
max(case when product_name = 'internet' then 1 else 0 end) as has_internet
from t
group by sales_person, customer_name
) sc
where has_phone = 0 and
has_tv = 1 and
has_internet = 1
group by sales_person;
I recommend this structure because it is pretty easy to change the conditions in the where clause to return this for any bundle -- or even to aggregate by the three flags and return the totals for all bundles in one query.

calculating transaction amount for the following week

Transaction -> Transaction_id, buyer_id, seller_id, object_id,Shipping_id, Price, Quantity, site_id,transaction_date, expected_delivery_date, check_out_status
leaf_category_id, defect_id
Buyer -> Buyer_id, name, country
Seller -> Seller_id, name, country, segment, standard
Listing -> object_id, seller_id, auction_start_date
auction_end_date, listing_site_id, leaf_category_id
quantity
For the sellers from UK who transacted on the second week of december(6 December 2015 to 12 December 2015), find the number of sellers
who have atleast twice the total transaction amount (qty*price) in the following week.
I have tried below query to get sellers who transacted in dec 2nd week but facing error when calculating sellers having twice the transaction amount from those sellers in following week.
With trans_dec_uk as
(
select s.seller_id,t.transaction_date, sum(t.Qty * Price) trans_amount
from transaction t join seller s
on t.seller_id =s.seller_id
where s.country ='UK'
and t.transaction_date between '12-05-2015' and '12-18-2015'
group by s.seller_id,t.transaction_date
)
select count(seller) from  trans_dec_uk
where trans_amount =  2 * to_char(sysdate+7,'DD-MM')
with uk_sellers as (
select * from <dataset>.Seller where country = 'UK'
),
first_week_uk as (
select seller_id, sum(Price*Quantity) as first_week_total
from <dataset>.Transaction
inner join uk_sellers using(seller_id)
where transaction_date between '2015-12-05' and '2015-12-11'
group by 1
),
second_week_uk as (
select seller_id, sum(Price*Quantity) as second_week_total
from <dataset>.Transaction
inner join uk_sellers using(seller_id)
where transaction_date between '2015-12-12' and '2015-12-18'
group by 1
)
select count(distinct seller_id) as the_answer
from first_week_uk
inner join second_week_uk using(seller_id)
where second_week_total >= 2*first_week_total

populating table from different tables

Good day.
I have the following tables:
Order_Header(Order_id {pk}, customer_id {fk}, agent_id {fk}, Order_date(DATE FORMAT))
Invoice_Header (Invoice_ID {pk}, Customer_ID {fk}, Agent_ID{fk}, invoice_Date{DATE FORMAT} )
Stock( Product_ID {pk}, Product_description)
I created a table called AVG_COMPLETION_TIME_FACT and want to populate it with the following values regarding the previous 3 tables:
Product_ID
Invoice_month
Invoice_Year
AVG_Completion_Time (Invoice_date - Order_date)
I have the following code that doesn't work:
INSERT INTO AVG_COMPLETION_TIME_FACT(
SELECT PRODUCT_ID, EXTRACT (YEAR FROM INVOICE_DATE), EXTRACT (MONTH FROM INVOICE_DATE), (INVOICE_DATE - ORDER_DATE)
FROM STOCK, INVOICE_HEADER, ORDER_HEADER
GROUP BY PRODUCT_ID, EXTRACT (YEAR FROM INVOICE_DATE), EXTRACT (MONTH FROM INVOICE_DATE)
);
I want to group it by the product_id, year of invoice and month of invoice.
Is this possible?
Any advice would be much appreciated.
Regards
Short answer: it may be possible - if your database contains some more columns that are needed for writing the correct query.
There are several problems, apart from the syntactical ones. When we create some test tables, you can see that the answer you are looking for cannot be derived from the columns you have provided in your question. Example tables (Oracle 12c), all PK/FK constraints omitted:
-- 3 tables, similar to the ones described in your question,
-- including some test data
create table order_header (id, customer_id, agent_id, order_date )
as
select 1000, 100, 1, date'2018-01-01' from dual union all
select 1001, 100, 2, date'2018-01-02' from dual union all
select 1002, 100, 3, date'2018-01-03' from dual
;
create table invoice_header ( id, customer_id, agent_id, invoice_date )
as
select 2000, 100, 1, date'2018-02-01' from dual union all
select 2001, 100, 2, date'2018-03-11' from dual union all
select 2002, 100, 3, date'2018-04-21' from dual
;
create table stock( product_id, product_description)
as
select 3000, 'product3000' from dual union all
select 3001, 'product3001' from dual union all
select 3002, 'product3002' from dual
;
If you join the tables as you have done it (using a cross join), you will see that you get more rows than expected ... But: Neither the invoice_header table, nor the order_header table contains any PRODUCT_ID data. Thus, we cannot tell which product_ids are associated with the stored order_ids or invoice_ids.
select
product_id
, extract( year from invoice_date )
, extract( month from invoice_date )
, invoice_date - order_date
from stock, invoice_header, order_header -- cross join -> too many rows in the resultset!
-- group by ...
;
...
27 rows selected.
For getting your query right, you should probably write INNER JOINs and conditions (keyword: ON). If we try to do this with your original table definitions (as provided in your question) you will see that we cannot join all 3 tables, as they do not contain all the columns needed: PRODUCT_ID (table STOCK) cannot be associated with ORDER_HEADER or INVOICE_HEADER.
One column that these 2 tables (ORDER_HEADER and INVOICE_HEADER) do have in common is: customer_id, but that's not enough for answering your question. However, we can use it for demonstrating how you could code the JOINs.
select
-- product_id
IH.customer_id as cust_id
, OH.id as OH_id
, IH.id as IH_id
, extract( year from invoice_date ) as year_
, extract( month from invoice_date ) as month_
, invoice_date - order_date as completion_time
from invoice_header IH
join order_header OH on IH.customer_id = OH.customer_id
-- the stock table cannot be joined at this stage
;
Missing columns:
Please regard the following just as "proof of concept" code. Assuming that somewhere in your database, you have tables that have columns that {1} link STOCK and ORDER_HEADER (name here: STOCK_ORDER) and {2} link ORDER_HEADER and INVOICE_HEADER (name here: ORDER_INVOICE), you could actually get the information you want.
-- each ORDER_HEADER is mapped to multiple product_ids
create table stock_order
as
select S.product_id, OH.id as oh_id -- STOCK and ORDER_HEADER
from stock S, order_header OH ; -- cross join, we use all possible combinations here
select oh_id, product_id
from stock_order
order by OH_id
;
PRODUCT_ID OH_ID
---------- ----------
3000 1000
3000 1001
3000 1002
3001 1000
3001 1001
3001 1002
3002 1000
3002 1001
3002 1002
9 rows selected.
-- each INVOICE_HEADER mapped to a single ORDER_HEADER
create table order_invoice ( order_id, invoice_id )
as
select 1000, 2000 from dual union all
select 1001, 2001 from dual union all
select 1002, 2002 from dual
;
For querying, make sure that you code the correct JOIN conditions (ON ...) eg
-- example query. NOTICE: conditions in ON ...
select
S.product_id
, IH.customer_id as cust_id
, OH.id as OH_id
, IH.id as IH_id
, extract( year from invoice_date ) as year_
, extract( month from invoice_date ) as month_
, invoice_date - order_date as completion_time
from invoice_header IH
join order_invoice OI on IH.id = OI.invoice_id -- <- new "link" table
join order_header OH on OI.order_id = OH.id
join stock_order SO on OH.id = SO.OH_id -- <- new "link" table
join stock S on S.product_id = SO.product_id
;
Now you can add the GROUP BY, and SELECT only the columns you need. Combined with an INSERT, you should write something like ...
-- example avg_completion_time_fact table.
create table avg_completion_time_fact (
product_id number
, year_ number
, month_ number
, avg_completion_time number
) ;
insert into avg_completion_time_fact ( product_id, year_, month_, avg_completion_time )
select
S.product_id
, extract( year from invoice_date ) as year_
, extract( month from invoice_date ) as month_
, avg( invoice_date - order_date ) as avg_completion_time
from invoice_header IH
join order_invoice OI on IH.id = OI.invoice_id
join order_header OH on OI.order_id = OH.id
join stock_order SO on OH.id = SO.OH_id
join stock S on S.product_id = SO.product_id
group by S.product_id, extract( year from invoice_date ), extract( month from invoice_date )
;
The AVG_COMPLETION_TIME_FACT table now contains:
SQL> select * from avg_completion_time_fact order by product_id ;
PRODUCT_ID YEAR_ MONTH_ AVG_COMPLETION_TIME
---------- ---------- ---------- -------------------
3000 2018 3 68
3000 2018 4 108
3000 2018 2 31
3001 2018 3 68
3001 2018 2 31
3001 2018 4 108
3002 2018 3 68
3002 2018 4 108
3002 2018 2 31
It is not completely clear what the final query for your database (or schema) will look like, as we don't know the definitions of all the tables it contains. However, if you apply the techniques and stick to the syntax of the examples, you should be able to obtain the required results. Best of luck!