This question already has answers here:
Two SQL LEFT JOINS produce incorrect result
(3 answers)
Closed 12 months ago.
I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)
The datasets are:
Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)
Final output desired:
I would like to have a table with:
col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
4
8
James
100022
6
10
But instead, I get:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
290
60
James
100022
290
60
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations
What am I doing wrong?
The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).
select
c.customer_name,
AllCustProducts.Product_Code,
coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
from
customer_df c
-- now, we need all possible UNIQUE instances of
-- a given customer and product to prevent duplicates
-- for subsequent queries of sales per customer and store
JOIN
( select distinct customerid, product_code
from store1_df
union
select distinct customerid, product_code
from store2_df ) AllCustProducts
on c.customerid = AllCustProducts.customerid
-- NOW, we can join to a pre-query of sales at store 1
-- by customer id and product code. You may also want to
-- get sum( SalesDollars ) if available, just add respectively
-- to each sub-query below.
LEFT JOIN
( select
s1.customerid,
s1.product_code,
count(*) as SalesEntries
from
store1_df s1
group by
s1.customerid,
s1.product_code ) PQStore1
on AllCustProducts.customerid = PQStore1.customerid
AND AllCustProducts.product_code = PQStore1.product_code
-- now, same pre-aggregation to store 2
LEFT JOIN
( select
s2.customerid,
s2.product_code,
count(*) as SalesEntries
from
store2_df s2
group by
s2.customerid,
s2.product_code ) PQStore2
on AllCustProducts.customerid = PQStore2.customerid
AND AllCustProducts.product_code = PQStore2.product_code
No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.
Related
I want to produce a table with two columns in the form of (country, total_revenue)
This is how the relational model looks like,
Each entry in the table orderdetails can produce revenue where its in the form of = quantityordered(a column)* priceEach(also a column).
The revenue an order produces is the sum of the revenue from the orderdetails in the order, but only if the order's status is shipped. The two tables orderdetails and order are related by the column ordernumber.
An order has a customer number that references customer table and the customer table has country field. The total_country_revenue is the sum over all shipped orders for customers in a country.
so far I have tried first producing a table by using group by(using ordernumber or customer number?) to produce a table with columns orderdetails revenue and the customer number to join with customer and use group by again but I keep getting weird results.....
-orderdetails table-
ordernumber
quantityordered
price_each
1
10
2.39
1
12
1.79
2
12
1.79
3
12
1.79
-orders table-
ordernumber
status.
customer_num
1
shipped
11
1
shipped
12
2
cancelled
13
3
shipped
11
-customers table-
custom_num
country
11
USA
12
France
13
Japan
11
USA
-Result table-
country
total_revenue
11
1300
12
1239
13
800
11
739
Your description is a bit weird. You are writing that you want to build the sum per country, but in your table which should show the desired outcome, you didn't build a sum and you also don't show the country.
Furthermore, you wrote you want to exclude orders that don't have the status "shipped", but your sample outcome includes them.
This query will produce the outcome you have described in words, not that one you have added as a table:
SELECT c.country,
SUM(d.quantityordered * d.price_each) AS total_revenue
FROM
orders o
JOIN orderdetails d ON o.ordernumber = d.ordernumber
JOIN customers c ON o.customer_num = c.custom_num
WHERE o.status = 'shipped'
GROUP BY c.country;
As you can see, you will need to JOIN your tables and apply a GROUP BY country clause.
A note: You could remove the WHERE clause and add its condition to a JOIN. It's possible this will reduce the execution time of your query, but it might be less readable.
A further note: You could also consider to use a window function for that using PARTITION BY c.country. Since you didn't tag your DB type, the exact syntax for that option is unclear.
A last note: Your sample data looks really strange. Is it really intended an order should be counted as for France and for the USA the same time?
If the query above isn't what you were looking for, please review your description and fix it.
DATA Explanation
I have two data tables, one (PAGE VIEWS) which represents user events (CV 1,2,3 etc) and associated timestamp with member ID. The second table (ORDERS) represents the orders made - event time & order value. Membership ID is available on each table.
Table 1 - PAGE VIEWS (1,000 Rows in Total)
Event_Day
Member ID
CV1
CV2
CV3
CV4
11/5/2021
115126
APP
camp1
Trigger
APP-camp1-Trigger
11/14/2021
189192
SEARCH
camp4
Search
SEARCH-camp4-Search
11/5/2021
193320
SEARCH
camp5
Search
SEARCH-camp5-Search
Table 2 - ORDERS (249 rows in total)
Date
Purchase Order ID
Membership Number
Order Value
7/12/2021
0088
183300
29.34
18/12/2021
0180
132159
132.51
4/12/2021
0050
141542
24.35
What I'm trying to answer
I'd like to attribute the CV columns (PAGE VIEWS) with the (ORDERS) order value, by the earliest event date in (PAGE VIEWS). This would be a simple attribution use case.
Visual explanation of the two data tables
Issues
I've spent the weekend result and scrolling through a variety of online articles but the closest is using the following query
Select min (event_day) As "first date",member_id,cv2,order_value,purchase_order_id
from mta_app_allpages,mta_app_orders
where member_id = membership_number
group by member_id,cv2,order_value,purchase_order_id;
The resulting data is correct using the DISTINCT function as Row 2 is different to Row 1, but I'd like to associate the result to Row 1 for member_id 113290, and row 3 for member_id 170897 etc.
Date
member_id
cv2
Order Value
2021-11-01
113290
camp5
58.81
2021-11-05
113290
camp4
58.51
2021-11-03
170897
camp3
36.26
2021-11-09
170897
camp5
36.26
2021-11-24
170897
camp1
36.26
Image showing the results table
I've tried using partition and sub query functions will little success. The correct call should return a maximum of 249 rows as that is as many rows as I have in the ORDERS table.
First-time poster so hopefully I have the format right. Many thanks.
Using RANK() is the best approach:
select * from
(
select *, RANK()OVER(partition by membership_number order by Event_Day) as rnk
from page_views as pv
INNER JOIN orders as o
ON pv.Member_ID=o.Membership_Number
) as q
where rnk=1
This will only fetch the minimum event_day.
However, you can use MIN() to achieve the same (but with complex sub-query):
select *
from
(select pv.*
from page_views as pv
inner join
(
select Member_ID, min(event_day) as mn_dt
from page_views
group by member_id
) as mn
ON mn.Member_ID=pv.Member_ID and mn.mn_dt=pv.event_day
)as sq
INNER JOIN orders as o
ON sq.Member_ID=o.Membership_Number
Both the queries will get us the same answer.
See the demo in db<>fiddle
I have two tables that I want to join together:
contracts:
id
customer_id_1
customer_id_2
customer_id_3
date
1
MAIN1
TRAN1
TRAN2
20201101
2
MAIN2
20201001
3
MAIN3
TRAN5
20200901
4
MAIN4
TRAN7
TRAN8
20200801
customers:
id
customer_id
info
date
1
MAIN1
blah
20200930
2
TRAN2
blah
20200929
3
TRAN5
blah
20200831
4
TRAN7
blah
20200801
In my contracts table, each row represents a contract with a customer, who may have 1 or more different IDs they are referred to by in the customers table. In the customers table, I have info on customers (can be zero or multiple records on different dates for each customer). I want to perform a join from contracts onto customers such that I get the most recent info available on a customer at the time a contract is recorded, ignoring any potential customer info that may be available after the contract date. I am also not interested in contracts which have no info on the customers. The main problem here is that in customers, each customer record can reference any 1 of the 3 IDs that may exist.
I currently have the following query which performs the task as intended but the problem is that is extremely slow when run on data in the 50-100k rows range. If I remove the OR statements in the INNER JOIN and just join on the the first ID, the query performs in seconds as opposed to ~ half an hour.
SELECT
DISTINCT ON (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date AS contract_date,
cst.info,
cst.date AS info_date
FROM
contracts ctr
INNER JOIN customers cst ON (
cst.customer_id = ctr.customer_id_1
OR cst.customer_id = ctr.customer_id_2
OR cst.customer_id = ctr.customer_id_3
)
AND ctr.date >= cst.date
ORDER BY
ctr.id,
cst.date DESC
Result:
id
customer_id_1
contract_date
info
info_date
1
MAIN1
20201101
blah
20200930
3
MAIN3
20200901
blah
20200831
4
MAIN4
20200801
blah
20200801
It seems like OR statements in JOINs aren't very common (I've barely found any examples online) and I presume this is because there must be a better way of doing this. So my question is, how can this be optimised?
OR often is a performance killer in SQL predicates.
One alternative unpivots before joining:
select distinct on (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date as contract_date,
cst.info,
cst.date as info_date
from contracts ctr
cross join lateral (values
(ctr.customer_id_1), (ctr.customer_id_2), (ctr.customer_id_3)
) as ctx(customer_id)
inner join customers cst on cst.customer_id = ctx.customer_id and ctr.date >= cst.date
order by ctr.id, cst.date desc
The use of this techniques pinpoints that your could vastly improve your data model: the relation between contracts and customers should be stored in a separate table, with each customer/contract tuple on a separate row - essentially, what the query does is virtually build that derived table in the lateral join.
I am trying to create a query in which I start with an item number and a customer and I have to determine the last selling price.
The tables involved are
SOP30200 = Sales Header
SOP30300 = Sales Detail lines
Given the following code and results:
CODE:
SELECT
SOP30200.CUSTNMBR,
MAX(SOP30200.DOCDATE),
SOP30300.ITEMNMBR,
SOP30300.UNITPRCE
FROM
SOP30200
INNER JOIN
SOP30300 ON
SOP30300.SOPNUMBE = SOP30200.SOPNUMBE AND
SOP30300.SOPTYPE = SOP30200.SOPTYPE
WHERE
SOP30200.SOPTYPE = 3 AND
SOP30200.CUSTNMBR = 'FAKECUST' AND
SOP30300.ITEMNMBR = 'FAKEITEM'
GROUP BY
SOP30200.CUSTNMBR,
SOP30300.ITEMNMBR,
SOP30300.UNITPRCE
RESULTS:
CUSTNMBR (No column name) ITEMNMBR UNITPRCE
FAKECUST 2013-07-12 00:00:00.000 FAKEITEM 16.80000
FAKECUST 2014-02-14 00:00:00.000 FAKEITEM 17.14000
I am getting 2 records because the query is grouped by UNITPRCE and we have sold this item to this customer at two different prices. That much I know, however, I want to see those four fields but only one record that has the latest date.
Add an order by MAX(SOP30200.DOCDATE) DESC and change Select to Select Top 1.
I have a state machine architecture, where a record will have many state transitions, the one with the greatest sort_key column being the current state. My problem is to determine which records held a particular state (or states) for a given date.
Example data:
items table
id
1
item_transitions table
id item_id created_at to_state sort_key
1 1 05/10 "state_a" 1
2 1 05/12 "state_b" 2
3 1 05/15 "state_a" 3
4 1 05/16 "state_b" 4
Problem:
Determine all records from items table which held state "state_a" on date 05/15. This should obviously return the item in the example data, but if you query with date "05/16", it should not.
I presume I'll be using a LEFT OUTER JOIN to join the items_transitions table to itself and narrow down the possibilities until I have something to query on that will give me the items that I need. Perhaps I am overlooking something much simpler.
Your question rephrased means "give me all items which have been changed to state_a on 05/15 or before and have not changed to another state afterwards. Please note that for the example it added 2001 as year to get a valid date. If your "created_at" column is not a datetime i strongly suggest to change it.
So first you can retrieve the last sort_key for all items before the threshold date:
SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id
Next step is to join this result back to the item_transitions table to see to which state the item was switched at this specific sort_key:
SELECT *
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
Finally you only want those who switched to 'state_a' so just add a condition:
SELECT DISTINCT it.item_id
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
WHERE it.to_state='state_a'
You did not mention which DBMS you use but i think this query should work with the most common ones.