How to retrieve the earliest date from columns that have different dimensions? - sql

DATA Explanation
I have two data tables, one (PAGE VIEWS) which represents user events (CV 1,2,3 etc) and associated timestamp with member ID. The second table (ORDERS) represents the orders made - event time & order value. Membership ID is available on each table.
Table 1 - PAGE VIEWS (1,000 Rows in Total)
Event_Day
Member ID
CV1
CV2
CV3
CV4
11/5/2021
115126
APP
camp1
Trigger
APP-camp1-Trigger
11/14/2021
189192
SEARCH
camp4
Search
SEARCH-camp4-Search
11/5/2021
193320
SEARCH
camp5
Search
SEARCH-camp5-Search
Table 2 - ORDERS (249 rows in total)
Date
Purchase Order ID
Membership Number
Order Value
7/12/2021
0088
183300
29.34
18/12/2021
0180
132159
132.51
4/12/2021
0050
141542
24.35
What I'm trying to answer
I'd like to attribute the CV columns (PAGE VIEWS) with the (ORDERS) order value, by the earliest event date in (PAGE VIEWS). This would be a simple attribution use case.
Visual explanation of the two data tables
Issues
I've spent the weekend result and scrolling through a variety of online articles but the closest is using the following query
Select min (event_day) As "first date",member_id,cv2,order_value,purchase_order_id
from mta_app_allpages,mta_app_orders
where member_id = membership_number
group by member_id,cv2,order_value,purchase_order_id;
The resulting data is correct using the DISTINCT function as Row 2 is different to Row 1, but I'd like to associate the result to Row 1 for member_id 113290, and row 3 for member_id 170897 etc.
Date
member_id
cv2
Order Value
2021-11-01
113290
camp5
58.81
2021-11-05
113290
camp4
58.51
2021-11-03
170897
camp3
36.26
2021-11-09
170897
camp5
36.26
2021-11-24
170897
camp1
36.26
Image showing the results table
I've tried using partition and sub query functions will little success. The correct call should return a maximum of 249 rows as that is as many rows as I have in the ORDERS table.
First-time poster so hopefully I have the format right. Many thanks.

Using RANK() is the best approach:
select * from
(
select *, RANK()OVER(partition by membership_number order by Event_Day) as rnk
from page_views as pv
INNER JOIN orders as o
ON pv.Member_ID=o.Membership_Number
) as q
where rnk=1
This will only fetch the minimum event_day.
However, you can use MIN() to achieve the same (but with complex sub-query):
select *
from
(select pv.*
from page_views as pv
inner join
(
select Member_ID, min(event_day) as mn_dt
from page_views
group by member_id
) as mn
ON mn.Member_ID=pv.Member_ID and mn.mn_dt=pv.event_day
)as sq
INNER JOIN orders as o
ON sq.Member_ID=o.Membership_Number
Both the queries will get us the same answer.
See the demo in db<>fiddle

Related

COUNT with multiple LEFT joins [duplicate]

This question already has answers here:
Two SQL LEFT JOINS produce incorrect result
(3 answers)
Closed 12 months ago.
I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)
The datasets are:
Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)
Final output desired:
I would like to have a table with:
col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
4
8
James
100022
6
10
But instead, I get:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
290
60
James
100022
290
60
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations
What am I doing wrong?
The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).
select
c.customer_name,
AllCustProducts.Product_Code,
coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
from
customer_df c
-- now, we need all possible UNIQUE instances of
-- a given customer and product to prevent duplicates
-- for subsequent queries of sales per customer and store
JOIN
( select distinct customerid, product_code
from store1_df
union
select distinct customerid, product_code
from store2_df ) AllCustProducts
on c.customerid = AllCustProducts.customerid
-- NOW, we can join to a pre-query of sales at store 1
-- by customer id and product code. You may also want to
-- get sum( SalesDollars ) if available, just add respectively
-- to each sub-query below.
LEFT JOIN
( select
s1.customerid,
s1.product_code,
count(*) as SalesEntries
from
store1_df s1
group by
s1.customerid,
s1.product_code ) PQStore1
on AllCustProducts.customerid = PQStore1.customerid
AND AllCustProducts.product_code = PQStore1.product_code
-- now, same pre-aggregation to store 2
LEFT JOIN
( select
s2.customerid,
s2.product_code,
count(*) as SalesEntries
from
store2_df s2
group by
s2.customerid,
s2.product_code ) PQStore2
on AllCustProducts.customerid = PQStore2.customerid
AND AllCustProducts.product_code = PQStore2.product_code
No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

I need to get count of total count for the query I had in postgresql

I created a select query as following, now I need to get the total count of the "No.of Ideas generated" column in a separate row as total which will have a count of the individual count of particular idea_sector and idea_industry combination.
Query:
select c.idea_sector,c.idea_industry,
count(*) as "No.of Ideas generated"
from hackathon2k21.consolidated_report c
group by idea_sector,idea_industry
order by idea_sector ,idea_industry
Output:
----------------------------------------------------------------------
idea_sector idea_industry No.of Ideas generated
-----------------------------------------------------------------------
COMMUNICATION-ROC TELECOMMUNICATIONS 1
Cross Sector Cross Industry 5
DISTRIBUTION TRAVEL AND TRANSPORTATION 1
FINANCIAL SERVICES BANKING 1
PUBLIC HEALTHCARE 1
Required output:
----------------------------------------------------------------------
idea_sector idea_industry No.of Ideas generated
-----------------------------------------------------------------------
COMMUNICATION-ROC TELECOMMUNICATIONS 1
Cross Sector Cross Industry 5
DISTRIBUTION TRAVEL AND TRANSPORTATION 1
FINANCIAL SERVICES BANKING 1
PUBLIC HEALTHCARE 1
------------------------------------------------------------------------
Total 9
You can accomplish this with grouping sets. That's where we tell postgres, in the GROUP BY clause, all of the different ways we would like to see our result set grouped for the aggregated column(s)
SELECT
c.idea_sector,
c.idea_industry,
count(*) as "No.of Ideas generated"
FROM hackathon2k21.consolidated_report c
GROUP BY
GROUPING SETS (
(idea_sector,idea_industry),
())
ORDER BY idea_sector ,idea_industry;
This generates two grouping sets. One that groups by idea_sector, idea_industry granularity like in your existing sql and another that groups by nothing, essentially creating a full table Total.
The easiest way seems to be adding a UNION ALL operator like this:
select c.idea_sector,c.idea_industry,
count(*) as "No.of Ideas generated"
from hackathon2k21.consolidated_report c
group by idea_sector,idea_industry
--order by idea_sector ,idea_industry
UNION ALL
SELECT 'Total', NULL, COUNT(*)
from hackathon2k21.consolidated_report

Determine records which held particular "state" on a given date

I have a state machine architecture, where a record will have many state transitions, the one with the greatest sort_key column being the current state. My problem is to determine which records held a particular state (or states) for a given date.
Example data:
items table
id
1
item_transitions table
id item_id created_at to_state sort_key
1 1 05/10 "state_a" 1
2 1 05/12 "state_b" 2
3 1 05/15 "state_a" 3
4 1 05/16 "state_b" 4
Problem:
Determine all records from items table which held state "state_a" on date 05/15. This should obviously return the item in the example data, but if you query with date "05/16", it should not.
I presume I'll be using a LEFT OUTER JOIN to join the items_transitions table to itself and narrow down the possibilities until I have something to query on that will give me the items that I need. Perhaps I am overlooking something much simpler.
Your question rephrased means "give me all items which have been changed to state_a on 05/15 or before and have not changed to another state afterwards. Please note that for the example it added 2001 as year to get a valid date. If your "created_at" column is not a datetime i strongly suggest to change it.
So first you can retrieve the last sort_key for all items before the threshold date:
SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id
Next step is to join this result back to the item_transitions table to see to which state the item was switched at this specific sort_key:
SELECT *
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
Finally you only want those who switched to 'state_a' so just add a condition:
SELECT DISTINCT it.item_id
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
WHERE it.to_state='state_a'
You did not mention which DBMS you use but i think this query should work with the most common ones.

How do you retrieve the top two records within each grouping

In my table, I have data that looks like this:
CODE DATE PRICE
100 1/1/13 $500
100 2/1/13 $521
100 3/3/13 $530
100 5/9/13 $542
222 3/3/13 $20
350 1/1/13 $200
350 3/1/13 $225
Is it possible to create query to pull out the TWO most recent records by DATE? AND only if there are 2+ dates for a specific code. So the result would be:
CODE DATE PRICE
100 5/9/13 $542
100 3/3/13 $530
350 3/1/13 $225
350 1/1/13 $200
Bonus points if you can put both prices/dates on the same line, like this:
CODE OLD_DATE OLD_PRICE NEW_DATE NEW_PRICE
100 3/3/13 $530 5/9/13 $542
350 1/1/13 $200 3/1/13 $225
Thank you!!!
I managed to solve it with 5 sub-queries and 1 rollup query.
First we have a subquery that gives us the MAX date for each code.
Next, we do the same subquery, except we exclude our previous results.
We assume that your data is already rolled up and you won't have duplicate dates for the same code.
Next we bring in the appropriate Code / Price for the latest and 2nd latest date. If a code doesn't exist in the 2nd Max query - then we don't include it at all.
In the union query we're combining the results of both. In the Rollup Query, we're sorting and removing null values generated in the union.
Results:
CODE MaxOfOLDDATE MaxOfOLDPRICE MaxOfNEWDATE MaxOfNEWPRICE
100 2013-03-03 $530.00 2013-05-09 542
350 2013-01-01 $200.00 2013-03-01 225
Using your Data in a table called "Table", create the following queries:
SUB_2ndMaxDatesPerCode:
SELECT Table.CODE, Max(Table.Date) AS MaxOfDATE1
FROM SUB_MaxDatesPerCode RIGHT JOIN [Table] ON (SUB_MaxDatesPerCode.MaxOfDATE = Table.DATE) AND (SUB_MaxDatesPerCode.CODE = Table.CODE)
GROUP BY Table.CODE, SUB_MaxDatesPerCode.CODE
HAVING (((SUB_MaxDatesPerCode.CODE) Is Null));
SUB_MaxDatesPerCode:
SELECT Table.CODE, Max(Table.Date) AS MaxOfDATE
FROM [Table]
GROUP BY Table.CODE;
SUB_2ndMaxData:
SELECT Table.CODE, Table.Date, Table.PRICE
FROM [Table] INNER JOIN SUB_2ndMaxDatesPerCode ON (Table.DATE = SUB_2ndMaxDatesPerCode.MaxOfDATE1) AND (Table.CODE = SUB_2ndMaxDatesPerCode.Table.CODE);
SUB_MaxData:
SELECT Table.CODE, Table.Date, Table.PRICE
FROM ([Table] INNER JOIN SUB_MaxDatesPerCode ON (Table.DATE = SUB_MaxDatesPerCode.MaxOfDATE) AND (Table.CODE = SUB_MaxDatesPerCode.CODE)) INNER JOIN SUB_2ndMaxDatesPerCode ON Table.CODE = SUB_2ndMaxDatesPerCode.Table.CODE;
SUB_Data:
SELECT CODE, DATE AS OLDDATE, PRICE AS OLDPRICE, NULL AS NEWDATE, NULL AS NEWPRICE FROM SUB_2ndMaxData;
UNION ALL SELECT CODE, NULL AS OLDDATE, NULL AS OLDPRICE, DATE AS NEWDATE, PRICE AS NEWPRICE FROM SUB_MaxData;
Data (Rollup):
SELECT SUB_Data.CODE, Max(SUB_Data.OLDDATE) AS MaxOfOLDDATE, Max(SUB_Data.OLDPRICE) AS MaxOfOLDPRICE, Max(SUB_Data.NEWDATE) AS MaxOfNEWDATE, Max(SUB_Data.NEWPRICE) AS MaxOfNEWPRICE
FROM SUB_Data
GROUP BY SUB_Data.CODE
ORDER BY SUB_Data.CODE;
There you go - thanks for the challenge.
Accessing the recent data
To access the recent data, you use TOP 2. Such as you inverse the data from the table, then select the top 2. Just as you start ABC from ZYX and select the TOP 2 which would provide you with ZY.
SELECT TOP 2 * FROM table_name ORDER BY column_time DESC;
This way, you reverse the table, and then select the most recent two from the top.
Joining the Tables
To join the two columns and create a result from there quest you can use JOIN (INNER JOIN; I prefer this) such as:
SELECT TOP 2 * FROM table_name INNER JOIN table_name.column_name ON
table_name.column_name2
This way, you will join both the tables where a value in one column matches the value from the other column in both tables.
You can use a for loop for this to select the value for them, or you can use this inside the foreach loop to take out the values for them.
My suggestion
My best method would be to, first just select the data that was ordered using the date.
Then inside the foreach() loop where you will write the data for that select the remaining data for that time. And write it inside that loop.
Code (column_name) won't bother you
And when you will reference the query using ORDER By Time Desc you won't be using the CODE anymore such as WHERE Code = value. And you will get the code for the most recent ones. If you really need the code column, you can filter it out using and if else block.
Reference:
http://technet.microsoft.com/en-us/library/ms190014(v=sql.105).aspx (Inner join)
http://www.w3schools.com/sql/sql_func_first.asp (top; check the Sql Server query)

Having problems fully understanding GROUP BY

I'm going over some practise questions for an exam that I have coming up and I'm having a problem fully understanding group by. I see GROUP BY as the following: group the result set by one or more columns.
I have the following database schema
My query
SELECT orders.customer_numb, sum(order_lines.cost_line), customers.customer_first_name, customers.customer_last_name
FROM orders
INNER JOIN customers ON customers.customer_numb = orders.customer_numb
INNER JOIN order_lines ON order_lines.order_numb = orders.order_numb
GROUP BY orders.customer_numb, order_lines.cost_line, customers.customer_first_name, customers.customer_last_name
ORDER BY order_lines.cost_line DESC
What I'm struggling to understand
Why can't I simply use just GROUP BY orders.cost_line and group the data by cost_line?
What I'm trying to achieve
I'd like to achieve the name of the customer who has spent the most money. I just don't fully understand how to achieve this. I understand how joins work, I just can't seem to get my head around why I can't simply GROUP BY customer_numb and cost_line (with sum() used to calculate the amount spent). I seem to always get "not a GROUP BY expression", if someone could explain what I'm doing wrong (not just give me the answer), that would be great - I'd really appreciate that, and of course any resources that you have for using GROUP by properly.
Sorry for the long essay and If I've missed anything I apologise. Any help would be greatly appreciated.
I just can't seem to get my head around why I can't simply GROUP BY
customer_numb and cost_line (with sum() used to calculate the amount
spent).
When you say group by customer_numb you know that customer_numb uniquely identifies a row in the customer table (assuming customer_numb is either a primary or alternate key), so that any given customers.customer_numb will have one and only one value for customers.customer_first_name and customers.customer_last_name. But at parse time Oracle does not know, or at least acts like it does not know that. And it says, in a bit of panic, "What do I do if a single customer_numb has more than one value for customer_first_name?"
Roughly the rule is, expressions in the select clause can use expressions in the group by clause and/or use aggregate functions. (As well as constants and system variables that don't depend on the base tables, etc.) And by "use" I mean be the expression or part of the expression. So once you group on first name and last name, customer_first_name || customer_last_name would be a valid expression also.
When you have a table, like customers and are grouping by a primary key, or a column with a unique key and not null constraint, you can safely include them in group by clause. In this particular instance, group by customer.customer_numb, customer.customer_first_name, customer.customer_last_name.
Also note, that the order by in the first query will fail, since order_lines.cost_line doesn't have a single value for the group. You can order on sum(order_lines.cost_line) or use an column alias in the select clause and order on that alias
SELECT orders.customer_numb,
sum(order_lines.cost_line),
customers.customer_first_name,
customers.customer_last_name
FROM orders
INNER JOIN customers ON customers.customer_numb = orders.customer_numb
INNER JOIN order_lines ON order_lines.order_numb = orders.order_numb
GROUP BY orders.customer_numb,
customers.customer_first_name,
customers.customer_last_name
ORDER BY sum(order_lines.cost_line)
or
SELECT orders.customer_numb,
sum(order_lines.cost_line) as sum_cost_line,
. . .
ORDER BY sum_cost_line
Note: I've heard that some RDBMSes will imply additional expressions for the grouping without them being explicitly stated. Oracle is not one of those RDBMSes.
As for grouping by both customer_numb and cost_line Consider a DB with two customers, 1 and 2 with two orders of one line each:
Customer Number | Cost Line
1 | 20.00
1 | 20.00
2 | 35.00
2 | 30.00
select customer_number, cost_line, sum(cost_line)
FROM ...
group by customer_number, cost_line
order by sum(cost_line) desc
Customer Number | Cost Line | sum(cost_line)
1 | 20.00 | 40.00
2 | 35.00 | 35.00
2 | 30.00 | 30.00
The first row with highest sum(cost_line) is not the customer who spent the most.
I understand how joins work, I just can't seem to get my head around
why I can't simply GROUP BY customer_numb and cost_line (with sum()
used to calculate the amount spent).
This should give you the sum for every customer.
SELECT orders.customer_numb, sum(order_lines.cost_line)
FROM orders
INNER JOIN order_lines ON order_lines.order_numb = orders.order_numb
GROUP BY orders.customer_numb
Note that every column in the SELECT clause that's not an argument to an aggregate function is also a column in the GROUP BY clause.
Now you can join that with other tables to get more detail. Here's one way using a common table expression. (There are other ways to express what you want.)
with customer_sums as (
-- We give the columns useful aliases here.
SELECT orders.customer_numb as customer_numb,
sum(order_lines.cost_line) as total_orders
FROM orders
INNER JOIN order_lines ON order_lines.order_numb = orders.order_numb
GROUP BY orders.customer_numb
)
select c.customer_numb, c.customer_first_name, c.customer_last_name, cs.total_orders
from customers c
inner join customer_sums cs
on cs.customer_numb = c.customer_numb
order by cs.total_orders desc
Why can't I simply use just GROUP BY orders.cost_line and group the
data by cost_line?
Applying GROUP BY to order_lines.cost_line will give you one row for each distinct value in order_lines.cost_line. (The column orders.cost_line doesn't exist.) Here's what that data might look like.
OL.ORDER_NUMB OL.COST_LINE O.CUSTOMER_NUMB C.CUSTOMER_FIRST_NAME C.CUSTOMER_LAST_NAME
--
1 1.45 2014 Julio Savell
1 2.33 2014 Julio Savell
1 1.45 2014 Julio Savell
2 1.45 2014 Julio Savell
2 1.45 2014 Julio Savell
3 13.00 2014 Julio Savell
You can group by order_lines.cost_line, but it won't give you any useful information. This query
select order_lines.cost_line, orders.customer_numb
from order_lines
inner join orders on orders.customer_numb = order_lines.customer_numb
group by order_lines.cost_line;
should return something like this.
OL.COST_LINE O.CUSTOMER_NUMB
--
1.45 2014
2.33 2014
13.00 2014
Not terribly useful.
If you're interested in the sum of the order line items, you need to decide what column or columns to group (summarize) by. If you group (summarize) by order number, you'll get three rows. If you group (summarize) by customer number, you'll get one row.