JOIN on Subquery Based on Date Evaluation - sql

I have a query that includes a subquery that references one of the tables that my query joins on, but I also need to do an evaluation on the field returned from the subquery in my WHERE clause.
Here's the current query (rough example) -
SELECT t1.first_name, t1.last_name,
(SELECT created_at FROM customer_order_status_history WHERE order_id=t2.order_id AND order_status=t2.order_status ORDER BY created_at DESC LIMIT 1) AS order_date
FROM customers AS t1
INNER JOIN customer_orders as t2 on t2.customer_id=t1.customer_id
My subquery is currently returning the latest date from the customer_order_status_history table, but in my query I want to do an evaluation on the subquery in the WHERE clause such that I only want it if the the most recent created_at date is greater than a specific date condition (i.e. system date - 5 days). So in a way this is a conditional join on the customer_orders and customer_order_status_history tables where the final result should only be returned if the most recent record in customer_order_status_history (sorted by created_at in descending order) is greater than system date - 5 days.
Apologies in advance for the bad explanation but hopefully it is clear what I am trying to achieve here. Also I did not come up with this database schema and given the project constraints, I can not alter the schema.
Thanks!

Use a lateral join:
SELECT c.first_name, c.last_name, cosh.created_at
FROM customers c INNER JOIN
customer_orders co
ON co.customer_id = c.customer_id CROSS JOIN LATERAL
(SELECT cosh.*
FROM customer_order_status_history cosh
WHERE cosh.order_id = co.order_id AND
cosh.order_status = co.order_status AND
cosh.created_at > now() - INTERVAL '5 DAY'
ORDER BY cosh.created_at DESC
LIMIT 1
) cosh

Related

Avoid using CROSS JOIN on my SQL query (too heavy)

I am working on an SQL query in order to define customer types, the goal is to differenciate the old active customers from the churn customers (churn = customers that stopped using your company's product or service during a certain time frame)
In order to do that, i came up with this query that works perfectly :
WITH customers AS (
SELECT
DATE(ord.delivery_date) AS date,
ord.customer_id
FROM table_template AS ord
WHERE cancel_date IS NULL
AND order_type_id IN (1,3)
GROUP BY DATE(ord.delivery_date), ord.customer_id, ord.delivery_date),
days AS (SELECT DISTINCT date FROM customers),
recap AS (
SELECT * FROM (
SELECT
a1.date,
a2.customer_id,
MAX(a2.date) AS last_order,
DATE_DIFF(a1.date, MAX(a2.date), day) AS days_since_last,
MIN(a2.date) AS first_order,
DATE_DIFF(a1.date, MIN(a2.date), day) AS days_since_first
FROM days AS a1
CROSS JOIN customers AS a2 WHERE a2.date <= a1.date
GROUP BY a1.date, customer_id)
)
SELECT * FROM recap
The result of the query :
The only issue of this query is that the calculation is too heavy (it uses a lot of CPU seconds) I think that it is due to the CROSS JOIN.
I need some of your help in order to find another way to come with the same result, a way that doesn't need a CROSS JOIN to have the same output, do you guys think it is possible ?
As you have mentioned the problem of query taking a long time to load was because of the internet issue. Also, I will try to explain Inner Join further with a sample query as below:
SELECT distinct a1.id,a1.date
FROM `table1` AS a1
INNER JOIN `table2` AS a2
ON a2.date <= a1.date
The INNER JOIN selects all rows from both the tables as long as the condition satisfies. In this sample query it gives the result based on condition a2.date <= a1.date only if date values in table1 are greater than or equal to date values in table2.
Input Table 1:
Input Table 2:
Output Table:

SQL - Guarantee at least n unique users with 2 appearances each in query

I'm working with AWS Personalize and one of the service Quotas is to have "At least 1000 records containing a min of 25 unique users with at least 2 records each", I know my raw data has those numbers but I'm trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.
The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the "two records each" section by just counting the appearances, but I don't know how to guarantee the min of 25 users.
It is important to say that my data is not shuffled in any way at the time of saving.
My query
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION"
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
WHERE orders.userid IN (
SELECT orders.userid
FROM A AS ORDERS
GROUP BY orders.userid
HAVING count(*) > 2
)
LIMIT 10
I use the LIMIT to just query a subset since I'm in AWS Athena.
The IN query is not very efficient since it needs to compare each row with all (worst case) the elements of the subquery to find a match.
It would be easier to start by storing all users with at least 2 records in a common table expression (CTE) and do a join to select them.
To ensure at least 25 distinct users you will need a window function to count the unique users since the first row and add a condition on that count. Since you can't use a window function in the where clause, you will need a second CTE and a final query that queries it.
For example:
with users as (
select userid as good_users
from orders
group by 1
having count(*) > 1 -- this condition ensures at least 2 records
),
cte as (
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION",
count(distinct A.userid) over (rows between unbounded preceding and current row) as n_distinct_users
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
JOIN users on A.userid = users.userid --> ensure only users with 2 records
order by A.userid -- needed for the window function
)
select * from cte where n_distinct_users < 26
sorting over userid in cte will ensure that at least 2 records per userid will appear in the results.

confused with INNER JOIN and FULL JOIN with temporary table

WITH
longest_used_bike AS (
SELECT
bikeid,
SUM(duration_minutes) AS trip_duration
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY
bikeid
ORDER BY
trip_duration DESC
LIMIT 1
)
-- find station at which longest_used bike leaves most often
SELECT
trips.start_station_id,
COUNT(*) AS trip_ct
FROM
longest_used_bike AS longest
INNER JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
ON longest.bikeid = trips.bikeid
GROUP BY
trips.start_station_id
ORDER BY
trip_ct DESC
LIMIT 1
this query will give you a result thats 2575 but why does the result change to 3798 when you use full join instead of inner join? im trying to figure that one what but i am not sure what to think
A full join will include all entries from the trips table - regardless of whether or not they are joinable to the longest_used_bike ID (they will have a NULL value for the columns in longest)
Also see here for an explanation on join-types.
A tip: If you encounter things like these try to look at the queries unaggregated (omit the GROUP BY clause and the COUNT function) - you would then notice here that you'll suddenly have more (unwanted) rows in the FULL JOIN query.
An INNER JOIN will return only rows where the JOIN condition is satisfied. So only rows where there us a natch in both tables.
A FULL JOIN will return ALL rows from the left and all rows from the right with null values in the fields where there is not a natch.

right way to alias count * in a subquery

I have query below as
select t.comment_count, count(*) as frequency
from
(select u.id, count(c.user_id) as comment_count
from users u
left join comments c
on u.id = c.user_id
and c.created_at between '2020-01-01' and '2020-01-31'
group by 1) t
group by 1
order by 1
when I also try to alias the count(*) as count(t.*) it gives error, can I not alias that with the t from the table? Not sure what I am missing
Thank you
Count(*) stands for the count of all rows returned by a query (with respect to GROUP BY columns). So it makes no sence to specify one of the involved tables. Consider counting rows produced by a join for example. If you need a count of rows of the specific table t you can use count(distinct t.<unique column>)

SQL Query Help obtaining last order date from 2 tables

I have the following table
Orders_All
Account
Orders all contains every order line many thousand records and the 2 records are order_date and order_account_id.
This need to join into the Account so other queries can be run as well but I want a report that shows the account_id and the last order date but only one record per account.
How can I create a query to acgieve this.
SELECT account_id, MAX(order_date) as last_order_date FROM Orders_All INNER JOIN Account ON order_account_id = account_id GROUP BY account_id
That will give you the account ID and the maximum (furthest in the future) date. The GROUP BY is what limits it - it's the maximum date "for each" account_id.
If an account has no orders and you still want that account to show, with a NULL in the date column, use a RIGHT OUTER JOIN instead of INNER JOIN there.
Try this:
Select accound_id,Max(OrderDate) from Order_All t1,AccountID t2
where t1.AccountID = t2.AccountID
group by account_ID
Subtitute * for the columns you want to show or leave it and try the next code.
select distinct * from Account A
join Orders_All O on O.order_account_id = A.account_id
--so far we have join both table, so we need the last order date from it
where o.order_date in (select MAX(order_date) from Orders_All)
--in where clause im getting max order date from table Orders_All
but what if I want last order date by customer?
Well then you could write this (not the best but it works):
select *
from Account A
join (select account_id as Acc_id, MAX(order_date) as Date
from Orders_All
group by account_id)
as OD on A.account_id = OD.Acc_id