Postgresql: How to use a WITH subquery with JOIN - sql

I have 2 tables: orders and contragents. Each contragent might have many orders. Each order has an order_date. I want to get a first order date for each contragent, but with a caveat: if there was a gap between orders more than 180 days, I need to "forget" those before the gap (and thus the first order after the gap is considered "the first".
For this, I've implement a following statement:
with o1 as (
select order_date, lag(order_date) over(order by order_date ASC) as prev_order_date
from orders o
where o.contragent_code = :code
order by order_date desc)
select o1.date_debts from o1
where extract(day from o1.order_date-o1.prev_order_date)>=180 or o1.prev_order_date is null
order by order_date desc
limit 1
this results in a single value being returned for a contragent with code code, which is what I need.
But I cannot figure out how to run a select that would return this date for every contragent in a table!
The only way I was able to do it was using a CREATE FUNCTION, but I will be unable to do it on production, so.. any advice is highly appreciated!

You want to add partition by, which is kinda like group by for over.
with o1 as (
select order_date, lag(order_date) over(partition by contragent_code order by order_date ASC) as prev_order_date
from orders o
order by order_date desc)
select o1.date_debts from o1
where extract(day from o1.order_date-o1.prev_order_date)>=180 or o1.prev_order_date is null
order by order_date desc
Now lag looks for the previous order_date of rows with same contragent_code.
UPDATE: at the end, it appears that that was not exactly enough. This is the final statement:
with s as (
select o.contragent_code, o.order_date,
case
when
extract(day from order_date-lag(order_date) over(partition by contragent_code order by order_date asc))>=180
then o.order_date else null
end as date_with_gap
from orders o
) select contragent_code, coalesce(max(date_with_gap), min(order_date)) from s
group by contragent_code

Related

SQL: select, sort and join tables

For example I have table 'orders' with columns: ID, order_date, order_price. I need to sort part of table with previous dates by DESC and part of table for future dates by ASC.
For previous it would be:
SELECT * FROM orders WHERE order_date < CURRENT_DATE() ORDER BY DESC
For future dates it would be:
SELECT * FROM orders WHERE order_date >= CURRENT_DATE() ORDER BY ASC
How can I combine these requests in one?
assuming mysql 8 or later then row_number window function
(based on (current_date() is a mysql function)
select results.* from (
SELECT 0 - row_number() over (order by order_date desc) as RowNum
, * FROM orders WHERE order_date < CURRENT_DATE()
union
SELECT row_number() over (order by order_date asc) as RowNum
, * FROM orders WHERE order_date >= CURRENT_DATE()
) results
order by results.RowNum asc
Might need a tweek on the orders before current date
Since the DBMS wasn't provided just assuming it supports some DATEDIFF function. I'm sure there's a way to handle this in one clause, but can't think of it.
ORDER
BY CASE WHEN current_date < order_date THEN 0
ELSE 1
END,
CASE WHEN current_date >= order_date THEN DATEDIFF('day',current_date,order_date)
ELSE DATEDIFF('day',current_date,order_date)*-1
END
I would use:
order by (case when order_date < current_date() then 1 else 2 end),
(case when order_date < current_date() then order_date end) desc,
order_date asc
The first key explicitly puts the older records first. The second explicitly orders them by descending date. And the third orders the rest ascendingly.

SQL: Difference between consecutive rows

Table with 3 columns: order id, member id, order date
Need to pull the distribution of orders broken down by No. of days b/w 2 consecutive orders by member id
What I have is this:
SELECT
a1.member_id,
count(distinct a1.order_id) as num_orders,
a1.order_date,
DATEDIFF(DAY, a1.order_date, a2.order_date) as days_since_last_order
from orders as a1
inner join orders as a2
on a2.member_id = a1.member_id+1;
It's not helping me completely as the output I need is:
You can use lag() to get the date of the previous order by the same customer:
select o.*,
datediff(
order_date,
lag(order_date) over(partition by member_id order by order_date, order_id)
) days_diff
from orders o
When there are two rows for the same date, the smallest order_id is considered first. Also note that I fixed your datediff() syntax: in Hive, the function just takes two dates, and no unit.
I just don't get the logic you want to compute num_orders.
May be something like this:
SELECT
a1.member_id,
count(distinct a1.order_id) as num_orders,
a1.order_date,
DATEDIFF(DAY, a1.order_date, a2.order_date) as days_since_last_order
from orders as a1
inner join orders as a2
on a2.member_id = a1.member_id
where not exists (
select intermediate_order
from orders as intermedite_order
where intermediate_order.order_date < a1.order_date and intermediate_order.order_date > a2.order_date) ;

How to summarize information over the dynamic period in sql?

I have a table with orders and the following fields:
create table orders2 (
orderID int,
customerID int,
date DateTime,
amount int)
engine=Memory;
Each customer can make 0 or many orders each day. I need to create an SQL query that will show for each customer how many orders he/she made during the period of 3 days starting from the day when the customer has made his/her first order.
So, for each customer, the query should detect the date of the first order, then compute the date that is 3 days in the future from the first date, then filter rows to take only orders with dates in the given range, and then perform counting of orders (orderID) in that time period. At the moment, I was able to just detect the date of the first order for each customer.
SELECT
O.customerID,
O.date AS first_day,
COUNT(O.orderID) AS first_day_orders_num,
SUM(O.amount) AS first_day_amount
FROM orders2 AS O
INNER JOIN
(
SELECT
customerID,
MIN(date) AS first_date
FROM orders2
GROUP BY customerID
) AS I ON (O.customerID = I.customerID) AND (O.date = I.first_date)
GROUP BY
O.customerID,
O.date
I don't really understand what result do you need. Probably it can be solved using arrays.
Here is solution using vanilla sql
select customerID, min(first_date), sum(num_orders_per_day)
from (
select customerID, date, min(date) first_date, count() num_orders_per_day
from orders2
group by customerID, date
having date <= first_date + interval 3 days
)
group by customerID
You can use window functions to get the first order date:
select o.CustomerID, count(*) as num_orders_3_days
from (select o.*, min(date) over (partition by CustomerID) as min_date
from orders o
) o
where date < min_date + interval '3 day'
group by CustomerID;
Try this query:
SELECT customerID, orders_count
FROM (
SELECT customerID,
arraySort(x -> x.1, groupArray((date, orderID))) sorted_date_per_order_pairs,
sorted_date_per_order_pairs[1].1 + INTERVAL 3 day AS end_date,
arrayFilter(x -> x.1 < end_date, sorted_date_per_order_pairs) orders_in_period,
length(orders_in_period) orders_count
FROM orders2
GROUP BY customerID);

Difference between multiple dates

I am working in a database with multiple orders of multiple suppliers. Now I would like to know the difference in days between order 1 and order 2, order 2 and order 3, order 3 and order 4 and so on.. For each supplier on its own. I need this to generate the Standard Deviation for each supplier based on their days between orders.
Hopefully someone can help..
What you describe is lag() with aggregation:
select supplier,
stddev(orderdate - prev_orderdate) as std_orderdate
from (select t.*,
lag(orderdate) over (partition by supplier order by orderdate) as prev_orderdate
from t
) t
group by supplier;
You would typically use window function lag() and date arithmetics.
Assuming the following data structure for table orders:
order_id int primary key
supplier_id int
order_date date
You would go:
select
i.*,
order_date
- lag(order_date) over(partition by supplier_id order by order_date) date_diff
from orders o
Which gives you, for each order, the difference in days from the previous order of the same supplier (or null if this is the first order of the supplier).
You can then compute the standard deviation with aggregation:
select supplier_id, stddev(date_diff)
from (
select
o.*,
order_date
- lag(order_date) over(partition by supplier_id order by order_date) date_diff
from orders o
) x
group by supplier_id

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.