Add columns to SQL query and filter by min(date) and sum(price) - sql

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?

I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100

Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.

No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Find rows with similar date values

I want to find customers where for example, system by error registered duplicates of an order.
It's pretty easy, if reg_date is EXACTLY the same but I have no idea how to implement it in query to count as duplicate if for example there was up to 1 second difference between transactions.
select * from
(select customer_id, reg_date, count(*) as cnt
from orders
group by 1,2
) x where cnt > 1
Here is example dataset:
https://www.db-fiddle.com/f/m6PhgReSQbVWVZhqe8n4mi/0
CUrrently only customer's 104 orders are counted as duplicates because its reg_date is identical, I want to count also orders 1,2 and 4,5 as there's just 1 second difference
demo:db<>fiddle
SELECT
customer_id,
reg_date
FROM (
SELECT
*,
reg_date - lag(reg_date) OVER (PARTITION BY customer_id ORDER BY reg_date) <= interval '1 second' as is_duplicate
FROM
orders
) s
WHERE is_duplicate
Use the lag() window function. It allows to have a look hat the previous record. With this value you can do a diff and filter the records where the diff time is more than one second.
Try this following script. This will return you day/customer wise duplicates.
SELECT
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy') reg_date,
customer_id,
count(*) as cnt
FROM orders
GROUP BY
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy'),
customer_id
HAVING count(*) >1

Finding lowest two minimum values and finding difference between the two in SQL Server?

I have a transaction table where I have to find the first and second date of transaction of every customer. Finding first date is very simple where I can use MIN() func to find the first date but the second and in particular finding the difference between the two is getting very challenging and somehow I am not able to find out any feasible way:
select a.customer_id, a.transaction_date, a.Row_Count2
from ( select
transaction_date as transaction_date,
reference_no as customer_id,
row_number() over (partition by reference_no
ORDER BY reference_no, transaction_date) AS Row_Count2
from transaction_detail
) a
where a.Row_Count2 < 3
ORDER BY a.customer_id, a.transaction_date, a.Row_Count2
Gives me this :
What I want is , following columns:
||CustomerID|| ||FirstDateofPurchase|| ||SecondDateofPuchase|| ||Diff. b/w Second & First Date ||
You can use window functions LEAD/LAG to return results you are looking for
First try to find all the leading dates by reference number using LEAD, generate row number for each row using your original logic. You can then do difference on dates for row number value 1 row from the result set.
Ex (I'm not excluding same day transactions and treating them as separate and generating row number based on result set from your query above, you can easily change the sql below to consider these as one and remove them so that you get next date as second date):
declare #tbl table(reference_no int, transaction_date datetime)
insert into #tbl
select 1000, '2018-07-11'
UNION ALL
select 1001, '2018-07-12'
UNION ALL
select 1001, '2018-07-12'
UNIOn ALL
select 1001, '2018-07-13'
UNIOn ALL
select 1002, '2018-07-11'
UNIOn ALL
select 1002, '2018-07-15'
select customer_id, transaction_date as firstdate,
transaction_date_next seconddate,
datediff(day, transaction_date, transaction_date_next) diff_in_days
from
(
select reference_no as customer_id, transaction_date,
lead(transaction_date) over (partition by reference_no
order by transaction_date) transaction_date_next,
row_number() over (partition by reference_no ORDER BY transaction_date) AS Row_Count
from #tbl
) src
where Row_Count = 1
You can do this with CROSS APPLY.
SELECT td.customer_id, MIN(ca.transaction_date), MAX(ca.transaction_date),
DATEDIFF(day, MIN(ca.transaction_date), MAX(ca.transaction_date))
FROM transaction_detail td
CROSS APPLY (SELECT TOP 2 *
FROM transaction_detail
WHERE customer_id = td.customer_id
ORDER BY transaction_date) ca
GROUP BY td.customer_id

sql return 1st day of each month in table

I have a sql table like so with two columns...
3/1/17 100
3/2/17 200
3/3/17 300
4/3/17 600
4/4/17 700
4/5/17 800
I am trying to run a query that returns the 1st day of each month in that above table, and grab the corresponding value.
results should be
3/1/17 100
4/3/17 600
then once I have these results... do something with each one.
any ideas how I can get started?
In standard SQL, you would use row_number():
select t.*
from (select t.*,
row_number() over (partition by extract(year from dte), extract(month from dte)
order by dte asc) as seqnum
from t
) t
where seqnum = 1;
Most databases support this functionality, but the exact functions (particularly for dates) may differ depending on the database.
An alternative (SQL Server flavour):
SELECT t.*
FROM YourTable t
JOIN (
select MIN(DateColumn) as MinimumDate
from YourTable
group by FORMAT(DateColumn,'yyyyMM')
) q on (t.DateColumn = q.MinimumDate)
ORDER BY t.DateColumn;
For the GROUP BY this will also be fine:
group by YEAR(DateColumn), MONTH(DateColumn)
or
group by DATEPART(YEAR,DateColumn), DATEPART(MONTH,DateColumn)

Finding a date with the largest sum

I have a database of transactions, accounts, profit/loss, and date. I need to find the dates which the largest profit occurs by account. I have already found a way to find these actually max/min values but I can't seem to be able to pull the actual date from it. My code so far is like this:
Select accountnum, min(ammount)
from table
where date > '02-Jan-13'
group by accountnum
order by accountnum
Ideally I would like to see account num, the min or max, and then the date which this occurred on.
Try something like this to get the min and max amount for each customer and the date it happened.
WITH max_amount as (
SELECT accountnum, max(amount) amount, date
FROM TABLE
GROUP BY accountnum, date
),
min_amount as (
SELECT accountnum, min(amount) amount, date
FROM TABLE
GROUP BY accountnum, date
)
SELECT t.accountnum, ma.amount, ma.date, mi.amount, ma.date
FROM table t
JOIN max_amount ma
ON ma.accountnum = t.accountnum
JOIN min_amount mi
ON mi.accountnum = t.accountnum
If you want the data for just this year you could add a where clause to the end of the statement
WHERE t.date > '02-Jan-13'
The easiest way to do this is using window/analytic functions. These are ANSI standard and most databases support them (MySQL and Access being two notable exceptions).
Here is one way:
select t.accountnum, min_amount, max_amount,
min(case when amount = min_amount then date end) as min_amount_date,
min(case when amount = min_amount then date end) as max_amount_date,
from (Select t.*,
min(amount) over (partition by accountnum) as min_amount,
max(amount) over (partition by accountnum) as max_amount
from table t
where date > '02-Jan-13'
) t
group by accountnum, min_amount, max_amount;
order by accountnum
The subquery calculates the minimum and maximum amount for each account, using min() as a window function. The outer query selects these values. It then uses conditional aggregation to get the first date when each of those values occurred.
;with cte as
(
select accountnum, ammount, date,
row_number() over (partition by accountnum order by ammount desc) rn,
max(ammount) over (partition by accountnum) maxamount,
min(ammount) over (partition by accountnum) minamount
from table
where date > '20130102'
)
select accountnum,
ammount as amount,
date as date_of_max_amount,
minamount,
maxamount
from cte where rn = 1