transform large price table to table with startdate and enddate - sql

I have a table with prices per article per date with a lot of redundancy: even if the price does not change, I still have a line for each date. What I would like to do is transform this table to a table where for every different price, there will be a new line with a startdate and enddate.
Source example:
article_ID date price
1 01/01/15 2.99
1 02/01/15 2.99
1 03/01/15 2.49
2 01/01/15 12.29
2 02/01/15 12.29
2 03/01/15 12.29
I am looking for an SQL query to create the following result:
article_ID startdate enddate price
1 01/01/15 02/01/15 2.99
1 03/01/15 03/01/15 2.49
2 01/01/15 03/01/15 12.49
I work with SQL Server and Oracle SQL Developer.

You need to identify rows of consecutive dates with the same price, and then group on the resulting identifier. A simpler way to get the group is to subtract an increasing sequence, generated by row_number():
select article_id, min(date) as startdate, max(date) as enddate, price
from (select s.*,
dateadd(day,
- row_number() over (partition by article_id, price
order by date
)
date) as grp
from source s
) s
group by grp, article_id, price;
If you have the possibility of missed dates, then a difference of row numbers works:
select article_id, min(date) as startdate, max(date) as enddate, price
from (select s.*,
(row_number() over (partition by article_id order by date) -
row_number() over (partition by article_id, price order by date)
) as grp
from source s
) s
group by grp, article_id, price;

You could try this:
INSERT INTO destinationtable (article_ID,startdate,enddate.price)
SELECT article_ID, MIN(date) AS startdate, MAX(date) AS enddate, price
FROM sourcetable
GROUP BY article_ID, price
This will not work properly if a price changes back to a previous value. If that is a chance you will have to run a procedural code that loops while price stays constant and tracks start and end date.

Related

How to differentiate iteration using date filed in bigquery

I have a process that occur every 30 days but can take few days.
How can I differentiate between each iteration in order to sum the output of the process?
for Example
the output I except is
Name
Date
amount
iteration (optional)
Sophia Liu
2016-01-01
4
1
Sophia Liu
2016-02-01
5
2
Nikki Leith
2016-01-02
5
1
Nikki Leith
2016-02-01
10
2
I tried using lag function on the date filed and using the difference between that column and the date column.
WITH base AS
(SELECT 'Sophia Liu' as name, DATE '2016-01-01' as date, 3 as amount
UNION ALL SELECT 'Sophia Liu', DATE '2016-01-02', 1
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-01', 3
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-02', 2
UNION ALL SELECT 'Nikki Leith', DATE '2016-01-02', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-01', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-02', 3
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-03', 1
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-04', 1)
select
name
,date
,lag(date) over (partition by name order by date) as lag_func
,date_diff(date,lag(date) over (partition by name order by date),day) date_differacne
,case when date_diff(date,lag(date) over (partition by name order by date),day) >= 10
or date_diff(date,lag(date) over (partition by name order by date),day) is null then true else false end as new_iteration
,amount
from base
Edited answer
After your clarification and looking at what's actually in your SQL code. I'm guessing you are looking for a solution to what's called a gaps and islands problem. That is, you want to identify the "islands" of activity and sum the amount for each iteration or island. Taking your example you can first identify the start of a new session (or "gap") and then use that to create a unique iteration ("island") identifier for each user. You can then use that identifier to perform a SUM().
gaps as (
select
name,
date,
amount,
if(date_diff(date, lag(date,1) over(partition by name order by date), DAY) >= 10, 1, 0) new_iteration
from base
),
islands as (
select
*,
1 + sum(new_iteration) over(partition by name order by date) iteration_id
from gaps
)
select
*,
sum(amount) over(partition by name, iteration_id) iteration_amount
from islands
Previous answer
Sounds like you just need a RANK() to count the iterations in your window functions. Depending on your need you can then sum cumulative or total amounts in a similar window function. Something like this:
select
name
,date
,rank() over (partition by name order by date) as iteration
,sum(amount) over (partition by name order by date) as cumulative_amount
,sum(amount) over (partition by name) as total_amount
,amount
from base

How to get min value at max date in sql?

I have a table with snapshot data. It has productid and date and quantity columns. I need to find min value in the max date. Let's say, we have product X: X had the last snapshot at Y date but it has two snapshots at Y with 9 and 8 quantity values. I need to get
product_id | date | quantity
X Y 8
So far I came up with this.
select
productid
, max(snapshot_date) max_date
, min(quantity) min_quantity
from snapshot_table
group by 1
It works but I don't know why. Why this does not bring min value for each date?
I would use RANK here along with a scalar subquery:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY quantity) rnk
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
)
SELECT productid, snapshot_date, quantity
FROM cte
WHERE rnk = 1;
Note that this solution caters to the possibility that two or more records happened to be tied for having the lower quantity among those most recent records.
Edit: We could simplify by doing away with the CTE and instead using the QUALIFY clause for the restriction on the RANK:
SELECT productid, snapshot_date, quantity
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
QUALIFY RANK() OVER (ORDER BY quantity) = 1;
Consider also below approach
select distinct product_id,
max(snapshot_date) over product as max_date,
first_value(quantity) over(product order by snapshot_date desc, quantity) as min_quantity
from your_table
window product as (partition by product_id)
use row_number()
with cte as (select *,
row_number() over(partition by product_id order by date desc) rn
from table_name) select * from cte where rn=1

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

cummulative distinct count

I'm having trouble getting a cumulative distinct count so let's just assume the below dataset.
DATE RID
1/1/18 1
1/1/18 2
1/1/18 3
1/1/18 3
So if we run this query
SELECT DATE, COUNT(DISTINCT RID) FROM TABLE;
we would expect it to return 3, however let's assume that the data for the next day is as follows.
DATE RID
1/2/18 1
1/2/18 6
1/2/18 9
How would you write a query to get the following results where the data for 1/1/18 is considered when returning the distinct for 1/2/18.
So it would be the following results.
Date Count(*)
1/1/18 3
1/2/18 5 <- 1/1/18 distinct plus + 1/2 distinct.
Hope that makes sense, keep in mind this is a very large dataset if that changes things.
You can do a cumulative count of the earliest date for each rid:
select mindate, count(*), sum(count(*)) over (order by mindate)
from (select rid, min(date) as mindate
from t
group by rid
) t
group by mindate
order by mindate;
Note: This will be missing dates that is not a mindate for some rid. Here is one way to get all the dates, if that is an issue:
select mindate, count(rid), sum(count(rid)) over (order by mindate)
from ((select rid, min(date) as mindate
from t
group by rid
)
union all
(select distinct NULL, date
from t
)
) rd
group by mindate
order by mindate;
Below query can give required cumulative distinct count.
--Step 3:
SELECT dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT rid,
dt,
COUNT(CASE WHEN row_num = 1 THEN rid END) OVER (ORDER BY dt ROWS BETWEEN Unbounded PRECEDING AND CURRENT ROW) cum_distinct_cnt
FROM (
--Step 1:
SELECT rid,
dt,
ROW_NUMBER() OVER (PARTITION BY rid ORDER BY dt) row_num
FROM table) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cum_distinct_cnt DESC) = 1
Since your dataset is very large, you can break the below query on steps as explained in query and create work tables to populate innerTab1/ innerTab2 to get final output

Finding a date with the largest sum

I have a database of transactions, accounts, profit/loss, and date. I need to find the dates which the largest profit occurs by account. I have already found a way to find these actually max/min values but I can't seem to be able to pull the actual date from it. My code so far is like this:
Select accountnum, min(ammount)
from table
where date > '02-Jan-13'
group by accountnum
order by accountnum
Ideally I would like to see account num, the min or max, and then the date which this occurred on.
Try something like this to get the min and max amount for each customer and the date it happened.
WITH max_amount as (
SELECT accountnum, max(amount) amount, date
FROM TABLE
GROUP BY accountnum, date
),
min_amount as (
SELECT accountnum, min(amount) amount, date
FROM TABLE
GROUP BY accountnum, date
)
SELECT t.accountnum, ma.amount, ma.date, mi.amount, ma.date
FROM table t
JOIN max_amount ma
ON ma.accountnum = t.accountnum
JOIN min_amount mi
ON mi.accountnum = t.accountnum
If you want the data for just this year you could add a where clause to the end of the statement
WHERE t.date > '02-Jan-13'
The easiest way to do this is using window/analytic functions. These are ANSI standard and most databases support them (MySQL and Access being two notable exceptions).
Here is one way:
select t.accountnum, min_amount, max_amount,
min(case when amount = min_amount then date end) as min_amount_date,
min(case when amount = min_amount then date end) as max_amount_date,
from (Select t.*,
min(amount) over (partition by accountnum) as min_amount,
max(amount) over (partition by accountnum) as max_amount
from table t
where date > '02-Jan-13'
) t
group by accountnum, min_amount, max_amount;
order by accountnum
The subquery calculates the minimum and maximum amount for each account, using min() as a window function. The outer query selects these values. It then uses conditional aggregation to get the first date when each of those values occurred.
;with cte as
(
select accountnum, ammount, date,
row_number() over (partition by accountnum order by ammount desc) rn,
max(ammount) over (partition by accountnum) maxamount,
min(ammount) over (partition by accountnum) minamount
from table
where date > '20130102'
)
select accountnum,
ammount as amount,
date as date_of_max_amount,
minamount,
maxamount
from cte where rn = 1