Populate one column based on the previous values of another - sql

I am trying to create a column which populates the transaction ID for every row up until the row where that transaction was completed - in this example every "add to basket" event before an order.
So far I have tried using FIRST_VALUE:
SELECT
UserID, date, session_id, hitnumber, add_to_basket, transactionid,
first_value(transactionid) over (partition by trans_part order by date, transactionid) AS t_id
FROM(
select UserID, date, session_id, hitnumber, add_to_basket, transactionid,
SUM(CASE WHEN transactionid IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY date, transactionid) AS trans_part,
FIRST_VALUE(transactionid IGNORE NULLS)
OVER (PARTITION BY userid ORDER BY hitnumber ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS t_id,
from q1
join q2 using (session_id)
order by 1,2,3,4
)
But the result I am getting is the inverse of what I want, populating the transaction ID of the previous order against the basket events which happened after this transaction.
How can I change my code so that I will see the transaction id of the order AFTER the basket events that led up to it? For example, in the table below I want to see the transaction id ending in ...095 instead of the id ending in ...383 for the column t_id.
Based on Gordon's answer below I have also tried:
last_value(transactionid ignore nulls) over(
order by hitnumber
rows between unbounded preceding and current row) as t_id2,
But this is not populating the event rows which proceed a transaction with a transaction id (seen below as t_id2):

You can use last_value(ignore nulls):
select . . . ,
last_value(transaction_id ignore nulls) over (
order by hitnumber
rows between unbounded preceding and current row
) as t_id
from q1 join
q2 using (session_id);
The difference from your answer is the windowing clause which ends at the current row.
EDIT:
It looks like there is one t_id per session_id, so just use max():
select . . . ,
max(transaction_id) over (partition by session_id) as t_id
from q1 join
q2 using (session_id);

Related

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

Can I create a field that is conditionally assigned by date in SQL?

I'm dealing with some subscription data. When the user upgrades/downgrades, the system overwrites the level of the subscription with the new value. I am trying to assign the historical values when the user has upgraded. My data set looks like the following where one user can upgrade or downgrade multiple times.
I am trying to get the what is outlined in the "desired value" column.
Essentially, and transactions that happened before an upgrade should be assigned the "original_product" that is captured on the upgrade transaction, transactions that happen after this should be assigned the "new_product" value.
I've been trying joining the data to itself, but I can't find a way to avoid getting multiple rows for each invoice.
You can use window functions:
select t.*,
coalesce(last_value(case when event = 'Upgrade' then new_product end ignore nulls) over (partition by sub_id order by created),
first_value(original_product ignore nulls) over (partition by sub_id order by created)
) as desired_value
from t;
This gets the most recent new_product from an "Upgrade" row. If that doesn't exist, then it gets the overall original_product.
I think you want first_value():
select
t.*,
coalesce(
first_value(new_product ignore nulls) over(
order by created desc
rows between unboundeed preceding and current row
),
first_value(original_product ignore nulls) over(
order by created
rows between current row and unbounded following
)
) desired_value
from mytable t
The idea is to first try to get the first non-null new_product value on preceding rows (current row included). If there is no such row, then we lookup the first non-null original product in the following rows.
In theory, you would also need a partition by clause that contains the column that represent the user. Your data has no sign of such column though, so I left it apart.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
IFNULL(
FIRST_VALUE(original_product IGNORE NULLS) OVER(original_product_lookup),
FIRST_VALUE(new_product IGNORE NULLS) OVER(new_product_lookup)
) AS desired_value
FROM `project.dataset.table`
WINDOW
original_product_lookup AS (ORDER BY created ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
new_product_lookup AS (ORDER BY created DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
You can test, play with above using simplified data from your question (using only used/relevant data-points) as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 created, NULL original_product, NULL new_product UNION ALL
SELECT 2, NULL, NULL UNION ALL
SELECT 3, 'Level 1', 'Level 2' UNION ALL
SELECT 4, NULL, NULL UNION ALL
SELECT 5, 'Level 2', 'Level 1' UNION ALL
SELECT 6, NULL, NULL
)
SELECT *,
IFNULL(
FIRST_VALUE(original_product IGNORE NULLS) OVER(original_product_lookup),
FIRST_VALUE(new_product IGNORE NULLS) OVER(new_product_lookup)
) AS desired_value
FROM `project.dataset.table`
WINDOW
original_product_lookup AS (ORDER BY created ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING),
new_product_lookup AS (ORDER BY created DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
ORDER BY created
with result
Row created original_product new_product desired_value
1 1 null null Level 1
2 2 null null Level 1
3 3 Level 1 Level 2 Level 2
4 4 null null Level 2
5 5 Level 2 Level 1 Level 1
6 6 null null Level 1
Was able to solve with the combination of answers
SELECT e.*,
coalesce(
last_value(case when (event ='Upgrade' OR event = "Downgrade" OR event = "Crossgrade") then new_product end ignore nulls) over (partition by subscription order by created),
first_value(original_product ignore nulls) over(
order by created
rows between current row and unbounded following
)
) desired_value
FROM e

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.
Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;
It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

SQL query on redshift to get the first and the last value

I have a data set like this.
I need to write a query which gives me the below output
for every SessionID and VisitID, it should sort based on the date_time Column and provide me with the First Category and the Last Category.
I have used the following code
rank() OVER( PARTITION BY SessionID
, VisitID
ORDER by
date_Time DESC ) as click_rank_last
where click_rank_last = 1
to get the last Category. But what I need is to get the first and the last in a single query with minimum hit to the data base as the data is huge and querying in costly.
Need the most optimum query!
One way would be:
select distinct
sessionid,
visitid,
first_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following),
last_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following)
from tbl

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;