SQL Show the previous value until the given value changes - sql

I have a table ordered by ID with column Value.
| ID | Value |
| -------- | -------------- |
| 1 | 50 |
| 2 | 50 |
| 3 | 62 |
| 4 | 62 |
| 5 | 62 |
| 6 | 79 |
| 7 | 90 |
| 8 | 90 |
I would like to create another column Prev_Value that for each row of column Value takes the previous/preceding number that differs from the current row value, as in the table below.
Output table:
| ID | Value |Prev_Value |
| -------- | -------------- |---------------|
| 1 | 50 |NULL |
| 2 | 50 |NULL |
| 3 | 62 |50 |
| 4 | 62 |50 |
| 5 | 62 |50 |
| 6 | 79 |62 |
| 7 | 90 |79 |
| 8 | 90 |79 |
Should I use modified LAG() function, the CROSS APPLY or nested CASE and what approach would be the most time-efficient? Any help would be appreciated.
Here are some references that unfortunately does not solve my problem:
LAG(offset) until value is reached in BigQuery and
SQL Server : select distinct until the value is changed

One method uses apply:
select t.*, tprev.value
from t outer apply
(select top (1) tprev.*
from t tprev
where tprev.value <> t.value and
tprev.id < t.id
order by tprev.id desc
) tprev;
The above is not the most efficient method on a large dataset. For that, I would suggest getting the first time that a value changes and marking that.
select t.*,
max(case when prev_value <> value then prev_value end) over (partition by grp) as prev_value
from (select t.*,
sum(case when prev_value = value then 0 else 1 end) over (order by id) as grp
from (select t.*,
lag(value) over (order by id) as prev_value
from t
) t
) t;
Here is a db<>fiddle.

Here is one way to do it. (Check it live on this fiddle)
select
t1.*,
t2.value prev_value
from
t t1
left join t t2 on t2.id = (select max(id) from t where value<t1.value)

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

Combining multiple rows into a single row SQL

I have a table like this.
|InvID| Client | Group | PricedDate | TotalFee | RepricedFee | CompanyFee|
|1 | A | A.1 | 02-24-2020 | 100 | 80 | 8 |
|1 | A | A.1 | 01-05-2020 | 100 | 75 | 1 |
|2 | A | A.1 | 01-09-2020 | 100 | 60 | 1 |
|3 | B | B.1 | 01-11-2020 | 150 | 95 | 10 |
|4 | B | B.1 | 01-01-2020 | 100 | 55 | 11 |
|4 | B | B.1 | 02-01-2020 | 100 | 90 | 10 |
I need to display a single row based on the latest PricedDate and Sum of Company Fee
|InvID| Client | Group | PricedDate | TotalFee | RepricedFee | CompanyFee|
|1 | A | A.1 | 02-24-2020 | 100 | 80 | 9 |
|2 | A | A.1 | 01-09-2020 | 100 | 60 | 1 |
|3 | B | B.1 | 01-11-2020 | 150 | 95 | 10 |
|4 | B | B.1 | 02-01-2020 | 100 | 90 | 21 |
Is it the latest row per InvID you want? I would probably just get the maximum date and the sum in an aggregation query and then join that row:
select
t.invid,
t.client,
t.group,
t.priceddate,
t.totalfee,
t.repricedfee,
agg.sum_fee as companyfee
from
(
select invid, max(priceddate) as max_date, sum(companyfee) as sum_fee
from mytable
group by invid
) agg
join mytable t on t.invid = agg.invid and t.priceddate = agg.max_date
order by t.invid;
just do aggregation
select invId,client,[group],max(priceddate),max(Totalfee),min(repricedFee),sum(companyfee)
from table
group by invId,client,[group]
Try it like this:
select *
, (select sum(CompanyFee) from my_table mt3 group by InvID) CompanyFee
from my_table mt1
where mt1.PricedDate = (select max(mt2.PricedDate)
from my_table mt2
where mt2.InvID = mt1.InvID);
This part will make sure your data is from the row that has the largest PricedDate :
mt1.PricedDate = (select max(mt2.PricedDate)
from my_table mt2
where mt2.InvID = mt1.InvID)
Also, if it is not enough to group by InvID only you can add other columns there.
Here is a demo
Try this,
declare #CompanyFee= select sum(CompanyFee) from table1
select InvID,Client,Group,PricedDate,TotalFee,RepricedFee,#CompanyFee from table1
where priceddate=max(priceddate)
Try this.
select *
from my_table mt1
cross apply (
select CompanyFee=sum(CompanyFee) from my_table mt3 where mt3.invid=mt1.invid
) as CompanyFeeTbl
where mt1.PricedDate = (select max(mt2.PricedDate)
from my_table mt2
where mt2.InvID = mt1.InvID)
You can use window function :
select t.InvID, t.Client, t.Group, t.PricedDate,
t.TotalFee, t.RepricedFee, t.SumCompanyFee as CompanyFee
from(select t.*, sum(t.companyfee) over (partition by t.client, t.invId) as SumCompanyFee,
row_number() over (partition by t.client, t.invId order by t.PricedDate desc) as seq
from table t
) t
where seq = 1;

How to de-duplicate SQL table rows by multiple columns with hierarchy?

I have a table with multiple records for each patient.
My end goal is a table that is 1-to-1 between Patient_id and Value.
I would like to de-duplicate (in respect to patient_id) my rows based on "a hierarchical series of aggregate functions" (if someone has a better way to phrase this, I'd appreciate that as well.)
+----+------------+------------+------------+----------+-----------------+-------+
| ID | patient_id | Date | Date2 | Priority | Source | Value |
+----+------------+------------+------------+----------+-----------------+-------+
| 1 | 1 | 2017-09-09 | 2018-09-09 | 1 | 'verified' | 55 |
| 2 | 1 | 2017-09-09 | 2018-11-11 | 2 | 'verified' | 78 |
| 3 | 1 | 2017-11-11 | 2018-09-09 | 3 | 'verified' | 23 |
| 4 | 1 | 2017-11-11 | 2018-11-11 | 1 | 'self_reported' | 11 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 2 | 'self_reported' | 90 |
| 5 | 1 | 2017-09-09 | 2018-09-09 | 3 | 'self_reported' | 34 |
| 6 | 2 | 2017-11-11 | 2018-09-09 | 2 | 'self_reported' | 21 |
+----+------------+------------+------------+----------+-----------------+-------+
For each patient_id, I would like to get the row(s) that has/have the MAX(Date). In the case that there are still duplicated patient_id, I would like to get the row(s) with the MIN(Priority). In the case that there are still duplicated rows I would like to get the row(s) with the MIN(Date2).
The way I've approached this problem is using a series of queries like this to de-duplicate on the columns one at a time.
SELECT *
FROM #table t1
LEFT JOIN
(SELECT
patient_id,
MIN(priority) AS min_priority
FROM #table
GROUP BY patient_id) t2 ON t2.patient_id = t1.patient_id
WHERE t2.min_priority = t1.priority
Is there a way to do this that allows me to de-dup on multiple columns at once? Is there a more elegant way to do this?
I'm able to get my results, but my solution feels very inefficient, and I keep running into this. Thank you for any input.
You could use row_number(), if your RDBMS supports it:
select ID, patient_id, Date, Date2, Priority, Source, Value
from (
select
t.*,
row_number() over(partition by patient_id order by Date desc, Priority, Date2) rn
from mytable t
) where rn = 1
Another option is to filter with a correlated subquery that sorts the record according to your criteria, like so:
select t.*
from mytable t
where id = (
select id
from mytable t1
where t1.patient_id = t.patient_id
order by t1.Date desc, t1.Priority, t1.Date2
limit 1
)
The actual syntax for limit varies accross RDBMS.

How to flatten a table from row to columns

I use MariaDB 10.2.21
I have not seen this exact case elsewhere, hence my request for assistance.
I have a History table containing one record per change on any of the fields in a JIRA issues:
+----------+---------------+----------+-----------------+---------------------+
| IssueKey | OriginalValue | NewValue | Field | ChangeDate |
+----------+---------------+----------+-----------------+---------------------+
| HRSK-184 | (NULL) | 2 | Risk Detection | 2019-10-24 10:57:27 |
| HRSK-184 | (NULL) | 2 | Risk Occurrence | 2019-10-24 10:57:27 |
| HRSK-184 | (NULL) | 2 | Risk Severity | 2019-10-24 10:57:27 |
| HRSK-184 | 2 | 4 | Risk Detection | 2019-10-25 11:54:07 |
| HRSK-184 | 2 | 6 | Risk Detection | 2019-10-25 11:54:07 |
| HRSK-184 | 2 | 3 | Risk Severity | 2019-10-24 11:54:07 |
| HRSK-184 | 6 | 5 | Risk Detection | 2019-10-26 09:11:01 |
+----------+---------------+----------+-----------------+---------------------+
Every record contains the old and new value and the fieldtype that has changed ('Field') and, of course, the corresponding timestamp of that change.
I want to query the point-in-time status providing me the combination of the most recent values of every of the fields 'Risk Severity, Risk Occurrence and Risk Detection'.
The result should be like this:
+----------+----------------+-------------------+------------------+----------------------+
| IssueKey | Risk Severity | Risk Occurrence | Risk Detection | ChangeDate |
+----------+----------------+-------------------+------------------+----------------------+
| HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01 |
+----------+----------------+-------------------+------------------+----------------------+
Any ideas? I'm stuck...
Thanks in advance for you effort!
You cold use a couple of inline queries
select
IssueKey,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Severity'
order by ChangeDate desc limit 1
) `Risk Severity`,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Occurrence'
order by ChangeDate desc limit 1
) `Risk Occurrence`,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Detection'
order by ChangeDate desc limit 1
) `Risk Severity`,
max(ChangeDate) ChangeDate
from mytable t
group by IssueKey
With an index on (IssueKey, Field, ChangeDate, NewValue), this should an efficient option.
Demo on DB Fiddle:
IssueKey | Risk Severity | Risk Occurrence | Risk Severity | ChangeDate
:------- | ------------: | --------------: | ------------: | :------------------
HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01
MariaDB 10.2 has introduced some Window Functions for analytical queries.
One of them is RANK() OVER (PARTITION BY ...ORDER BY...) function.
Firstly, you can apply it, and then pivot through Conditional Aggregation :
SELECT IssueKey,
MAX(CASE WHEN Field = 'Risk Severity' THEN NewValue END ) AS RiskSeverity,
MAX(CASE WHEN Field = 'Risk Occurrence' THEN NewValue END ) AS RiskOccurrence,
MAX(CASE WHEN Field = 'Risk Detection' THEN NewValue END ) AS RiskDetection,
MAX(ChangeDate) AS ChangeDate
FROM
(
SELECT RANK() OVER (PARTITION BY IssueKey, Field ORDER BY ChangeDate Desc) rnk,
t.*
FROM mytable t
) t
WHERE rnk = 1
GROUP BY IssueKey;
IssueKey | RiskSeverity | RiskOccurrence | RiskDetection | ChangeDate
-------- + --------------+-----------------+----------------+--------------------
HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01
Demo

SQL-Server query to select last and previous information for multiple columns

After looking in Stackoverflow I cant find a solution to this problem.
I'm using this query:
SELECT *
FROM(
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL(
SELECT DISTINCT * FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL (...
) a ORDER BY a.ID, a.StartDate
I got something like this, for each ID i would like to keep the last and previous date and other columns, to record a history
+------+------------+-----------+-------+-------+
| ID | StartDate | EndDate | Value | rate |
+------+------------+-----------+-------+-------+
| 1 | 2018-06-29 |2018-10-22 | 15 | 77.2 |
| 1 | 2018-04-28 |2018-06-21 | 23 | 55.3 |
| 1 | 2018-02-24 |2018-04-15 | 41 | 44.3 |
| 1 | 2017-06-29 |2017-11-29 | 55 | 44.1 |
| 2 | 2018-07-29 |2018-11-22 | 15 | 106.1 |
| 2 | 2018-03-28 |2018-07-21 | 23 | 10.8 |
| 2 | 2017-12-28 |2018-03-28 | 22 | 11.0 |
| 3 | 2017-09-28 |2018-01-28 | 11 | 87.09 |
| 3 | 2017-06-27 |2018-09-28 | 58 | 100 |
| ... | ... | ... | ... | ... |
+------+------------+-----------+-------+--------+
And I would like to have the next table, to keep the previous information
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+
| ID | StartDate | EndDate | StartDateP | EndDateP | Value | rate | ValueP| rateP |
+------+------------+------------+-----------+-----------+-------+--------+-------+--------+
| 1 | 2018-06-29 |2018-10-22 | 2018-04-28 |2018-06-21 | 15 | 77.2 | 23 | 55.3 |
| 2 | 2018-07-29 |2018-11-22 | 2018-03-28 |2018-07-21 | 15 | 106.1 | 23 | 10.8 |
| 3 | 2017-09-28 |2018-01-28 | 2017-06-27 |2018-09-28 | 11 | 87.09 | 58 | 100 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+
If I understand you correctly you want the row with the latest start date combined with the row with the startdate just before that? This might do the trick
WITH results AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY StartDate DESC) r
FROM (
-- start of your original query
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL
(
SELECT DISTINCT *
FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL
(...) a
ORDER BY a.ID, a.StartDate
-- end of your original query
)
)
SELECT
r1.id, r1.startDate, r2.enddate,
r2.startDate startDateP, r2.enddate enddateP,
r1.value, r1.rate,
r2.value valueP, r2.rate rateP
FROM results r1
LEFT JOIN results r2 ON r2.id = r1.id AND r2.r = 2
WHERE r1.r = 1
Another option is using Row_Number() in concert with a conditional aggregation
Example
Select ID
,StartDate = max(case when RN=1 then StartDate end)
,EndDate = max(case when RN=1 then EndDate end)
,StartDateP = max(case when RN=2 then StartDate end)
,EndDateP = max(case when RN=2 then EndDate end)
,Value = max(case when RN=1 then Value end)
,Rate = max(case when RN=1 then Rate end)
,ValueP = max(case when RN=2 then Value end)
,RateP = max(case when RN=2 then Rate end)
From (
Select *
,RN = Row_Number() over (Partition By ID Order by EndDate Desc)
From YourTable
) A
Group By ID
Returns