SQL Server SQL - From and To Min and Max to remove overlap - sql

I am trying to implement SCD type two using the below data set. As you can see there is a multiple records with the same class (highlighted in red) and I wish to combine these two to remove one. basically if such occurs, i wish to take min and max for ValidFrom and ValidTo for the same class.
Can this be acheivable in SQL ? I am using SQL Server 2014 so lead or lag function could be used for this i gues.. but what if there is more than 2 consecutive records with the same class.
And lastly, I want to set the last record's valid to to NULL.
Any help would be appreciated !

Your sample data has holes but no overlaps. If this is generally the case, then this is not too hard:
select productid, class, min(validfrom) as validfrom,
lead(min(validfrom)) over (partition by productid order by min(validfrom)) as validto
from (select scd.*,
row_number() over (partition by productid, class order by validfrom) as seqnum_pc,
row_number() over (partition by productid order by validfrom) as seqnum_c
from scd
) s
group by productid, class, (seqnum_c - seqnum_pc);
Understanding how this works requires "getting" how the difference in row numbers identifies the group of adjacent class values. My advice is to run the subquery (perhaps on a subset of your data) to see how the difference works.

Related

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

Case when in Rank partition By

I've been relearning SQL again but I'm not sure if this code can be done. Can someone please provide feedback or alternative on this case ?
So over all I'm looking into any duplication between a order that was submitted between the same day, different time, same user.
I was thinking for the second step I would rank them to find out if there's another row based on the time and date, to be ranked two?
Select * ( including orderDate)
RANK() OVER(PARTITION BY
Customer,
case
when (Orderstart(Datetimestamp) > OrderEnd(Datetimestamp) and OrderEnd<Orderstart ) AS Rank_Items
From FirstStep
This is just ranking everything now going up to 500+ ranks.
Sample Data
Desired Result:
I would use row_number() to get identify multiple orders by the same customer on the same date:
row_number() over (partition by customer, cast(orderdatetime as date) order by orderdatetime)
The cast-to-date might vary by database.
This enumerates the orders for a customer on a given date, which seems to be what you want to accomplish.

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?
Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

SQL 'partition by order by' turns count() into rank()?

I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15
count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;

SQL Server: I have multiple records per day and I want to return only the first of the day

I have some records track inquires by DATETIME. There is an glitch in the system and sometimes a record will enter multiple times on the same day. I have a query with a bunch of correlated subqueries attached to these but the numbers are off because when there were those glitches in the system then these leads show up multiple times. I need the first entry of the day, I tried fooling around with MIN but I couldn't quite get it to work.
I currently have this, I am not sure if I am on the right track though.
SELECT SL.UserID, MIN(SL.Added) OVER (PARTITION BY SL.UserID)
FROM SourceLog AS SL
Here's one approach using row_number():
select *
from (
select *,
row_number() over (partition by userid, cast(added as date) order by added) rn
from sourcelog
) t
where rn = 1
You could use group by along with min to accomplish this.
Depending on how your data is structured if you are assigning a unique sequential number to each record created you could just return the lowest number created per day. Otherwise you would need to return the ID of the record with the earliest DATETIME value per day.
--Assumes sequential IDs
select
min(Id)
from
[YourTable]
group by
--the conversion is used to stip the time value out of the date/time
convert(date, [YourDateTime]