Case when in Rank partition By - sql

I've been relearning SQL again but I'm not sure if this code can be done. Can someone please provide feedback or alternative on this case ?
So over all I'm looking into any duplication between a order that was submitted between the same day, different time, same user.
I was thinking for the second step I would rank them to find out if there's another row based on the time and date, to be ranked two?
Select * ( including orderDate)
RANK() OVER(PARTITION BY
Customer,
case
when (Orderstart(Datetimestamp) > OrderEnd(Datetimestamp) and OrderEnd<Orderstart ) AS Rank_Items
From FirstStep
This is just ranking everything now going up to 500+ ranks.
Sample Data
Desired Result:

I would use row_number() to get identify multiple orders by the same customer on the same date:
row_number() over (partition by customer, cast(orderdatetime as date) order by orderdatetime)
The cast-to-date might vary by database.
This enumerates the orders for a customer on a given date, which seems to be what you want to accomplish.

Related

Partition by order of partition columns

I have this window function appled that looks like this:
SUM(value) OVER (
PARTITION BY product, service, site
ORDER BY region, site, service, product, year, week ASC
ROWS BETWEEN 12 PRECEDING AND 0 PRECEDING
) AS value
The Query is working fine but I want to understand more of the window function, I have two questions:
Does the partition columns order matters (PARTITION BY product, service, site)?
Do I need to specify columns from point 1 in the ORDER BY clause or can I omit them?
Does the partition columns order matter?
No.
It makes no difference whether you have PARTITION BY product, service, site or PARTITION BY site, service, product or any other ordering of those three columns.
The members of a partition will be the rows that share the same values for all three of the columns and it doesn't matter how you order them.
Do I need to specify columns from point 1 in ORDER BY clause or can I
omit them?
You don't need to specify the columns.
Don't do
PARTITION BY product, service, site
ORDER BY region, site, service, product, year, week
Just do
PARTITION BY product, service, site
ORDER BY region, year, week
All of the rows within a partition will have the same values for all of product, service, site so it doesn't add anything to include these in the ordering.
The query is clearer with these "no-op" elements removed.
The partition by clause specifies the granularity. In your case it is saying look at all the rows of a given product for a given service at a given site.
Since it is the granularity the order doesn't matter so you can have the clause
PARTITION BY product, service, site)
PARTITION BY service, site, product)
PARTITION BY site, product, service)
The Order by clause on the other hand is really important. You have to really think about how you want the function to operate. For example:
if you are using a row_number() dense_rank() rank functions:
row_number() over(partition by user order by sales desc)
would generate a rank starting from 1 based on the granularity. In this case, the rank would start per user for a given sales value, but since it is desc the rank will start from higher to lower and vice versa. A similar thing will happen to all the above-mentioned functions.
First_value() operates the same, it fetches the first value of a given column, if you would like to get the user_id with lowest sales. You can get it using first_value() with order by ASC, but DESC would give you the id that has highest sales i.e equivalent to the answer of LAST_VALUE as an example, but if you put the order by clause as a desc, it will actually give you the last value
I would suggest to create a small data set and play with ORDER BY caluse for all the functions that you are interested in

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

SQL 'partition by order by' turns count() into rank()?

I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15
count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;

SQL Server SQL - From and To Min and Max to remove overlap

I am trying to implement SCD type two using the below data set. As you can see there is a multiple records with the same class (highlighted in red) and I wish to combine these two to remove one. basically if such occurs, i wish to take min and max for ValidFrom and ValidTo for the same class.
Can this be acheivable in SQL ? I am using SQL Server 2014 so lead or lag function could be used for this i gues.. but what if there is more than 2 consecutive records with the same class.
And lastly, I want to set the last record's valid to to NULL.
Any help would be appreciated !
Your sample data has holes but no overlaps. If this is generally the case, then this is not too hard:
select productid, class, min(validfrom) as validfrom,
lead(min(validfrom)) over (partition by productid order by min(validfrom)) as validto
from (select scd.*,
row_number() over (partition by productid, class order by validfrom) as seqnum_pc,
row_number() over (partition by productid order by validfrom) as seqnum_c
from scd
) s
group by productid, class, (seqnum_c - seqnum_pc);
Understanding how this works requires "getting" how the difference in row numbers identifies the group of adjacent class values. My advice is to run the subquery (perhaps on a subset of your data) to see how the difference works.

SQL Server: I have multiple records per day and I want to return only the first of the day

I have some records track inquires by DATETIME. There is an glitch in the system and sometimes a record will enter multiple times on the same day. I have a query with a bunch of correlated subqueries attached to these but the numbers are off because when there were those glitches in the system then these leads show up multiple times. I need the first entry of the day, I tried fooling around with MIN but I couldn't quite get it to work.
I currently have this, I am not sure if I am on the right track though.
SELECT SL.UserID, MIN(SL.Added) OVER (PARTITION BY SL.UserID)
FROM SourceLog AS SL
Here's one approach using row_number():
select *
from (
select *,
row_number() over (partition by userid, cast(added as date) order by added) rn
from sourcelog
) t
where rn = 1
You could use group by along with min to accomplish this.
Depending on how your data is structured if you are assigning a unique sequential number to each record created you could just return the lowest number created per day. Otherwise you would need to return the ID of the record with the earliest DATETIME value per day.
--Assumes sequential IDs
select
min(Id)
from
[YourTable]
group by
--the conversion is used to stip the time value out of the date/time
convert(date, [YourDateTime]