I have this window function appled that looks like this:
SUM(value) OVER (
PARTITION BY product, service, site
ORDER BY region, site, service, product, year, week ASC
ROWS BETWEEN 12 PRECEDING AND 0 PRECEDING
) AS value
The Query is working fine but I want to understand more of the window function, I have two questions:
Does the partition columns order matters (PARTITION BY product, service, site)?
Do I need to specify columns from point 1 in the ORDER BY clause or can I omit them?
Does the partition columns order matter?
No.
It makes no difference whether you have PARTITION BY product, service, site or PARTITION BY site, service, product or any other ordering of those three columns.
The members of a partition will be the rows that share the same values for all three of the columns and it doesn't matter how you order them.
Do I need to specify columns from point 1 in ORDER BY clause or can I
omit them?
You don't need to specify the columns.
Don't do
PARTITION BY product, service, site
ORDER BY region, site, service, product, year, week
Just do
PARTITION BY product, service, site
ORDER BY region, year, week
All of the rows within a partition will have the same values for all of product, service, site so it doesn't add anything to include these in the ordering.
The query is clearer with these "no-op" elements removed.
The partition by clause specifies the granularity. In your case it is saying look at all the rows of a given product for a given service at a given site.
Since it is the granularity the order doesn't matter so you can have the clause
PARTITION BY product, service, site)
PARTITION BY service, site, product)
PARTITION BY site, product, service)
The Order by clause on the other hand is really important. You have to really think about how you want the function to operate. For example:
if you are using a row_number() dense_rank() rank functions:
row_number() over(partition by user order by sales desc)
would generate a rank starting from 1 based on the granularity. In this case, the rank would start per user for a given sales value, but since it is desc the rank will start from higher to lower and vice versa. A similar thing will happen to all the above-mentioned functions.
First_value() operates the same, it fetches the first value of a given column, if you would like to get the user_id with lowest sales. You can get it using first_value() with order by ASC, but DESC would give you the id that has highest sales i.e equivalent to the answer of LAST_VALUE as an example, but if you put the order by clause as a desc, it will actually give you the last value
I would suggest to create a small data set and play with ORDER BY caluse for all the functions that you are interested in
Related
I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.
I've been relearning SQL again but I'm not sure if this code can be done. Can someone please provide feedback or alternative on this case ?
So over all I'm looking into any duplication between a order that was submitted between the same day, different time, same user.
I was thinking for the second step I would rank them to find out if there's another row based on the time and date, to be ranked two?
Select * ( including orderDate)
RANK() OVER(PARTITION BY
Customer,
case
when (Orderstart(Datetimestamp) > OrderEnd(Datetimestamp) and OrderEnd<Orderstart ) AS Rank_Items
From FirstStep
This is just ranking everything now going up to 500+ ranks.
Sample Data
Desired Result:
I would use row_number() to get identify multiple orders by the same customer on the same date:
row_number() over (partition by customer, cast(orderdatetime as date) order by orderdatetime)
The cast-to-date might vary by database.
This enumerates the orders for a customer on a given date, which seems to be what you want to accomplish.
I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15
count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;
I am having issues with a query, I want it to rank the result based on the time the last change was recorded.
SELECT
ROW_NUMBER() OVER (PARTITION BY ph.pricingHistoryId ORDER BY ph.changeRecorded DESC),
ph.*
FROM
PriceHistory ph
It returns all 1 for the ranking.
If pricingHistoryId is the Primary Key, Partitioning by it always returns the rank as 1 because there cannot be repetitive primary keys!
The row number is applied to each partition and reset for the next partition. You need to "partition over" the group you want numbered. If you want one sequence over the entire result set, remove the "PARTITION BY ph.pricingHistoryId" entirely and just keep the "ORDER BY" part.
tell some big, diff between order by and group by,
like sort columns data=>order by
group it by similar data used for aggregation , order by could be used inside the grouped items ,
please Tell 5 diff
The order by clause is used to order your data set. For example,
select *
from customers
order by customer_id asc
will give you a list of customers in order of customer id from lowest to highest.
The group by clause is used to aggregate your data. For example,
select customer_id, sum(sale_price), max(sale_price)
from customers
group by customer_id
order by customer_id asc
will give you each customer along with their total sales and maximum sale, again ordered by customer id.
In other words, grouping allows you to combine multiple rows from the database into a single output row, based on some criteria, and select functions of those fields not involved in the grouping (minimum, maximum, total, average and so on).
group by groups data by one or more columns, and order by orders the data by one or more columns? i don't really get the question?
using group by is similar to select distinct in the aspect that only unique values for the given values will be returned. furthermore you can use aggregate functions to calculate e.g. the sum for each group.
what do you want to hear? tell me five differences between apples and oranges?