SQL 'partition by order by' turns count() into rank()? - sql

I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15

count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;

Related

How to Group By column, while keeping a naming column in as well

I am trying to show the most popular TV show in each country. However, the resulting table outputs multiple shows from the same country, if I include the column that has the shows name. If I don't include this column, it correctly outputs the MAX for eacg country, but without the show name. Can I include both?
This is the script that gets the result I want without the names.
SELECT
origin_country, MAX(popularity) as Most_popular
FROM TV_data
WHERE origin_country not like '%(%'
GROUP BY origin_country
order by Most_popular DESC
This is the script that results in multiple shows from the same country, since the name column is grouped as well.
SELECT
origin_country, name, MAX(popularity) as Most_popular
FROM TV_data
WHERE origin_country not like '%(%'
GROUP BY origin_country, name
order by Most_popular DESC
Thnka you, still learning SQL so any advice is greatly appreciated.
Your idea is correct to GROUP BY origin_country and use MAX to find the highest popularity per country.
All you need to do now is to put this in a subquery, build a main query which shows the other columns, too and JOIN them:
SELECT
tv1.origin_country,
tv1.name,
tv1.popularity Most_Popular
FROM tv_data tv1
JOIN (
SELECT origin_country, MAX(popularity) popularity
FROM tv_data
GROUP BY origin_country) tv2
ON tv1.origin_country = tv2.origin_country
AND tv1.popularity = tv2.popularity
WHERE tv1.origin_country NOT LIKE '%(%'
ORDER BY tv1.popularity DESC;
The above query will be executed on every DB.
Today, DB's usually provide window functions for that as another and maybe easier option. The exact syntax for this way depends on the DB you use since functions often differ between OracleDB, MYSQL DB etc.
Here is an example for a SQLServer DB using RANK:
SELECT
origin_country,
name,
popularity Most_Popular
FROM (SELECT origin_country,
name,
popularity,
RANK() OVER(PARTITION BY origin_country ORDER BY popularity DESC) dest_rank
FROM tv_data) sub
WHERE dest_rank = 1
AND origin_country NOT LIKE '%(%'
ORDER BY popularity DESC;
The PARTITION BY clause works like the GROUP BY in the first query.
If you change for example the condition dest_rank = 1 to dest_rank < 3, you will get the two most popular shows per country.
Try out here: db<>fiddle

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

SQL - select specific results

I have a table containing 2 columns for example. First column has unique values and the second column duplicates. Is there any way for me to select the first unique value only from the first column in relation to the second column?
For example: The results should get: Apple, Tire, and Fork only since they are the first results of the second column (category)
Details
Category
Apple
Fruits
Banana
Fruits
Tire
Car
Engine
Car
Fork
Silverware
Spoon
Silverware
Knife
Silverware
Usually we can use windowing functions like ROW_NUMBER() to simplify these types of queries, however your requested record set does not have a natural sort order that could be used that would result in the output you are expecting.
The following is a simple solution that uses ROW_NUMBER(), however it will not result as you have requested:
SELECT Category, Details
FROM
(
SELECT Category, Details, row_number() over (partition by category order by details) as rn
FROM SpecificResults
) as numberedRecords
WHERE rn = 1;
Results:
Category
Details
Car
Engine
Fruits
Apple
Silverware
Fork
You requested an output of: Apple, Tire, and Fork
The next query might produce the expected output, because we do not specify the sort, however due to this the output is non-deterministic, that is we cannot gaurantee it, due to database internals over time or even after instantaneously repeated queries the result might be different.
There are many discussions on non-deterministic queries in SQL, have a read through this thread on SO: The order of a SQL Select statement without Order By clause
SELECT Category, details.Details
FROM SpecificResults byCategory
CROSS APPLY (
SELECT TOP 1 Details
FROM SpecificResults lookup
WHERE lookup.Category = byCategory.Category
--ORDER BY Details
) as details
GROUP BY Category, details.Details;
Results in:
Category
Details
Car
Tire
Fruits
Apple
Silverware
Fork
I have setup a SQL Fiddle for you to explore this further: http://sqlfiddle.com/#!18/68530/12
Real World Solution
In the real world, your dataset will have a primary key, and in many cases that key value might be incrementally tallied, if not there may be other columns that could be used to determine the sort order that will match your expected results.
Assuming that your dataset has an integer column called Id and that column is an Identity column, then a simple change to the original query using ROW_NUMBER() will achieve the desired result:
SELECT Category, Details
FROM
(
SELECT Category, Details, row_number() over (partition by category order by Id) as rn
FROM OrderedResults
) as numberedRecords
WHERE rn = 1;
I have updated the SQL Fiddle with this variation: http://sqlfiddle.com/#!18/3f7bd/2
If there is a Created date or some other Timestamp or DateTime based column in your recordset then you you could consider those as candidates for your ORDER BY clause.
SQL table represent unordered sets. There is no "first" value unless a column specifies the value. If you have such a column, then you can use row_number():
select t.*
from (select t.*,
row_number() over (partition by category order by <ordering col>) as seqnum
from t
) t
where seqnum = 1;
If you don't have such a column, then you simply cannot ask such a question in a relational database. The data doesn't support the question.
If I understand it correctly, try this -
select category, details from ( select *, row number() over (partition by category order by details) as rn from tablename) where rn = 1

Case when in Rank partition By

I've been relearning SQL again but I'm not sure if this code can be done. Can someone please provide feedback or alternative on this case ?
So over all I'm looking into any duplication between a order that was submitted between the same day, different time, same user.
I was thinking for the second step I would rank them to find out if there's another row based on the time and date, to be ranked two?
Select * ( including orderDate)
RANK() OVER(PARTITION BY
Customer,
case
when (Orderstart(Datetimestamp) > OrderEnd(Datetimestamp) and OrderEnd<Orderstart ) AS Rank_Items
From FirstStep
This is just ranking everything now going up to 500+ ranks.
Sample Data
Desired Result:
I would use row_number() to get identify multiple orders by the same customer on the same date:
row_number() over (partition by customer, cast(orderdatetime as date) order by orderdatetime)
The cast-to-date might vary by database.
This enumerates the orders for a customer on a given date, which seems to be what you want to accomplish.

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.