How to select 1 row per id? - sql

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?

If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.

The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

Related

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?
Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

SQL 'partition by order by' turns count() into rank()?

I am trying to figure out how to use partition by properly, and looking for a brief explanation to the following results. (I apologize for including the test data without proper SQL code.)
Example 1: Counts the IDs (e.g. shareholders) for each company and adds it to the original data frame (as "newvar").
select ID, company,
count(ID) over(partition by company) as newvar
from testdata;
Example 2: When I now add order by shares count() somehow seems to turn into rank(), so that the output is merely a ranking variable.
select ID, company,
count(ID) over(partition by company order by shares) as newvar
from testdata;
I thought order by just orders the data, but it seems to have an impact on "newvar".
Is there a simple explanation to this?
Many thanks in advance!
.csv file that contains testdata:
ID;company;shares
1;a;10
2;a;20
3;a;70
1;b;50
4;b;10
5;b;10
6;b;30
2;c;80
3;c;10
7;c;10
1;d;20
2;d;30
3;d;25
6;d;10
7;d;15
count() with an order by does a cumulative count. It is going to turn the value either into rank() or row_number(), depending on ties in the shares value and how the database handles missing windows frames (rows between or range between).
If you want to just order the data, then the order by should be after the from clause:
select ID, company,
count(ID) over(partition by company) as newvar
from testdata
order by shares;

SQL Server: I have multiple records per day and I want to return only the first of the day

I have some records track inquires by DATETIME. There is an glitch in the system and sometimes a record will enter multiple times on the same day. I have a query with a bunch of correlated subqueries attached to these but the numbers are off because when there were those glitches in the system then these leads show up multiple times. I need the first entry of the day, I tried fooling around with MIN but I couldn't quite get it to work.
I currently have this, I am not sure if I am on the right track though.
SELECT SL.UserID, MIN(SL.Added) OVER (PARTITION BY SL.UserID)
FROM SourceLog AS SL
Here's one approach using row_number():
select *
from (
select *,
row_number() over (partition by userid, cast(added as date) order by added) rn
from sourcelog
) t
where rn = 1
You could use group by along with min to accomplish this.
Depending on how your data is structured if you are assigning a unique sequential number to each record created you could just return the lowest number created per day. Otherwise you would need to return the ID of the record with the earliest DATETIME value per day.
--Assumes sequential IDs
select
min(Id)
from
[YourTable]
group by
--the conversion is used to stip the time value out of the date/time
convert(date, [YourDateTime]

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.

Using a DISTINCT clause to filter data but still pull other fields that are not DISTINCT

I am trying to write a query in Postgresql that pulls a set of ordered data and filters it by a distinct field. I also need to pull several other fields from the same table row, but they need to be left out of the distinct evaluation. example:
SELECT DISTINCT(user_id) user_id,
created_at
FROM creations
ORDER BY created_at
LIMIT 20
I need the user_id to be DISTINCT, but don't care whether the created_at date is unique or not. Because the created_at date is being included in the evaluation, I am getting duplicate user_id in my result set.
Also, the data must be ordered by the date, so using DISTINCT ON is not an option here. It required that the DISTINCT ON field be the first field in the ORDER BY clause and that does not deliver the results that I seek.
How do I properly use the DISTINCT clause but limit its scope to only one field while still selecting other fields?
As you've discovered, standard SQL treats DISTINCT as applying to the whole select-list, not just one column or a few columns. The reason for this is that it's ambiguous what value to put in the columns you exclude from the DISTINCT. For the same reason, standard SQL doesn't allow you to have ambiguous columns in a query with GROUP BY.
But PostgreSQL has a nonstandard extension to SQL to allow for what you're asking: DISTINCT ON (expr).
SELECT DISTINCT ON (user_id) user_id, created_at
FROM creations
ORDER BY user_id, created_at
LIMIT 20
You have to include the distinct expression(s) as the leftmost part of your ORDER BY clause.
See the manual on DISTINCT Clause for more information.
If you want the most recent created_at for each user then I suggest you aggregate like this:
SELECT user_id, MAX(created_at)
FROM creations
WHERE ....
GROUP BY user_id
ORDER BY created_at DESC
This will return the most recent created_at for each user_id
If you only want the top 20, then append
LIMIT 20
EDIT: This is basically the same thing Unreason said above... define from which row you want the data by aggregation.
The GROUP BY should ensure distinct values of the grouped columns, this might give you what you are after.
(Note I'm putting in my 2 cents even though I am not familiar with PostgreSQL, but rather MySQL and Oracle)
In MySql
SELECT user_id, created_at
FROM creations
GROUP BY user_id
ORDER BY user_id
In Oracle sqlplus
SELECT user_id, FIRST(created_at)
FROM creations
GROUP BY user_id
ORDER BY user_id
These will give you the user_id followed by the first created_at associated with that user_id. If you want a different created_at you have the option to substitute FIRST with other functions like AVG, MIN, MAX, or LAST in Oracle, you can also try adding ORDER BY on other columns (including ones that are not returned, to give you a different created_at.
Your question is not well defined - when you say you need also other data from the same row you are not defining which row.
You do say you need to order the results by created_at, so I will assume that you want values from the row with min created_at (earliest).
This now becomes one of the most common so SQL questions - retrieving rows containing some aggregate value (MIN, MAX).
For example
SELECT user_id, MIN(created_at) AS created_at
FROM creations
GROUP BY user_id
ORDER BY MIN(create_at)
LIMIT 20
This approach will not let you (easily) pick other values from the same row.
One approach that will let you pick other values is
SELECT c.user_id, c.created_at, c.other_columns
FROM creations c LEFT JOIN creation c_help
ON c.user_id = c_help.user_id AND c.created_at > c_help.create_at
WHERE c_help IS NULL
ORDER BY c.created_at
LIMIT 20
Using a sub-query was suggested by someone on the irc #postgresql channel. It worked:
SELECT user_id
FROM (SELECT DISTINCT ON (user_id) * FROM creations) ss
ORDER BY created_at DESC
LIMIT 20;