order by field with more than 10000 ids - sql

I need to do specific ordering with use of order by field.
select * from table order by field(id,3,4,1,2.......upto 10000 ids)
As the ordering required is not gettable from SQL then how much it affect as per performance and is it feasible to do?
Updates from the comments:
Ordering depends on user and category IDs and can be anything the user wants.
The ordering specification changes (about) daily.
So, we need a custom ordering that depends on the user and category and this ordering needs to change daily.

The easiest way would be to put your ordering in a separate table (called ordering_table in this example):
id | position
----+----------
1 | 11
2 | 42
3 | 23
etc.
The above would mean "put an id of 1 at position 11, 2 at position 42, 3 at position 23, ...". Then you can join that ordering table in:
SELECT t.id, t.col1, t.col2
FROM some_table t
JOIN ordering_table o ON (t.id = o.id)
ORDER BY o.position
Where ordering_table is the table (as above) that defines your strange ordering. This approach simply represents your ordering function as a table (any function with a finite domain is, essentially, just a table after all).
This "ordering table" approach should work fine as long as the ordering table is complete.
If you only need this strange ordering in one place then you could merge the position column into your main table and add NOT NULL and UNIQUE constraints on that column to make sure you cover everything and have a consistent ordering.
Further commenting indicates that you want different orderings for different users and categories and that the ordering will change on a daily basis. You could make separate tables for each condition (which would lead to a combinatorial explosion) or, as Mikael Eriksson and ypercube suggest, add a couple more columns to the ordering table to hold the user and category:
CREATE TABLE ordering_table (
thing_id INT NOT NULL,
position INT NOT NULL,
user_id INT NOT NULL,
category_id INT NOT NULL
);
The thing_id, user_id, and category_id would be foreign keys to their respective tables and you'd probably want to index all the columns in ordering_table but a couple minutes of looking at the query plans would be worthwhile to see if the indexes get used would be worthwhile. You could also make all four columns the primary key to avoid duplicates. Then, the lookup query would be something like this:
SELECT t.id, t.col1, t.col2
FROM some_table t
LEFT JOIN ordering_table o
ON (t.id = o.thing_id AND o.user_id = $user AND o.category_id = $cat)
ORDER BY COALESCE(o.position, 99999)
Where $user and $cat are the user and category IDs (respectively). Note the change to a LEFT JOIN and the addition of COALESCE to allow for missing rows in ordering_table, these changes will push anything that doesn't have a specified position in the order to the bottom of the list rather than removing them from the results completely.

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

How to break ties when comparing columns in SQL

I am trying to delete duplicates in Postgres. I am using this as the base of my query:
DELETE FROM case_file as p
WHERE EXISTS (
SELECT FROM case_file as p1
WHERE p1.serial_no = p.serial_no
AND p1.cfh_status_dt < p.cfh_status_dt
);
It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.
For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).
Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?
DELETE FROM case_file AS p
WHERE id NOT IN (
SELECT DISTINCT ON (serial_no) id -- id = PK
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no
);
This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.
NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.
If you want the greatest registration_no instead, to still sort NULL values last, use:
...
ORDER BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST
See:
Select first row in each GROUP BY group?
Sort by column ASC, but NULL values first?
If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:
How to delete duplicate rows without unique identifier
If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.
Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:
How to delete duplicate entries?
Surviving rows are simply:
SELECT DISTINCT ON (serial_no) *
FROM case_file
ORDER BY serial_no, cfh_status_dt DESC, registration_no;

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

How to search on levelOrder values un SQL?

I have a table in SQL Server that contains the following columns :
Id Name ParentId LevelOrder
8 vehicle 0 0/8/
9 car 8 0/8/9/
10 bike 8 0/8/10/
11 House 0 0/11/
...
This creates a tree.
Say that I have the LevelOrder 0/8/, this should return only the car and bike rows, but how do I handle this in SQL Server?
I have tried :
Select * FROM MyTable WHERE LevelOrder >= '0/8/'
but that does not work.
The underscore character will guarantee at least one character comes after '0/8/', so you don't get a match on the "vehicle" row.
SELECT *
FROM MyTable
WHERE LevelOrder LIKE '0/8/_%'
This code allows you to select values that start with 0/8/
Select * FROM MyTable WHERE LevelOrder like '0/8/%'
Okay -
While #Joe's answer is the simplest and easiest to implement (and possibly better performing than what I'm about to propose...), there are some issues with update anomalies.
Specifically:
You already have a parentId column. You need to synchronize both this and the levelOrder column, or risk inconsistent data. (I believe this also violates 1NF, although my understanding of the exact definition is a little sketchy...)
levelOrder contains the entire heirarchy. If any one parent is moved, all children rows must have levelOrder modified to reflect this (potentially very messy).
In light of this, here's what I recommend:
Drop the levelOrder column, as its existence will (generally) cause problems.
Use a recursive CTE and the parentId column to build the heirarchy dynamically. Either leave the column where it is, or move it to a dedicated relationship table. Moving one parent then requires only one cell to be updated, and cannot result in any (data, not semantic) anomalies. The CTE should look similar to this form (will need to be adjusted for purpose):
WITH heir_parent (parentId, id) as (SELECT parentId, id
FROM table
WHERE id =
UNION ALL
SELECT b.parentId, b.id
FROM heir_parent as a
JOIN table as b
ON b.parentId = a.id)
At the moment, the CTE returns a list of all children of the given id, with their id and their immediate parent. It can be adjusted to return a number of other things as well - although I recommend that the CTE be used only to generate the relationship, and join externally to get the remaining data.

unique pair in a "friendship" database

I'm posting this question which is somewhat a summary of my other question.
I have two databases:
1) db_users.
2) db_friends.
I stress that they're stored in separate databases on different servers and therefore no foreign keys can be used.
In 'db_friends' I have the table 'tbl_friends' which has the following columns:
- id_user
- id_friend
Now how do I make sure that each pair is unique at this table ('tbl_friends')?
I'd like to enfore that at the table level, and not through a query.
For example these are invalid rows:
1 - 2
2 - 1
I'd like this to be impossible to add.
Additionally - how would I seach for all of the friends of user 713 while he could be mentioned, on some friendship rows, at the second column ('id_friend')?
You're probably not going to be able to do this at the database level -- your application code is going to have to do this. If you make sure that your tbl_friends records always go in with (lowId, highId), then a typical PK/Unique Index will solve the duplicate problem. In fact, I'd go so far to rename the columns in your tbl_friends to (id_low, id_high) just to reinforce this.
Your query to find anything with user 713 would then be something like
SELECT id_low AS friend FROM tbl_friends WHERE (id_high = ?)
UNION ALL
SELECT id_high AS friend FROM tbl_friends WHERE (id_low = ?)
For efficiency, you'd probably want to index it forward and backward -- that is by (id_user, id_friend) and (id_friend, id_user).
If you must do this at a DB level, then a stored procedure to swap arguments to (low,high) before inserting would work.
You'd have to use a trigger to enforce that business rule.
Making the two columns in tbl_friends the primary key (unique constraint failing that) would only ensure there can't be duplicates of the same set: 1, 2 can only appear once but 2, 1 would be valid.
how would I seach for all of the friends of user 713 while he could be mentioned, on some friendship rows, at the second column ('id_friend')?
You could use an IN:
WHERE 713 IN (id_user, id_friend)
..or a UNION:
JOIN (SELECT id_user AS user
FROM TBL_FRIENDS
UNION ALL
SELECT id_friend
FROM TBL_FRIENDS) x ON x.user = u.user
Well, a unique constraint on the pair of columns will get you half way there. I think the easiest way to ensure you don't get the reversed version would be to add a constraint ensuring that id_user < id_friend. You will need to compensate for this ordering at insertion time, but it will get you the database Level constraint you desire without duplicating data or relying on foreign keys.
As for the second question, to find all friends for id=1 you could select id_user, id_friend from tbl_friend where id_user = 1 or id_friend = 1 and then in your client code throw out all the 1's regardless of column.
One way you could do it is to store the two friends on two rows:
CREATE TABLE FriendPairs (
pair_id INT NOT NULL,
friend_id INT NOT NULL,
PRIMARY KEY (pair_id, friend_id)
);
INSERT INTO FriendPairs (pair_id, friend_id)
VALUES (1234, 317), (1234, 713);
See? It doesn't matter which order you insert them, because both friends go in the friend_id column. So you can enforce uniqueness easily.
You can also query easily for friends of 713:
SELECT f2.friend_id
FROM FriendPairs AS f1
JOIN FriendPairs AS f2 ON (f1.pair_id = f2.pair_id)
WHERE f1.friend_id = 713