I have a table including the columns (int id pkey, int group, double sortOrder) where sortOrder implements a user-specified sort order within one group. On some modifications, I need to re-number the sort order of all items in the group. SortOrder values in different groups are independen of each other (i.e. they are only ever compared within one group)
SELECT id, sortOrder FROM tbl WHERE group=X ORDER BY SortOrder
Gives me all elements in that group, in the current order, in whch I would have to assign sequential values (1, 2, 3, ...) to SortOrder.
Q: Is there any reasonable way - preferrably portable to other SQL implementations - to do that without updating every row individually?
More info: I am using doubles, because that allows to trivially assign a new sort order without modifying other items (MIN-1 for before, MIN+1 for insert at end, and (A+B)/2 for inserting between A and B.) - this is limited of course by double resolution (~52 inserts in the worst case). I'm not sure yet if this is worth the additional checking for overflow, but I'd have the same problem with any other data type anyway.
The only other idea I came up with was simulating infinite resolution with strings and custom COLLATE and ADD/SUB/AVG functions. However, this seems immensely non-portable.
I don't believe that SQLite has a trivial way of doing this.
What you may be better looking at (if this really is a concern) is sticking with integer sorting and implementing an algorithm that maintains the sequence.
If, for example, all changes to sorting involve moving an item up/down one place, you just need to swap the sort orders of both values...
UPDATE table
SET sort_order = CASE WHEN sort_order = 3 THEN 4 ELSE 3 END
WHERE sort_order IN (3,4)
Or if you can move an item to any position, use something like...
UPDATE table
SET sort_order = CASE WHEN sort_order = #old THEN #new ELSE sort_order + 1
WHERE sort_order >= #new AND sort_order <= #old
I'm aware that you said that you don't want to update other items each time, but this has actually proven to be pretty efficient in most cases for me in the past.
Related
I need to display a random last name of a person who entered into an employment contract in a specified month using the rand function
go
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont and dbo.Tenants.tenant_code = (select * from randNumber))
return #surname
end
go
select dbo.User_Surname (1)
create or alter view randNumber as
Select FLOOR((RAND() * (MAX(tenant_code + 1) - 1)) + 1) as value from Tenants
So what if tenant #42 has been removed? If the random number function returns 42, then your query will yield nothing.
To fix this problem, an approach which would be quite difficult to correctly implement would involve a row-sequence-number column which is an integer which sequentially increments and does not contain gaps. In order to avoid a gap when a row is deleted, you must pick the last row from the table and give it the row-sequence-number of the deleted column. Consistently doing so without ever forgetting to do it seems like a tough proposition. Achieving this without concurrency problems when rows are being concurrently deleted also seems like a tough proposition. Furthermore, the possibility that the last row may be re-sequenced means that you cannot use an SQL SEQUENCE for issuing row sequence numbers, or that your RDBMS must support the ability to count-down on a sequence, which is a tough proposition.
A better approach would be to create a random number N between zero and the number of rows instead of the maximum row id number, and then to pick the Nth row from the table. That would be something like SELECT BOTTOM 1 FROM (SELECT TOP N FROM...
An SQL-only solution (involving no stored procedures) would be very inefficient. It would involve joining the table of interest with the random-number function, (just real random numbers between 0.0 and 1.0,) essentially creating a new table which also contains a random number field, then using ORDER BY the random field, and then using TOP 1 to get the first row. To achieve this, your RDBMS would be performing a full table scan, and creating an entire new sorted temporary table, and it would be doing that each time you ask for a row at random, so it would be preposterously inefficient.
A performance improvement on the above idea would be to permanently add the random number column to each row, (and to issue a new random number between 0.0 and 1.0 to each row later inserted,) and then use a SEQUENCE for issuing sequential row index numbers, so that each time you want a new random row you pick the next number N from the sequence, you compute its modulus by the number of rows in the table, and you get the Nth row from the table sorted by random-number-column. It will probably be a good idea to make that random number column indexed. The problem with this approach is that it does not truly yield records at random, it yields all records in random order. Truly yielding records at random means that the same row might be yielded twice in two successive queries. This approach will only yield a record again once all other records have first been yielded.
As you want only one tenant, use 'ORDEr BY RAND()`
As always with randomness, you could alos get 100 times the same Tennant, especially when you have only a small number of tennants that fit the bill.
This will never be fast as the table needs to be full scnanned
but at least you should have an index on (tenant_code ,rental_start_date) so that it will be faster to select the correct tennants
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont
ORDER By RAND())
return #surname
end
Using MS SQL 2008, all tables contain a Status varchar(1) column that indicates "I" for inserted record, "U" for updated record, and "D" for deleted record as well as a DateCreated datetime column and a DateUpdated datetime column.
In most cases, we want to query tables for active records only and we would do something like:
SELECT column FROM table WHERE Status <> 'D'
To provide a perspective on usage, this is most frequently used filter as it appears in nearly every query and multiple times when tables are joined.
We're developing a new web application and database with a focus on maximizing performance. One proposal is to, starting with this and future projects, have the varchar(1) Status column pattern replaced with a bit like "IsDeleted" to indicate if the record was deleted or not and infer updated status from the two datetime fields.
In other words...
SELECT column as InsertedRecords FROM table WHERE Status = 'I' -- Rare case
SELECT column as UpdatedRecords FROM table WHERE Status = 'U' -- Rare case
SELECT column as ActiveRecords FROM table WHERE Status <> 'D'
SELECT column as DeletedRecords FROM table WHERE Status = 'D'
...would instead look something like...
SELECT column as InsertedRecords FROM table WHERE IsDeleted = 0 AND DateCreated = DateUpdated -- Rare case
SELECT column as UpdatedRecords FROM table WHERE IsDeleted = 0 AND DateCreated <> DateUpdated -- Rare case
SELECT column as ActiveRecords FROM table WHERE IsDeleted = 0
SELECT column as DeletedRecords FROM table WHERE IsDeleted = 1
Are there any tangible performance benefits/implications (primary around indexes and large queries) or are both implementations perfectly acceptable? Are there any disadvantages with continuing the current pattern for consistency sake to align them with the previously created applications/databases?
I think the's not much to lose or gain from just using a bit column instead of a char(1) column.
In terms of indexing, an index on just a bit column won't give you much value as it may have only 2 possible values: 1 and 0 (I assume your column is not nullable).
A query with WHERE condition DateCreated <> DateUpdated won't work very well as it won't be able to use indexes efficiently and will most likely behave worse than your existing char(1) field.
All in all, I think your existing solution will work better than a bit field and a date field. If you want to use numbers, you can store your values in a tinyint field instead (e.g. I = 0, U = 1, D = 2).
There's two further things that you can possibly do to improve performance:
Create indexes based on the bit/char column and other columns depending on the queries you run (e.g. on IsDeleted and DateCreated)
Include colums returned in SELECT in your index so that the query doesn't have to look up records from the table.
Without going into too much detail (you can look it up yourself), other ways to improve performance over non selective data are: filtered indexes and table partitioning.
For example if you're looking for sprecific data within WHERE Status <> 'D' then that might benefit from a filtered index. Basically it only indexes the records you're interested in, making the index a bit smaller (and possibly faster)
Personally I prefer the I/U/D pattern over the bit pattern as to me it is 'orthogonal' and it's what I'm used to.
Also if you don't look at the deleted records much you may wish to split them off into a different partition. It's transparent to the user (they see just one table) but behind the scenes you can actually put it on a slower cheaper disk, back it up less etc. Also it knows which partition to go to and doesn't bother looking in the other (deleted) partitions.
I would also consider why you have these deleted records cluttering up this table if you rarely ever use them. Perhaps you could move them into a datawarehouse and report forom there instead.
I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?
NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table
Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().
Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().
I have a table with unique values within it and once a stored procedure is called, I use the following code within a sub-query to get a random value from the table:
SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC
I have however noticed that I am managing now and then (and I'm guessing two calls running simultaneously cause it) to retrieve the same unique value twice, which causes some issues within the program.
Is there any way (preferably not locking the table) to make the unique values ID generation completely unique - or unique enough to not affect two simultaneous calls? As a note, I need to keep the unique values and cannot use GUIDs directly here.
Thanks,
Kyle
Edit for clarification:
I am buffering the unique values. That's what the WHERE InitiatingID is NULL is all about. As a value gets picked out of the query, the InitiatingID is set and therefore cannot be used again until released. The problem is that in the milliseconds of that process setting the InitiatingID it seems that the value is getting picked up again, thus harming the process.
Random implies that you will get the same value twice randomly.
Why not using IDENTITY columns?
I wrote a blog post about manual ID generation some days ago here. Maybe that helps.
What you're doing isn't really generating random unique values - which has a low probability of generating duplicates if you use the appropriate routines, but randomly selecting one item from a population - which, depending on the size of your population, will have a much higher chance of repeat occurrences. In fact, given enough repeated drawing, there will occasionally be repeats - if there weren't, it wouldn't be truly random.
If what you want is to never draw the same unique id in a row, you might consider buffering the 'old' unique id somewhere, and discarding your draw if it matches (or running a WHERE <> currentlydrawuniqueID).
What about using update with the output clause to select the UniqueId and set InitiatingId all at once. http://msdn.microsoft.com/en-US/library/ms177564(v=SQL.90).aspx
Something like: (Though I don't have SQL Server handy, so not tested.)
DECLARE #UniqueIDTable TABLE
(
UniqueId int
)
UPDATE UniqueValues
SET InitiatingID = #InitiatingID
OUTPUT INSERTED.UniqueId into #UniqueIDTable
WHERE UniqueID =
(SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC)
AND InitiatingID is NULL
MySQL
Suppose you want to retrieve just a single record by some id, but you want to know what its position would have been if you'd encountered it in a large ordered set.
Case in point is a photo gallery. You land on a single photo, but the system must know what its offset is in the entire gallery.
I suppose I could use custom indexing fields to keep track of positions, but there must be a more graceful way in SQL alone.
So, first you create a virtual table with the position # ordered by whatever your ORDER BY is, then you select the highest one from that set. That's the position in the greater result set. You can run into problems if you don't order by a unique value/set of values...
If you create an index on (photo_gallery_id, date_created_on) it may do an index scan (depending on the distribution of photos), which ought to be faster than a table scan (provided your gallery_id isn't 90% of the photos or whatnot).
SELECT #row := 0;
SELECT MAX( position )
FROM ( SELECT #row := #row + 1 AS position
FROM photos
WHERE photo_gallery_id = 43
AND date_created_on <= 'the-date-time-your-photo-was'
ORDER BY date_created_on ) positions;
Not really. I think Oracle gives you a "ROWID" or something like that, but most don't give you one. A custom ordering, like a column in your database that tells you want position the entry in the gallery is good because you can never be sure that SQL will put things in the table in the order you think they should be in.
As you are not specific about what database you're using, in SQL Server 2005 you could use
SELECT
ROW_NUMBER() OVER (ORDER BY PhotoID)
, PhotoID
FROM dbo.Photos
You don't say what DBMS you are using, and the "solution" will vary accordingly. In Oracle you could do this (but I would urge you not to!):
select photo, offset
from
( select photo
, row_number() over (partition by gallery_id, order by photo_seq) as offset
from photos
)
where id = 123
That query will select all photos (full table scan) and then pick out the one you asked for - not a performant query!
I would suggest if you really need this information it should be stored.
Assuming the position is determined solely by the id, would it not be as simple as counting all records with a smaller id value?:
select
po.[id]
...
((select count(pi.[id]) from photos pi where pi.[id] < po.[id]) + 1) as index
...
from photos po
...
I'm not sure what the performance implications of such a query would be, but I would think returning a lot of records could be a problem.
You must understand the difference between a "application key" and a "technical key".
The technical key exists for the sole purpose to make an item unique. It's usually in INTEGER or BIGINT, generated (identity, whatever). This key is used to locate objects in the database, quickly figure out of an object has already been persisted (IDs must be > 0, so an object with the default ID == 0 is not in the DB, yet), etc.
The application key is something which you need to make sense of an object within the context of your application. In this case, it's the ordering of the photos in the gallery. This has no meaning whatsoever for the database.
Think ordered list: This is the default in most languages. You have a set of items, accessed by an index. For a database, this index is an application key since sets in the database are unordered (or rather the database doesn't guarantee any ordering unless you specify ORDER BY). For the very same reason, paging through results from a query is such a pain: Databases really don't like the idea of "position".
So what you must do is add an index row (i.e. an INTEGER which says at which position in the gallery your image is; not a database index for quicker access, even though you should create an index on this column ...) and maintain that. For every insertion, you must UPDATE index = index + 1 where index >= insertion_point, etc.
Yes, it sucks. The only solution I know of: Use an ORM framework which solves this for you.
There's no need for an extra table, why not just count the records instead?
You know the order in which they are displayed (which can vary), but you know it.
You also know the ID of the current record; let's say it's ordered on date:
The offset of the record, is the total number of records counted with a date < that date.
SELECT COUNT(1) FROM ... WHERE date < "the-date"
This gives you the number you can use as the offset for the other queries...