custom sorting or ordering a table without resorting the whole shebang - sql

For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)

You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list

Related

Select N unique rows randomly while excluding 1 specific row (in SQLite DB)

I need to select x unique rows randomly from a table with n rows, while excluding 1 specific row. (x is small, 3 for example) This can be done in several queries if needed and I can also compute anything in programming language (Java). The one important thing is that it must be done faster than O(n), consuming O(x) memory and indefinite looping (retrying) is also undesirable.
Probability of selection should be equal for all rows (except the one which is excluded, of course)
Example:
| id | some data |
|————|———————————|
| 1 | … |
| 2 | … |
| 3 | … |
| 4 | … |
| 5 | … |
The algorithm is ran with arguments (x = 3, exclude_id = 4), so:
it should select 3 different random rows from rows with id in 1,2,3,5.
(end of example)
I tried the following approach:
get row count (= n);
get the position of the excluded row by something like select count(*) where id < excluded_id, assuming id is monotonically increasing;
select the numbers from 0..n, obeying all the rules, by using some "clever" algorithms, it's something like O(x), in other words fast enough;
select these x rows one by one by using limit(index, 1) SQL clause.
However, it turned out that it's possible for rows to change positions (I'm not sure why), so the auto-generated ids are not monotonically increasing. And in this case the second step (get the position of the excluded row) produces wrong result and the algorithm fails to do its job correctly.
How can this be improved?
Also, if this is vastly easier with a different SQL-like database system, it would be interesting, too. (the DB is on a server and I can install any software there as long as it's compatible with Ubuntu 14.04 LTS)
(I'm sorry for a bit of confusion;) the algorithm I used is actually correct it the id is monotonically increasing, I just forgot that it was not itself auto-generated, it was taken from another table where it's auto-generated, and it was possible to add these rows in different order.
So I added another id for this table, which is auto-generated, and used it for row selection, and now it works as it should.

in theory: would this be possible: SQL query for efficient use of build parts

I have a couple of tables, which I have simplified to the extreme for the purpose of this example.
Table 1, 'Units'
ID UnitName PartNumber
1 UnitX 1
2 UnitX 1
3 UnitX 2
4 UnitX 3
5 UnitX 3
6 UnitY 1
7 UnitY 2
8 UnitY 3
Table 2, 'Parts'
ID PartName Quantity
1 Screw 2
2 Wire 1
3 Ducttape 1
I would like to query on these tables which of these units would be Possible to build, AND if so, which one could be built first ideally to make efficient use of these parts.
Now the question is: can this be done using SQL, or is a background application required/more efficient?
So in this example, it is easy, because only one unit (unit Y) can be built.. But I guess you get the idea. (I'm not looking for a shake and bake answer, just your thoughts on this.)
Thanks
As you present it, it is efficient to use sql. As you described PartNumber column of table Units is a foreign key on ID column of Parts table, so a simple outer join or selecting units that the PartNumber are "NOT IN" the Parts table would give you the units that can not be build.
However if your db schema consists of many non normalised tables, or is very complex without indexes, other "bad" things etc
it could be examined whether specific application code is faster. But i really doubt it for the particular case, the query seems trivial.

Efficient SELECT with complex WHERE condition - do I need to store a column with the calculated value?

Suppose I have a large table to store ranges of integers. I can do this with two fields:
start|end
10 |210 (represents 10 to 210)
5 |55 (represents 5 to 55)
(quick to select by end column), or:
start|length
10 | 200 (represents 10 to 210)
5 | 50 (represents 5 to 55)
(quick to select by length column).
What if sometimes I need to select by end, and sometimes by length, and both queries need to be fast? I could store both:
start|length|end
10 | 200 |210
5 | 50 |55
But then this is not normalised and everyone has to remember to update both fields, and is just bad design.
I know I can select by start + length or end - start but for a very large table, isn't this extremely slow?
How can I query by calculated values quickly without storing redundant data - or should I just store the extra column?
Depending on the database type you are using, you might want to use a trigger to calculate the derived field. That way, they can never get out of synch.
This means that the field (length) could be re-calculated every time start or end changes.
I'd store the length, but I'd make sure the calculation was done in my insert and update sprocs so that as long as everyone uses your sprocs there is no more overhead for them.
Unfortunately neither of your target databases support computed columns. I would do the following:
First, determine whether you really have a performance problem. It is true that WHERE end - start = ? will perform more slowly than WHERE length = ?, but you don't define what a "really big table" is in your application, nor what the required performance is. No need to optimize away a problem that may not exist.
Determine whether you can support any latency in your searches. If so, you can add the calculated column to the table but dedicate a separate task, running every five minutes, each hour, or whatever, to fill in the values.
In PostgreSQL you could consider a materialized view, which I believe are supported at the engine level. (See Catcall's comment, below).
Finally, if all else fails, consider using a trigger to maintain the calculated column.

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.

Query Performance with NULL

I would like to know about how NULL values affect query performance in SQL Server 2005.
I have a table similar to this (simplified):
ID | ImportantData | QuickPickOrder
--------------------------
1 | 'Some Text' | NULL
2 | 'Other Text' | 3
3 | 'abcdefg' | NULL
4 | 'whatever' | 4
5 | 'it is' | 2
6 | 'technically' | NULL
7 | 'a varchar' | NULL
8 | 'of course' | 1
9 | 'but that' | NULL
10 | 'is not' | NULL
11 | 'important' | 5
And I'm doing a query on it like this:
SELECT *
FROM MyTable
WHERE QuickPickOrder IS NOT NULL
ORDER BY QuickPickOrder
So the QuickPickOrder is basically a column used to single out some commonly chosen items from a larger list. It also provides the order in which they will appear to the user. NULL values mean that it doesn't show up in the quick pick list.
I've always been told that NULL values in a database are somehow evil, at least from a normalization perspective, but is it an acceptable way to filter out unwanted rows in a WHERE constraint?
Would it be better to use specific number value, like -1 or 0, to indicate items that aren't wanted? Are there other alternatives?
EDIT:
The example does not accuratly represent the ratio of real values to NULLs. An better example might show at least 10 NULLs for every non-NULL. The table size might be 100 to 200 rows. It is a reference table so updates are rare.
SQL Server indexes NULL values, so this will most probably just use the Index Seek over an index on QuickPickOrder, both for filtering and for ordering.
Another alternative would be two tables:
MyTable:
ID | ImportantData
------------------
1 | 'Some Text'
2 | 'Other Text'
3 | 'abcdefg'
4 | 'whatever'
5 | 'it is'
6 | 'technically'
7 | 'a varchar'
8 | 'of course'
9 | 'but that'
10 | 'is not'
11 | 'important'
QuickPicks:
MyTableID | QuickPickOrder
--------------------------
2 | 3
4 | 4
5 | 2
8 | 1
11 | 5
SELECT MyTable.*
FROM MyTable JOIN QuickPicks ON QuickPickOrder.MyTableID = MyTable.ID
ORDER BY QuickPickOrder
This would allow updating QuickPickOrder without locking anything in MyTable or logging a full row transaction for that table. So depending how big MyTable is, and how often you are updating QuickPickOrder, there may be a scalability advantage.
Also, having a separate table will allow you to add a unique index on QuickPickOrder to ensure no duplication, and could be more easily scaled later to allow different kinds of QuickPicks, having them specific to certain contexts or users, etc.
They do not have a negative performance hit on the database. Remember, NULL is more of a state than a value. Checking for NOT NULL vs setting that value to a -1 makes no difference other than the -1 is probably breaking your data integrity, imo.
SQL Server's performance can be affected by using NULLS in your database. There are several reasons for this.
First, NULLS that appear in fixed length columns (CHAR) take up the entire size of the column. So if you have a column that is 25 characters wide, and a NULL is stored in it, then SQL Server must store 25 characters to represent the NULL value. This added space increases the size of your database, which in turn means that it takes more I/O overhead to find the data you are looking for. Of course, one way around this is to use variable length fields instead. When NULLs are added to a variable length column, space is not unnecessarily wasted as it is with fixed length columns.
Second, use of the IS NULL clause in your WHERE clause means that an index cannot be used for the query, and a table scan will be performed. This can greatly reduce performance.
Third, the use of NULLS can lead to convoluted Transact-SQL code, which can mean code that doesn't run efficiently or that is buggy.
Ideally, NULLs should be avoided in your SQL Server databases.
Instead of using NULLs, use a coding scheme similar to this in your databases:
NA: Not applicable
NYN: Not yet known
TUN: Truly unknown
Such a scheme provides the benefits of using NULLs, but without the drawbacks.
NULL looks fine to me for this purpose. Performance is likely to be basically the same as with a non-null column and constant value, or maybe even better for filtering out all NULLs.
The alternative is to normalize QuickPickOrder into a table with a foreign key, and then perform an inner join to filter the nulls out (or a left join with a where clause to filter the non-nulls out).
NULL looks good to me as well. SQL Server has many kinds of indices to choose from. I forget which ones do this, but some only index values in a given range. If you had that kind of index on the column being tested, the NULL valued records would not be in the index, and the index scan would be fast.
Having a lot of NULLs in a column which has an index on it (or starting with it) is generally beneficial to this kind of query.
NULL values are not entered into the index, which means that inserting / updating rows with NULL in there doesn't take the performance hit of having to update another secondary index. If, say, only 0.001% of your rows have a non-null value in that column, the IS NOT NULL query becomes pretty efficient as it just scans a relatively small index.
Of course all of this is relative, if your table is tiny anyway, it makes no appreciable difference.