LINQ OrderBy. Does it always return the same ordered list? - sql

I was trying out a simple OrderBy statement.
The target data to order is something like below:
[
{"id":40, "description":"aaa", "rate":1},
{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2},
{"id":19, "description":"aaa", "rate":1}
]
Then I order items by the rate property.
The odd thing is that if I 'order' them, it 'skips' some items by a given offset and then 'take' only portion of the data.
For example,
var result = items.OrderBy(i => i.rate);
var result = result.Skip(2);
var result = result.Take(2);
The result looks fine for the most of it, but the 'edge case' item is not returned at all.
For example,
if the first result came back as
[{"id":40, "description":"aaa", "rate":1}, {"id":1, "description":"bbb", "rate":1}]
the second result comes back like
[{"id":1, "description":"bbb", "rate":1}, {"id":4, "description":"ccc", "rate":2}]
Item "id: 19" has not been returned with the second query call. Instead item "id: 1" has returned twice.
My guess is that the SQL OrderBy statement doesn't produce the same ordered list every single time OrderBy orders by a given property, but the exact order within a group that shares the same property can change.
What is the exact mechanism under the hood?

Short answer: LINQ to Objects uses a stable sort algorithm, so we can say that it is deterministic, and LINQ to SQL depends on the database implementation of Order By that is usually nondeterministic.
A deterministic sort algorithm is one that have always the same behavior on different runs.
In you example, you have duplicates in your OrderBy clause. For a guaranteed and predicted sort, one of the order clauses or the combination of order clauses must be unique.
In LINQ, you can achieve it by adding another OrderBy clause to refer your unique property, like in
items.OrderBy(i => i.Rate).ThenBy(i => i.ID).
Long answer:
LINQ to Objects uses a stable sort, as documented in this link: MSDN.
In LINQ to SQL, it depends on the sort algorithm of the underlying database and it is usually an unstable sort, like in MS SQL Server (MSDN).
In a stable sort, if the keys of two elements are equal, the order of the elements is preserved. In contrast, an unstable sort does not preserve the order of elements that have the same key.
So, for LINQ to SQL, the sorting is usually nondeterministic because the RDMS (Relational Database Management System, like MS SQL Server) may directly use a unstable sort algorithm with a random pivot selection or the randomness can be related with which row the database happens to access first in the file system.
For example, imagine that the size of a page in the file system can hold up to 4 rows.
The page will be full if you insert the following data:
Page 1
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
If you need to insert a new row, the RDMS has two options:
Create a new page to allocate the new row.
Split the current page in two pages. So the first page will hold the Names A and B and the second page will hold C and D.
Suppose that the RDMS chooses option 1 (to reduce index fragmentation). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 9 |
| B | 2 | | | |
| C | 3 | | | |
| D | 4 | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| C | 9 | -- Value 9 appears after because it was at another page
| D | 4 |
Now, suppose that the RDMS chooses option 2 (to increase the insert performance in a storage system with many spindles). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 3 |
| B | 2 | | D | 4 |
| C | 9 | | | |
| | | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 9 | -- Value 9 appears before because it was at the first page
| C | 3 |
| D | 4 |
Regarding your example:
I believe that you have mistyped something in your question, because you have used items.OrderBy(i => i.rate).Skip(2).Take(2); and the first result do not show a row with Rate = 2. This is not possible since the Skip will ignore the first two rows and they have Rate = 1, so your output must show the row with Rate = 2.
You've tagged your question with database, so I believe that you are using LINQ to SQL. In this case, results can be nondeterministic and you could get the following:
Result 1:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
Result 2:
[{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
If you had used items.OrderBy(i => i.rate).ThenBy(i => i.ID).Skip(2).Take(2); then the only possible result would be:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]

Related

What is the most efficient way to store a variable number of columns in SQL Server?

What is the most efficient way to store a variable amount of columns in MS-SQL?
I have a requirement to store a large number (several million) records into a Microsoft SQL server (via c#). Most columns are standard, but certain groups of users will need to add their own custom columns, and record data in them.
The data in each custom column field will not be large, but the number of records with a certain set of custom columns will be in the millions.
I do not know ahead of time what these columns might be (in terms of name or datatype), but I'll need to pull reports based on these columns as effeciently as possible..
What is the most efficient way of storing the new varying columns and data?
Entity-Attribute-Value model?
Con's: Efficiency if there's a large number of custom columns (= large number of rows)?
A extra table "CustomColumns"?
Storing columnName, Data, Datatype each time an entry has a custom column, for each column.
Con's: A table with a large number of records, perhaps not the most efficient storage.
Serialise the extra columns for each record into a single field
Con's: Lookup efficiency and stored procedure complicated when running reports based on a custom field.
Any other?
Edit: Think I may be confusing option (1) and (2): I actually meant, is the following the best approach :
Entity (User Groups)
id | name | description
-- | ---- | ------------
1 | user group 1 | user group 1
2 | user group 2 | user group 2
Attribute
id | name | type | entityids (best way to do this for 2 user
-- | ---- | ---- | groups using same attribute?
1 | att1 | string | 1,2
2 | att2 | int | 2
3 | att3 | string | 1
4 | att4 | numeric | 2
5 | att5 | string | 1
Value
id | entityId| attributeId | value
-- | --------| ----------- | -----
1 | 1 | 1 | a
2 | 1 | 2 | 1
3 | 1 | 3 | b
4 | 1 | 3 | c
5 | 1 | 3 | d
6 | 1 | 3 | 75
7 | 1 | 5 | Inches

Efficient Classification of records by common letters in impala

I have a table in impala (TBL1), that contains different names with different number of first common letters. The table contains about 3M records. I would like to add add an new attribute to the table, where each common first letters will have a class. It is the same way as DENSE_RANK work but with dynamic number of first letters. The number of same first letters should not be less than p=3 letters (p = parameter).
Here is an example for the table and the required results:
| ID | Attr1 | New_Attr1 | Some more attribute...
+-------+--------------+-------------+-----------------------
| 1 | ZXA-12 | 1 |
| 2 | YL3300 | 2 |
| 3 | ZXA-123 | 1 |
| 4 | YL3400 | 2 |
| 5 | YL3-aaa | 2 |
| 6 | TSA 789 | 3 |
...
Does this do what you want?
select t.*,
dense_rank() over (order by strleft(attr1, 3)) as newcol
from . . .;
The "3" is your parameter.
As a note: In your example, you seem to have assigned the new value in reverse alphabetic order. Hence, you would want desc for the order by.

Primary key auto-increment manipulation

Is there any way to have a primary key with a feature that increments it but fills in gaps? Assuming I have the following table:
____________________
| ID | Value |
| 1 | A |
| 2 | B |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Notice that the value is only an example, the order has nothing to do with the question.
Once I remove the row with the ID of 2 (the table will look like this):
____________________
| ID | Value |
| 1 | A |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
And I add another row, with regular auto-increment feature it will look like this:
____________________
| ID | Value |
| 1 | A |
| 3 | C |
| 4 | D |
^^^^^^^^^^^^^^^^^^^^^
As expected.
The output I'd want would be:
____________________
| ID | Value |
| 1 | A |
| 2 | D |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Where the gap is filled with the new row. Also note that maybe, in memory, it would look different. But the point is that the primary key would fill the gaps.
When having the primary keys (for instance) 1, 2, 3, 6, 7, 10, 11, 4 should be first filled in, then 5, 8 and so on... When the table is empty (even if it had a million of rows before) it should start over from 1.
How do I accomplish that? Is there any built-in feature similar to that? Can I implement it?
EDIT: If it's not possible, why not?
No, you don't want to do that, as juergen-d said. It's unlikely to do what you think it is doing, and it will do it even less in a multi-user environment.
In a multiuser environment you are likely to get voids even when there are no deletes, just from aborted inserts.

Database design - efficient text searching

I have a table that contains URL strings, i.e.
/A/B/C
/C/E
/C/B/A/R
Each string is split into tokens where the separator in my case is '/'. Then I assign integer value to each token and the put them into dictionary (different database table) i.e.
A : 1
B : 2
C : 3
E : 4
D : 5
G : 6
R : 7
My problem is to find those rows in first tables which contain given sequence of tokens. Additional problem is that my input is sequence of ints, i.e. I have
3, 2
and I'd like to find following rows
/A/B/C
/C/B/A/R
How to do this in efficient way. By this I mean how to design proper database structure.
I use PostgreSQL, solution should work well for 2 mln of rows in first table.
To clarify my example - I need both 'B' AND 'C' to be in the URL. Also 'B' and 'C' can occur in any order in the URL.
I need efficient SELECT. INSERT does not have to be efficient. I do not have to do all work in SQL if this changes anything.
Thanks in advance
I'm not sure how to do this, but I'm just giving you some idea that might be useful. You already have your initial table. You process is and create the token table:
+------------+---------+
| TokenValue | TokenId |
+------------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
| E | 4 |
| D | 5 |
| G | 6 |
| R | 7 |
+------------+---------+
That's ok for me. Now, what I would do is to create a new table in which I would match the original table with the tokens of the token table (OrderedTokens). Something like:
+-------+---------+---------+
| UrlID | TokenId | AnOrder |
+-------+---------+---------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 3 |
| 2 | 5 | 1 |
| 2 | 2 | 2 |
| 2 | 1 | 3 |
| 2 | 7 | 4 |
| 3 | 3 | 1 |
| 3 | 4 | 2 |
+-------+---------+---------+
This way you can even recreate your original table as long as you use the order field. For example:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join tokens t on t.tokenId = ot.tokenId
group by ot.urlId
The previous query would result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C |
| D/B/A/R |
| C/E |
+-------------+
So, you don't even need your original table anymore. If you want to get Urls that have any of the provided token ids (in this case B OR C), you sould use this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(case when ot.tokenId in (2, 3) then 1 end) > 0
This results in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
| D/B/A/R | => It has only B
| C/E | => It has only C
+-------------+
Now, if you want to get all Urls that have BOTH ids, then try this:
select string_agg(t.tokenValue, '/' order by ot.anOrder) as OriginalUrl
from OrderedTokens as ot
join Tokens t on t.tokenId = ot.tokenId
group by urlid
having count(distinct case when ot.tokenId in (2, 3) then ot.tokenId end) = 2
Add in the count all the ids you want to filter and then equal that count the the amount of ids you added. The previous query will result in:
+-------------+
| OriginalUrl |
+-------------+
| A/B/C | => It has both B and C
+-------------+
The funny thing is that none of the solutions I provided results in your expected result. So, have I misunderstood your requirements or is the expected result you provided wrong?
Let me know if this is correct.
It really depends on what you mean by efficient. It will be a trade-off between query performance and storage.
If you want to efficiently store this information, then your current approach is appropriate. You can query the data by doing something like this:
SELECT DISTINCT
u.url
FROM
urls u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url ~ E'\\m' || d.url_component || E'\\m'
This query will take some time, as it will be required to do a full table scan, and perform regex logic on each URL. It is, however, very easy to insert and store data.
If you want to optimize for query performance, though, you can create a reference table of the URL components; it would look something like this:
/A/B/C A
/A/B/C B
/A/B/C C
/C/E C
/C/E E
/D/B/A/R D
/D/B/A/R B
/D/B/A/R A
/D/B/A/R R
You can then create a clustered index on this table, on the URL component. This query would retrieve your results very quickly:
SELECT DISTINCT
u.full_url
FROM
url_components u
INNER JOIN
dictionary d
ON
d.id IN (3, 2)
AND u.url_component = d.url_component
Basically, this approach moves the complexity of the query up front. If you are doing few inserts, but lots of queries against this data, then that is appropriate.
Creating this URL component table is trivial, depending on what tools you have at your disposal. A simple awk script could work through your 2M records in a minute or two, and the subsequent copy back into the database would be quick as well. If you need to support real-time updates to this table, I would recommend a non-SQL solution: whatever your app is coded in could use regular expressions to parse the URL and insert the components into the component table. If you are limited to using the database, then an insert trigger could fulfill the same role, but it will be a more brittle approach.

Relative incremental ID by reference field

I have a table to store reservations for certain events; relevant part of it is:
class Reservation(models.Model):
# django creates an auto-increment field "id" by default
event = models.ForeignKey(Event)
# Some other reservation-specific fields..
first_name = models.CharField(max_length=255)
Now, I wish to retrieve the sequential ID of a given reservation relative to reservations for the same event.
Disclaimer: Of course, we assume reservations are never deleted, or their relative position might change.
Example:
+----+-------+------------+--------+
| ID | Event | First name | Rel.ID |
+----+-------+------------+--------+
| 1 | 1 | AAA | 1 |
| 2 | 1 | BBB | 2 |
| 3 | 2 | CCC | 1 |
| 4 | 2 | DDD | 2 |
| 5 | 1 | EEE | 3 |
| 6 | 3 | FFF | 1 |
| 7 | 1 | GGG | 4 |
| 8 | 1 | HHH | 5 |
+----+-------+------------+--------+
The last column is the "Relative ID", that is, a sequential number, with no gaps, for all reservations of the same event.
Now, what's the best way to accomplish this, without having to manually calculate relative id for each import (I don't like that)? I'm using postgresql as underlying database, but I'd prefer to stick with django abstraction layer in order to keep this portable (i.e. no database-specific solutions, such as triggers etc.).
Filtering using Reservation.objects.filter(event_id = some_event_id) should suffice. This will give you a QuerySet that should have the same ordering each time. Or am I missing something in your question?
I hate always being the one that responds its own questions, but I solved using this:
class Reservation(models.Model):
# ...
def relative_id(self):
return self.id - Reservation.objects.filter(id__lt=self.id).filter(~Q(event=self.event)).all().count()
Assuming records from reservations are never deleted, we can safely assume the "relative id" is the incremental id - (count of reservations before this one not belonging to same event).
I'm thinking of any drawbacks, but I didn't find any.