What is LINQ operator to perform division operation on Tables? - sql

To select elements belonging to a particular group in Table, if elements and their group type are contained in one table and all group types are listed in another table we perform division on tables.
I am trying LINQ query to perform the same operation. Please tell me how can I do perform it?

Apparently from the definition of that blog post you'd want to intersect and except.
Table1.Except(Table1.Intersect(Table2));
or rather in your case I'd guess
Table1.Where(d => !Table2.Any(t => t.Type == d.Type));
not so hard.
I don't think performance can be made much better, actually. Maybe with a groupby.
Table1.GroupBy(t => t.Type).Where(g => !Table2.Any(t => t.Type == g.Key)).SelectMany(g => g);
this should be better for performance. Only searches the second table for every kind of type once, not for every row in Table1.

It's a bit difficult to determine exactly what you're asking. But, it sounds like you are looking to determine the elements that are common in two tables or streams. If so, I think you want Intersect.
Take a look here
It works something like this:
int[] array1 = { 1, 2, 3 };
int[] array2 = { 2, 3, 4 };
var intersect = array1.Intersect(array2);
Returns 2 and 3.
The opposite of this would be Except().

Related

Efficient way to select one from each category - Rails

I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?

What's the most efficient way to store sets in a database?

I want to store sets in a such a way that I can query for sets that are a superset of, subset of, or intersect with another set.
For example, if my database has the sets { 1, 2, 3 }, { 2, 3, 5 }, { 5, 10, 12} and I query it for:
Sets which are supersets of { 2, 3 } it should give me { 1, 2, 3 }, { 2, 3, 5 }
Sets which are subsets of { 1, 2, 3, 4 } it should give me { 1, 2, 3 }
Sets which intersect with { 1, 10, 20 } it should give me { 1, 2, 3 }, { 5, 10, 12}
Since some sets are unknown in advance (your comment suggests they come from the client as a search criteria), you cannot "precook" the set relationships into the database. Even if you could, that would represent a redundancy and therefore an opportunity for inconsistencies.
Instead, I'd do something like this:
CREATE TABLE "SET" (
ELEMENT INT, -- Or whatever the element type is.
SET_ID INT,
PRIMARY KEY (ELEMENT, SET_ID)
)
Additional suggestions:
Note how ELEMENT field is at the primary key's leading edge. This should aid the queries below better than PRIMARY KEY (SET_ID, ELEMENT). You can still add the latter if desired, but if you don't, then you should also...
Cluster the table (if your DBMS supports it), which means that the whole table is just a single B-Tree (and no table heap). That way, you maximize the performance of queries below, and minimize storage requirements (and cache effectiveness).
You can then find IDs of sets that are equal to or supersets of (for example) set {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID
HAVING COUNT(*) = 2;
And sets that intersect {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID;
And sets that are equal to or are subsets of {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE SET_ID NOT IN (
SELECT SET_ID
FROM "SET" S2
WHERE S2.ELEMENT NOT IN (2, 3)
)
GROUP BY SET_ID;
"Efficient" can mean a lot of things, but the normalized way would be to have an Items table with all the possible elements and a Sets table with all the sets, and an ItemsSets lookup table. If you have sets A and B in your Sets table, queries like (doing this for clarity rather than optimization... also "Set" is a bad name for a table or field, given it is a keyword)
SELECT itemname FROM Items i
WHERE i.itemname IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'A')
AND i.name IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'B')
That, for instance, is the intersection of A and B (you can almost certainly speed this up as a JOIN; again, "efficient" can mean a lot of things, and you'll want an architecture that allows a query like that). Similar queries can be made to find out the difference, complement, test for equality, etc.
Now, I know you asked about efficiency, and this is a horribly slow way to query, but this is the only reliably scalable architecture for the tables to do this, and the query was just an easy one to show how the tables are built. You can do all sorts of crazy things to, say, cache intersections, or store multiple items that are in a set in one field and process that, or what have you. But don't. Cached info will eventually get stale; static limits on the number of items in the field size will be surpassed; ad-hoc members of new tuples will be misinterpreted.
Again, "efficient" can mean a lot of different things, but ultimately an information architecture you as a programmer can understand and reason about is going to be the most efficient.

JOIN datasets via GROUP ALL

This question might sounds very weird, but it is always on my mind. Let's assume that I have a one-column set of data. How could I place a static string next to a second column for every line available? So if the first line of the single-column dataset says "hello", the two-column equivalent should say "hello", "world".
In case you are wondering why I what to do that, it is because, later in my script I need to join the single column dataset with another one, where the former has no point of reference. This is what I have done so far:
fnl2 = FOREACH fnl1 GENERATE
var1,
(var1 == var1 ? 'World' : 'World') AS var2;
In case this can be done by group all or something similar, please feel to provide your hints.
You're on the right track, but the bincond is unnecessary. You can just do
fnl2 = FOREACH fnl1 GENERATE var1, 'World' AS var2;
But even this is not necessary if you are doing this so you can perform a JOIN later. JOIN takes expressions as well as fields, so you can just do
joined = JOIN fnl1 BY 1, other BY 1;
But even THIS is unnecessary, because you are just performing a cross-product, and Pig is one step ahead of you:
crossed = CROSS fnl1, other;
The last statement is what I think you are looking for, but hopefully the others illustrate some helpful points for you.

SQL query performance for a 'map' statement

I am using Ruby on Rails 3 and I would like to know what is the performance difference for these query statements:
# Case 1
accounts = ids.map { |id| Account.find_by_id(id) }
# Case 2
accounts = ids.map { |id| Account.where(:id => id).first }
There is another way to do things better? If ids are 100, how can I limit the search until accounts are 5?
As #RubyFanatic said, there's no real difference between those two (they'd both generate the same query), but there is a considerably better way of doing it:
accounts = Account.where(:id => ids)
This will generate sql like select * from accounts where accounts.id in (1,2,3) and will be considerably faster than finding them one at a time.
And if you want to only use say 5 of the ids from the array of ids, you'd need to decide which 5 to use. For example, if you wanted to use the first 5;
accounts = Account.where(:id => ids[0..4])
Or, you could use limit, but this makes the query have to do a little more work still if the ids array is large:
accounts = Account.where(:id => ids).limit(5)
There should be no performance difference for these two queries. They are literally doing the same thing. The second statement might be slightly slower but it's so minuscule that it does not even matter.

LINQ exclusion

Is there a direct LINQ syntax for finding the members of set A that are absent from set B? In SQL I would write this
SELECT A.* FROM A LEFT JOIN B ON A.ID = B.ID WHERE B.ID IS NULL
See the MSDN documentation on the Except operator.
var results = from itemA in A
where !B.Any(itemB => itemB.Id == itemA.Id)
select itemA;
I believe your LINQ would be something like the following.
var items = A.Except(
from itemA in A
from itemB in B
where itemA.ID == itemB.ID
select itemA);
Update
As indicated by Maslow in the comments, this may well not be the most performant query. As with any code, it is important to carry out some level of profiling to remove bottlenecks and inefficient algorithms. In this case, chaowman's answer provides a better performing result.
The reasons can be seen with a little examination of the queries. In the example I provided, there are at least two loops over the A collection - 1 to combine the A and B list, and the other to perform the Except operation - whereas in chaowman's answer (reproduced below), the A collection is only iterated once.
// chaowman's solution only iterates A once and partially iterates B
var results = from itemA in A
where !B.Any(itemB => itemB.Id == itemA.Id)
select itemA;
Also, in my answer, the B collection is iterated in its entirety for every item in A, whereas in chaowman's answer, it is only iterated upto the point at which a match is found.
As you can see, even before looking at the SQL generated, you can spot potential performance issues just from the query itself. Thanks again to Maslow for highlighting this.