Sorting with many to many relationship - sql

I have a 3 tables person, person_speaks_language and language.
person has 80 records
language has 2 records
I have the following records
the first 10 persons speaks one language
the first 70 persons (include the first group) speaks 2 languages
the last 10 persons dont speaks any language
Following with the example I want sort the persons by language, How I can do it correctly.
I'm trying to use the the following SQL but seems quite strange
SELECT "person".*
FROM "person"
LEFT JOIN "person_speaks_language" ON "person"."id" = "person_speaks_language"."person_id"
LEFT JOIN "language" ON "person_speaks_language"."language_id" = "language"."id"
ORDER BY "language"."name"
ASC
dataset
71,Catherine,Porter,male,NULL
72,Isabelle,Sharp,male,NULL
73,Scott,Chandler,male,NULL
74,Jean,Graham,male,NULL
75,Marc,Kennedy,male,NULL
76,Marion,Weaver,male,NULL
77,Melvin,Fitzgerald,male,NULL
78,Catherine,Guerrero,male,NULL
79,Linnie,Strickland,male,NULL
80,Ann,Henderson,male,NULL
11,Daniel,Boyd,female,English
12,Ora,Beck,female,English
13,Hulda,Lloyd,female,English
14,Jessie,McBride,female,English
15,Marguerite,Andrews,female,English
16,Maurice,Hamilton,female,English
17,Cecilia,Rhodes,female,English
18,Owen,Powers,female,English
19,Ivan,Butler,female,English
20,Rose,Bishop,female,English
21,Franklin,Mann,female,English
22,Martha,Hogan,female,English
23,Francis,Oliver,female,English
24,Catherine,Carlson,female,English
25,Rose,Sanchez,female,English
26,Danny,Bryant,female,English
27,Jim,Christensen,female,English
28,Eric,Banks,female,English
29,Tony,Dennis,female,English
30,Roy,Hoffman,female,English
31,Edgar,Hunter,female,English
32,Matilda,Gordon,female,English
33,Randall,Cruz,female,English
34,Allen,Brewer,female,English
35,Iva,Pittman,female,English
36,Garrett,Holland,female,English
37,Johnny,Russell,female,English
38,Nina,Richards,female,English
39,Mary,Ballard,female,English
40,Adrian,Sparks,female,English
41,Evelyn,Santos,female,English
42,Bess,Jackson,female,English
43,Nicholas,Love,female,English
44,Fred,Perkins,female,English
45,Cynthia,Dunn,female,English
46,Alan,Lamb,female,English
47,Ricardo,Sims,female,English
48,Rosie,Rogers,female,English
49,Susan,Sutton,female,English
50,Mary,Boone,female,English
51,Francis,Marshall,male,English
52,Carl,Olson,male,English
53,Mario,Becker,male,English
54,May,Hunt,male,English
55,Sophie,Neal,male,English
56,Frederick,Houston,male,English
57,Edwin,Allison,male,English
58,Florence,Wheeler,male,English
59,Julia,Rogers,male,English
60,Janie,Morgan,male,English
61,Louis,Hubbard,male,English
62,Lida,Wolfe,male,English
63,Alfred,Summers,male,English
64,Lina,Shaw,male,English
65,Landon,Carroll,male,English
66,Lilly,Harper,male,English
67,Lela,Gordon,male,English
68,Nina,Perry,male,English
69,Dean,Perez,male,English
70,Bertie,Hill,male,English
1,Nelle,Gill,female,Spanish
2,Lula,Wright,female,Spanish
3,Anthony,Jensen,female,Spanish
4,Rodney,Alvarez,female,Spanish
5,Scott,Holmes,female,Spanish
6,Daisy,Aguilar,female,Spanish
7,Elijah,Olson,female,Spanish
8,Alma,Henderson,female,Spanish
9,Willie,Barrett,female,Spanish
10,Ada,Huff,female,Spanish
11,Daniel,Boyd,female,Spanish
12,Ora,Beck,female,Spanish
13,Hulda,Lloyd,female,Spanish
14,Jessie,McBride,female,Spanish
15,Marguerite,Andrews,female,Spanish
16,Maurice,Hamilton,female,Spanish
17,Cecilia,Rhodes,female,Spanish
18,Owen,Powers,female,Spanish
19,Ivan,Butler,female,Spanish
20,Rose,Bishop,female,Spanish
21,Franklin,Mann,female,Spanish
22,Martha,Hogan,female,Spanish
23,Francis,Oliver,female,Spanish
24,Catherine,Carlson,female,Spanish
25,Rose,Sanchez,female,Spanish
26,Danny,Bryant,female,Spanish
27,Jim,Christensen,female,Spanish
28,Eric,Banks,female,Spanish
29,Tony,Dennis,female,Spanish
30,Roy,Hoffman,female,Spanish
31,Edgar,Hunter,female,Spanish
32,Matilda,Gordon,female,Spanish
33,Randall,Cruz,female,Spanish
34,Allen,Brewer,female,Spanish
35,Iva,Pittman,female,Spanish
36,Garrett,Holland,female,Spanish
37,Johnny,Russell,female,Spanish
38,Nina,Richards,female,Spanish
39,Mary,Ballard,female,Spanish
40,Adrian,Sparks,female,Spanish
41,Evelyn,Santos,female,Spanish
42,Bess,Jackson,female,Spanish
43,Nicholas,Love,female,Spanish
44,Fred,Perkins,female,Spanish
45,Cynthia,Dunn,female,Spanish
46,Alan,Lamb,female,Spanish
47,Ricardo,Sims,female,Spanish
48,Rosie,Rogers,female,Spanish
49,Susan,Sutton,female,Spanish
50,Mary,Boone,female,Spanish
51,Francis,Marshall,male,Spanish
52,Carl,Olson,male,Spanish
53,Mario,Becker,male,Spanish
54,May,Hunt,male,Spanish
55,Sophie,Neal,male,Spanish
56,Frederick,Houston,male,Spanish
57,Edwin,Allison,male,Spanish
58,Florence,Wheeler,male,Spanish
59,Julia,Rogers,male,Spanish
60,Janie,Morgan,male,Spanish
61,Louis,Hubbard,male,Spanish
62,Lida,Wolfe,male,Spanish
63,Alfred,Summers,male,Spanish
64,Lina,Shaw,male,Spanish
65,Landon,Carroll,male,Spanish
66,Lilly,Harper,male,Spanish
67,Lela,Gordon,male,Spanish
68,Nina,Perry,male,Spanish
69,Dean,Perez,male,Spanish
70,Bertie,Hill,male,Spanish
Update
the expect results are: each person must be appears only one time using the language order
For explain the case further, I'll take a new and small dataset, using only the person id and the language name
1,English
2,English
3,English
4,English
19,English
1,Spanish
2,Spanish
3,Spanish
4,Spanish
5,Spanish
14,Spanish
15,Spanish
16,Spanish
19,Spanish
21,Spanish
25,Spanish
I'm using the same order but if I use a limit for example LIMIT 8 the results will be
1,English
2,English
3,English
4,English
19,English
1,Spanish
2,Spanish
3,Spanish
And the expected result is
1,English
2,English
3,English
4,English
19,English
5,Spanish
14,Spanish
15,Spanish
What I'm trying to do
What I'm trying to do is sorting, paginating and filtering a list of X that may have a many-to-many relationship with Y, in this case X is a person and Y is the language. I need do it in a general way. I found a trouble if I want ordering the list by some Y properties.
The list will show in this way:
firstname, lastname, gender , languages
Daniel , Boyd , female , English Spanish
Ora , Beck , female , English
Anthony , Jensen , female , Spanish
....
I only need return a array with the IDs in the correct order
this is the main reason I need that the results only appears the person one time is because the ORM (that I'm using) try to hydrate each result and if I paginate the results using offset and limit. the results maybe aren't the expected. I'm doing assumptions many to many relationships
I can't use the string_agg or group_concat because I dont know the real data, I dont know if are integers or strings

If you want each person to appear only once, then you need to aggregate by that person. If you then want the list of languages, you need to combine them in some way, concatenation comes to mind.
The use of double quotes suggests Postgres or Oracle to me. Here is Postgres syntax for this:
SELECT p.id, string_agg(l.name) as languages
FROM person p LEFT JOIN
person_speaks_language psl
ON p.id = psl.person_id LEFT JOIN
language l
ON psl.language_id = l.id
GROUP BY p.id
ORDER BY COUNT(l.name) DESC, languages;
Similar functionality to string_agg() exists in most databases.

There is nothing wrong with Bertie Hill appearing in two rows, with one language each, that is the Tabular View of Data per the Relational Model. There are no dependencies on data values or number of data values. It is completely correct and un-confused.
But here, the requirement is confused, because you really want three separate lists:
speaks one language
speaks two languages [or the number of languages currently in the language file]
speaks no language [on file] ) ...
But you want those three lists in one list.
Concatenating data values is never, ever a good idea. It is a breach of rudimentary standards, specifically 1NF. It may be common, but it is a gross error. It may be taught by the so-called "theoreticians", but it remains a gross error. Even in a result set, yes.
It creates confusion, such as I have detailed at the top.
With concatenated strings, as the number of languages changes, the width of that concatenated field will grow, and eventually exceed space, wherever it appears (eg. the width of the field on the screen).
Just two of the many reasons why it is incorrect, not expandable, sub-standard.
By the way, in your "dataset" (it isn't the result set produced by your code), the sexes appear to be nicely mixed up.
Therefore the answer, and the only correct one, even if it isn't popular, is that your code is correct (it can be cleaned it up, sure), and you have to educate the user re the dangers of sub-standard code or reports.
You can sort by person.name (rather than by language.name) and then write smarter SQL such that (eg) the person.name is not repeated on the second and subsequent row for persons who speak more than one language, etc. That is just pretty printing.
The non-answer, for those who insist on sub-standard code that will break one day when, is Gordon's response.
Response to Comments
In the Relational Model:
There is no order to the rows, that is deemed a physical or implementation aspect, which we have no control over, and which changes anyway, and which we are warned not to rely upon. If order is sought in the output result set, then we must us ORDER BY, that is its purpose in life.
The data has meaning, and that meaning is carried in Relational Keys. Meaning cannot be carried in surrogates (ie. ID columns).
Limiting myself to the files (they are not tables) that you have given, there is no such thing in the data as:
the first 10 persons who speaks one language
Obtaining persons who speak one language is simple, I believe you already understand that:
SELECT person.first_name,
person.last_name
FROM person P,
(SELECT person_id
FROM person_speaks_language
GROUP BY person_id
HAVING COUNT(*) = 1 -- change this for 2 languages, etc
) AS PL
WHERE P.person_id = PL.person_id
But "first" ? "first" by what criteria ? Record creation date ?
ORDER BY date_created -- if it exists in the data
Record ID does not give first anything: as records are added and deleted, any "order" that may exist initially is completely lost.
You cannot extract meaning out of, or assign meaning to something that, by definition, has no meaning. If the Record ID is relevant, ie. you are going to use it for some purpose, then it is not a Record ID, name the field for what it actually is.
I fail to see, I do not understand, the relevance of the difference between the "dataset" and the updated "small dataset". The "dataset" size is irrelevant, the field headings are irrelevant, what the result set means, is relevant.
The problem is not some "limitation" in the Relational Model, the problem is (a) your fixed view of data values, and (b) your lack of understanding about what the Relational Model is, what it does, understanding of which makes this whole question disappear, and we are left with a simple SQL (as tagged) "how to" question. Eg. If I had a Relational Database, with persons and languages, with no ID columns, there is nothing that I cannot do with it, no report that I cannot produce from it, from the data.
Please try to use an example that conveys the meaning in the data, in what you are trying to do.
the expect results are: each person must be appear only one time
They already appear only once (for each language)
using the language order
Well, there is no order in the language file. We can give it some order, whatever order is meaning-ful, to you, in the result set, based on the data. Eg. language.name. Of course, many persons speak each language, so what order would you like within language.name? How about last_name, first_name. The Record IDs are meaningless to the user, so I won't display them in the result set. NULL is also meaningless, and ambiguous, so I will make the meaning here explicit. This is pretty much what you have, tidied up:
SELECT [language] = CASE name
WHEN NULL THEN "[None]"
ELSE name
END,
last_name,
first_name
FROM person P
LEFT JOIN person_speaks_language PL
ON P.id = PL.person_id
LEFT JOIN language L
ON PL.language_id = L.id
ORDER BY name,
last_name,
first_name
But then you have:
And the expected result is
The example data of which contradicts your textual descriptions:
the expect results are: each person must be appear only one time using the language order
So now, if I ignore the text, and examine the example data re what you want
(which is a horrible thing to do, because I am joining you in the incorrect activity of focussing on the data values, rather than understanding the meaning),
it appears you want the person to appear only once, full stop, regardless of how many languages they speak. Your example data is meaningless, so I cannot be asked to reproduce it. See if this has some meaning.
SELECT last_name,
first_name,
[language] = ( -- correlated subquery
SELECT TOP 1 -- get the "first" language
CASE name -- make meaning of null explicit
WHEN NULL THEN "[None]"
ELSE name
END
FROM person_speaks_language PL
JOIN language L
ON PL.language_id = L.id
WHERE P.id = PL.person_id -- the subject person
ORDER BY name -- id would be meaningless
)
FROM person P -- vector for person, once
ORDER BY last_name,
first_name
Now if you wanted only persons who speak a language (on file):
SELECT last_name,
first_name,
[language] = ( -- correlated subquery
SELECT TOP 1 -- get the "first" language
name
FROM person_speaks_language PL
JOIN language L
ON PL.language_id = L.id
WHERE P.id = PL.person_id -- the subject person
ORDER BY name -- id would be meaningless
)
FROM person P,
(
SELECT DISTINCT person_id -- just one occ, thanks
FROM person_speaks_language PL -- vector for speakers
) AS PL_1
WHERE P.id = PL_1.person_id -- join them to person fields
There, not an outer join anywhere to be seen, in either solution. LEFT or RIGHT will confuse you. Do not attempt to "get everything", so that you can "see" the data values, and then mangle, hack and chop away at the result set, in order to get what you want from that. No, forget about the data values and get only what you want from the record filing system.
Response to Update
I was trying to explain the case with a data set, I think I made things tougher than they actually were
Yes, you did. Reviewing the update then ...
The short answer is, get rid of the ORM. There is nothing in it of value:
you can access the RDB from the queries that populate your objects directly. The way we did for decades before the flatulent beast came along. Especially if you understand and implement Open Architecture Standards.
Further, as evidenced, it creates masses of problems. Here, you are trying to work around the insane restrictions of the ORM.
Pagination is a straight-forward issue, if you have your data Normalised, and Relational Keys.
The long answer is ... please read this Answer. I trust you will understand that the approach you take to designing your app components, your design of windows, will change. All your queries will be simplified, you get only what you require for the specific window or object.
The problem may well disappear entirely (except for possibly the pagination, you might need a method).
Then please think about those architectural issues carefully, and make specific comments of questions.

Related

My Joins in query not pulling through correctly

Good evening. Could someone please help me with the following. I am trying to join two tables.The first id wbr_global.gl_ap_details. This stores historic GL information. The second table sandbox.utr_fixed_mapping is where account mapping is stored. For example, ana ccount number 60820 is mapped as Employee relation. The first table needs the mapping from the second table linked on the account number. The output I am getting is not right and way to bug. Any help would be appreciated!
Output
select sandbox.utr_fixed_mapping_na.new_mapping_1,sum(wbr_global.gl_ap_details.amount)
from wbr_global.gl_ap_details
LEFT JOIN sandbox.utr_fixed_mapping_na ON wbr_global.gl_ap_details.account_number = sandbox.utr_fixed_mapping_na.account_number
Where gl_ap_details.cost_center = '1172'
and gl_ap_details.period_name = 'JUL-21'
and gl_ap_details.ledger_name = 'Amazon.com, Inc.'
Group by 1;
I tried adding the cast function but after 5000 seconds of the query running I canceled it.
The query itself appears ok, but minor changes. Learn to use table "aliases". This way you don't have to keep typing long database.table.column all over. Additionally, SQL is easier to read doing it that way anyhow.
Notice the aliases "gl" and "fm" after the tables are declared, then these aliases are used to represent the columns.. Easier to read, would you agree.
Added GL Account number as described below the query.
select
gl.account_number,
fm.new_mapping_1,
sum(gl.amount)
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
gl.account_number,
fm.new_mapping_1
Now, as for your query and getting null. This just means that there are records within the gl_ap_details table with an account number that is not found in the utr_fixed_mapping_na table. So, to see WHAT gl account number does NOT exist, I have added it to the query. Its possible there are MULTIPLE records in the gl_ap_details that are not found in the mapping table. So, you may get
GLAccount Description SumOfAmount
glaccount1 null $someAmount
glaccount37 null $someAmount
glaccount49 null $someAmount
glaccount72 Depreciation $someAmount
glaccount87 Real Estate $someAmount
glaccount92 Building $someAmount
glaccount99 Salaries $someAmount
I obviously made-up glaccounts just to show the purpose. You may have multiple where the null's total amount is actually masking how many different gl account numbers were NOT found.
Once you find which are missing, you can check / confirm they SHOULD be in the mapping table.
FEEDBACK.
Since you do realize the missing numbers, lets consider a Cartesian result. If there are multiple entries in the mapping table for the same G/L account number, you will get a Cartesian result thus bloating your numbers. To clarify, lets say your mapping table has
Mapping file.
GL Descr1 NewMapping
1 test Salaries
1 testView Buildings
1 Another Depreciation
And your GL_AP_Details has
GL Amount
1 $100
Your total for the query would result in $300 because the query is trying to join the AP Details GL #1 to EACH of the entries in the mapping file thus bloating the amount. You could also add a COUNT(*) as NumberOfEntries to the query to see how many transactions it THINKS it is processing. Is there some "unique ID" in the GL_AP_Details table? If so, then you could also do a count of DISTINCT ID values. If they are different (distinct is lower than # of entries), I think THAT is your culprit.
select
fm.new_mapping_1,
sum(gl.amount),
count(*) as NumberOfEntries,
count( distinct gl.UniqueIdField ) as DistinctTransactions
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
fm.new_mapping_1
Might you also need to limit the mapping table for a specific prophecy or mec view?
If you "think" that the result of an aggregate is wrong, then the easiest way to verify this is to select the individual rows that correlate to 1 record in the aggregate output and inspect the records, looking for duplications.
For instance, pick 'Building Management':
SELECT fixed.new_mapping_1,details.amount,*
FROM wbr_global.gl_ap_details details
LEFT JOIN sandbox.utr_fixed_mapping_na fixed ON details.account_number = fixed.account_number
WHERE details.cost_center = '1172'
AND details.period_name = 'JUL-21'
AND details.ledger_name = 'Amazon.com, Inc.'
AND details.account_number = 'Building Management'
Notice that we tack on a ,* to the end of the projection, this will show you everything that the query has access to, you should look for repeating sections of data that you were not expecting, then depending on which table they originate from your might add additional criteria to the JOIN, or to the WHERE or you might need to group by additional columns.
This type of issue is really hard to comment on in a forum like this because it is highly specific to your schema, and the data contained within it, making solutions highly subjective to criteria you are not likely to publish online.
Generally if you think a calculation is wrong, you need to manually compute it to verify, this above advice helps you to inspect the data your query is using, you should either construct your own query or use other tools to build the data set that helps you to manually compute the correct values, then work them back into or replace your original query.
The speed issues are out of scope here, we can comment on the poor schema design but I suspect you don't have a choice. In the utr_fixed_mapping_na table you should make the account_number have the same column type as the source data, or add a new column that has the data in the original type, then you can setup indexes on the columns to improve the speed of the join.

Matching observations based on similarity of categorical variables

I was wondering, if someone has a good way how to match two observations based on categorical (non-ordinal) variables.
The exercise I am working on is matching mentees with mentors based on interests and other characteristics that are (non-ordinal or ordinal) categorical variables.
Variable Possible Values
Sport “Baseball”, “Football”, “Basketball” (…)
Marital Status “Single, no kids”, “Single, young kids”, “Married, no kids”, “Married, young kids”, (…)
Job Level 1, 2, 3, 4, 5, 6
Industry “Retail”, “Finance”, “Wholesale”, (…)
There are also indicators if any of the variables is important to the person. I understand, I could force marital status into one or two ordinal variables like (“Single”, “Married”, “Widow”) and (“no kids”, “young kids”, “grown kids”). But I don’t know how to handle industry and sport as there is no logical order to them. My plan was originally to use a clustering technique to find a match between the mentor and the mentee set based on the shortest distance or the given points. But that would ignore the fact that people could decide, if the variable is important to them or not (“Yes”, “No”).
Now, I am thinking to just brute force logic on it by using nested IF statements that check, if there is a perfect match based on the importance and the actual values. ELSE check if there is a matching record that has all matches, but one category etc. This seems very inefficient, so I was hoping if someone came across a similar problem, I would find a better way how to handle this.
Would it make sense to create two variables one for the importance sequence (eg: "YesNoYesNoNo") and one for the interests (eg "BasketballSingleNokids6Retail") and then employ fuzzy matching?
Best regards,
One approach would be to decide first on which variables you must have an exact match, do a cartesian join on those, then generate a score based on other non-mandatory matches and output records where the score exceeds a threshold. The more mandatory matches you require, the better the query will perform.
E.g.
%let MATCH_THRESHOLD = 2; /*At least this many optional variables must match*/
proc sql;
create table matches as
select * from mentors a inner join mentees b
/*Mandatory matches*/
on a.m_var1 = b.m_var1
and a.m_var2 = b.m_var2
and ...
/*Optional threshold-based matches*/
where a.o_var1 = b.o_var1
+ a.o_var2 = b.o_var2
+ ...
>= &MATCH_THRESHOLD;
quit;
Going further - if you have inconsistently entered data, you could use soundex or edit distance matching rather than exact matching for the optional conditions. If some optional matches are worth more than others, you can weight their contribution to the score.

display the order in relational algebra

Suppose we have this relational schema
homebuilder(hID, hName, hStreet, hCity, hZip, hPhone)
model(hID, mID, mName, sqft, story) subdivision(sName,
sCity, sZip) offered(sName, hID, mID, price) lot(sName,
lotNum, lStAddr, lSize, lPremium) sold(sName, lotNum, hID, mID,
status)
I have problem by doing relational algebra for each subdivision , find the number of models offered and the average, minimum and maximum price of the models offered at that subdivision. Also display the result in descending order on the average price of a home.
I am done with SQL formula, but it hard for me to translate this SQL to relational algebra. Can someone help me?
Here is what I got so far:
SQL:=
SELECT S, avg (O.price), min (O.price), max (O.price), count(*)
FROM offered O, subdivision S
WHERE O.sName = S.sName
GROUP BY S.sName
ORDER BY 4 desc;
+1 to DPenner's comment: quite true that you can't do ordering in RA. (Although those q's and a's referenced seem to have some 'difficulties'.)
Another thing you can't do in RA (contra the SQL that JaveLeave shows) is to have anonymous columns referenced by position. If SQL were a sensible language (or indeed any sort of language at all), you could name the column in the SELECT clause ..., max (O.price) AS maxPrice, ... then ORDER BY maxPrice desc. But no, you can't do that. In SQL you have to repeat ORDER BY max (O.price) desc. (By the way, the question asked for ordering by average price, not max(?) That's column 2.)
Contrast that the RA Group operation returns a relation. And being a relation it must have attributes only addressable by name.
Back to the question as asked. The nearest you can get to an ordering is to put a column on each row with the ordinal position of this row relative to the overall table. Since the question asks for descending sequence, the first step is to find the subdivision with minimum average price and tag it with ordinal 1. Then select all but that one, get the minmium of those, tag it with 2. And in general: take all not tagged so far; get the minmum; tag it with highest tag so far +1; recurse. So you need the transitive closure operation (which is another 'missing' feature of standard RA). You can find some SQL code to achieve this sort of thing in comp.database.theory -- from memory Joe Celko gives examples.
Off-topic: I'm puzzled why courses/professors/textbooks in SQL also ask you to do impossible things in RA. Certainly it's good to have a grounding in RA. It's a powerful mental model to understand data structures that SQL only obscures. RA (as an algebra) underpins most SQL engines. But then why leave the impression that RA is some sort of 'poor cousin' to SQL? There are no commercial implementations of RA; there are no job advertisements for RA programmers. Why try to make it what it isn't?

Filtering simultaneously on count of related objects and on count of related objects that satisfy a condition in Django

So I have models amounting to this (very simplified, obviously):
class Mystery(models.Model):
name = models.CharField(max_length=100)
class Character(models.Model):
mystery = models.ForeignKey(Mystery, related_name="characters")
required = models.BooleanField(default=True)
Basically, in each mystery there are a number of characters, which can be essential to the story or not. The minimum number of actors that can stage a mystery is the number of required characters for that mystery; the maximum number is the number of characters total for the mystery.
Now I'm trying to query for mysteries that can be played by some given number of actors. It seemed straightforward enough using the way Django's filtering and annotation features function; after all, both of these queries work fine:
# Returns mystery objects with at least x characters in all
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x)
# Returns mystery objects with no more than x required characters
Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x)
However, when I try to combine the two...
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x, max_actors__gte=x)
...it doesn't work. Both min_actors and max_actors come out containing the maximum number of actors. The relevant parts of the actual query being run look like this:
SELECT `mysteries_mystery`.`id`,
`mysteries_mystery`.`name`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `max_actors`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `min_actors`
FROM `mysteries_mystery`
LEFT OUTER JOIN `mysteries_character` ON (`mysteries_mystery`.`id` = `mysteries_character`.`mystery_id`)
INNER JOIN `mysteries_character` T5 ON (`mysteries_mystery`.`id` = T5.`mystery_id`)
WHERE T5.`required` = True
GROUP BY `mysteries_mystery`.`id`, `mysteries_mystery`.`name`
...which makes it clear that while Django is creating a second join on the character table just fine (the second copy of the table being aliased to T5), that table isn't actually being used anywhere and both of the counts are being selected from the non-aliased version, which obviously yields the same result both times.
Even when I try to use an extra clause to select from T5, I get told there is no such table as T5, even as examining the output query shows that it's still aliasing the second character table to T5. Another attempt to do this with extra clauses went like this:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).extra(select={'min_actors': "SELECT COUNT(*) FROM mysteries_character WHERE required = True AND mystery_id = mysteries_mystery.id"}).extra(where=["`min_actors` <= %s", "`max_actors` >= %s"], params=[x, x])
But that didn't work because I can't use a calculated field in the WHERE clause, at least on MySQL. If only I could use HAVING, but alas, Django's .extra() does not and will never allow you to set HAVING parameters.
Is there any way to get Django's ORM to do what I want?
How about combining your Count()s:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True),min_actors=Count('characters', distinct=True)).filter(characters__required=True).filter(min_actors__lte=x, max_actors__gte=x)
This seems to work for me but I didn't test it with your exact models.
It's been a couple of weeks with no suggested solutions, so here's how I ended up going about it, for anyone else who might be looking for an answer:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x, id__in=Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x).values('id'))
In other words, filter on the first count and on IDs that match those in an explicit subquery that filters on the second count. Kind of clunky, but it works well enough for my purposes.

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.