display the order in relational algebra - sql

Suppose we have this relational schema
homebuilder(hID, hName, hStreet, hCity, hZip, hPhone)
model(hID, mID, mName, sqft, story) subdivision(sName,
sCity, sZip) offered(sName, hID, mID, price) lot(sName,
lotNum, lStAddr, lSize, lPremium) sold(sName, lotNum, hID, mID,
status)
I have problem by doing relational algebra for each subdivision , find the number of models offered and the average, minimum and maximum price of the models offered at that subdivision. Also display the result in descending order on the average price of a home.
I am done with SQL formula, but it hard for me to translate this SQL to relational algebra. Can someone help me?
Here is what I got so far:
SQL:=
SELECT S, avg (O.price), min (O.price), max (O.price), count(*)
FROM offered O, subdivision S
WHERE O.sName = S.sName
GROUP BY S.sName
ORDER BY 4 desc;

+1 to DPenner's comment: quite true that you can't do ordering in RA. (Although those q's and a's referenced seem to have some 'difficulties'.)
Another thing you can't do in RA (contra the SQL that JaveLeave shows) is to have anonymous columns referenced by position. If SQL were a sensible language (or indeed any sort of language at all), you could name the column in the SELECT clause ..., max (O.price) AS maxPrice, ... then ORDER BY maxPrice desc. But no, you can't do that. In SQL you have to repeat ORDER BY max (O.price) desc. (By the way, the question asked for ordering by average price, not max(?) That's column 2.)
Contrast that the RA Group operation returns a relation. And being a relation it must have attributes only addressable by name.
Back to the question as asked. The nearest you can get to an ordering is to put a column on each row with the ordinal position of this row relative to the overall table. Since the question asks for descending sequence, the first step is to find the subdivision with minimum average price and tag it with ordinal 1. Then select all but that one, get the minmium of those, tag it with 2. And in general: take all not tagged so far; get the minmum; tag it with highest tag so far +1; recurse. So you need the transitive closure operation (which is another 'missing' feature of standard RA). You can find some SQL code to achieve this sort of thing in comp.database.theory -- from memory Joe Celko gives examples.
Off-topic: I'm puzzled why courses/professors/textbooks in SQL also ask you to do impossible things in RA. Certainly it's good to have a grounding in RA. It's a powerful mental model to understand data structures that SQL only obscures. RA (as an algebra) underpins most SQL engines. But then why leave the impression that RA is some sort of 'poor cousin' to SQL? There are no commercial implementations of RA; there are no job advertisements for RA programmers. Why try to make it what it isn't?

Related

SQL - HAVING (execution vs structure)

I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!
The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.
the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.
There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)

How to represent GROUP BY with HAVING COUNT(*)>1 in relational algebra?

For an exam, I am asked to get the list of clients having more than one rent, both as an SQL query and as an algebraic expression.
For some reasons, the correction doesn't provide the algebraic version.
So now I am left with:
SELECT IdClient, Name, ...
FROM Client
WHERE IdClient IN (
SELECT IdClient
FROM Rental
GROUP BY IdClient
HAVING COUNT(*) > 1
)
I don't know if there is a standard for algebraic notations, therefore:
Π projection
× Cartesian product
⋈ natural join
σ selection
Then I translate as:
Π IdClient, Name, ... (
σ (count(IdClient)>1) (Π Rental) ⋈ (Client ⋈ Rental)
)
But I find no source to prove me right or wrong, especially for:
the logic behind the math
Π Rental seems like a shady business
I saw the use of count() at https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra and while it isn't used the same way, I couldn't figure out a way to use it without the projection (which I wanted to avoid.)
There are many variants of "relational algebra", differing even on what a relation is. You need to tell us which one you are supposed to use.
Also you don't explain what it means for a pair of RA & SQL queries to "have the form of" or "be the same as" each other. (Earlier versions.) Same result? Or also some kind of parallel structure?
Also you don't explain what "get the list of clients" means. What attributes does the result have?
If you try to write a definition of the count you are trying to use in σ count(IdClient)>1 (...)--what it inputs & what it outputs based on that--you will see that you can't. That kind of count that takes just an attribute does not correspond to a relational operator. It is used in a grouping expression--which you are missing. Such count & group are not actually relational operators, they are non-terminals in so-called relational algebras that are really query languages, designed by SQL apologists, suggesting it is easy to map SQL to relational algebra, but begging the question of how we aggregate in an algebra. Still, maybe that's the sort of "relational algebra" you were told to use.
I saw the use of count() there https://cs.stackexchange.com/questions/29897/use-count-in-relational-algebra
The nature of algebras is that the only sense in which we "use" operators "with" other operators is to pass outputs of operator calls as inputs to other operator calls. (Hence some so-called algebras are not.) In your linked answer, grouping operator G inputs aggregate name count and attribute name name and that affects the output. The answer quotes Database System Concepts, 5th Ed:
G1, G2, ..., Gn G F1(A1), F2(A2), ..., Fm(Am) (E)
where E is any relational-algebra expression; G1, G2, ..., Gn constitute a list of attributes on which to group; each Fi is an aggregate function; and each Ai is an attribute name.
G returns rows with attributes G1, ..., A1, ... where one or more rows with the same G1, ... subrows are in E and each Ai holds the output from aggregating Fi on Ai over those rows.
But the answer when you read & linked it used that definition improperly. (I got it fixed since.) Correct is:
π name (σ phone>1 (name G count(phone) (Person)))
This is clear if you carefully read the definition.
G has misleading syntax. count(phone) is not a call of an operator; it's just a pair of arguments, an aggregate name count & an attribute name phone. Less misleading syntax would be
π name (σ phone>1 (name G count phone (Person)))
One does not need a grouping operator to write your query. That makes it all the more important to know what "relational algebra" means in the exam. It is harder if you can't use a grouping operator.
"Π Rental seems like a shady business" is unclear. You do use projection incorrectly; proper use is π attributes (relation). I guess you are using π in an attempt to involve a grouping operator like G. Re "the logic behind the math" see this.

Sorting with many to many relationship

I have a 3 tables person, person_speaks_language and language.
person has 80 records
language has 2 records
I have the following records
the first 10 persons speaks one language
the first 70 persons (include the first group) speaks 2 languages
the last 10 persons dont speaks any language
Following with the example I want sort the persons by language, How I can do it correctly.
I'm trying to use the the following SQL but seems quite strange
SELECT "person".*
FROM "person"
LEFT JOIN "person_speaks_language" ON "person"."id" = "person_speaks_language"."person_id"
LEFT JOIN "language" ON "person_speaks_language"."language_id" = "language"."id"
ORDER BY "language"."name"
ASC
dataset
71,Catherine,Porter,male,NULL
72,Isabelle,Sharp,male,NULL
73,Scott,Chandler,male,NULL
74,Jean,Graham,male,NULL
75,Marc,Kennedy,male,NULL
76,Marion,Weaver,male,NULL
77,Melvin,Fitzgerald,male,NULL
78,Catherine,Guerrero,male,NULL
79,Linnie,Strickland,male,NULL
80,Ann,Henderson,male,NULL
11,Daniel,Boyd,female,English
12,Ora,Beck,female,English
13,Hulda,Lloyd,female,English
14,Jessie,McBride,female,English
15,Marguerite,Andrews,female,English
16,Maurice,Hamilton,female,English
17,Cecilia,Rhodes,female,English
18,Owen,Powers,female,English
19,Ivan,Butler,female,English
20,Rose,Bishop,female,English
21,Franklin,Mann,female,English
22,Martha,Hogan,female,English
23,Francis,Oliver,female,English
24,Catherine,Carlson,female,English
25,Rose,Sanchez,female,English
26,Danny,Bryant,female,English
27,Jim,Christensen,female,English
28,Eric,Banks,female,English
29,Tony,Dennis,female,English
30,Roy,Hoffman,female,English
31,Edgar,Hunter,female,English
32,Matilda,Gordon,female,English
33,Randall,Cruz,female,English
34,Allen,Brewer,female,English
35,Iva,Pittman,female,English
36,Garrett,Holland,female,English
37,Johnny,Russell,female,English
38,Nina,Richards,female,English
39,Mary,Ballard,female,English
40,Adrian,Sparks,female,English
41,Evelyn,Santos,female,English
42,Bess,Jackson,female,English
43,Nicholas,Love,female,English
44,Fred,Perkins,female,English
45,Cynthia,Dunn,female,English
46,Alan,Lamb,female,English
47,Ricardo,Sims,female,English
48,Rosie,Rogers,female,English
49,Susan,Sutton,female,English
50,Mary,Boone,female,English
51,Francis,Marshall,male,English
52,Carl,Olson,male,English
53,Mario,Becker,male,English
54,May,Hunt,male,English
55,Sophie,Neal,male,English
56,Frederick,Houston,male,English
57,Edwin,Allison,male,English
58,Florence,Wheeler,male,English
59,Julia,Rogers,male,English
60,Janie,Morgan,male,English
61,Louis,Hubbard,male,English
62,Lida,Wolfe,male,English
63,Alfred,Summers,male,English
64,Lina,Shaw,male,English
65,Landon,Carroll,male,English
66,Lilly,Harper,male,English
67,Lela,Gordon,male,English
68,Nina,Perry,male,English
69,Dean,Perez,male,English
70,Bertie,Hill,male,English
1,Nelle,Gill,female,Spanish
2,Lula,Wright,female,Spanish
3,Anthony,Jensen,female,Spanish
4,Rodney,Alvarez,female,Spanish
5,Scott,Holmes,female,Spanish
6,Daisy,Aguilar,female,Spanish
7,Elijah,Olson,female,Spanish
8,Alma,Henderson,female,Spanish
9,Willie,Barrett,female,Spanish
10,Ada,Huff,female,Spanish
11,Daniel,Boyd,female,Spanish
12,Ora,Beck,female,Spanish
13,Hulda,Lloyd,female,Spanish
14,Jessie,McBride,female,Spanish
15,Marguerite,Andrews,female,Spanish
16,Maurice,Hamilton,female,Spanish
17,Cecilia,Rhodes,female,Spanish
18,Owen,Powers,female,Spanish
19,Ivan,Butler,female,Spanish
20,Rose,Bishop,female,Spanish
21,Franklin,Mann,female,Spanish
22,Martha,Hogan,female,Spanish
23,Francis,Oliver,female,Spanish
24,Catherine,Carlson,female,Spanish
25,Rose,Sanchez,female,Spanish
26,Danny,Bryant,female,Spanish
27,Jim,Christensen,female,Spanish
28,Eric,Banks,female,Spanish
29,Tony,Dennis,female,Spanish
30,Roy,Hoffman,female,Spanish
31,Edgar,Hunter,female,Spanish
32,Matilda,Gordon,female,Spanish
33,Randall,Cruz,female,Spanish
34,Allen,Brewer,female,Spanish
35,Iva,Pittman,female,Spanish
36,Garrett,Holland,female,Spanish
37,Johnny,Russell,female,Spanish
38,Nina,Richards,female,Spanish
39,Mary,Ballard,female,Spanish
40,Adrian,Sparks,female,Spanish
41,Evelyn,Santos,female,Spanish
42,Bess,Jackson,female,Spanish
43,Nicholas,Love,female,Spanish
44,Fred,Perkins,female,Spanish
45,Cynthia,Dunn,female,Spanish
46,Alan,Lamb,female,Spanish
47,Ricardo,Sims,female,Spanish
48,Rosie,Rogers,female,Spanish
49,Susan,Sutton,female,Spanish
50,Mary,Boone,female,Spanish
51,Francis,Marshall,male,Spanish
52,Carl,Olson,male,Spanish
53,Mario,Becker,male,Spanish
54,May,Hunt,male,Spanish
55,Sophie,Neal,male,Spanish
56,Frederick,Houston,male,Spanish
57,Edwin,Allison,male,Spanish
58,Florence,Wheeler,male,Spanish
59,Julia,Rogers,male,Spanish
60,Janie,Morgan,male,Spanish
61,Louis,Hubbard,male,Spanish
62,Lida,Wolfe,male,Spanish
63,Alfred,Summers,male,Spanish
64,Lina,Shaw,male,Spanish
65,Landon,Carroll,male,Spanish
66,Lilly,Harper,male,Spanish
67,Lela,Gordon,male,Spanish
68,Nina,Perry,male,Spanish
69,Dean,Perez,male,Spanish
70,Bertie,Hill,male,Spanish
Update
the expect results are: each person must be appears only one time using the language order
For explain the case further, I'll take a new and small dataset, using only the person id and the language name
1,English
2,English
3,English
4,English
19,English
1,Spanish
2,Spanish
3,Spanish
4,Spanish
5,Spanish
14,Spanish
15,Spanish
16,Spanish
19,Spanish
21,Spanish
25,Spanish
I'm using the same order but if I use a limit for example LIMIT 8 the results will be
1,English
2,English
3,English
4,English
19,English
1,Spanish
2,Spanish
3,Spanish
And the expected result is
1,English
2,English
3,English
4,English
19,English
5,Spanish
14,Spanish
15,Spanish
What I'm trying to do
What I'm trying to do is sorting, paginating and filtering a list of X that may have a many-to-many relationship with Y, in this case X is a person and Y is the language. I need do it in a general way. I found a trouble if I want ordering the list by some Y properties.
The list will show in this way:
firstname, lastname, gender , languages
Daniel , Boyd , female , English Spanish
Ora , Beck , female , English
Anthony , Jensen , female , Spanish
....
I only need return a array with the IDs in the correct order
this is the main reason I need that the results only appears the person one time is because the ORM (that I'm using) try to hydrate each result and if I paginate the results using offset and limit. the results maybe aren't the expected. I'm doing assumptions many to many relationships
I can't use the string_agg or group_concat because I dont know the real data, I dont know if are integers or strings
If you want each person to appear only once, then you need to aggregate by that person. If you then want the list of languages, you need to combine them in some way, concatenation comes to mind.
The use of double quotes suggests Postgres or Oracle to me. Here is Postgres syntax for this:
SELECT p.id, string_agg(l.name) as languages
FROM person p LEFT JOIN
person_speaks_language psl
ON p.id = psl.person_id LEFT JOIN
language l
ON psl.language_id = l.id
GROUP BY p.id
ORDER BY COUNT(l.name) DESC, languages;
Similar functionality to string_agg() exists in most databases.
There is nothing wrong with Bertie Hill appearing in two rows, with one language each, that is the Tabular View of Data per the Relational Model. There are no dependencies on data values or number of data values. It is completely correct and un-confused.
But here, the requirement is confused, because you really want three separate lists:
speaks one language
speaks two languages [or the number of languages currently in the language file]
speaks no language [on file] ) ...
But you want those three lists in one list.
Concatenating data values is never, ever a good idea. It is a breach of rudimentary standards, specifically 1NF. It may be common, but it is a gross error. It may be taught by the so-called "theoreticians", but it remains a gross error. Even in a result set, yes.
It creates confusion, such as I have detailed at the top.
With concatenated strings, as the number of languages changes, the width of that concatenated field will grow, and eventually exceed space, wherever it appears (eg. the width of the field on the screen).
Just two of the many reasons why it is incorrect, not expandable, sub-standard.
By the way, in your "dataset" (it isn't the result set produced by your code), the sexes appear to be nicely mixed up.
Therefore the answer, and the only correct one, even if it isn't popular, is that your code is correct (it can be cleaned it up, sure), and you have to educate the user re the dangers of sub-standard code or reports.
You can sort by person.name (rather than by language.name) and then write smarter SQL such that (eg) the person.name is not repeated on the second and subsequent row for persons who speak more than one language, etc. That is just pretty printing.
The non-answer, for those who insist on sub-standard code that will break one day when, is Gordon's response.
Response to Comments
In the Relational Model:
There is no order to the rows, that is deemed a physical or implementation aspect, which we have no control over, and which changes anyway, and which we are warned not to rely upon. If order is sought in the output result set, then we must us ORDER BY, that is its purpose in life.
The data has meaning, and that meaning is carried in Relational Keys. Meaning cannot be carried in surrogates (ie. ID columns).
Limiting myself to the files (they are not tables) that you have given, there is no such thing in the data as:
the first 10 persons who speaks one language
Obtaining persons who speak one language is simple, I believe you already understand that:
SELECT person.first_name,
person.last_name
FROM person P,
(SELECT person_id
FROM person_speaks_language
GROUP BY person_id
HAVING COUNT(*) = 1 -- change this for 2 languages, etc
) AS PL
WHERE P.person_id = PL.person_id
But "first" ? "first" by what criteria ? Record creation date ?
ORDER BY date_created -- if it exists in the data
Record ID does not give first anything: as records are added and deleted, any "order" that may exist initially is completely lost.
You cannot extract meaning out of, or assign meaning to something that, by definition, has no meaning. If the Record ID is relevant, ie. you are going to use it for some purpose, then it is not a Record ID, name the field for what it actually is.
I fail to see, I do not understand, the relevance of the difference between the "dataset" and the updated "small dataset". The "dataset" size is irrelevant, the field headings are irrelevant, what the result set means, is relevant.
The problem is not some "limitation" in the Relational Model, the problem is (a) your fixed view of data values, and (b) your lack of understanding about what the Relational Model is, what it does, understanding of which makes this whole question disappear, and we are left with a simple SQL (as tagged) "how to" question. Eg. If I had a Relational Database, with persons and languages, with no ID columns, there is nothing that I cannot do with it, no report that I cannot produce from it, from the data.
Please try to use an example that conveys the meaning in the data, in what you are trying to do.
the expect results are: each person must be appear only one time
They already appear only once (for each language)
using the language order
Well, there is no order in the language file. We can give it some order, whatever order is meaning-ful, to you, in the result set, based on the data. Eg. language.name. Of course, many persons speak each language, so what order would you like within language.name? How about last_name, first_name. The Record IDs are meaningless to the user, so I won't display them in the result set. NULL is also meaningless, and ambiguous, so I will make the meaning here explicit. This is pretty much what you have, tidied up:
SELECT [language] = CASE name
WHEN NULL THEN "[None]"
ELSE name
END,
last_name,
first_name
FROM person P
LEFT JOIN person_speaks_language PL
ON P.id = PL.person_id
LEFT JOIN language L
ON PL.language_id = L.id
ORDER BY name,
last_name,
first_name
But then you have:
And the expected result is
The example data of which contradicts your textual descriptions:
the expect results are: each person must be appear only one time using the language order
So now, if I ignore the text, and examine the example data re what you want
(which is a horrible thing to do, because I am joining you in the incorrect activity of focussing on the data values, rather than understanding the meaning),
it appears you want the person to appear only once, full stop, regardless of how many languages they speak. Your example data is meaningless, so I cannot be asked to reproduce it. See if this has some meaning.
SELECT last_name,
first_name,
[language] = ( -- correlated subquery
SELECT TOP 1 -- get the "first" language
CASE name -- make meaning of null explicit
WHEN NULL THEN "[None]"
ELSE name
END
FROM person_speaks_language PL
JOIN language L
ON PL.language_id = L.id
WHERE P.id = PL.person_id -- the subject person
ORDER BY name -- id would be meaningless
)
FROM person P -- vector for person, once
ORDER BY last_name,
first_name
Now if you wanted only persons who speak a language (on file):
SELECT last_name,
first_name,
[language] = ( -- correlated subquery
SELECT TOP 1 -- get the "first" language
name
FROM person_speaks_language PL
JOIN language L
ON PL.language_id = L.id
WHERE P.id = PL.person_id -- the subject person
ORDER BY name -- id would be meaningless
)
FROM person P,
(
SELECT DISTINCT person_id -- just one occ, thanks
FROM person_speaks_language PL -- vector for speakers
) AS PL_1
WHERE P.id = PL_1.person_id -- join them to person fields
There, not an outer join anywhere to be seen, in either solution. LEFT or RIGHT will confuse you. Do not attempt to "get everything", so that you can "see" the data values, and then mangle, hack and chop away at the result set, in order to get what you want from that. No, forget about the data values and get only what you want from the record filing system.
Response to Update
I was trying to explain the case with a data set, I think I made things tougher than they actually were
Yes, you did. Reviewing the update then ...
The short answer is, get rid of the ORM. There is nothing in it of value:
you can access the RDB from the queries that populate your objects directly. The way we did for decades before the flatulent beast came along. Especially if you understand and implement Open Architecture Standards.
Further, as evidenced, it creates masses of problems. Here, you are trying to work around the insane restrictions of the ORM.
Pagination is a straight-forward issue, if you have your data Normalised, and Relational Keys.
The long answer is ... please read this Answer. I trust you will understand that the approach you take to designing your app components, your design of windows, will change. All your queries will be simplified, you get only what you require for the specific window or object.
The problem may well disappear entirely (except for possibly the pagination, you might need a method).
Then please think about those architectural issues carefully, and make specific comments of questions.

Joining Tables in Oracle 8i to Limit Results

My background in SQL is limited, but my Googling isn't, I feel that I might just be missing the vocabulary to ask this question properly so hopefully beyond an answer to my question I can get the vocabulary I need to research this issue further.
I have a parts table - PARTS
I have a Purchase Order table - PO
and I have a PO Line Item table - PO_LINEITEM
The question I'm attempting to answer is given a particular part I want to get the latest purchase order and then look at the price we paid.
The PO table holds the date (PO_DATE) of when the Purchase Order was filled and the PO_LINEITEM table holds the information regarding the particular line item such as part primary key (PART_PRIMARYKEY) and price (PART_PRICE). The PARTS table isn't as important as the rest except that I need to return a PART primary key so that I can base the resulting view off the PARTS table
I've been through sub-queries and scalable sub-queries and the like, but I can't seem to find the right combination.
I started from a base of:
SELECT a.PO_DATE, b.PART_PRIMARYKEY, b.PART_PRICE
FROM PO a, PO_LINEITEM b
WHERE a.PO_PRIMARYKEY = b.PO_PRIMARYKEY
As you would expect this returns a list of every instance of an object in a PO with it's price and the date the Purchase Order was filled. The closest I have come to crack this is by using the MAX function on the date such as:
SELECT MAX(a.PO_DATE) AS DATE, b.PART_PRIMARYKEY, b.PART_PRICE
FROM PO a, PO_LINEITEM b
WHERE a.PO_PRIMARYKEY = b.PO_PRIMARYKEY
GROUP BY b.PART_PRIMARYKEY, b.PART_PRICE
This returns a Max date for each price a we paid for a particular part, so:
PART 1234, £12.95, 12/08/2012
PART 1234, £13.00, 14/08/2012
PART 1234. £11.15, 17/08/2012
PART 2345, £5.25, 12/08/2012
PART 2345, £5.65, 13/08/2012
etc.
What I need is:
PART 1234, £11.15, 17/08/2012
PART 2345. £5.65, 13/08/2012
If I could just group by the PART_PRIMARYKEY that would be excellent, but I get an ORA-00979 not a GROUP BY expression when I try.
Like I said I feel that my lack of vocabulary around this issue is impeding me finding an answer, so if anyone could point me in the right direction I'd be grateful
So hopefully I'm not asking a question that is asked every other day, but haven't found because I didn't use the magical combination of words to find.
Thank you for any help you can offer.
Look up Analytic Functions. They were introduced in 8i though I'm not sure how advanced they were at the time compared to how very good they can be in 11. A few links that I've used to understand them:
http://www.oracle-base.com/articles/misc/analytic-functions.php
http://www.orafaq.com/node/55
Though, a sub-query such as this might suffice (I may have your column naming mixed up):
select A.PART_PRIMARYKEY, B.PART_PRICE, A.PO_DATE
from PO A, PO_LINEITEM B
where A.PO_PRIMARYKEY = B.PO_PRIMARYKEY
and (A.PART_PRIMARYKEY, A.PO_DATE) in
( select A.PART_PRIMARYKEY, max(A.PO_DATE)
from PO A, PO_LINEITEM B
where A.PO_PRIMARYKEY = B.PO_PRIMARYKEY
group by A.PART_PRIMARYKEY);

minimizing runtime of calculating several ranks (based on grades) on large sql tables, how short can it get (ms access)

I've been stuck with the rather famous problem of ranking students by grade for a couple weeks now, and while I've learned much, I still haven't solved my problem (the ranks are generated, but the process is too slow):
I have a large table (320,000 rows) that contains the student codes (serves as an identifier, instead of their names), the students classroom, the test , the tests date, the subject, the question number and the students grade on that question. This table is the base for everything else that is calculated and its size makes all these calculations very very slow, to the point where I find me almost breaking everything here at work.
First, some intel on the school (very little info, required to understand the problem)
Here at the school we have weekly tests over several subjects. The school is also separated in classrooms with different purposes (one is focused on math, physics and chemistry, another one is focused on biology, and the last one focuses on history, Portuguese and geography). But they all do the same tests every week.
What we want to do is calculate the standard deviation for each question for everyone in the school (not per-classroom) and the average grade per question (also for everyone in the school), and then generate the following ranks (all of them per date):
-Rank per subject per classroom (with "raw" grades), Rank per subject considering the whole school (with "raw" grades) and Rank per subject considering the whole school (using normalized grades, with the standard deviation per question and the average grade per question information)
-The same ranks that were mentioned above, but not per Subject, considering instead all subjects
As you can see, after calculating the average grades and the standard deviations, we still need to calculate the sums of the grades on each question, and rank according to these sums (the actual subject/test grades). I've attacked this problem in a few ways:
1)
Created two tables, one with the grades per student per subject (fields: Students code, Students classroom, Date of test, Subject, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades) and another with the grades per student per test (all subjects taken into account, fields: Students code, Students classroom, Date of test, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades).
The insertion of data in these tables takes about 50 seconds
Then, I tried using SQL to rank, however, I ran into some problems:
-Access has no ROW_NUMBER or RANK functions, and thus I have to use queries with COUNT, like (below is just a simplified version):
SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;
There still is the rank with the normalized grades in the query above, but I omitted it.
The table grades_table_per_subject has about 45,000 lines and this query takes more than 15 minutes here, even with indexing (tried many different index combinations, even some odd ones when I saw that the ones that should work didn't).
I also tried to ORDER BY Count() DESC the inner selects, but I hit ctrl+break after 7 minutes and no results.
2)
Added the following fields to the tables above:
Rank in Classroom, Rank in School, Rank in School using normalized grades
Then I tried using VBA with DAO and manually update the Rank fields, running the following code (simplified version):
Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
...
i = 1
...
rs.Edit
rs![Rank in classroom]=i
rs.Update
i = i + 1
rs.movenext
Wend
rs.close
This obviously builds only one of the ranks (in this case per subject per classroom), and it takes alone 3min 10sec.
I verified that it takes so long due to the writes on the table (rs.Edit and rs.Update are the culprits, commenting them makes the whole thing run in only 4 seconds), but I need the ranks written to the table to generate an access report later.
FINALLY:
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly. The times we have achieved, however, make this impossible.
Overall, the question to be asked is the following:
-Is there a way to calculate the ranks shown above through an Access Query under 10 seconds, or to use VBA and calculate-insert these ranks to the table in a similar time considering the size of the tables used here?
Also, I would love to see a list of efficient ranking algorithms, so that even if I can't do everything quickly, I can improve it as much as possible.
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly.
Why?
Why bother regenerating the same data over and over? It's most likely preferable to generate these statistics when the data changes and just look them up every other time. Redoing work you've already done whenever somebody wants to check something is just silly.
I just saw you say ms access only
so ignore this answer -- or consider moving to a real DB if you want to be able to do this type of power processing.
original answer below
I don't have access to your test data, but how fast does this run?
SELECT RANK () OVER (PARTITION BY [Date],[Subject] ORDER BY Grade) AS [Global Rank],
RANK () OVER (PARTITION BY [Date],[Subject], Classroom ORDER BY Grade) AS [Rank in classroom]
FROM grades_table_per_subject
My guess is you are not going to be able to beat SQL Servers ranking speed in VBA, if this is not fast enough then you need to look in the profiler and see what indexes it suggests you make.