recursive geometric query : five closest entities - sql

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us

I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T

Related

Query complex in Oracle SQL

I have the following tables and their fields
They ask me for a query that seems to me quite complex, I have been going around for two days and trying things, it says:
It is desired to obtain the average age of female athletes, medal winners (gold, silver or bronze), for the different modalities of 'Artistic Gymnastics'. Analyze the possible contents of the result field in order to return only the expected values, even when there is no data of any specific value for the set of records displayed by the query. Specifically, we want to show the gender indicator of the athletes, the medal obtained, and the average age of these athletes. The age will be calculated by subtracting from the system date (SYSDATE), the date of birth of the athlete, dividing said value by 365. In order to avoid showing decimals, truncate (TRUNC) the result of the calculation of age. Order the results by the average age of the athletes.
Well right now I have this:
select person.gender,score.score
from person,athlete,score,competition,sport
where person.idperson = athlete.idathlete and
athlete.idathlete= score.idathlete and
competition.idsport = sport.idsport and
person.gender='F' and competition.idsport=18 and score.score in
('Gold','Silver','Bronze')
group by
person.gender,
score.score;
And I got this out
By adding the person.birthdate field instead of leaving 18 records of the 18 people who have a medal, I'm going to many more records.
Apart from that, I still have to draw the average age with SYSDATE and TRUNC that I try in many ways but I do not get it.
I see it very complicated or I'm a bit saturated from so much spinning, I need some help.
Reading the task you got, it seems that you're quite close to the solution. Have a look at the following query and its explanation, note the differences from your query, see if it helps.
select p.gender,
((sysdate - p.birthday) / 365) age,
s.score
from person p join athlete a on a.idathlete = p.idperson
left join score s on s.idathlete = a.idathlete
left join competition c on c.idcompetition = s.idcompetition
where p.gender = 'F'
and s.score in ('Gold', 'Silver', 'Bronze')
and c.idsport = 18
order by age;
when two dates are subtracted, the result is number of days. Dividing it by 365, you - roughly - get number of years (as each year has 365 days - that's for simplicity, of course, as not all years have that many days (hint: leap years)). The result is usually a decimal number, e.g. 23.912874918724. In order to avoid that, you were told to remove decimals, so - use TRUNC and get 23 as the result
although data model contains 5 tables, you don't have to use all of them in a query. Maybe the best approach is to go step-by-step. The first one would be to simply select all female athletes and calculate their age:
select p.gender,
((sysdate - p.birthday) / 365 age
from person p
where p.gender = 'F'
Note that I've used a table alias - I'd suggest you to use them too, as they make queries easier to read (table names can have really long names which don't help in readability). Also, always use table aliases to avoid confusion (which column belongs to which table)
Once you're satisfied with that result, move on to another table - athlete It is here just as a joining mechanism with the score table that contains ... well, scores. Note that I've used outer join for the score table because not all athletes have won the medal. I presume that this is what the task you've been given says:
... even when there is no data of any specific value for the set of records displayed by the query.
It is suggested that we - as developers - use explicit table joins which let you to see all joins separated from filters (which should be part of the WHERE clause). So:
NO : from person p, athlete a
where a.idathlete = p.idperson
and p.gender = 'F'
YES: from person p join athlete a on a.idathlete = p.idperson
where p.gender = 'F'
Then move to yet another table, and so forth.
Test frequently, all the time - don't skip steps. Move on to another one only when you're sure that the previous step's result is correct, as - in most cases - it won't automagically fix itself.

Difference between using rownum before group by and after group by

I want to know the difference between following two queries
SELECT INVN46, MTER46,CHRE46,CTSQ46, INFR46, INTO46
FROM AULFXAF2.SSP46L01#FXA400HA10G
inner join Invoice_line_in line on line.invoice_line_id = trim(RECT46)|| '+' || trim(INVN46) || '+' || trim(INVL46)
where line.invoice_line_id like 'IN+Q%'
and rownum <=1000
group by INVN46, MTER46,CHRE46,CTSQ46, INFR46, INTO46;
and
select * from
(SELECT INVN46, MTER46,CHRE46,CTSQ46, INFR46, INTO46
FROM AULFXAF2.SSP46L01#FXA400HA10G
inner join Invoice_line_in line on line.invoice_line_id = trim(RECT46)|| '+' || trim(INVN46) || '+' || trim(INVL46)
where line.invoice_line_id like 'IN+Q%'
group by INVN46, MTER46,CHRE46,CTSQ46, INFR46, INTO46)
where rownum <=1000
Which one is accurate and fast ? According to my understanding first one gets the 1000 records and then group them together. I am not sure about the second one though.
The first chooses 1000 records and then aggregates.
The second aggregates all the data and then chooses 1000 records from the result set.
Which is accurate? That depends on what you want to do. Which is faster? Probably the first because it is filtering the data before aggregating it.
Imagine you have a database with residents of a country, with their town and annual salary. You group by town and you add salaries together.
The first query selects 1000 random residents - they may all live in one town, or they may be from five or ten or thirty towns. Then you group by town and sum their salaries. Some towns will not be included in the result set; and for the towns that are, you get the sum of salaries of "some" (random) residents of that town. In most cases, this kind of query is useless.
The second query groups the residents by town and computes the total of salaries for each town. Then the outer query selects 1000 random towns from the country. (OK, it is a very large country, with over 1000 towns.) This query is a bit more helpful - although the fact that you select 1000 random towns makes it somewhat useless still.
Obviously the first query should be much faster, since it only looks at 1000 residents instead of, perhaps, 120 million residents. But this comparison is completely irrelevant, since the two queries do very different things.

How can i find the number of disk accesses needed for a relational algebra query?

Hello are any of you very nice people able to explain the concept of query optimization to me in regards to relational algebra?
my preferred method of constructing relational algebra queries is by using temporary values step by step, but the only resources i can find for explaining how query optimization works to find the amount of disk access needed uses different notation for relational algebra queries, which confuses me.
so if i am given the following relations:
department(deptNo, deptName, location)
Employee(empNo, empName, empAddress, jobDesc, deptNo*)
and have produced the following relational algebra query to find all the programmers who work in a Manchester department as so:
temp1 = department JOIN employee
temp 2 = SELECT(jobdesc = 'programmer') (temp1)
result = SELECT(location = 'Manchester)(temp 2)
And i can assume that there are 10,00 tuples in the employee relation, 50 tuples in the department relation, 100 programmers (2 in each department) and one department located in Manchester, how would i work out how many disk accesses are needed?
Thankyou in advance!
Yup - Gordon's right. However, this is an academic exercise: you're building sets of data - assume each element/tuple returned by a sub-query is one disk access. General rule of thumb - limit the most amount of data as early as possible. Lets assume you do the JOIN first (10000 employees + 50 departments = 10050 disk entries {even thought he number of rows returned is 10000!}), then you do the SELECT (assuming that the sub-query is perfectly indexed) = (100 programmers + 1 department in Manchester) thus the total number of "accesses" = 10050+101 = 10151.
If you do the SELECTS first, the whole exercise changes dramatically: (temp 1=get programmers = 100 rows/disk accesses), (temp 2=get departments = 1 row/disk access), JOIN (again, assuming perfect indexing on temporary views/queries, etc, etc) = 50 rows: therefore total number of "accesses" = 100+1+50 = 151.
Same results, but the way it is interpreted and executed can influence the amount of work the database engine has to perform.
It's been many years, every possibility that I might have got this wrong - I don't mind being corrected.

How do I use the MAX function over three tables?

So, I have a problem with a SQL Query.
It's about getting weather data for German cities. I have 4 tables: staedte (the cities with primary key loc_id), gehoert_zu (contains the city-key and the key of the weather station that is closest to this city (stations_id)), wettermessung (contains all the weather information and the station's key value) and wetterstation (contains the stations key and location). And I'm using PostgreSQL
Here is how the tables look like:
wetterstation
s_id[PK] standort lon lat hoehe
----------------------------------------
10224 Bremen 53.05 8.8 4
wettermessung
stations_id[PK] datum[PK] max_temp_2m ......
----------------------------------------------------
10224 2013-3-24 -0.4
staedte
loc_id[PK] name lat lon
-------------------------------
15 Asch 48.4 9.8
gehoert_zu
loc_id[PK] stations_id[PK]
-----------------------------
15 10224
What I'm trying to do is to get the name of the city with the (for example) highest temperature at a specified date (could be a whole month, or a day). Since the weather data is bound to a station, I actually need to get the station's ID and then just choose one of the corresponding to this station cities. A possible question would be: "In which city was it hottest in June ?" and, say, the highest measured temperature was in station number 10224. As a result I want to get the city Asch. What I got so far is this
SELECT name, MAX (max_temp_2m)
FROM wettermessung, staedte, gehoert_zu
WHERE wettermessung.stations_id = gehoert_zu.stations_id
AND gehoert_zu.loc_id = staedte.loc_id
AND wettermessung.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX (max_temp_2m) DESC
LIMIT 1
There are two problems with the results:
1) it's taking waaaay too long. The tables are not that big (cities has about 70k entries), but it needs between 1 and 7 minutes to get things done (depending on the time span)
2) it ALWAYS produces the same city and I'm pretty sure it's not the right one either.
I hope I managed to explain my problem clearly enough and I'd be happy for any kind of help. Thanks in advance ! :D
If you want to get the max temperature per city use this statement:
SELECT * FROM (
SELECT gz.loc_id, MAX(max_temp_2m) as temperature
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY gz.loc_id) as subselect
INNER JOIN staedte as std
ON std.loc_id = subselect.loc_id
ORDER BY subselect.temperature DESC
Use this statement to get the city with the highest temperature (only 1 city):
SELECT * FROM(
SELECT name, MAX(max_temp_2m) as temp
FROM wettermessung as wm
INNER JOIN gehoert_zu as gz
ON wm.stations_id = gz.stations_id
INNER JOIN staedte as std
ON gz.loc_id = std.loc_id
WHERE wm.datum BETWEEN '2012-8-1' AND '2012-12-1'
GROUP BY name
ORDER BY MAX(max_temp_2m) DESC
LIMIT 1) as subselect
ORDER BY temp desc
LIMIT 1
For performance reasons always use explicit joins as LEFT, RIGHT, INNER JOIN and avoid to use joins with separated table name, so your sql serevr has not to guess your table references.
This is a general example of how to get the item with the highest, lowest, biggest, smallest, whatever value. You can adjust it to your particular situation.
select fred, barney, wilma
from bedrock join
(select fred, max(dino) maxdino
from bedrock
where whatever
group by fred ) flinstone on bedrock.fred = flinstone.fred
where dino = maxdino
and other conditions
I propose you use a consistent naming convention. Singular terms for tables holding a single item per row is a good convention. You only table breaking this is staedte. Should be stadt.
And I suggest to use station_id consistently instead of either s_id and stations_id.
Building on these premises, for your question:
... get the name of the city with the ... highest temperature at a specified date
SELECT s.name, w.max_temp_2m
FROM (
SELECT station_id, max_temp_2m
FROM wettermessung
WHERE datum >= '2012-8-1'::date
AND datum < '2012-12-1'::date -- exclude upper border
ORDER BY max_temp_2m DESC, station_id -- id as tie breaker
LIMIT 1
) w
JOIN gehoert_zu g USING (station_id) -- assuming normalized names
JOIN stadt s USING (loc_id)
Use explicit JOIN conditions for better readability and maintenance.
Use table aliases to simplify your query.
Use x >= a AND x < b to include the lower border and exclude the upper border, which is the common use case.
Aggregate first and pick your station with the highest temperature, before you join to the other tables to retrieve the city name. Much simpler and faster.
You did not specify what to do when multiple "wettermessungen" tie on max_temp_2m in the given time frame. I added station_id as tiebreaker, meaning the station with the lowest id will be picked consistently if there are multiple qualifying stations.

minimizing runtime of calculating several ranks (based on grades) on large sql tables, how short can it get (ms access)

I've been stuck with the rather famous problem of ranking students by grade for a couple weeks now, and while I've learned much, I still haven't solved my problem (the ranks are generated, but the process is too slow):
I have a large table (320,000 rows) that contains the student codes (serves as an identifier, instead of their names), the students classroom, the test , the tests date, the subject, the question number and the students grade on that question. This table is the base for everything else that is calculated and its size makes all these calculations very very slow, to the point where I find me almost breaking everything here at work.
First, some intel on the school (very little info, required to understand the problem)
Here at the school we have weekly tests over several subjects. The school is also separated in classrooms with different purposes (one is focused on math, physics and chemistry, another one is focused on biology, and the last one focuses on history, Portuguese and geography). But they all do the same tests every week.
What we want to do is calculate the standard deviation for each question for everyone in the school (not per-classroom) and the average grade per question (also for everyone in the school), and then generate the following ranks (all of them per date):
-Rank per subject per classroom (with "raw" grades), Rank per subject considering the whole school (with "raw" grades) and Rank per subject considering the whole school (using normalized grades, with the standard deviation per question and the average grade per question information)
-The same ranks that were mentioned above, but not per Subject, considering instead all subjects
As you can see, after calculating the average grades and the standard deviations, we still need to calculate the sums of the grades on each question, and rank according to these sums (the actual subject/test grades). I've attacked this problem in a few ways:
1)
Created two tables, one with the grades per student per subject (fields: Students code, Students classroom, Date of test, Subject, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades) and another with the grades per student per test (all subjects taken into account, fields: Students code, Students classroom, Date of test, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades).
The insertion of data in these tables takes about 50 seconds
Then, I tried using SQL to rank, however, I ran into some problems:
-Access has no ROW_NUMBER or RANK functions, and thus I have to use queries with COUNT, like (below is just a simplified version):
SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;
There still is the rank with the normalized grades in the query above, but I omitted it.
The table grades_table_per_subject has about 45,000 lines and this query takes more than 15 minutes here, even with indexing (tried many different index combinations, even some odd ones when I saw that the ones that should work didn't).
I also tried to ORDER BY Count() DESC the inner selects, but I hit ctrl+break after 7 minutes and no results.
2)
Added the following fields to the tables above:
Rank in Classroom, Rank in School, Rank in School using normalized grades
Then I tried using VBA with DAO and manually update the Rank fields, running the following code (simplified version):
Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
...
i = 1
...
rs.Edit
rs![Rank in classroom]=i
rs.Update
i = i + 1
rs.movenext
Wend
rs.close
This obviously builds only one of the ranks (in this case per subject per classroom), and it takes alone 3min 10sec.
I verified that it takes so long due to the writes on the table (rs.Edit and rs.Update are the culprits, commenting them makes the whole thing run in only 4 seconds), but I need the ranks written to the table to generate an access report later.
FINALLY:
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly. The times we have achieved, however, make this impossible.
Overall, the question to be asked is the following:
-Is there a way to calculate the ranks shown above through an Access Query under 10 seconds, or to use VBA and calculate-insert these ranks to the table in a similar time considering the size of the tables used here?
Also, I would love to see a list of efficient ranking algorithms, so that even if I can't do everything quickly, I can improve it as much as possible.
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly.
Why?
Why bother regenerating the same data over and over? It's most likely preferable to generate these statistics when the data changes and just look them up every other time. Redoing work you've already done whenever somebody wants to check something is just silly.
I just saw you say ms access only
so ignore this answer -- or consider moving to a real DB if you want to be able to do this type of power processing.
original answer below
I don't have access to your test data, but how fast does this run?
SELECT RANK () OVER (PARTITION BY [Date],[Subject] ORDER BY Grade) AS [Global Rank],
RANK () OVER (PARTITION BY [Date],[Subject], Classroom ORDER BY Grade) AS [Rank in classroom]
FROM grades_table_per_subject
My guess is you are not going to be able to beat SQL Servers ranking speed in VBA, if this is not fast enough then you need to look in the profiler and see what indexes it suggests you make.