I'm trying to do some query optimization; taking an SQL query into relational algebra and optimizing it.
My db tables schemas are as follow:
Hills(MId, Mname, Long, Lat, Height, Rating,... )
Runners(HId, HName, Age, Skill,... )
Runs(MId, CId, Date, Duration)
Where there may be many columns in Runners and Hills.
My SQL query is:
SELECT DISTINCT Runners.HName, Runners.Age
FROM Hills, Runners, Runs
WHERE Runners.HId = Runs.HId AND Runs.MID = Hills.MId AND Height > 1200
So i could start by doing:
π Name, Age(σ Height > 1200 (Hills × Runners × Runs))
Or something like this and then optimizing it with a good choice of joins, but i'm not sure where to start
You could start by using the SQL join notation:
SELECT DISTINCT P.HName, P.Age
FROM Hills AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
WHERE H.Height > 1200
You can then observe that the WHERE condition applies only to Hills, so you could push down the search criterion:
SELECT DISTINCT P.HName, P.Age
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
JOIN Runners AS P ON P.HId = R.HId
This is a standard optimization - and one which the SQL optimizer will do automatically. In fact, it probably isn't worth doing much rewriting of the first query shown because the optimizer can deal with it. The other optimization I see as possible is pushing the DISTINCT operation down a level:
SELECT P.HName, P.Age
FROM (SELECT DISTINCT R.HId
FROM (SELECT MId FROM Hills WHERE Height > 1200) AS H
JOIN Runs AS R ON H.MId = R.MId
) AS R1
JOIN Runners AS P ON P.HId = R1.HId
This keeps the intermediate result set as small as possible: R1 contains a list of ID-values for the people who have run at least one 1200 metre (or is that 1200 feet?) hill, and can be joined 1:1 with the details in the Runners table. It would be interesting to see whether an optimizer is able to deduce the push-down of the DISTINCT for itself.
Of course, in relational algebra, the DISTINCT operation is done 'automatically' - every result and intermediate result is always a relation with no duplicates.
Given the original 'relational algebra' notation:
π Name, Age(σ Height > 1200 (Hills × Runners × Runs))
This corresponds to the first SQL statement above.
The second SQL statement then corresponds (more or less) to:
π Name, Age ((π MId (σ Height > 1200 (Hills))) × Runners × Runs)
The third SQL statement then corresponds (more or less) to:
π Name, Age ((π HId ((π MId (σ Height > 1200 (Hills))) × Runs)) × Runners)
Where I'm assuming that parentheses force the relational algebra to evaluate expressions in order. I'm not sure that I've got the minimum possible number of parentheses in there, but the ones that are there don't leave much wriggle room for ambiguity.
Related
I have two tables are A & B.
A table having columns are hotelcode_id, latitude,longitude
B table having columns are latitude, longitude
Requirement is, I need retrieving hotelcode_id according to match latitude from both tables and longitude from both tables
I have designed the following query, but still in query performance
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A
JOIN B
ON a.latitude like concat ('%', b.latitude, '%') AND a.longitude like concat ('%', b.longitude, '%')
Also I'm designed the following another query but I can't able to accuret data's.
This query running too much time but still now I can't able to retrieve the data's.
NOTE:
A table has 150k records
B table has 250k records
: I have set DECIMAL(10,6) for latitude and longitude columns in both tables.
I have done the following job but still in problems in query performance,
done index properly using EXPLAIN statements
done hash partition for this tables
I think wild card characters not allowed the index reference.
Also LIKE SELECT query performance very poor in MySQL.
Any other solution is there instead wild cards issues & LIKE issues in SELECT query?
If you are sure that the numeric values of LAT/LON pairs are equal across the two table, the simple approach would be
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A JOIN B
WHERE a.latitude = b.latitude
AND a.longitude = b.longitude
If there is some inaccuracy in the data, you may want to define the maximum deviation (here 3.6 angle seconds) which you would regard as "same place", e.g.
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A JOIN B
WHERE ABS(a.latitude-b.latitude) < 0.001
AND ABS(a.longitude-b.longitude) < 0.001
Mind that in the second case the actual distance (in km) between two points are not the same at any given LAT ... higher LAT --> less distance
And review the sizing of LON and LAT columns ... you know that (usually ...)
-180 <= LON <= 180
-90 <= LAT <= 90
Hello are any of you very nice people able to explain the concept of query optimization to me in regards to relational algebra?
my preferred method of constructing relational algebra queries is by using temporary values step by step, but the only resources i can find for explaining how query optimization works to find the amount of disk access needed uses different notation for relational algebra queries, which confuses me.
so if i am given the following relations:
department(deptNo, deptName, location)
Employee(empNo, empName, empAddress, jobDesc, deptNo*)
and have produced the following relational algebra query to find all the programmers who work in a Manchester department as so:
temp1 = department JOIN employee
temp 2 = SELECT(jobdesc = 'programmer') (temp1)
result = SELECT(location = 'Manchester)(temp 2)
And i can assume that there are 10,00 tuples in the employee relation, 50 tuples in the department relation, 100 programmers (2 in each department) and one department located in Manchester, how would i work out how many disk accesses are needed?
Thankyou in advance!
Yup - Gordon's right. However, this is an academic exercise: you're building sets of data - assume each element/tuple returned by a sub-query is one disk access. General rule of thumb - limit the most amount of data as early as possible. Lets assume you do the JOIN first (10000 employees + 50 departments = 10050 disk entries {even thought he number of rows returned is 10000!}), then you do the SELECT (assuming that the sub-query is perfectly indexed) = (100 programmers + 1 department in Manchester) thus the total number of "accesses" = 10050+101 = 10151.
If you do the SELECTS first, the whole exercise changes dramatically: (temp 1=get programmers = 100 rows/disk accesses), (temp 2=get departments = 1 row/disk access), JOIN (again, assuming perfect indexing on temporary views/queries, etc, etc) = 50 rows: therefore total number of "accesses" = 100+1+50 = 151.
Same results, but the way it is interpreted and executed can influence the amount of work the database engine has to perform.
It's been many years, every possibility that I might have got this wrong - I don't mind being corrected.
I am working on SQL and Relational Algebra these days. And I am stuck on the below questions. I am able to make a SQL for the below questions but somehow my Relational Algebra that I have made doesn't looks right.
Below are my tables-
Employee (EmployeeId, EmployeeName, EmployeeCountry)
Training (TrainingCode, TrainingName, TrainingType, TrainingInstructor)
Outcome (EmployeeId, TrainingCode, Grade)
All the keys are specified with star *.
Below is the question and its SQL query as well which works fine-
Find an Id of the Employee who has taken every training.
SQL Qyery:
SELECT X.EmployeeID
FROM (SELECT EmployeeID, COUNT(*) AS NumClassesTaken
FROM OutCome GROUP BY EmployeeID )
AS X
JOIN (SELECT COUNT(*) AS ClassesAvailable
FROM Training)
AS Y
ON X.NumClassesTaken = Y.ClassesAvailable
I am not able to understand what will be the relational algebra for the above query? Can anyone help me with that?
Relational algebra for:
Find an Id of the Employee who has taken every training.
Actually you need division % operator in relational algebra:
r ÷ s is used when we wish to express queries with “all”:
Example:
Which persons have a bank account at ALL the banks in the country?
Retrieve the name of employees who work on ALL the projects that Jon Smith works on?
Read also this slid for division operator:
You also need query % operator for your query: "Employee who has taken all training".
First list off all Training codes:
Training (TrainingCode, TrainingName, TrainingType, TrainingInstructor)
Primary key is: TrainingCode:
TC = ∏ TrainingCode(Training)
A pair of employeeID and trainingCode: a employee take the training.
ET = ∏ EmployeeId, TrainingCode(Outcome)
Apply % Division operation which gives you desired employee's codes with trainingCode then apply projection to filter out employee code only.
Result = ∏ EmployeeId(ET % TC)
"Fundamentals of Database Systems" is the book I always keep in my hand.
6.3.4 The DIVISION Operation
The DIVISION operation is defined for convenience for dealing with
queries that involves universal quantification or the all
condition. Most RDBMS implementation with SQL as the primary query
language do not directly implement division. SQL has round way of
dealing with the type of query using EXISTS, CONTAINS and NOT EXISTS
key words.
The general DIVISION operation applied to two relations T(Y) = R(Z) %
S(X), where X ⊆ Z and Y = Z - X (and hence Z =
X ∪ Y); that is Y is the set of attributes of R that are not
attributes of S e.g. X = {A}, Z = {A, B} then Y = {B}, B
attribute is not present in relation S.
T(Y) the result of DIVISION is a relation includes a tuple t if
tuple tR appear in relation R with
tR[Y] = t, and with
tR[X] = tS for every tuple in
S. This means that. for a tuple t to appear in the result T of the DIVISION, the value of t must be appear in R in combination with every tuple in S.
I would also like to add that the set of relational algebra operations {σ,∏,⋈,Χ,-} namely Selection, Projection, Join, Cartesian Cross and Minus is a complete set; that is any of the other original relational algebra operation can be expressed as a sequence of operations from this set. Division operation % can also be expressed in the form of ∏, ⋈, and - operations as follows:
T1 <-- ∏Y(R)
T2 <-- ∏Y((S Χ T1) - R)
T3 <-- T1 - T2
To represent your question using basic relational algebraic operation just replace R by Outcome, S by Training and attribute set Y by EmployeeId.
I hope this help.
The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T
I have an expense table designed using sqlite I would like to construct a query to filter out some random rows using the sum function on the amount column of the table.
Sample Expense table
Clients Amounts
A 1000
B 3000
C 5000
D 2000
E 6000
Assuming i would like total sum in the table above to be 10,000 i would like to construct a query which would return any number of randoms rows that would add up to 10,000
So far i tried
SELECT *
FROM Expense Table
GROUP BY (Clients)
HAVING SUM(AMOUNT)=10000
but i got nothing generated
I have also had a go with the random function but i'm assuming i need to specify a LIMIT
SQLLite does not support CTEs (specifically recursive ones), so I can't think of an easy way of doing this. Perhaps you would be better off doing this in your presentation logic.
One option via SQL would be to string to together a number of UNION statements. Using your above sample data, you would need to string 3 UNIONs to get your results:
select clients
from expense
where amounts = 10000
union
select e.clients || e2.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
where e.amounts + e2.amounts = 10000
union
select e.clients || e2.clients || e3.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
inner join expense e3 on e3.rowid > e2.rowid
where e.amounts + e2.amounts + e3.amounts = 10000
Resulting in ABE and BCD. This would work for any group of clients, 1 to 3, whose sum is 10000. You could string more unions to get more clients -- this is just an example.
SQL Fiddle Demo
(Here's a sample with up to 4 clients - http://sqlfiddle.com/#!7/b01cf/2).
You can probably use dynamic sql to construct your endless query if needed, however, I do think this is better suited on the presentation side.
What you are describing is the knapsack problem (in your case, the value is equal to the weight).
This can be solved in SQL (see sgeddes's answer), but due to SQL's set-oriented design, the computation is rather complex and very slow.
You would be better off by reading the amounts into your program and solving the problem there (see the pseudocode on the Wikipedia page).