Optimize annotated query in Django

Optimize annotated query in Django - sql

I'm trying to convert this simple SQL query into something Django can handle:
SELECT *
FROM location AS a
WHERE a.travel_distance = (
SELECT MAX(travel_distance)
FROM location AS b
WHERE b.person_id = a.person_id
)
ORDER BY a.travel_distance DESC
What this basically does is fetching all traveled locations and select only the rows that contain the maximum travel distance.
This is what i got so far:
travels = Location.objects.filter(pk__in=Location.objects.order_by().values('person_id').annotate(max_id=Max('id')).values('max_id')).order_by('travel_distance')[::-1]
Although the results match each other. It takes a whole lot longer for the second method to return results.
Is there anyway I can rewrite this query, so it becomes faster?

If I understand correctly you want the maximum distance travelled for each person. Assuming there is a Person model, perhaps ask from the other direction. Something like:
Person.objects.values('id').annotate(max_distance=Max('location__travel_distance'))
I haven't tested this since I don't have an equivalent data schema handy, but does this work for you?

Isn't this works? Something like:
select max(id), sum(travel_distance) from table group by person_id;

Related

Need a SQL query explained

I'm learning the databricks platform at the moment, and I'm on a lesson where we are talking about CTE's. This specific query is of a CTE in a CTE definition, and the girl in the video is not doing the best job breaking down what exactly this query is doing.
WITH lax_bos AS (
WITH origin_destination (origin_airport, destination_airport) AS (
SELECT
origin,
destination
FROM
external_table
)
SELECT
*
FROM
origin_destination
WHERE
origin_airport = 'LAX'
AND destination_airport = 'BOS'
)
SELECT
count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
lax_bos;
the output of the query comes out to 684 which I know comes from the last select statement, It's just mostly everything that's going on above, I don't fully understand what's happening.

at first you choose 2 needed columns from external_table and name this cte "origin_destination" :
SELECT
origin,
destination
FROM
external_table
next you filter it in another cte named "lax_bos"
SELECT
*
FROM
origin_destination ------the cte you already made
WHERE
origin_airport = 'LAX'
AND destination_airport = 'BOS'
and this is the main query where you use cte "lax_bos" that you made in previous step, here you just count a number of flights:
SELECT
count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
lax_bos

Nesting CTE's is wierd. Normally they form a single-level transformation pipeline, like this:
WITH origin_destination (origin_airport, destination_airport) AS
(
SELECT origin, destination
FROM external_table
), lax_bos AS
(
SELECT *
FROM origin_destination
WHERE origin_airport = 'LAX'
AND destination_airport = 'BOS'
)
SELECT count(origin_airport) AS `Total Flights from LAX to BOS`
FROM lax_bos;

I do not understand why you are using an common table expression (cte).
I am going to give you a quick overview of how this can be done without an cte.
Always, use some type of sample data set. There are plenty that are installed with databricks. In fact, there is one for delayed airplane departures.
The next step is to read in the file and convert it to a temporary view.
At this point, we can use the Spark SQL magic command to query the data.
The query shows plane flights from LAX to BOS. We can remove the limit 10 option and change the '*' to "count(*) as Total" to get your answer. Thus, we solved the problem without a CTE.
The above image uses a CTE to pull the origin, destination and delay for all flights from LAX to BOS. Then it bins the delays from -9 to 9 hours with counts.
Again, this can all be done in one SQL statement that might be cleaner.
I reserve CTE for more complex situations. For instance, calculating a complex math formula using a range of data and paring it with the base data set.

CTE can be recursive query, or subquery. Here, they are only simple subquery.
1st, the query origin_destination is done. Second, the query lax_bos is done over origin_destination result. And then, the final query is done on lax_bos result.

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

What do OrientDB's functions do when applied to the results of another function?

I am getting very strange behavior on 2.0-M2. Consider the following against the GratefulDeadConcerts database:
Query 1
SELECT name, in('written_by') AS wrote FROM V WHERE type='artist'
This query returns a list of artists and the songs each has written; a majority of the rows have at least one song.
Query 2
Now try:
SELECT name, count(in('written_by')) AS num_wrote FROM V WHERE type='artist'
On my system (OSX Yosemite; Orient 2.0-M2), I see just one row:
name num_wrote
---------------------------
Willie_Cobb 224
This seems wrong. But I tried to better understand. Perhaps the count() causes the in() to look at all written_by edges...
Query 3
SELECT name, in('written_by') FROM V WHERE type='artist' GROUP BY name
Produces results similar to the first query.
Query 4
Now try count()
SELECT name, count(in('written_by')) FROM V WHERE type='artist' GROUP BY name
Wrong path -- So try LET variables...
Query 5
SELECT name, $wblist, $wbcount FROM V
LET $wblist = in('written_by'),
$wbcount = count($wblist)
WHERE type='artist'
Produces seemingly meaningless results:
You can see that the $wblist and $wbcount columns are inconsistent with one another, and the $wbcount values don't show any obvious progression like a cumulative result.
Note that the strange behavior is not limited to count(). For example, first() does similarly odd things.

count(), like in RDBMS, computes the sum of all the records in only one value. For your purpose .size()seems the right method to call:
in('written_by').size()

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.

As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)

Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

Limiting MySQL results within query

I'm looking to see if I can get the results I need with a single query, and my MySQL skills are still in their adolescence over here.
I have 4 tables: shows, artists, venues and tours. A simplified version of my main query right now looks like this:
SELECT *
FROM artists AS a,
venues AS v,
shows AS s
LEFT JOIN tours AS t ON s.show_tour_id = t.tour_id
WHERE s.show_artist_id = a.artist_id
AND s.show_venue_id = v.venue_id
ORDER BY a.artist_name ASC, s.show_date ASC;
What I want to add is a limit on how many shows are returned per artist. I know I could SELECT * FROM artists, and then run a query with a simple LIMIT clause for each returned row, but I figure there must be a more efficient way.
UPDATE: to put this more simply, I want to select up to 5 shows for each artist. I know I could do this (stripping away all irrelevancies):
<?php
$artists = $db->query("SELECT * FROM artists");
foreach($artists as $artist) {
$db->query("SELECT * FROM shows WHERE show_artist_id = $artist->artist_id LIMIT 5");
}
?>
But it seems wrong to be putting another query within a foreach loop. I'm looking for a way to achieve this within one result set.

This is the kind of thing stored procedures are for.
Select a list of artists, then loop through that list, adding 5 or fewer shows for each artists to a temp table.
Then, return the temp table.

As a plan-B, if you can't figure the proper SQL statement to use you can read the whole thing into a memory construct (array, class, etc) and loop it that way. If the data is sufficiently small and memory available sufficiently large this would let you do only one query. Not elegant, but may work for you.

Well I hesitate to suggest this because it certainly won't be computationally efficient (see the stored procedures answer for that...) but it will all be in one query like you wanted. I'm also taking some liberties and assuming that you want the 5 most recent shows...hopefully you can modify to your actual requirements.
SELECT *
FROM artists AS a,
venues AS v,
shows AS s
LEFT JOIN tours AS t ON s.show_tour_id = t.tour_id
WHERE s.show_artist_id = a.artist_id
AND s.show_venue_id = v.venue_id
AND s.show_id IN
(SELECT subS.show_id FROM shows subS
WHERE subS.show_artist_id = s.show_artist_id
ORDER BY subS.show_date DESC
LIMIT 5)
ORDER BY a.artist_name ASC, s.show_date ASC;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimize annotated query in Django - sql

Isn't this works? Something like: select max(id), sum(travel_distance) from table group by person_id;

Related

Need a SQL query explained

Order by in subquery behaving differently than native sql query?

What do OrientDB's functions do when applied to the results of another function?

Subqueries and AVG() on a subtraction

Limiting MySQL results within query

Categories

Resources