SQL - JOIN on two result tables, ideas to refactor?

SQL - JOIN on two result tables, ideas to refactor? - sql

This may already have been asked, but StackOverflow is massive and trying to google for something specific enough to actually help is a nightmare!
I've ended up with a fairly large SQL query, and was wondering if SO could maybe point out easier methods for doing it that I might have missed.
I have a table called usage with the following structure:
host | character varying(32) |
usage | integer |
logtime | timestamp without time zone | default now()
I want to get the usage value for both the MAX and MIN logtimes. After working through some of my old textbooks (been a while since I really used SQL properly), I've ended up with this JOIN:
SELECT *
FROM (SELECT u.host,u.usage AS min_val,r2.min
FROM usage u
JOIN (SELECT host,min(logtime) FROM usage GROUP BY host) r2
ON u.host = r2.host AND u.logtime = r2.min) min_table
NATURAL JOIN (SELECT u.host,u.usage AS max_val,r1.max
FROM usage u
JOIN (SELECT host,max(logtime) FROM usage GROUP BY host) r1
ON u.host = r1.host AND u.logtime = r1.max) max_table
;
This seems like a messy way to do it, as I'm basically running the same query twice, once with MAX and once with MIN. I can get both logtime columns in one query by doing SELECT usage,MAX(logtime),MIN(logtime) FROM ..., but I couldn't work out how to then show the usage values that correspond to the two different records.
Any ideas?

With PostgreSQL 9.1 you have window functions at your disposal (8.4+):
SELECT DISTINCT
u.host
,first_value(usage) OVER w AS first_usage
,last_value(usage) OVER w AS last_usage
FROM usage u
WINDOW w AS (PARTITION BY host ORDER BY logtime, usage
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
I sort the partition by logtime and by usage in addition to break any ties and arrive at a stable result. Read about window functions in the manual.
For some more explanation and links you might want to refer to recent related answers here or here.

Related

How do I find the previous line without writing an inefficient subquery?

So I have this query (and have encountered or coded a bunch of similar ones during my life :) ) which is extremely inefficient in terms of performance, due to the subquery.
I'm running pgsql currently but have had this issue with mysql and mssql as well.
Sometimes I can use MAX() but here I have 2 different columns: runners.id (the one I need to find my data) and runners.updated_at (the one on which I could do MAX()).
Any tips?
SELECT
ROUND(CAST(AVG(DATE_PART('day', current_claim_event.updated_at - claims.created_at)) AS NUMERIC),1)
AS average, count(*)
FROM claims_events current_claim_event
INNER JOIN claims ON claims.id = current_claim_event.claim_id
WHERE current_claim_event.id = (
SELECT runners.id
FROM claims_events runners
WHERE runners.claim_id = current_claim_event.claim_id
ORDER BY runners.updated_at DESC
LIMIT 1
);

How to join records by date range

I need to match scrap records in one table with records indicating the material that was running at the same time on a machine. I have a table with the scrap counts and a table with records showing whenever the material changed on a machine.
I have a working query of which I will include a simplified version below, but it is very slow when applied to a large data set. I would like to try one of Oracle's analytical functions to make it faster, but I can't figure out how. I tried FIRST_VALUE, and ROW_NUMBER in a few different forms, but I couldn't get them right. Looking for any suggestions.
Please let me know if you would like more details.
Following are simplified versions of the tables:
Scrap readings table (~41m rows)
Machine
ScrapReasonCode
ScrapQuantity
ReportTime
Material numbers (~3m rows)
Machine
MaterialNumber
MEASUREMENT_TIMESTAMP
SELECT Scrap.Machine,
Scrap.MaterialNumber,
Scrap.ScrapReasonCode,
Scrap.ScrapQuantity,
Scrap.ReportTime
FROM Scrap, Materials
WHERE Scrap.Machine = Materials.Machine
AND Materials.MEASUREMENT_TIMESTAMP =
(SELECT MAX (M2.MEASUREMENT_TIMESTAMP)
FROM Materials M2
WHERE M2.Materials.Machine = Scrap.Machine
AND M2.MEASUREMENT_TIMESTAMP <= Scrap.ReportTime)

I think this is what you are trying to do. You can use the FIRST_VALUE window function.
SELECT DISTINCT
s.Machine,
s.MaterialNumber,
s.ScrapReasonCode,
s.ScrapQuantity,
s.ReportTime,
FIRST_VALUE(m.MEASUREMENT_TIMESTAMP) OVER(PARTITION BY s.Machine ORDER BY m.MEASUREMENT_TIMESTAMP DESC)
--or you can use the `MAX` window function too.
--MAX(m.MEASUREMENT_TIMESTAMP) OVER(PARTITION BY s.Machine)
FROM Scrap s
JOIN Materials m
WHERE s.Machine = m.Machine AND m.MEASUREMENT_TIMESTAMP <= s.ReportTime

I may be misunderstanding your requirements but I believe the following query should work in terms of implementing using ROW_NUMBER:
SELECT q.*
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Scrap.Machine ORDER BY Materials.MEASUREMENT_TIMESTAMP DESC) AS RNO
Scrap.MaterialNumber,
Scrap.ScrapReasonCode,
Scrap.ScrapQuantity,
Scrap.ReportTime
FROM Scrap, Materials
WHERE Scrap.Machine = Materials.Machine
AND Materials.MEASUREMENT_TIMESTAMP <= Scrap.ReportTime
) q
WHERE q.RNO = 1
Edit: if you need the measurement timestamp before (rather than on-or-before) the Scrap ReportTime, you could just change the <= sign to a < sign in the query above.

PostgreSQL - Get related columns of an aggregated column

I have a table called "places"
origin | destiny | distance
---------------------------
A | X | 5
A | Y | 8
B | X | 12
B | Y | 9
For each origin, I want to find out which is the closest destiny. In MySQL I could do
SELECT origin, destiny, MIN(distance) FROM places GROUP BY origin
And I could expect the following result
origin | destiny | distance
---------------------------
A | X | 5
B | y | 9
Unfortunately, this query is not working in PostgreSQL. Postgre is forcing me to either put "destiny" in his own aggregate function or to define it as another argument of the GROUP BY statement. Both "solutions" change completely my desired result.
How could I translate the above MySQL query to PostgreSQL?

MySQL is the only DBMS that allows the broken ("lose" in MySQL terms) group by handling. Every other DBMS (including Postgres) would reject your original statement.
In Postgres you can use the distinct on operator to achieve the same thing:
select distinct on (origin)
origin,
destiny,
distance
from places
order by origin, distance;
The ANSI solution would be something like this:
select p.origin,
p.destiny,
p.distance
from places p
join (select p2.origin, min(p2.distance) as distance
from places p2
group by origin
) t on t.origin = p.origin and t.distance = p.distance
order by origin;
Or without a join using window functions
select t.origin,
t.destiny,
t.distance
from (
select origin,
destiny,
distance,
min(distance) over (partition by origin) as min_dist
from places
) t
where distance = min_dist
order by origin;
Or another solution with window functions:
select distinct origin,
first_value(destiny) over (partition by origin order by distance) as destiny,
min(distance) over (partition by origin) as distance
from places
order by origin;
My guess is that the first one (Postgres specific) is probably the fastest one.
Here is an SQLFiddle for all three solutions: http://sqlfiddle.com/#!12/68308/2
Note that the MySQL result might actually be incorrect as it will return an arbitrary (=random) value for destiny. The value returned by MySQL might not be the one that belongs to the lowest distance.
More details on the broken group by handling in MySQL can be found here: http://www.mysqlperformanceblog.com/2006/09/06/wrong-group-by-makes-your-queries-fragile/

The neatest (in my opinion) way to do this in PostgreSQL is to use an aggregate function which clearly specifies which value of destiny should be selected.
The desired value can be described as "the first matching destiny, if you order the matching rows by their distance".
You therefore need two things:
A "first" aggregate, which simply returns the "first" of a list of values. This is very easy to define, but is not included as standard.
The ability to specify what order those matches come in (otherwise, like the MySQL "loose Group By", it will be undefined which value you actually get). This was added in PostgreSQL 9.0, and the syntax is documented under "Aggregate Expressions".
Once the first() aggregate is defined (which you need do only once per database, while you're setting up your initial tables), you can then write:
Select
origin,
first(destiny Order by distance Asc) as closest_destiny,
min(distance) as closest_destiny_distance
-- Or, equivalently: first(distance Order by distance Asc) as closest_destiny_distance
from places
group by origin
order by origin;
Here is a SQLFiddle demo showing the whole thing in operation.

Just to add another possible solution to a_horse_with_no_name answer - using windowed function row_num:
with cte as (
select
row_number() over(partition by origin order by distance) as row_num,
*
from places
)
select
origin,
destiny,
distance
from cte
where row_num = 1
It'll work in SQL Server or other RDBMS supporting row_number too. In PostgreSQL, though, I prefer distinct on syntax.
sql fiddle demo

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.

As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)

Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

SQL conundrum, how to select latest date for part, but only 1 row per part (unique)

I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...

SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1

EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate

You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - JOIN on two result tables, ideas to refactor? - sql

Related

How do I find the previous line without writing an inefficient subquery?

How to join records by date range

PostgreSQL - Get related columns of an aggregated column

Subqueries and AVG() on a subtraction

SQL conundrum, how to select latest date for part, but only 1 row per part (unique)

Categories

Resources