SQL SELECT rows with MAX value on a column and returns all columns - sql

Ok so I have this table :
+----+--------------+------------------+----------+
| id | business_key | other columns... | creation |
+----+--------------+------------------+----------+
| 1 | 1 | ... | 01/01/14 |
| 2 | 1 | ... | 12/02/14 |
| 3 | 1 | ... | 13/03/14 | <--
| 4 | 2 | ... | 01/01/14 |
| 5 | 2 | ... | 12/02/14 | <--
| 6 | 8 | ... | 01/01/14 | <--
| 7 | 10 | ... | 01/01/14 |
| 8 | 10 | ... | 12/02/14 |
| 9 | 10 | ... | 13/03/14 |
| 10 | 10 | ... | 13/03/14 | <--
+----+--------------+------------------+----------+
For each business key, I want to return the most recent row and for that I have the "creation" column (see the arrows above). The simple answer would be :
SELECT business_key, MAX(creation) FROM mytable GROUP BY business_key;
The thing is, I need to return ALL the columns. Then I learned the existence of the greatest-n-per-group tag on StackOverflow and I found this topic : SQL Select only rows with Max Value on a Column. The best answer is great and provides this request :
SELECT mt1.*
FROM mytable mt1
LEFT OUTER JOIN mytable mt2
ON (mt1.business_key = mt2.business_key AND mt1.creation < mt2.creation)
WHERE mt2.business_key IS NULL;
Sadly it doesn't work because my situation is a little trickier : if you look at the line 9 and 10 of my table, you will see that they have the same business key and the same creation date. While this should be avoided in my application, I still have to handle it if it happens.
With the last request above, this is what I will get :
+----+--------------+------------------+----------+
| id | business_key | other columns... | creation |
+----+--------------+------------------+----------+
| 3 | 1 | ... | 13/03/14 |
| 5 | 2 | ... | 12/02/14 |
| 6 | 8 | ... | 01/01/14 |
| 9 | 10 | ... | 13/03/14 | <--
| 10 | 10 | ... | 13/03/14 | <--
+----+--------------+------------------+----------+
While I wanted this :
+----+--------------+------------------+----------+
| id | business_key | other columns... | creation |
+----+--------------+------------------+----------+
| 3 | 1 | ... | 13/03/14 |
| 5 | 2 | ... | 12/02/14 |
| 6 | 8 | ... | 01/01/14 |
| 10 | 10 | ... | 13/03/14 | <--
+----+--------------+------------------+----------+
I know it's a poor choice to want a MAX() on a technical column like "id", but right now it's the only way for me to prevent duplicates when the business key AND the creation date are the same. The problem is, I have no idea how to do it. Any idea ? Keep in mind it must return all the columns (and we have a lot of columns so a SELECT * will be necessary).
Thanks a lot.

The first thought is that your id seems to increment along with the date, so just use that:
SELECT mt1.*
FROM mytable mt1 LEFT OUTER JOIN
mytable mt2
ON mt1.business_key = mt2.business_key AND mt2.id > mt1.id
WHERE mt2.business_key IS NULL;
You can still do the same idea with two columns:
SELECT mt1.*
FROM mytable mt1 LEFT OUTER JOIN
mytable mt2
ON mt1.business_key = mt2.business_key AND
(mt2.creation > mt1.creation OR
mt2.creation = mt1.creation AND
mt2.id > mt1.id
)
WHERE mt2.business_key IS NULL;

Related

Is it faster to do WHERE IN or INNER JOIN in Redshift

I have 2 tables in redshift:
table1
| ids |
|------:|
| 1 |
| 2 |
| 6 |
| 9 |
| 12 |
table2
| id | value |
|-----:|---------:|
| 1 | 0.134435 |
| 2 | 0.767417 |
| 3 | 0.779567 |
| 4 | 0.726051 |
| 5 | 0.405138 |
| 6 | 0.775206 |
| 7 | 0.699945 |
| 8 | 0.499433 |
| 10 | 0.457386 |
| 9 | 0.227511 |
| 10 | 0.369292 |
| 11 | 0.653735 |
| 12 | 0.537251 |
| 2 | 0.953539 |
| 13 | 0.377625 |
| 14 | 0.973905 |
| 4 | 0.104643 |
| 1 | 0.450627 |
And I basically want to get the rows in table2 where id is in table1 and I have 2 possibilities:
SELECT *
FROM table2
WHERE id IN (SELECT ids FROM table1)
or
SELECT t2.id, t2.value
FROM table2 t2
INNER JOIN table1 t1
ON t2.id = t1.ids
I want to know if there is any performance difference between them.
(I know I could just test in this example to find out but I would like to know if there is one which is always faster)
Edit: table1.ids is a unique column
The two queries do different things.
The JOIN can multiply the number of rows if id is duplicated in table1.
The IN will never duplicate rows.
If id can be duplicated, you should use the version that does what you want. If id is guaranteed to be unique, then the two are functionally equivalent.
In my experience, JOIN is typically at least as fast a IN. Of course, you can test on your data, but that is a starting point.

Merge columns on two left joins

I have 3 tables as shown:
Video
+----+--------+-----------+
| id | name | videoSize |
+----+--------+-----------+
| 1 | video1 | 1MB |
| 2 | video2 | 2MB |
| 3 | video3 | 3MB |
+----+--------+-----------+
Survey
+----+---------+-----------+
| id | name | questions |
+----+---------+-----------+
| 1 | survey1 | 1 |
| 2 | survey2 | 2 |
| 3 | survey3 | 3 |
+----+---------+-----------+
Sequence
+----+---------+-----------+----------+
| id | videoId | surveyId | sequence |
+----+---------+-----------+----------+
| 1 | null | 1 | 1 |
| 2 | 2 | null | 2 |
| 3 | null | 3 | 3 |
+----+---------+-----------+----------+
I would like to query Sequence and join on both of video and survey tables and merge common columns without specifying the column names (in this case name) like this:
Query Result:
+----+---------+-----------+----------+---------+-----------+-----------+
| id | videoId | surveyId | sequence | name | videoSize | questions |
+----+---------+-----------+----------+---------+-----------+-----------+
| 1 | null | 1 | 1 | survey1 | null | 1 |
| 2 | 2 | null | 2 | video2 | 2MB | null |
| 3 | null | 3 | 3 | survey3 | null | 3 |
+----+---------+-----------+----------+---------+-----------+-----------+
Is this possible?
BTW the below sql doesn't work as it doesn't merge on the name field:
SELECT * FROM "Sequence"
LEFT JOIN "Survey" ON "Survey"."id" = "Sequence"."surveyId"
LEFT JOIN "Video" ON "Video"."id" = "Sequence"."videoId"
This query will show what you want:
select
s.*,
coalesce(y.name, v.name) as name, -- picks the right column
v.videoSize,
y.questions
from sequence s
left join survey y on y.id = s.surveyId
left join video v on v.id = s.videoId
However, the SQL standard requires you to name the columns you want. The only exception being * as shown above.

db2, roll up unknown number of rows from case statement result

I am trying to write a query where I can concatenate some rows into a single column based on the result of the case statement in DB2 v9.5
The contractId can be a variable number of rows as well.
Given I have the following table structure
Table1
+------------+------------+------+
| ContractId | Reference | Code |
+------------+------------+------+
| 12 | P123456789 | A |
| 12 | A987654321 | B |
| 12 | 9995559971 | C |
| 12 | 3215654778 | D |
| 13 | abcdef | A |
| 15 | asdfa | B |
| 37 | 282jd | B |
| 89 | asdf82 | C |
+------------+------------+------+
I would like to get the output of the result like so
+-------------+-----------------------+------------------------------------+
| ContractId | Reference with Code A | Other References |
+-------------+-----------------------+------------------------------------+
| 12 | P123456789 | A987654321, 9995559971, 3215654778 |
| 13 | abcdef | asdfa, 282jd, asdf82 |
+-------------+-----------------------+------------------------------------+
I've tried queries like
select t1.contract_id,
max(case when t1.code = A then t1.reference end) as "reference with code a",
max(case when t1.code in ('B','C','D') then t1.reference end) as 'other references
from table t1
group by t1.contractId
however, this is still giving me an output like
+-------------+-----------------------+------------------+
| ContractId | Reference with Code A | Other References |
+-------------+-----------------------+------------------+
| 12 | P123456789 | null |
| 12 | null | A987654321 |
| 12 | null | 9995559971 |
| 12 | null | 3215654778 |
+-------------+-----------------------+------------------+
I've also attempted using some of the XML Agg functions but can't seem to get it to format the way I want it too.

SQL Inner Join based on MAX of timestamp

Amended Once
Amended Twice: The headers of the remaining 9 tables except for reports are always called "what".
I have about 10 tables with the following structure:
reports (165k rows)
+-----------+-----------+
| identifier| category |
+-----------+-----------+
| 1 | fixed |
| 2 | wontfix |
| 3 | fixed |
| 4 | invalid |
| 5 | later |
| 6 | wontfix |
| 7 | duplicate |
| 8 | later |
| 9 | wontfix |
+-----------+-----------+
status (300k rows, all identifiers from reports come up at least once)
+-----------+-----------+----------+
| identifier| time | what |
+-----------+-----------+----------+
| 1 | 12 | RESOLVED |
| 1 | 9 | NEW |
| 2 | 7 | ASSIGNED |
| 3 | 10 | RESOLVED |
| 5 | 4 | REOPEN |
| 7 | 9 | ASSIGNED |
| 4 | 9 | ASSIGNED |
| 7 | 11 | RESOLVED |
| 8 | 3 | NEW |
| 4 | 3 | NEW |
| 7 | 6 | NEW |
+-----------+-----------+----------+
priority (300k rows, all identifiers from reports come up at least once)
+-----------+-----------+----------+
| identifier| time | what |
+-----------+-----------+----------+
| 3 | 12 | LOW |
| 1 | 9 | LOW |
| 9 | 2 | HIGH |
| 8 | 7 | HIGH |
| 3 | 10 | HIGH |
| 5 | 4 | MEDIUM |
| 4 | 9 | MEDIUM |
| 4 | 3 | LOW |
| 7 | 9 | LOW |
| 7 | 11 | HIGH |
| 8 | 3 | LOW |
| 6 | 12 | MEDIUM |
| 7 | 6 | LOW |
| 6 | 9 | HIGH |
| 2 | 6 | HIGH |
| 2 | 1 | LOW |
+-----------+-----------+----------+
What I need is:
reportsfinal (165k rows)
+-----------+-----------+--------------+------------+
| identifier| category | what11 | what22 |
+-----------+-----------+--------------+------------+
| 1 | fixed | RESOLVED | LOW |
| 2 | wontfix | ASSIGNED | HIGH |
| 3 | fixed | RESOLVED | LOW |
| 4 | invalid | ASSIGNED | MEDIUM |
| 5 | later | REOPEN | MEDIUM |
| 6 | wontfix | | MEDIUM |
| 7 | duplicate | RESOLVED | HIGH |
| 8 | later | NEW | HIGH |
| 9 | wontifx | | HIGH |
+-----------+-----------+--------------+------------+
That is, reports (after query = reportsfinal) serves as the basis table and I have to add one or two columns from 9 other tables. The identifier is the key, but in some tables, the identifier comes up multiple times. In these cases I want to use the entry with the highest time only.
I tried several queries, but none of them worked. If possible, I want to run one query to get different columns from the 9 other tables with this approach.
What I tried based on the answer below:
select T.identifier,
T.category,
t.what AS what11,
t.what AS what22 from (
select R.identifier,
R.category,
COALESCE(S.what,'NA')what,
COALESCE(P.what,'NA')what,
ROW_NUMBER()OVER(partition by R.identifier,R.category ORDER by (select null))RN
from reports R
LEFT JOIN bugstatus S
ON S.identifier = R.identifier
LEFT JOIN priority P
ON P.identifier = s.identifier
GROUP BY R.identifier,R.category,S.what,P.what)T
Where T.RN = 1
ORDER BY T.identifier;
This gives the error:
Error: near "(": syntax error.
Basically you need a correlated subqueries in the select list.
From the hip, something like:
Select a.Identifier
,a.Category
,(select process
from status where status.identifier = a.Identifer order by time desc limit 1) Process
,(select prio
from priority where priorty.identifier = a.Identifer order by time desc limit 1) prio
From Reports a
For each associated table just use a predicate based on a subquery to identify the specific timestamp...
Single letter tokens r, s, and p are defined aliases for tables reports, status and priority respectively
Select r.Identifier, r.category,
coalesce(s.what, 'NA') status,
coalesce(p.what, 'NA') priority
From reports r
left join status s
on s.identifier = r.identifier
and s.time =
(Select max(time) from status
where identifier = r.identifier)
left join priority p
on p.identifier = r.identifier
and p.time =
(Select max(time) from priority
where identifier = r.identifier);
QUESTION: Why did you rename the columns from Status, and priority to What?? You might as well name then something or data, or information. At least the original names (status and prio) communicated something.. The word What is meaningless.
NOTE. I reversed (undid) the edit for the aliases of what11 and what12, as these names are e meaningless.
using Row_number works based on your assumed data
select T.identifier,
T.category,
what AS what11,
what AS what22 from (
select R.identifier,
r.category,
COALESCE(S.what,'NA')what,
COALESCE(P.what,'NA')what,
ROW_NUMBER()OVER(partition by R.identifier,r.category ORDER by (select null))RN
from reports R left join status S
ON S.identifier = R.identifier
LEFT JOIN Priority P
ON P.identifier = s.identifier
GROUP BY R.identifier,r.category,S.what,P.what)T
Where T.RN = 1
ORDER BY T.identifier

Create a combined list from two tables

I have a table with CostCenter_ID (int) and a second table with Process_ID (int).
I'd like to combine the results of both tables so that each cost center ID is assigned to all process IDs, like so:
|CostCenterID | ProcessID |
---------------------------
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
I've done it before but I'm drawing a blank. I've tried this:
SELECT CostCenter_ID,NULL FROM dbo.Cost_Centers
UNION ALL
SELECT NULL,Process_ID FROM dbo.Processes
which returns this:
|CostCenterID | ProcessID |
---------------------------
| 1 | NULL |
| NULL | 1 |
| NULL | 2 |
| NULL | 3 |
Try:
select a.CostCenterID, b.ProcessID
from table1 a
cross join table2 b
or:
select a.CostCenterID, b.ProcessID
from table1 a
,table2 b
NB: cross join is the better method as it makes it clearer to the reader what your intentions are.
More info (with pics) here: http://www.w3resource.com/sql/joins/cross-join.php