Distinct multi-columns - sql

For this table:
mysql> select * from work;
+------+---------+-------+
| code | surname | name |
+------+---------+-------+
| 1 | John | Smith |
| 2 | John | Smith |
+------+---------+-------+
I'd like to get the pair of code where the names are equal, so I do this:
select distinct A.code, B.code from work A, work B where A.name = B.name group by A.code, B.code;
However, I get the follow result back:
+------+------+
| code | code |
+------+------+
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
+------+------+
As you can see, This result has 2 duplicates, obviously from a cartesian product. I'd like to find out how I can do this such that it outputs only:
+------+------+
| code | code |
+------+------+
| 1 | 2 |
+------+------+
Any clue? Thanks!

This should work (assuming code is the primary key):
SELECT A.code, B.code
FROM work A, work B
WHERE A.name = B.name AND A.code < B.code

try this
Select A.Code, B.Code
From work a
Join work b
On A.surname = b.surname
And A.Name = B.Name
And A.Code > B.Code
You need to use A.Code > B.Code rather than != to eliminate dupes of the type
{1, 2} and {2, 1}
(If you only care about when the name is the same and not the surname, eliminate that predicate from the join condition)

Related

SQL select all rows that are not equal to an id, and replace the id column with the value - without cross join

Say I have a table like this:
+----+-------+
| id | value |
+----+-------+
| 1 | a |
| 1 | b |
| 2 | c |
| 2 | d |
| 3 | e |
| 3 | f |
+----+-------+
And I want to select all rows with id that are not a, and change their id to a; select all rows with id that are not b, and change the id to b; and select all rows with id that are not c, and change their id to c.
Here is the output I want:
+----+-------+
| id | value |
+----+-------+
| 1 | c |
| 1 | d |
| 1 | e |
| 1 | f |
| 2 | a |
| 2 | b |
| 2 | e |
| 2 | f |
| 3 | a |
| 3 | b |
| 3 | c |
| 3 | d |
+----+-------+
The only solution I can think of is through cross join and distinct:
select distinct a.id, b.value
from table a
cross join table b
where a.id != b.id
Is there any other way to avoid such expensive operation?
I think the typical way to write this is to generate all pairs of id and value and then remove the ones that exist:
select i.id, v.value
from (select distinct id from t) i cross join
(select distinct value from t) v left join
t
on t.id = i.id and t.value = i.value
where t.id is null;
First, I don't think this is what your query does. But this is what you seem to be describing.
From a performance perspective, you might have other sources for i and v that don't require subqueries. If so, use those for performance.
Finally, I don't think you can do much to improve the performance of this, apart from using explicit tables -- and perhaps having appropriate indexes on all the tables.

Combine different table with same column without ambigous

I have query problem. I have 3 table.
table A
----------------------------
NAME | CODE
----------------------------
bob | PL
david | AA
susan | PL
joe | AB
table B
----------------------------
CODE | DESCRIPTION
----------------------------
PL | code 1
PB | code 2
PC | code 3
table C
----------------------------
CODE | DESCRIPTION
----------------------------
AA | code 4
AB | code 5
AC | code 6
Table B and C have unique row.
the result I need :
----------------------------
NAME | CODE | DESCRIPTION
----------------------------
bob | PL | code 1
david | AA | code 4
susan | PL | code 1
joe | AB | code 5
What I have tried so far
http://sqlfiddle.com/#!9/ffb2eb/9
You are close. I think you just need COALESCE():
select A.*, coalesce(B.DESCRIPTION, C.DESCRIPTION) as description
from A left join
B
on A.CODE = B.CODE left join
C
on A.CODE = C.CODE
order by A.NAME;
I think UNION will make this. And additionally It will also remove duplicates if some will exists.
SELECT A.NAME , UN.CODE ,UN.DESCRIPTION
FROM A,
(SELECT CODE,DESCRIPTION FROM B
UNION
SELECT CODE,DESCRIPTION FROM C ) UN
WHERE A.CODE = UN.CODE;

Return multiple columns with 3 distinct fields in SQL query for Access DB

I am trying to make a query that returns multiple fields, keeping the first 3 as distinct columns and returns values for the last modified date. Some of the variables in the query fields should come from more than one table and one of them has a True/False criterion too. The three 3 distinct fields are needed because the combination of these is associated with the other returning parameters.
The tables look roughly as follows...
Table a:
ID | Sc | Country | TechID | VarA | ... | VarX(T/F) | LastModified
1 | 1 | AA | 1 | x | ... | T | 1-1-2017
2 | 1 | AA | 1 | z | ... | T | 1-1-2017
3 | 1 | AA | 2 | y | ... | T | 1-1-2018
4 | 1 | AB | 1 | u | ... | T | 1-1-2017
5 | 2 | AB | 2 | v | ... | T | 1-1-2018
6 | 3 | AB | 1 | w | ... | F | 1-1-2018
Table b:
TechID | TechName | Categ | Units
1 | Tech1 | Cat1 | M
2 | Tech2 | Cat2 | N
3 | Tech3 | Cat3 | P
The idea is that the query returns something like this (when the T/F criterion is met). Where the combination of Sc-Country-Tech shows up only once, with the last modified having presedence:
Sc' | Country' |TechName'| Units | Cat | VarA... | LastModified |
1 | AA | 1 | ... | ... | ... | 1-1-2018
1 | AB | 2 | ... | ... | ... | 1-1-2017
2 | AB | 1 | ... | ... | ... | 1-1-2018
So far I've tried a few SQL lines to no avail. First, with Select DISTINCT but the option was too "all inclusive".
SELECT DISTINCT a.Sc, a.Country, b.TechName, b.Units, b.Cat, a.VarA,..,a.VarX, Max(a.LastModified) AS MaxOfLastModified
FROM a INNER JOIN (b INNER JOIN a ON b.TechName =
a.TechID) ON b.Cat = a.TechID
GROUP BY a.Sc, a.Country, b.TechName, b.Units, b.Cat, a.VarA,..,a.VarX
HAVING (((a.VarX)=True));
Also tried this but it prompts errors related to aggregate functions:
SELECT a.Sc, a.Country, b.TechID, b.Units, b.Cat, a.VarA,..,a.VarX, Max(a.LastModified) AS MaxOfLastModified
FROM a INNER JOIN (b INNER JOIN a ON b.TechName =
a.TechID) ON b.Cat = a.TechID
GROUP BY a.Sc, a.Country, a.TechID
HAVING (((a.VarX)=True));
Any thoughts/suggestions on how to go about this?? Any pointers to previous related answers are also much appreciated.
Thanks in advance! :)
EDIT (2017.09.29):
This certainly cleared things up a bit!
I managed to get the query going with some of the fields, only when calling fields from a single table with the following:
SELECT a.Sc, a.Country, a.Tech, a.LastModified, a.VarA
FROM a INNER JOIN (SELECT Sc, Country, Tech, max(LastModified) AS lm FROM a GROUP BY Sc, Country, Tech) AS dt ON (dt.lm=a.LastModified) AND (dt.Tech=a.Tech) AND (dt.Country=a.Country) AND (dt.Sc=a.Sc)
GROUP BY a.Sc, a.Country, a.Tech, a.LastModified, a.VarA, a.VarX
HAVING (((a.VarX)=Yes));
I'm still running into a syntax error on JOIN when trying to add fields from a lookup table using the INNER JOIN command as suggested. The code I tried looked something like:
SELECT a.Sc, a.Country, a.Tech, a.LastModified, a.VarA b.TechCategory
FROM a INNER JOIN (SELECT Sc, Country, Tech, max(LastModified) AS lm FROM a GROUP BY Sc, Country, Tech) AS dt ON (dt.lm=a.LastModified) AND (dt.Tech=a.Tech) AND (dt.Country=a.Country) AND (dt.Sc=a.Sc)
INNER JOIN b ON Tech.Category=a.Tech
GROUP BY a.Sc, a.Country, a.Tech, b.TechCategory, a.LastModified, a.VarA, a.VarX
HAVING (((a.VarX)=Yes));
Any additional pointers are much appreciated!
Use an aggregate query to get the maximum date for each combination of Sc, Country, and TechID, then use this as a subquery and join it back to tables a and b to get the data in your final query. Something like this:
select
a.Sc, a.Country, b.TechName,
b.Units, b.Category, b.Units, a.VarA, a.LastModified
from
(Table_a as a
inner join (
select Sc, Country, TechID, max(LastModified) as lm
from Table_a
group by Sc, Country, TechID
) as dt on dt.Sc=a.Sc and dt.Country=a.Country and dt.TechID=a.TechID and dt.lm=a.LastModified)
inner join Table_b as b on b.TechID=a.TechID

Filter array depending on other table

I'm trying to filter values from an array. The information, which values should be kept, are in another table.
table_a table_b
___________________ ___________
| id | values | | keyword |
------------------- -----------
| 1 | [a, b, c] | | b |
| 2 | [d, e, f] | | e |
| 3 | [a, g] | | f |
------------------- -----------
I expect the following output:
output
________________________
| id | filtered_values |
------------------------
| 1 | [b] |
| 2 | [e, f] |
| 3 | [] |
------------------------
At the moment, I am using following query:
SELECT
id,
array_intersect(ta.values, tb.filter_keywords) AS filtered_values -- brickhouse UDF
FROM
table_a ta
CROSS JOIN (
SELECT
collect_set(keyword) as filter_keywords
FROM (
SELECT
"dummy" as grouping_dummy,
keyword
FROM
table_b
) tmp
GROUP BY
grouping_dummy
)
table_a has a couple million rows, table_b contains less than 1000 rows.
I guess the cross join is the bottleneck, because it uses only one reducer.
Is there any way to optimize this query?
Thanks!
I have a different assumption.
The reducer is needed in order to generate filter_keywords, not for the CROSS JOIN which is a map side operation.
So no problem here.
My guess is that the performance penalty comes from the use of array_intersect with an array of 1000 elements, therefor the solution would be avoiding it.
P.s.
There is no need for grouping_dummy.
You don't need to use GROUP BY in order to use aggregate functions.
select a.id
,collect_list (case when b.keyword is not null then a.val end) as vals
from (select a.id
,e.val
from table_a a
lateral view outer
explode (a.vals) e as val
) a
left join table_b b
on b.keyword =
a.val
group by a.id
+----+-----------+
| id | vals |
+----+-----------+
| 1 | ["b"] |
| 2 | ["e","f"] |
| 3 | [] |
+----+-----------+

Postgres group by columns and within group select other columns by max aggregate

This is probably a standard problem, and I've keyed off some other greatest-n-per-group answers, but so far been unable to resolve my current problem.
A B C
+----+-------+ +----+------+ +----+------+-------+
| id | start | | id | a_id | | id | b_id | name |
+----+-------+ +----+------+ +----+------+-------+
| 1 | 1 | | 1 | 1 | | 1 | 1 | aname |
| 2 | 2 | | 2 | 1 | | 2 | 2 | aname |
+----+-------+ | 3 | 2 | | 3 | 3 | aname |
+----+------+ | 4 | 3 | bname |
+----+------+-------+
In English what I'd like to accomplish is:
For each c.name, select its newest entry based on the start time in a.start
The SQL I've tried is the following:
SELECT a.id, a.start, c.id, c.name
FROM a
INNER JOIN (
SELECT id, MAX(start) as start
FROM a
GROUP BY id
) a2 ON a.id = a2.id AND a.start = a2.start
JOIN b
ON a.id = b.a_id
JOIN c
on b.id = c.b_id
GROUP BY c.name;
It fails with errors such as:
ERROR: column "a.id" must appear in the GROUP BY clause or be used in an aggregate function Position: 8
To be useful I really need the ids from the query, but cannot group on them since they are unique. Here is an example of output I'd love for the first case above:
+------+---------+------+--------+
| a.id | a.start | c.id | c.name |
+------+---------+------+--------+
| 2 | 2 | 3 | aname |
| 2 | 2 | 4 | bname |
+------+---------+------+--------+
Here is a Sqlfiddle
Edit - removed second case
Case 1
select distinct on (c.name)
a.id, a.start, c.id, c.name
from
a
inner join
b on a.id = b.a_id
inner join
c on b.id = c.b_id
order by c.name, a.start desc
;
id | start | id | name
----+-------+----+-------
2 | 2 | 3 | aname
2 | 2 | 4 | bname
Case 2
select distinct on (c.name)
a.id, a.start, c.id, c.name
from
a
inner join
b on a.id = b.a_id
inner join
c on b.id = c.b_id
where
b.a_id in (
select a_id
from b
group by a_id
having count(*) > 1
)
order by c.name, a.start desc
;
id | start | id | name
----+-------+----+-------
1 | 1 | 1 | aname