Select Distinct returning multiple rows - sql

I've got a query that is:
SELECT DISTINCT a.field1, a.field2, a.field3, b.field1, b.field2, a.field4
FROM table1 a
JOIN table2 b ON b.fielda = a.fieldb
WHERE a.field1 = 'xxxx'
I run this and it returns three xxxx rows. I need all of the information listed above with the first field being distinct. Do I have the correct syntax for this?

In Postgres, you can use distinct on:
select distinct on (a.field1) a.field1, a.field2, a.field3, b.field1, b.field2, a.field4
from table1 a join
table2 b
on b.fielda = a.fieldb
where a.field1 = 'xxxx'
order by a.field1;
In either Postgres or SQL Server, you can use row_number():
select ab.*
from (select a.field1, a.field2, a.field3, b.field1, b.field2, a.field4,
row_number() over (partition by a.field1 order by a.field1) as seqnum
from table1 a join
table2 b
on b.fielda = a.fieldb
where a.field1 = 'xxxx'
) ab
where seqnum = 1;
Or, since you only want one row, you can use limit/top:
select a.field1, a.field2, a.field3, b.field1, b.field2, a.field4
from table1 a join
table2 b
on b.fielda = a.fieldb
where a.field1 = 'xxxx'
limit 1;
In SQL Server:
select top 1 a.field1, a.field2, a.field3, b.field1, b.field2, a.field4
from table1 a join
table2 b
on b.fielda = a.fieldb
where a.field1 = 'xxxx';

One option is to use row_number():
with cte as (
select distinct a.field1, a.field2, a.field3, b.field1, b.field2, a.field4,
row_number() over (partition by a.field1 order by a.field1) rn
from table1 a
join table2 b on b.fielda = a.fieldb
where a.field1 = 'xxxx'
)
select *
from cte
where rn = 1
But you need to define which record to take. This orders by field1 which essentially will take a random record...

As you can read from your comments, DISTINCT cannot work for you. It gives you distinct rows. What you need instead is an aggregation, so as to get from three records to only one.
So the first comment you got (by sgeddes) is already the answer you need: "what values should the other columns have?". How shall the dbms know? You didn't tell it.
One row per field1 usually means GROUP BY field1. Then for every other field decide what you want to see: The maximum of field2 maybe? The minimum of field3? The avarage of field4?
select a.field1, max(a.field2), min(a.field3), count(b.field1), sum(b.field2), avg(a.field4)
from table1 a
join table2 b on b.fielda = a.fieldb
where a.field1 = 'xxxx'
group by a.field1;

Related

SQL Server - Subquery in Join vs Subquery in Where clause

I have a situation where I have source tables that get dumped with all the historic data on a daily basis. The way for me to extract the latest dump is by filtering the records using a date field.
Now I have scenarios where I may need to fetch data from about 4-5 tables in the same query. In that case, which one of the below options would be better for the tables that have high number of records:
SELECT A.col1,
B.col2,
C.col3
FROM (SELECT col1, x
FROM tableA
WHERE posting_date = (SELECT max(posting_date) from tableA)
) A
JOIN
(SELECT col2, y, z
FROM tableB
WHERE posting_date = (SELECT max(posting_date) from tableB)
) B
ON B.y = A.x
JOIN
(SELECT col3, w
FROM tableC
WHERE posting_date = (SELECT max(posting_date) from tableC)
) C
ON C.w = B.z
OR should I do a simple subqueries in the WHERE clause,
SELECT A.col1,
B.col2,
C.col3
FROM tableA A,
tableB B,
tableC
WHERE A.posting_date = (SELECT max(posting_date) from tableA)
AND B.posting_date = (SELECT max(posting_date) from tableB)
AND C.posting_date = (SELECT max(posting_date) from tableC)
AND A.x = B.y
AND B.z = C.w
From the readability perspective, I find the second option better. But I am not too sure of the performance when there will be a lot of records in all the required tables.
I, personally, think that using the ANSI-92 JOIN syntax and then putting the clauses in the WHERE would be the most readable though.
SELECT A.col1,
B.col2,
C.col3
FROM dbo.tableA A
JOIN dbo.tableB B ON A.x = B.y
JOIN dbo.tableC B.z = C.w
WHERE A.posting_date = (SELECT MAX(sq.posting_date) from tableA sq)
AND B.posting_date = (SELECT MAX(sq.posting_date) from tableB sq)
AND C.posting_date = (SELECT MAX(sq.posting_date) from tableC sq);
I wouldn't do either of them. Window functions will serve you better.
Obviously use proper join syntax, not those awful, deprecated comma-joins.
SELECT A.col1,
B.col2,
C.col3
FROM (
SELECT *, maxdate = MAX(a.posting_date) OVER ()
FROM dbo.tableA a
) A
JOIN (
SELECT *, maxdate = MAX(b.posting_date) OVER ()
FROM dbo.tableB b
) B ON A.x = B.y
JOIN (
SELECT *, maxdate = MAX(c.posting_date) OVER ()
FROM dbo.tableC c
) C ON B.z = C.w
WHERE A.posting_date = A.maxdate
AND B.posting_date = B.maxdate
AND C.posting_date = C.maxdate;

Converting a PostgreSQL query with IN / NOT IN to JOINs

I currently have two tables with the same structure. Table A (as an example) has 10,000 rows, table B has 100,000 rows. I need to obtain the rows that are in Table B that are not in Table A, but only if certain fields are the same (and one is not).
Right now, the query is something like:
select *
from tableA A
where (A.field1, A.field2) in (select field1, field2 from tableB B)
and A.field3 not in (select field3 from B)
This works, but probably a better performant solution could be done with JOINs. I have tried to do it but all I get is a very huge list of duplicated rows. Could someone point me in the right direction?
Based on your current query this is what it translates to as joins:
select *
from tableA A
inner join tableB B on A.field1 = B.field1 and A.field2 = B.field2
left outer join tableB C on A.field3 = C.field3
where c.field3 is null
A faster query would be:
select A.pk
from tableA A
inner join tableB B on A.field1 = B.field1 and A.field2 = B.field2
left outer join tableB C on A.field3 = C.field3
where c.field3 is null
group by A.pk
This would give you the rows you need to add to tableB because they aren't found.
Or you can just get the fields you want to pull over:
select A.field1, A.field2, A.field3
from tableA A
inner join tableB B on A.field1 = B.field1 and A.field2 = B.field2
left outer join tableB C on A.field3 = C.field3
where c.field3 is null
group by A.field1, A.field2, A.field3
[NOT] EXISTS is your friend:
SELECT *
FROM tableA A
WHERE EXISTS ( SELECT * FROM tableB B
WHERE A.field1 = B.field1
AND A.field2 = B.field2
)
AND NOT EXISTS ( SELECT * FROM tableB B
WHERE A.field3 = B.field3
);
Note: if the joined columns are NOT NULLable, the [NOT] EXISTS() version will behave exactly the same as the [NOT] IN version
Reading the question text again (and again):
I need to obtain the rows that are in Table B that are not in Table A, but only if certain fields are the same (and one is not).
SELECT *
FROM tableB B
WHERE EXISTS ( SELECT * FROM tableA A
WHERE A.field1 = B.field1
AND A.field2 = B.field2
AND A.field3 <> B.field3
);

Remove Redundant Values in Cell built from LISTAGG, Oracle SQL [duplicate]

This question already has an answer here:
Distinct LISTAGG that is inside a subquery in the SELECT list
(1 answer)
Closed 7 years ago.
I'm very new to querying in oracle. I have built an oracle query using LISTAGG like so:
select a.field1,
LISTAGG(d.field2, ';') WITHIN GROUP (ORDER BY d.field2) AS FIELD_ALIAS
from
table1 a, table2 b,
table4 c, table5 d
where
a.field2 = b.field2
and
b.field2 = c.field2
and
c.field3 = d.field3
group by
a.field1
which returns:
field1 field2
----------------
504482 Labour;Labour;Labour;Labour;Labour;Labour;Labour;Labour
What I would like to do it simplify the second field and remove the redundant values so that I get:
field1 field2
----------------
504482 Labour
Is that possible?
I don't think that listagg() takes the distinct keyword. One approach is to use a subquery:
select field1, LISTAGG(d.field2, ';') WITHIN GROUP (ORDER BY field2)
from (select distinct a.field1, d.field2
from table1 a join
table2 b
on a.field2 = b.field2 join
table4 c
on b.field2 = c.field2 join
table5 d
on c.field3 = d.field3
) t
group by field1

How to LEFT JOIN in DB2 iseries with first row?

I have need a query that JOIN a TABLE with A first row of other table value based:
SELECT * FROM TABLEA A LEFT JOIN
(SELECT * from TABLEB
WHERE FIELD1 <> '3' and FIELD2 = 'D' AND A.CODE=CODE
FETCH FIRST 1 ROW ONLY
) B
on a.FIELDA = b.FIELDA
and A.FIELDB = B.FIELDB
but DB2 return ERROR because can't use A.CODE
How can solve this?
You need to use the nested table expression:
SELECT * FROM TABLEA A LEFT JOIN
LATERAL (SELECT * from TABLEB
WHERE FIELD1 <> '3' and FIELD2 = 'D' AND A.CODE=CODE
FETCH FIRST 1 ROW ONLY
) B
on a.FIELDA = b.FIELDA
and A.FIELDB = B.FIELDB
This is a highly optimized statement.
Your not getting any data from tableb and your going for first row so you just need exists clause.
select a.* from tablea a
where exists (select * from tableb b
where a.fielda = b.fielda
and a.fieldb = b.fieldb
and b.code = a.code
and b.field2 = 'd' and b.field1 <> '3')
You can use the OLAP function row_number() to rank the records according to somefield(s) within a (fielda,fieldb,code) group. Somefield might be a transaction id, or sequence, for example. The order by clause is optional there, but without it, you might be randomly picking which record is the first in the group.
WITH B AS
(SELECT *,
row_number() over (partition by fielda,fieldb,code
order by somefield
) as pick
from TABLEB
WHERE FIELD1 <> '3'
and FIELD2 = 'D'
)
SELECT *
FROM TABLEA A LEFT JOIN B
on a.FIELDA = b.FIELDA
and A.FIELDB = B.FIELDB
and A.CODE = B.CODE
where pick=1

Invalid identifier error on field created with select statement

Why sql bellow don't work?
select
a.field1, a.field2, a.field3,
(select count(*)
from table2 b
where b.field1 = a.field1
) as field4,
(select count(*)
from table3 b
where b.field1 = a.field1
) as field5,
(select count(*)
from table4 b
where b.field1 = a.field1
) as field6,
from table1 a
order by field4
Oracle says: ORA-00904: "field4": invalid identifier
try to wrap it up
select * from
(
select
a.field1, a.field2, a.field3,
(select count(*)
from table2 b
where b.field1 = a.field1
) as field4,
(select count(*)
from table3 b
where b.field1 = a.field1
) as field5,
(select count(*)
from table4 b
where b.field1 = a.field1
) as field6,
from table1 a
)
order by field4