Drop duplicated rows in postgresql - sql

I am querying some data from the database, and my code looks as below
select
a.id
a.party
a.date
a.name
a.revenue
b.company
c.cost
from a
left join b on a.id = b.id
left join c on a.id = c.id
where a.party = 'cat' and a.date > '2000-01-01'
I got a returned table but the table has duplicated rows. Is there anyway I can remove all duplicated rows (meaning the entire row is the same, row 1 = row2, remove row1)
I put select distinct at the top, but then it took forever to run. Not sure if some fundamental programming logic was wrong in this code.

If one row in a id related to two rows in b because b.id is not unique, and both these rows have the same company, your query result will contain duplicate rows. There is nothing wrong with that.
Removing duplicate rows with DISTINCT is expensive for big result sets, because the set has to be sorted.
Ideas to improve performance:
increase work_mem, that makes sorting faster
perhaps you don't need all the result rows, then adding a LIMIT clause will make the query faster

Normal deduplicate process with better performance is using row_number() over(partition by order by) and then rownumber =1.
Select a_id, b_id, name, ...
From
( Select a.id a_id, b.id b_id, b.name name,
row_number() over (partition by a.id,b.id, b.name order by create_date desc) rownum from
A
Join b on a.id=b.id) rs
Where rs.rownum=1
Please note you can partition by, order by any key column(whatever you want as unique). I used create date to pick latest row.
Also note huge number of row can hamper performance but it's better than distinct. Also plss check if you can get distinct first before joining.

Related

SQL function to create a one-to-one match between two tables?

I am trying to join 2 tables. Table_A has ~145k rows whereas Table_B has ~205k rows.
They have two columns in common (i.e. ISIN and date). However, when I execute this query:
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.date = B.date
WHERE A.isin = B.isin
I get a table with more than 147k rows. How is it possible? Shouldn't it return a table with at most ~145k rows?
What you are seeing indicates that, for some of the records in Table_A, there are several records in Table_B that satisfy the join conditions (equality on the (date, isin) tuple).
To exhibit these records, you can do:
select B.date, B.isin
from Table_A
join Table_B on A.date = B.date and A.isin = B.isin
group by B.date, B.isin
having count(*) > 1
It's up to you to define how to handle those duplicates. For example:
if the duplicates have different values in column column_name, then you can decide to pull out the maximum or minimum value
or use another column to filter on the top or lower record within the duplicates
if the duplicates are true duplicates, then you can use select distinct in a subquery to dedup them before joining
... other solutions are possible ...
If you want one row per table A, then use outer apply:
SELECT A.*,
B.column_name
FROM Table_A a OUTER APPLY
(SELECT TOP (1) b.*
FROM Table_B b
WHERE A.date = B.date AND A.isin = B.isin
ORDER BY ? -- you can specify *which* row you want when there are duplicates
) b;
OUTER APPLY implements a lateral join. The TOP (1) ensures that at most one row is returned. The OUTER (as opposed to CROSS) ensures that nothing is filtered out. In this case, you could also phrase it as a correlated subquery.
All that said, your data does not seem to be what you really expect. You should figure out where the duplicates are coming from. The place to start is:
select b.date, b.isin, count(*)
from tableb b
group by b.date, b.isin
having count(*) >= 2;
This will show you the duplicates, so you can figure out what to do about them.
Duplicate possibilities is already discuss.
When millions of records are use in join then often due to poor Cardianility Estimate,
record return are not accurate.
For this just change join order,
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.isin = B.isin
and
A.date = B.date
Also create non clustered index on both table.
Create NonClustered index isin_date_table_A on Table_A(isin,date)include(*Table_A)
*Table_A= comma seperated list Table_A column which is require in resultset
Create NonClustered index isin_date_table_B on Table_B(isin,date)include(column_nameA)
Update STATISTICS Table_A
Update STATISTICS Table_B
Keeping the DATE columns of both tables in the same format in the JOIN condition you should be getting the result as expected.
Select A.*, B.column_name
from Table_A
join Table_B on to_date(a.date,'DD-MON-YY') = to_date(b.date,'DD-MON-YY')
where A.isin = B.isin

Oracle SQL: Selecting all, plus an extra column with a complex query

I have these tables setup:
NOMINATIONS: A table of award nominations
NOMINATION_NOMINEES: A table of users with a FK on NOMINATIONS.ID
One Nomination can be referenced by many nominees via the ID field.
SELECT a.*, COUNT(SELECT all records from NOMINATION_NOMINEES with this ID) AS "b"
FROM NOMINATIONS a
LEFT JOIN NOMINATION_NOMINEES b on a.ID = b.ID
The results would look like:
ID | NOMINATION_DESCRIPTION | ... | NUMBER_NOMINEES
Where NUMBER_NOMINEES is the number of rows in the NOMINATION_NOMINEES table with the current row's ID.
This is a tricky one, we are feeding this into a larger system so I'm hoping to get this in one query with a bunch of subqueries. Implementing subqueries into this has twisted my mind. Anyone have an idea of where to head with this?
I'm sure the above way is not close to a decent approach to this one, but I can't quite wrap my mind around this one.
It can be done with a single correlated sub-query in SELECT clause.
SELECT a.*,
( SELECT COUNT(b.ID) FROM NOMINATION_NOMINEES b WHERE a.ID= b.ID )
FROM NOMINATIONS a
You should be able to use count as an analytic function:
select a.*,
count(b.id) over (partition by b.id)
from nominations a
left join nomination_nominees b on a.id = b.id

Have two tables A,B that has one to many relationship. I want to link A to B by only taking the most recent instance in B

I have two tables, A and B. For a particular value in A in a particular column there will be multiple rows in B that correspond to it. I want to extract the info from B, but not all the rows but just the most recent.
What is the eastiest way to do this?
Thanks
A typical way to do this is to use the ANSI standard row_number() function. Here is a sketch of what the query might look like:
select *
from a left join
(select b.*, row_number() over (partition by b.aid order by b.date desc) as seqnum
from b
) b
on a.aid = b.aid and b.seqnum = 1;
You can also approach this with aggregation:
select *
from a left join
b
on a.aid = b.aid join
(select b.aid, max(b.date) as maxdate
from b
) bmax
on b.aid = b.aid and b.date = bmax.date;
Assuming B.id is AUTO INCREMENTED and B.a_id references A, you can
SELECT B.id,B.a_id,B.data FROM A
JOIN B
ON B.a_id = A.id
WHERE B.id IN (
SELECT MAX(B.id)
FROM B
GROUP BY B.a_id
)
See SQLFiddle. I'm assuming PostgreSQL here, but I'm sure you can adapt accordingly.
If the rows are only added and not updated and the IDs are not recycled numbers, then you can selects the record from B with highest number.
If you are able to edit the schema, then put last lastUpdatedTime column into B and select according to column.
I will explain you by giving example.
Suppose your A table is for ORDERS & B is for PRODUCT DETAILS
In your A table always one value will going to insert at a time.
But you cant have many products at a time in a single order
So this is the general scenario where we need to get the most updated records from the database.
There are 2 ways :
1. On creation of order Insert one date into database & get updated records by using the date.
2. Auto incremented ID is the other option last ID will be your most updated records.
But Most standard way is to use Date column in the database.

Efficient way to check if row exists for multiple records in postgres

I saw answers to a related question, but couldn't really apply what they are doing to my specific case.
I have a large table (300k rows) that I need to join with another even larger (1-2M rows) table efficiently. For my purposes, I only need to know whether a matching row exists in the second table. I came up with a nested query like so:
SELECT
id,
CASE cnt WHEN 0 then 'NO_MATCH' else 'YES_MATCH' end as match_exists
FROM
(
SELECT
A.id as id, count(*) as cnt
FROM
A, B
WHERE
A.id = B.foreing_id
GROUP BY A.id
) AS id_and_matches_count
Is there a better and/or more efficient way to do it?
Thanks!
You just want a left outer join:
SELECT
A.id as id, count(B.foreing_id) as cnt
FROM A
LEFT OUTER JOIN B ON
A.id = B.foreing_id
GROUP BY A.id

Efficient latest record query with Postgresql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.
this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST
what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past
On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.
If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.
You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)