Efficient latest record query with Postgresql - sql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);

If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.

this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST

what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past

On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.

If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.

You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)

Related

SQL - select * given count from another table

I'm trying to select * from two tables (a and b) using a join (column a.id and b.id), given that the count of a column (b.owner) in b is lower than 3, i.e. the occurence of a person's name can be max 2.
I've tried:
SELECT a.*, COUNT(b.owner) AS b_count
FROM a LEFT JOIN b on a.id = b.id
GROUP BY b.owner HAVING COUNT(b_count) <3
As im pretty new to SQL, im pretty stuck here. How can i resolve this issue? The result should be all columns for owners who do not appear more than twice in the data.
The query you are trying to run is not working due to the columns missing in the GROUP BY clause.
As you are outputting all columns from table a (with SELECT a.*), you need to include all those columns in the GROUP BY statement, so that the database understand the group of fields to group by and perform the aggregation required (in your case COUNT(b.owner)).
Example
Considering that your table a has 3 columns below:
CREATE TABLE persons (
id INTEGER,
name VARCHAR(50),
birthday DATE,
PRIMARY KEY (id)
);
.. and your table b the following and referencing the first table as below:
CREATE TABLE sales (
id INTEGER,
person_id INTEGER,
sale_value DECIMAL,
PRIMARY KEY (id),
FOREIGN KEY (person_id) REFERENCES persons(id)
);
.. you should query it aggregating the COUNT() by those 3 columns:
SELECT a.id, a.name, a.birthday, COUNT(b.person_id) AS b_count
FROM persons a
LEFT JOIN sales b ON a.id = b.person_id
GROUP BY a.id, a.name, a.birthday
HAVING COUNT(b.person_id) < 3
Alternative
In case the total of records on the 2nd table is not important to you, you could use a different "strategy" here to avoid performing the JOIN between the tables (useful when joining two huge tables) and rewriting all the columns from a on the SELECT+GROUP BY.
By identifying the records that has less than the 3 occurrences firstly:
SELECT b.person_id
FROM sales b
GROUP BY b.person_id
HAVING COUNT(b.id) < 3;
.. and using it in the WHERE clause to retrieve all the columns from the 1st table only for the ids that resulted from the previous query:
SELECT a.*
FROM persons a
WHERE a.id IN (....other query here....);
.. the execution happens in a more chronological and, perhaps, easier way to visualize while getting more familiar with SQL:
SELECT a.*
FROM persons a
WHERE a.id IN (SELECT b.person_id
FROM sales b
GROUP BY b.person_id
HAVING COUNT(b.id) < 3);
DB Fiddle here
In Standard SQL, you can use:
SELECT a.*, COUNT(b.owner) AS b_count
FROM a LEFT JOIN
b
ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.owner) < 3;
This may not work in all databases (and it assumes that a.id is unique/primary key). An alternative would be to use a correlated subquery:
SELECT a.*
FROM (SELECT a.*,
(SELECT COUNT(*)
FROM b
WHERE a.id = b.id
) as b_count
FROM a
) a
WHERE b_count < 3;

Drop duplicated rows in postgresql

I am querying some data from the database, and my code looks as below
select
a.id
a.party
a.date
a.name
a.revenue
b.company
c.cost
from a
left join b on a.id = b.id
left join c on a.id = c.id
where a.party = 'cat' and a.date > '2000-01-01'
I got a returned table but the table has duplicated rows. Is there anyway I can remove all duplicated rows (meaning the entire row is the same, row 1 = row2, remove row1)
I put select distinct at the top, but then it took forever to run. Not sure if some fundamental programming logic was wrong in this code.
If one row in a id related to two rows in b because b.id is not unique, and both these rows have the same company, your query result will contain duplicate rows. There is nothing wrong with that.
Removing duplicate rows with DISTINCT is expensive for big result sets, because the set has to be sorted.
Ideas to improve performance:
increase work_mem, that makes sorting faster
perhaps you don't need all the result rows, then adding a LIMIT clause will make the query faster
Normal deduplicate process with better performance is using row_number() over(partition by order by) and then rownumber =1.
Select a_id, b_id, name, ...
From
( Select a.id a_id, b.id b_id, b.name name,
row_number() over (partition by a.id,b.id, b.name order by create_date desc) rownum from
A
Join b on a.id=b.id) rs
Where rs.rownum=1
Please note you can partition by, order by any key column(whatever you want as unique). I used create date to pick latest row.
Also note huge number of row can hamper performance but it's better than distinct. Also plss check if you can get distinct first before joining.

Finding Most Recent Value in Table B Relative to Table A's Date

I currently have two tables: table a and table b.
My goal is to take the newest score from table b and add it as a new column using a join in table a (however when I say "newest" I really mean "latest in relation to Event_Date listed in table a)
I'm assuming it will be a left join but am having trouble pulling the Score. All I know how to do is pull the date:
select
a.Entity_ID,
a.Event_Date,
max(b.date_processed) --I want to change this to the score correlated to the max date_processed
from myTable a
left join myTable b
on a.Entity_ID = b.Entity_ID and b.date_processed < a.event_date
Group By a.Entity_ID, a.Event_Date, b.Date_Processed
Any help would be much appreciated
I understand that you want the most recent score from tabebprior to the event_date of tablea.
One option uses a correlated subquery with a row-limiting clause:
select
a.*,
(
select b.score
from tableb b
where b.entity_id = a.entity_id and b.date_processed <= a.event_date
order by b.date_processed desc
fetch first 1 row only
) most_recent_score
from tablea a

Filter by count from another table

This query works fine only without WHERE, otherwise there is an error:
column "cnt" does not exist
SELECT
*,
(SELECT count(*)
FROM B
WHERE A.id = B.id) AS cnt
FROM A
WHERE cnt > 0
Use a subquery:
SELECT a.*
FROM (SELECT A.*,
(SELECT count(*)
FROM B
WHERE A.id = B.id
) AS cnt
FROM A
) a
WHERE cnt > 0;
Column aliases defined in the SELECT cannot be used by the WHERE (or other clauses) for that SELECT.
Or, if the id on a is unique, you can more simply do:
SELECT a.*, COUNT(B.id)
FROM A LEFT JOIN
B
ON A.id = B.id
GROUP BY A.id
HAVING COUNT(B.id) > 0;
Or, if you don't really need the count, then:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Assumptions:
You need all columns from A in the result, plus the count from B. That's what your demonstrated query does.
You only want rows with cnt > 0. That's what started your question after all.
Most or all B.id exist in A. That's the typical case and certainly true if a FK constraint on B.id references to A.id.
Solution
Faster, shorter, correct:
SELECT * -- !
FROM (SELECT id, count(*) AS cnt FROM B) B
JOIN A USING (id) -- !
-- WHERE cnt > 0 -- this predicate is implicit now!
Major points
Aggregate before the join, that's typically (substantially) faster when processing the whole table or major parts of it. It also defends against problems if you join to more than one n-table. See:
Aggregate functions on multiple joined tables
You don't need to add the predicate WHERE cnt > 0 any more, that's implicit with the [INNER] JOIN.
You can simply write SELECT *, since the join only adds the column cnt to A.* when done with the USING clause - only one instance of the joining column(s) (id in the example) is added to the out columns. See:
How to drop one join key when joining two tables
Your added question in the comment
postgres really allows to have outside aggregate function attributes that are not behind group by?
That's only true if the PK column(s) is listed in the GROUP BY clause - which covers the whole row. Not the case for a UNIQUE or EXCLUSION constraint. See:
Return a grouped list with occurrences using Rails and PostgreSQL
SQL Fiddle demo (extended version of Gordon's demo).

Returning only duplicate rows from two tables

Every thread I've seen so far has been to check for duplicate rows and avoiding them. I'm trying to get a query to only return the duplicate rows. I thought it would be as simple as a subquery, but I was wrong. Then I tried the following:
SELECT * FROM a
WHERE EXISTS
(
SELECT * FROM b
WHERE b.id = a.id
)
Was a bust too. How do I return only the duplicate rows? I'm currently going through two tables, but I'm afraid there are a large amount of duplicates.
use this query, maybe is better if you check the relevant column.
SELECT * FROM a
INTERSECT
SELECT * FROM b
I am sure your posted code would work too like
SELECT * FROM a
WHERE EXISTS
(
SELECT 1 FROM b WHERE id = a.id
)
You can as well do a INNER JOIN like
SELECT a.* FROM a
JOIN b on a.id = b.id;
You can as well use a IN operator saying
SELECT * FROM a where id in (select id from b);
If none of them, then you can use UNION if both table satisfies the union restriction along with ROW_NUMBER() function like
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS rn
FROM (
select * from a
union all
select * from b) xx ) yy
WHERE rn = 1;
Note: there's an ambiguity as to what you mean by a duplicate row, and whether you're talking about duplicate keys, or all fields being the same. My answer deals with all fields being the same; some of the others are assuming it's just the keys. It's unclear which you intend.
You might try
SELECT id, col1, col2 FROM a INNER JOIN b ON a.id = b.id
WHERE a.col1 = b.col1 AND a.col2 = b.col2
adding in other columns as necessary. The database engine should be intelligent enough to do the comparisons on the indexed columns first, so it'll be efficient as long as you don't have rows that are different only on lots of non-indexed fields. (If you do, then I don't think anything will do it particularly efficiently.)