postgres update value based on count of associated join table - sql

I have a simple scenario TableA, TableB and JoinTable that joins TableA and TableB. I want to store in TableA for each row in TableA the count of records in JoinTable that have TableAId. I can select it correctly as follows:
SELECT "Id", (SELECT COUNT(*) FROM "JoinTable" WHERE "JoinTable"."TableAId" = "TableA"."Id")
AS TOT FROM "TableA" LIMIT 100
However I'm having a hard time writing an update query. I want to update TableA.JoinCount with this result.

You can use a correlated subquery:
update tablea a
set tot = (
select count(*)
from jointable j
where t.tableaid = a.id
)
This updates all rows of tablea with the count of matches from jointable; if there are not matches, tot is set to 0.
I would not necessarily, however, recommend storing such derived information. While it can easily be intialized with the above statement, maintaining it is tedious. You will soon find yourself creating triggers for every DML operation on the join table (update, delete, insert). Instead, you could put the information in a view:
create view viewa as
select id,
(select count(*) from jointable j where j.tableaid = a.id) as tot
from tablea a
Side note: in general, don't use quoted identifiers in Postgres. This this link for more.

You can use the group by query as the source for an UPDATE statement:
update "TableA" a
set "JoinCount" = t.cnt
from (
select "TableAId" as id, count(*) as cnt
from "JoinTable"
group by "TableAId"
) t
WHERE t.id = a."Id"

Related

Snowflake, SQL where clause

I need to write query with where clause:
where
pl.ods_site_id in (select id from table1 where ...)
But if subquery (table1) didn't return anything, where clause doesn't need to include in result query (like it returns TRUE).
How can I do it? (I have snowflake SQL dialect)
You could include a second condition:
where pl.ods_site_id in (select id from table1 where ...) or
not exists (select id from table1 where ...)
This explicitly checks for the subquery returning no rows.
If you are willing to use a join instead, Snowflake supports qualify clause which might come in handy here. You can run this on Snowflake to see how it works.
with
pl (ods_site_id) as (select 1 union all select 5),
table1 (id) as (select 5) --change this to 7 to test if it returns ALL on no match
select a.*
from pl a
left join table1 b on a.ods_site_id = b.id -- and other conditions you want to add
qualify b.id = a.ods_site_id --either match the join condition
or count(b.id) over () = 0; --or make sure there is 0 match from table1

Filter by count from another table

This query works fine only without WHERE, otherwise there is an error:
column "cnt" does not exist
SELECT
*,
(SELECT count(*)
FROM B
WHERE A.id = B.id) AS cnt
FROM A
WHERE cnt > 0
Use a subquery:
SELECT a.*
FROM (SELECT A.*,
(SELECT count(*)
FROM B
WHERE A.id = B.id
) AS cnt
FROM A
) a
WHERE cnt > 0;
Column aliases defined in the SELECT cannot be used by the WHERE (or other clauses) for that SELECT.
Or, if the id on a is unique, you can more simply do:
SELECT a.*, COUNT(B.id)
FROM A LEFT JOIN
B
ON A.id = B.id
GROUP BY A.id
HAVING COUNT(B.id) > 0;
Or, if you don't really need the count, then:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Assumptions:
You need all columns from A in the result, plus the count from B. That's what your demonstrated query does.
You only want rows with cnt > 0. That's what started your question after all.
Most or all B.id exist in A. That's the typical case and certainly true if a FK constraint on B.id references to A.id.
Solution
Faster, shorter, correct:
SELECT * -- !
FROM (SELECT id, count(*) AS cnt FROM B) B
JOIN A USING (id) -- !
-- WHERE cnt > 0 -- this predicate is implicit now!
Major points
Aggregate before the join, that's typically (substantially) faster when processing the whole table or major parts of it. It also defends against problems if you join to more than one n-table. See:
Aggregate functions on multiple joined tables
You don't need to add the predicate WHERE cnt > 0 any more, that's implicit with the [INNER] JOIN.
You can simply write SELECT *, since the join only adds the column cnt to A.* when done with the USING clause - only one instance of the joining column(s) (id in the example) is added to the out columns. See:
How to drop one join key when joining two tables
Your added question in the comment
postgres really allows to have outside aggregate function attributes that are not behind group by?
That's only true if the PK column(s) is listed in the GROUP BY clause - which covers the whole row. Not the case for a UNIQUE or EXCLUSION constraint. See:
Return a grouped list with occurrences using Rails and PostgreSQL
SQL Fiddle demo (extended version of Gordon's demo).

How to aggregate on a left join in a Postgres CTE?

In this CTE, each row in mytable can have 0 or many rows joined to it in jointable. I'm trying to returning an array_agg of the jointable's value column in this query, but I get an error saying I can't have an aggregate in a RETURNING.
WITH updated as(
UPDATE mytable SET status = 'A'
FROM
(
SELECT id FROM mytable
WHERE status = 'B'
ORDER BY mycolumn
LIMIT 100
FOR UPDATE
) sub
LEFT JOIN jointable j USING (id)
WHERE mytable.id = sub.id
GROUP BY (mytable.id)
RETURNING mytable.id, array_agg(j.value)
)
select *
from updated
ORDER BY mycolumn
You cannot have a GROUP BY clause in an UPDATE statement. Also, the UPDATE won't necessarily visit all matching rows in the joined table, so it wouldn't be able to return then anyway.
You will have to join jointable again in the outer query to get the desired result.

SQL Update subquery returns no results run different sub query

I am trying to do an update and i'm having problems (using microsoft sql server)
update mytable
set myvalue=
(
(select myvalue from someothertable
where someothertable.id=mytable.id)
)
from table mytable
where mytable.custname='test'
Basically the subquery could return no results if that does happen i want to call a different subquery:
(select myvalue from oldtable
where oldtable.id=mytable.id)
well, you could run the second query first, and then, the first query.
In that way, you will override the values only when the first query can bring them, and when the (original) first query won't bring any result, they will have the result of first query.
Also, I think you have a typo in your second query with the table name.
update mytable
set myvalue=
(
select myvalue from oldtable
where oldtable.id=mytable.id
)
from table mytable
where mytable.custname='test'
and exists (select 1 from oldtable
where oldtable.id=mytable.id)
update mytable
set myvalue=
(
select myvalue from someothertable
where someothertable.id=mytable.id
)
from table mytable
where mytable.custname='test'
and exists ( select 1 from someothertable
where someothertable.id=mytable.id)
edit: you will need to add the exists clause, cause if not it will update with null values, I think
You can simply join both tables,
UPDATE a
SET a.myValue = b.myValue
FROM myTable a
INNER JOIN someOtherTable b
ON a.ID = b.ID
WHERE a.CustName = 'test'

Efficient latest record query with Postgresql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.
this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST
what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past
On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.
If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.
You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)