Aggregate with a limit in postgres - sql

In this query, each row of table a could have hundreds of rows of table b associated with it. So the array_agg contains all of those values. I'd like to be able to set a limit for it, but instide array_agg I can do order by but there's no way to set a limit.
select a.column1, array_agg(b.column2)
from a left join b using (id)
where a.column3 = 'value'
group by a.column1
I could use the "slice" syntax on the array but that's quite expensive since it first has to retrieve all the rows then discard the other ones. What's the proper efficient way to do this?

I would use a lateral join.
select a.column1, array_agg(b.column2)
from a left join lateral
(select id, column2 from b where b.id=a.id order by something limit 10) b using (id)
where a.column3 = 'value'
group by a.column1
Since the "id" restriction is already inside the lateral query, you could make the join condition on true rather than using (id). I don't know which is less confusing.

I think you need to count first and then aggregate:
select a.column1, array_agg(b.column2)
from (select a.column1, b.column2,
row_number() over (partition by a.column1 order by a.column1) as seqnum
from a left join
b
using (id)
where a.column3 = 'value'
) a
where seqnum <= 10
group by a.column1

Related

SQL function to create a one-to-one match between two tables?

I am trying to join 2 tables. Table_A has ~145k rows whereas Table_B has ~205k rows.
They have two columns in common (i.e. ISIN and date). However, when I execute this query:
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.date = B.date
WHERE A.isin = B.isin
I get a table with more than 147k rows. How is it possible? Shouldn't it return a table with at most ~145k rows?
What you are seeing indicates that, for some of the records in Table_A, there are several records in Table_B that satisfy the join conditions (equality on the (date, isin) tuple).
To exhibit these records, you can do:
select B.date, B.isin
from Table_A
join Table_B on A.date = B.date and A.isin = B.isin
group by B.date, B.isin
having count(*) > 1
It's up to you to define how to handle those duplicates. For example:
if the duplicates have different values in column column_name, then you can decide to pull out the maximum or minimum value
or use another column to filter on the top or lower record within the duplicates
if the duplicates are true duplicates, then you can use select distinct in a subquery to dedup them before joining
... other solutions are possible ...
If you want one row per table A, then use outer apply:
SELECT A.*,
B.column_name
FROM Table_A a OUTER APPLY
(SELECT TOP (1) b.*
FROM Table_B b
WHERE A.date = B.date AND A.isin = B.isin
ORDER BY ? -- you can specify *which* row you want when there are duplicates
) b;
OUTER APPLY implements a lateral join. The TOP (1) ensures that at most one row is returned. The OUTER (as opposed to CROSS) ensures that nothing is filtered out. In this case, you could also phrase it as a correlated subquery.
All that said, your data does not seem to be what you really expect. You should figure out where the duplicates are coming from. The place to start is:
select b.date, b.isin, count(*)
from tableb b
group by b.date, b.isin
having count(*) >= 2;
This will show you the duplicates, so you can figure out what to do about them.
Duplicate possibilities is already discuss.
When millions of records are use in join then often due to poor Cardianility Estimate,
record return are not accurate.
For this just change join order,
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.isin = B.isin
and
A.date = B.date
Also create non clustered index on both table.
Create NonClustered index isin_date_table_A on Table_A(isin,date)include(*Table_A)
*Table_A= comma seperated list Table_A column which is require in resultset
Create NonClustered index isin_date_table_B on Table_B(isin,date)include(column_nameA)
Update STATISTICS Table_A
Update STATISTICS Table_B
Keeping the DATE columns of both tables in the same format in the JOIN condition you should be getting the result as expected.
Select A.*, B.column_name
from Table_A
join Table_B on to_date(a.date,'DD-MON-YY') = to_date(b.date,'DD-MON-YY')
where A.isin = B.isin

Can you use a Select statement in an joins on clause?

I am trying to use a sub query to pick an column to join on to, is this even possible:
if let's say table b has a value that = the column name of table a?
Please note the below example does specify the table_b.Column_A however this is to make my question clearer and less cluttered. The where condition will always return an single value/record.
EDIT: I am trying to basically create an dynamic on clause if that makes any sense.
Further more the only relationship the tables have is that Table_b contains Table_a's columns as values.
SELECT *
FROM table_a a
INNER JOIN table_b b
ON a.(select column1 FROM table_b WHERE Column1 ='Column_A') = b.Column_A
Is this what you want?
SELECT *
FROM table_a a INNER JOIN
table_b b
ON (b.Column1 = 'Column_A' AND a.column1 = b.column_A) OR
(b.Column1 = 'Column_B' AND a.column1 = b.column_B) OR
(b.Column1 = 'Column_C' AND a.column1 = b.column_C)
You would have to list out all the columns directly. Also, JOINs with ORs generally have very poor performance.
You can express this more concisely using APPLY:
SELECT *
FROM table_b b CROSS APPLY
(SELECT 'Column_A' as colname, b.Column_A as colval FROM DUAL UNION ALL
SELECT 'Column_B', b.Column_B FROM DUAL UNION ALL
SELECT 'Column_C', b.Column_C FROM DUAL
) v JOIN
table_a a
ON a.column1 = v.colval
WHERE v.colname = b.Column1
Note that this version works in Oracle 12C+.
You can't use the select statement in way you are trying but looking to you code you could use and condition eg:
SELECT *
FROM table_a a
INNER JOIN table_b b ON a.column1 = b.Column_A
and b.Column1 ='Column_A'
Otherwise if you want build dynamically the query code you should build server side a string then using this string as a command

Oracle - Can this query be optimized?

I want to get the last Date of a set of rows. What is more performant: Query1 or Query2:
Query1
select *
from(
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234'
order by column_date desc) c
where rownum=1
Query2
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234'
order by column_date desc
and take the first row in backend.
Or maybe is there another way to take the first row in Oracle? I know that normally subselects are bad performant. That's why I am trying to remove the subselect.
I tried that but I am not getting the result expected:
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234' and rownum=1
order by column_date desc
First, you can't really optimize a query. Queries are always rewritten by the optimizer and may give very different results depending on how much data there is, indexes, etc. So if you have a query that is slow, you must look at the execution plan to see what's happening. And if you have a query that is not slow, you shouldn't be optimizing it.
There's nothing wrong with subselects, per se. As Wernfriend Domscheit suggests, this will give you the minimum column_date, which I assume resides in table2.
SELECT MIN( b.column_date )
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
That is guaranteed to give you a single row. If you needed more than just the date field, this will select the rows with the minimum date:
SELECT a.*, b.column_date
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
AND b.column_date = ( SELECT MIN( b2.column_date ) FROM table2 b2 )
But if your column_date is not unique, this may return multiple rows. If that's possible, you'll need something in the data to differentiate the rows to select. This is guaranteed to give you a single row:
SELECT * FROM (
SELECT a.*, b.column_date
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
AND b.column_date = ( SELECT MIN( b2.column_date ) FROM table2 b2 )
ORDER BY a.some_other_column
)
WHERE ROWNUM = 1
In a recent enough version of Oracle, you can use FETCH FIRST 1 ROW ONLY instead of the ROWNUM query. I don't think it makes a difference.

Join based on temp column

I have two tables that need to be joined, but the only similar column has excess data that needs to be stripped. I would just modify the tables, but I only have read access to them. So, I strip the unneeded text out of the table and add a temp column, but I cannot join to it. I get the error:
Invalid column name 'TempJoin'
SELECT
CASE WHEN CHARINDEX('- ExtraText',a.Column1)>0 THEN LEFT(a.Column1, (CHARINDEX('- ExtraText', a.Column1))-1)
WHEN CHARINDEX('- ExtraText',a.Column1)=0 THEN a.Column1
END AS TempJoin
,a.Column1
,b.Column2
FROM Table1 as a
LEFT JOIN Table2 as b WITH(NOLOCK) ON b.Column2=TempJoin
Easiest way would be to wrap this in a CTE. Also, be careful using NOLOCK, unless you have an explicit reason.
WITH cte AS (
SELECT
CASE WHEN CHARINDEX('- ExtraText',a.Column1) > 0
THEN LEFT(a.Column1, (CHARINDEX('- ExtraText', a.Column1))-1)
WHEN CHARINDEX('- ExtraText',a.Column1) = 0
THEN a.Column1
END AS TempJoin,
a.Column1
FROM Table1 AS a
)
SELECT *
FROM cte
LEFT JOIN Table2 AS b WITH(NOLOCK) ON b.Column2 = TempJoin;

How to limit records considered in a nested select within a join?

Curious to see if there is a way to write the following T-SQL statement (this one errors with cannot bind TableA in the nested select.) Removing the error line seems to consider all records from TableB then performs the join.
select *
from TableA A
join (
select TableAid, TableBinfo
from TableB
where TableB.TableAid = A.TableAid -- error line
group by TableAid, TableBinfo
) B on
A.TableAid = B.TableAid
where A.TableAid = 123
Is the following SQL the best I can hope for?
I'd really like to limit the distinct comparison to just the one column in the one table rather than all the columns I select. I don't control the database and it doesn't have indexes on anything but primary keys.
select A.TableAid, B.TableBinfo
from TableA A
join TableB B on
A.TableAid = B.TableAid
where A.TableAid = 123
group by A.TableAid, B.TableBinfo
Your first example looks like you're trying to do an APPLY over a correlated subquery:
SELECT *
FROM TableA a
CROSS APPLY
(
SELECT t.TableBInfo
FROM TableB t
WHERE a.TableAId = b.TableBId
GROUP BY b.TableBInfo
) b
WHERE a.TableAId = 123