Query for a table with big size of columns - sql

I've got a table in which there are some columns with big text data. The query for 10 rows (table has only 31 records) takes more than 20 seconds. If I remove fields with big size, the query is executed quickly. The query for 1 row (by id) always executed quickly.
How can I do the query for many rows work more faster?
The query looks like this
SELECT DISTINCT (a.id), a.field_1, a.field_2, a.field_3
, a.field_4, a.field_5, a.filed_6, ...
FROM table_a a, table_b b
WHERE a.field_8 = 'o'
ORDER BY a.field_2 DESC
LIMIT 10;

#a_horse already hinted at the likely syntax error. Try:
SELECT DISTINCT ON (a.id) a.id, a.field_1, a.field_2, a.field_3, ...
FROM table_a a
-- JOIN table_b b ON ???
WHERE a.field_8 = 'o'
ORDER BY a.id, a.field_2 DESC
LIMIT 10;
Note the bold emphasis and read up on the DISTINCT clause in the manual.
Also, an index on field_8 might help.
A multicolumn index on (field_8, id, field_2) might help even more, if you can narrow it down to that (and if that is the sort order you want, which I doubt).
If you want the result sorted by a.field_2 DESC first:
In PostgreSQL 9.1, if id is the primary key:
SELECT a.id, a.field_1, a.field_2, a.field_3, ...
FROM table_a a
-- JOIN table_b b ON ???
WHERE a.field_8 = 'o'
GROUP BY a.id -- primary key takes care of all columns in table a
ORDER BY a.field_2 DESC
LIMIT 10;

why you are selecting table_b? you dont join this tables!
make a real join like this
SELECT DISTINCT
(a.id), a.field_1, a.field_2, a.field_3, a.field_4, a.field_5, a.filed_6
FROM table_a a
INNER JOIN table_b b
ON b.field_on_table_b = a.field_on_table_a
WHERE a.field_8 = 'o'
ORDER BY a.field_2 DESC LIMIT 10
then be sure that field_8 (in the where statement) is defined with a key!

Related

Drop duplicated rows in postgresql

I am querying some data from the database, and my code looks as below
select
a.id
a.party
a.date
a.name
a.revenue
b.company
c.cost
from a
left join b on a.id = b.id
left join c on a.id = c.id
where a.party = 'cat' and a.date > '2000-01-01'
I got a returned table but the table has duplicated rows. Is there anyway I can remove all duplicated rows (meaning the entire row is the same, row 1 = row2, remove row1)
I put select distinct at the top, but then it took forever to run. Not sure if some fundamental programming logic was wrong in this code.
If one row in a id related to two rows in b because b.id is not unique, and both these rows have the same company, your query result will contain duplicate rows. There is nothing wrong with that.
Removing duplicate rows with DISTINCT is expensive for big result sets, because the set has to be sorted.
Ideas to improve performance:
increase work_mem, that makes sorting faster
perhaps you don't need all the result rows, then adding a LIMIT clause will make the query faster
Normal deduplicate process with better performance is using row_number() over(partition by order by) and then rownumber =1.
Select a_id, b_id, name, ...
From
( Select a.id a_id, b.id b_id, b.name name,
row_number() over (partition by a.id,b.id, b.name order by create_date desc) rownum from
A
Join b on a.id=b.id) rs
Where rs.rownum=1
Please note you can partition by, order by any key column(whatever you want as unique). I used create date to pick latest row.
Also note huge number of row can hamper performance but it's better than distinct. Also plss check if you can get distinct first before joining.

SQL function to create a one-to-one match between two tables?

I am trying to join 2 tables. Table_A has ~145k rows whereas Table_B has ~205k rows.
They have two columns in common (i.e. ISIN and date). However, when I execute this query:
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.date = B.date
WHERE A.isin = B.isin
I get a table with more than 147k rows. How is it possible? Shouldn't it return a table with at most ~145k rows?
What you are seeing indicates that, for some of the records in Table_A, there are several records in Table_B that satisfy the join conditions (equality on the (date, isin) tuple).
To exhibit these records, you can do:
select B.date, B.isin
from Table_A
join Table_B on A.date = B.date and A.isin = B.isin
group by B.date, B.isin
having count(*) > 1
It's up to you to define how to handle those duplicates. For example:
if the duplicates have different values in column column_name, then you can decide to pull out the maximum or minimum value
or use another column to filter on the top or lower record within the duplicates
if the duplicates are true duplicates, then you can use select distinct in a subquery to dedup them before joining
... other solutions are possible ...
If you want one row per table A, then use outer apply:
SELECT A.*,
B.column_name
FROM Table_A a OUTER APPLY
(SELECT TOP (1) b.*
FROM Table_B b
WHERE A.date = B.date AND A.isin = B.isin
ORDER BY ? -- you can specify *which* row you want when there are duplicates
) b;
OUTER APPLY implements a lateral join. The TOP (1) ensures that at most one row is returned. The OUTER (as opposed to CROSS) ensures that nothing is filtered out. In this case, you could also phrase it as a correlated subquery.
All that said, your data does not seem to be what you really expect. You should figure out where the duplicates are coming from. The place to start is:
select b.date, b.isin, count(*)
from tableb b
group by b.date, b.isin
having count(*) >= 2;
This will show you the duplicates, so you can figure out what to do about them.
Duplicate possibilities is already discuss.
When millions of records are use in join then often due to poor Cardianility Estimate,
record return are not accurate.
For this just change join order,
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.isin = B.isin
and
A.date = B.date
Also create non clustered index on both table.
Create NonClustered index isin_date_table_A on Table_A(isin,date)include(*Table_A)
*Table_A= comma seperated list Table_A column which is require in resultset
Create NonClustered index isin_date_table_B on Table_B(isin,date)include(column_nameA)
Update STATISTICS Table_A
Update STATISTICS Table_B
Keeping the DATE columns of both tables in the same format in the JOIN condition you should be getting the result as expected.
Select A.*, B.column_name
from Table_A
join Table_B on to_date(a.date,'DD-MON-YY') = to_date(b.date,'DD-MON-YY')
where A.isin = B.isin

Not able to join two tables with limit in Postgres

I have table A with col1,col2,col3 and Table B col1.
I want to join both tables using the limit
I want some thing like
select a.col1,a.col2,a.col3,b.col1
from tableA a, tableB b limit 5 and a.col1 between 1 AND 10;
So I have 10 records in table b and 10 in table a. I should get total of 50 records by limiting only 5 records from table b
Your description translates to a CROSS JOIN:
SELECT a.col1, a.col2, a.col3, b.b_col1 -- unique column names
FROM tablea a
CROSS JOIN ( SELECT col1 AS b_col1 FROM tableb LIMIT 5 ) b;
-- WHERE a.col1 BETWEEN 1 AND 10; -- see below
... and LIMIT for tableb like a_horse already demonstrated. LIMIT without ORDER BY returns arbitrary rows. The result can change from one execution to the next.
To select random rows from tableb:
...
CROSS JOIN ( SELECT col1 AS b_col1 FROM tableb ORDER BY random() LIMIT 5) b;
If your table is big consider:
Best way to select random rows PostgreSQL
While you ...
have 10 records in ... table a
... the added WHERE condition is either redundant or wrong to get 50 rows.
And while SQL allows it, it rarely makes sense to have multiple result columns of the same name. Some clients throw an error right away. Use a column alias to make names unique.
You need a derived table (aka "sub-query") for that. In the derived table, you can limit the number of rows.
select a.col1, a.col2, b.col3, b.col1
from tablea a
join (
select b.col3, b.col1
from tableb
limit 5 -- makes no sense without an ORDER BY
) b on b.some_column = a.some_column --<< you need a join condition
where a.col1 between 1 and 10;
Note that using LIMIT without an ORDER BY usually makes no sense.

Oracle - Can this query be optimized?

I want to get the last Date of a set of rows. What is more performant: Query1 or Query2:
Query1
select *
from(
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234'
order by column_date desc) c
where rownum=1
Query2
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234'
order by column_date desc
and take the first row in backend.
Or maybe is there another way to take the first row in Oracle? I know that normally subselects are bad performant. That's why I am trying to remove the subselect.
I tried that but I am not getting the result expected:
select column_date
from table1 a join table2 b on a.column1=b.column1
where id= '1234' and rownum=1
order by column_date desc
First, you can't really optimize a query. Queries are always rewritten by the optimizer and may give very different results depending on how much data there is, indexes, etc. So if you have a query that is slow, you must look at the execution plan to see what's happening. And if you have a query that is not slow, you shouldn't be optimizing it.
There's nothing wrong with subselects, per se. As Wernfriend Domscheit suggests, this will give you the minimum column_date, which I assume resides in table2.
SELECT MIN( b.column_date )
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
That is guaranteed to give you a single row. If you needed more than just the date field, this will select the rows with the minimum date:
SELECT a.*, b.column_date
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
AND b.column_date = ( SELECT MIN( b2.column_date ) FROM table2 b2 )
But if your column_date is not unique, this may return multiple rows. If that's possible, you'll need something in the data to differentiate the rows to select. This is guaranteed to give you a single row:
SELECT * FROM (
SELECT a.*, b.column_date
FROM table1 a
INNER JOIN table2 b on a.column1 = b.column1
WHERE a.id= '1234'
AND b.column_date = ( SELECT MIN( b2.column_date ) FROM table2 b2 )
ORDER BY a.some_other_column
)
WHERE ROWNUM = 1
In a recent enough version of Oracle, you can use FETCH FIRST 1 ROW ONLY instead of the ROWNUM query. I don't think it makes a difference.

Efficient latest record query with Postgresql

I need to do a big query, but I only want the latest records.
For a single entry I would probably do something like
SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;
But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.
Here's what I have. It's not very efficient. I was wondering if there's a better way.
SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.
this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:
SELECT *
FROM table a
JOIN (SELECT ID, max(date) maxDate
FROM table
GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST
what do you think about this?
select * from (
SELECT a.*, row_number() over (partition by a.id order by date desc) r
FROM table a where ID IN $LIST
)
WHERE r=1
i used it a lot on the past
On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use
CREATE TABLE
a_latest
( id INTEGER NOT NULL,
date TSTAMP NOT NULL,
PRIMARY KEY (id, max_time) );
Then use a query similar to that suggested by najmeddine :
SELECT a.*
FROM TABLE a, TABLE a_latest
USING ( id, date );
The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.
The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.
If you have many rows per id's you definitely want a correlated subquery.
It will make 1 index lookup per id, but this is faster than sorting the whole table.
Something like :
SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;
The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.
You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":
SELECT t.id FROM table t
WHERE NOT EXISTS
(SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)