Optimising Postgresql Queries - sql

I have two tables and i have to query my postgresql database. The table 1 has about 140 million records and table 2 has around 50 million records of the following.
the table 1 has the following structure:
tr_id bigint NOT NULL, # this is the primary key
query_id numeric(20,0), # indexed column
descrip_id numeric(20,0) # indexed column
and table 2 has the following structure
query_pk bigint # this is the primary key
query_id numeric(20,0) # indexed column
query_token numeric(20,0)
The sample db of table1 would be
1 25 96
2 28 97
3 27 98
4 26 99
The sample db of table2 would be
1 25 9554
2 25 9456
3 25 9785
4 25 9514
5 26 7412
6 26 7433
7 27 545
8 27 5789
9 27 1566
10 28 122
11 28 1456
I am preferring queries in which i would be able to query in blocks of tr_id. In range of 10,000 as this is my requirement.
I would like to get output in the following manner
25 {9554,9456,9785,9514}
26 {7412,7433}
27 {545,5789,1566}
28 {122,1456}
I tried in the following manner
select query_id,
array_agg(query_token)
from sch.table2
where query_id in (select query_id
from sch.table1
where tr_id between 90001 and 100000)
group by query_id
I am performing the following query which takes about 121346 ms and when some 4 such queries are fired it still takes longer time. Can you please help me to optimise the same.
I have a machine which runs on windows 7 with i7 2nd gen proc with 8GB of RAM.
The following is my postgresql configuration
shared_buffers = 1GB
effective_cache_size = 5000MB
work_mem = 2000MB
What should I do to optimise it.
Thanks
EDIT : it would be great if the results ordered according to the following format
25 {9554,9456,9785,9514}
28 {122,1456}
27 {545,5789,1566}
26 {7412,7433}
ie according to the order of the queryid present in table1 ordered by tr_id. If this is computationally expensive may be in the client code i would try to optimise it. But I am not sure how efficient it would be.
Thanks

Query
I expect a JOIN to be much faster that the IN condition you have presently:
SELECT t2.query_id
,array_agg(t2.query_token) AS tokens
FROM t1
JOIN t2 USING (query_id)
WHERE t1.tr_id BETWEEN 1 AND 10000
GROUP BY t1.tr_id, t2.query_id
ORDER BY t1.tr_id;
This also sorts the results as requested. query_token remains unsorted per query_id.
Indexes
Obviously you need indexes on t1.tr_id and t2.query_id.
You obviously have that one already:
CREATE INDEX t2_query_id_idx ON t2 (query_id);
A multicolumn index on t1 may improve performance (you'll have to test):
CREATE INDEX t1_tr_id_query_id_idx ON t1 (tr_id, query_id);
Server configuration
If this is a dedicated database server, you can raise the setting for effective_cache_size some more.
#Frank already gave advise on work_mem. I quote the manual:
Note that for a complex query, several sort or hash operations might
be running in parallel; each operation will be allowed to use as much
memory as this value specifies before it starts to write data into
temporary files. Also, several running sessions could be doing such
operations concurrently. Therefore, the total memory used could be
many times the value of work_mem;
It should be just big enough to be able to sort your queries in RAM. 10 MB is more than plenty to hold 10000 of your rows at a time. Set it higher, if you have queries that need more at a time.
With 8 GB on a dedicated database server, I would be tempted to set shared_buffers to at least 2 GB.
shared_buffers = 2GB
effective_cache_size = 7000MB
work_mem = 10MB
More advice on performance tuning in the Postgres Wiki.

Related

A more efficient way to sum the difference between columns in postgres?

For my application I have a table with these three columns: user, item, value
Here's some sample data:
user item value
---------------------
1 1 50
1 2 45
1 23 35
2 1 88
2 23 44
3 2 12
3 1 27
3 5 76
3 23 44
What I need to do is, for a given user, perform simple arithmetic against everyone else's values.
Let's say I want to compare user 1 against everyone else. The calculation looks something like this:
first_user second_user result
1 2 SUM(ABS(50-88) + ABS(35-44))
1 3 SUM(ABS(50-27) + ABS(45-12) + ABS(35-44))
This is currently the bottleneck in my program. For example, many of my queries are starting to take 500+ milliseconds, with this algorithm taking around 95% of the time.
I have many rows in my database and it is O(n^2) (it has to compare all of user 1's values against everyone else's matching values)
I believe I have only two options for how to make this more efficient. First, I could cache the results. But the resulting table would be huge because of the NxN space required, and the values need to be relatively fresh.
The second way is to make the algorithm much quicker. I searched for "postgres SIMD" because I think SIMD sounds like the perfect solution to optimize this. I found a couple related links like this and this, but I'm not sure if they apply here. Also, they seem to both be around 5 years old and relatively unmaintained.
Does Postgres have support for this sort of feature? Where you can "vectorize" a column or possibly import or enable some extension or feature to allow you to quickly perform these sorts of basic arithmetic operations against many rows?
I'm not sure where you get O(n^2) for this. You need to look up the rows for user 1 and then read the data for everyone else. Assuming there are few items and many users, this would be essentially O(n), where "n" is the number of rows in the table.
The query could be phrased as:
select t1.user, t.user, sum(abs(t.value - t1.value))
from t left join
t t1
on t1.item = t.item and
t1.user <> t.user and
t1.user = 1
group by t1.user, t.user;
For this query, you want an index on t(item, user, value).

Execution time select * vs select count(*)

I'm trying to measure execution time of a query, but I have a feeling that my results are wrong.
Before every query I execute: sync; echo 3 > /proc/sys/vm/drop_caches
My server log file results are:
2014-02-08 14:28:30 EET LOG: duration: 32466.103 ms statement: select * from partsupp
2014-02-08 14:32:48 EET LOG: duration: 9785.503 ms statement: select count(*) from partsupp
Shouldn't select count(*) take more time to execute since it makes more operations?
To output all the results from select * I need 4 minutes (not 32 seconds, as indicated by server log). I understand that the client has to output a lot of data and it will be slow, but what about the server's log? Does it count output operations too?
I also used explain analyze and the results are (as expected):
select *: Total runtime: 13254.733 ms
select count(*): Total runtime: 13463.294 ms
I have run it many times and the results are similar.
What exactly does the log measure?
Why there is so big difference in select * query between explain analyze and server's log, although it doesn't count I/O operations?
What is the difference between log measurement and explain analyze?
I have a dedicated server with Ubuntu 12.04 and PostgreSQL 9.1
Thank you!
Any aggregate function has some small overhead - but on second hand SELECT * send to client lot of data in dependency on column numbers and column size.
log measurements is a total query time, it can be similar to EXPLAIN ANALYZE - but much more times is significantly faster, because EXPLAIN ANALYZE collects a execution time (and execution statistics) for all subnodes of execution plan. And it is significant overhead usually. But there are no overhead from transport data from server to client.
The first query asks for all rows in a table. Therefore, the entire table must be read.
The second query only asks for how many rows there are. The database can answer this by reading the entire table, but can also answer this by reading any index it has for that table. Since the index is smaller than the table, doing that would be faster. In practice, nearly all tables have indexes (because a primary key constraint creates an index, too).
select * = select all data all column included
select count(*) = count how many rows
for example this table
------------------------
name | id | address
----------------------
s | 12 | abc
---------------------
x | 14 | cc
---------------------
y | 15 | vv
---------------------
z | 16 | ll
---------------------
select * will display all the table
select count(*) will display the total of the rows = 4

Performance of MERGE vs. UPDATE with subquery

Note that I've modified table/field names etc. for readability. Some of the original names are quite confusing.
I have three different tables:
Retailer (Id+Code is a unique key)
- Id
- Code
- LastReturnDate
- ...
Delivery/DeliveryHistory (combination of Date+RetailerId is unique)
- Date
- RetailerId
- HasReturns
- ...
Delivery and DeliveryHistory are almost identical. Data is periodically moved to the history table, and there's no surefire way to know when this last happened. In general, the Delivery-table is quite small -- usually less than 100,000 rows -- while the history table will typically have millions of rows.
My task is to update the LastReturnDate field for each retailer based on the current highest date value for which HasReturns is true in Delivery or DeliveryHistory.
Previously this has been solved with a view defined as follows:
SELECT Id, Code, MAX(Date) Date
FROM Delivery
WHERE HasReturns = 1
GROUP BY Id, Code
UNION
SELECT Id, Code, MAX(Date) Date
FROM DeliveryHistory
WHERE HasReturns = 1
GROUP BY Id, Code
And the following UPDATE statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code AND EXISTS (
SELECT * FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code
HAVING
MAX(Date) > LastReturnDate OR
(LastReturnDate IS NULL AND MAX(Date) IS NOT NULL))
The EXISTS-clause guards against updating fields where the current value is greater than the new one, but this is actually not a significant concern, because it's hard to see how that could ever happen during normal program execution. Note also how the AND Max(Date) IS NOT NULL part is in fact superfluous, since it's impossible for Date to be null in DeliveryView. But the EXISTS-clause appears to actually improve performance slightly.
However, the performance of the UPDATE has recently been horrendous. In a database where the Retailer table contains only 1000-2000 relevant entries, the UPDATE has been taking more than five minutes to run. Note that it does this even if I remove the entire EXISTS clause, i.e. with this very simply statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code
I've therefore been looking into a better solution. My first idea was to create a temporary table, but after a while I tried to write it as a MERGE statement:
MERGE INTO Retailer
USING (SELECT Id, Code, MAX(Date) Date FROM DeliveryView GROUP BY Id, Code)
ON (Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHEN MATCHED THEN
UPDATE SET LastReturnDate = Date WHERE Code = :Code
This seems to work, and it's more than an order of magnitude faster than the UPDATE.
I have three questions:
Can I be certain that this will have the same effect as the UPDATE in all cases (disregarding the edge case of LastReturnDate already being larger than MAX(Date))?
Why is it so much faster?
Is there some better solution?
Query plans
MERGE plan
Cost: 25,831, Bytes: 1,143,828
Plain language
Every row in the table SCHEMA.Delivery is read.
The rows were sorted in order to be grouped.
Every row in the table SCHEMA.DeliveryHistory is read.
The rows were sorted in order to be grouped.
Return all rows from steps 2, 4 - including duplicate rows.
The rows from step 5 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 6.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 8.
Every row in the table SCHEMA.Retailer is read.
The result sets from steps 9, 10 were joined (hash).
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were merged.
Rows were remotely merged.
Technical
Plan Cardinality Distribution
14 MERGE STATEMENT REMOTE ALL_ROWS
Cost: 25 831 Bytes: 1 143 828 3 738
13 MERGE SCHEMA.Retailer ORCL
12 VIEW SCHEMA.
11 HASH JOIN
Cost: 25 831 Bytes: 1 192 422 3 738
9 VIEW SCHEMA.
Cost: 25 803 Bytes: 194 350 7 475
8 SORT GROUP BY
Cost: 25 803 Bytes: 194 350 7 475
7 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 25 802 Bytes: 194 350 7 475
6 SORT UNIQUE
Cost: 25 802 Bytes: 134 550 7 475
5 UNION-ALL
2 SORT GROUP BY
Cost: 97 Bytes: 25 362 1 409
1 TABLE ACCESS FULL TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 94 Bytes: 210 654 11 703
4 SORT GROUP BY
Cost: 25 705 Bytes: 109 188 6 066
3 TABLE ACCESS FULL TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 16 827 Bytes: 39 333 636 2 185 202
10 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 653 390 2 230
UPDATE plan
Cost: 101,492, Bytes: 272,060
Plain language
Every row in the table SCHEMA.Retailer is read.
One or more rows were retrieved using index SCHEMA.DeliveryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.Delivery were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
One or more rows were retrieved using index SCHEMA.DeliveryHistoryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.DeliveryHistory were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
Return all rows from steps 4, 7 - including duplicate rows.
The rows from step 8 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 9.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were updated.
Rows were remotely updated.
Technical
Plan Cardinality Distribution
14 UPDATE STATEMENT REMOTE ALL_ROWS
Cost: 101 492 Bytes: 272 060 1 115
13 UPDATE SCHEMA.Retailer ORCL
1 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 272 060 1 115
12 VIEW SCHEMA.
Cost: 90 Bytes: 52 2
11 SORT GROUP BY
Cost: 90 Bytes: 52 2
10 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 90 Bytes: 52 2
9 SORT UNIQUE
Cost: 90 Bytes: 36 2
8 UNION-ALL
4 SORT GROUP BY
Cost: 15 Bytes: 18 1
3 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 14 Bytes: 108 6
2 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHasReturns [Analyzed] ORCL
Cost: 2 12
7 SORT GROUP BY
Cost: 75 Bytes: 18 1
6 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 74 Bytes: 4 590 255
5 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHistoryHasReturns [Analyzed] ORCL
Cost: 6 509

Join More Than 2 Tables

I have three tables.
Table Data contains data for individual parts that come from a
"data.txt" file.
Table Limits contains the limits for the Data table
from a "limits.txt" file.
Table Files is a listing for
each individual .txt file above.
So the "Files" table looks like this. As you can see it is a listing of each file that exists. The LimitsA file will contain the limits for every Data file of type A.
ID File_Name Type Sub-Type
1 DataA_10 A 10
2 DataA_20 A 20
3 DataA_30 A 30
4 LimitsA A NONE
5 DataB_10 B 10
6 DataB_20 B 20
7 LimitsB B NONE
The "Data" table looks like this. The File_ID is the foreign key from the "Files" table. Specifically, this would be data for DataA_10 above:
ID File_ID Dat1 Dat2 Dat3... Dat20
1 1 50 52 53
2 1 12 43 52
3 1 32 42 62
The "Limits" table looks like this. The File_ID is the foreign key from the "Files" table. Specifically, this would be data for LimitsA above:
ID File_ID Sub-Type Lim1 Lim2
1 4 10 40 60
2 4 20 20 30
3 4 30 10 20
So what I want to do is JOIN the correct limits from the "Limit" table to the data from the corresponding "Data" table. Each row of DataA_10 would have the limits of "40" and "60" from the LimitsA table. Unfortunately there is no way to directly link the limits table to the data table. The only way to do this would be to look back to the files table and see that LimitsA and DataA_10 are of type A. Once I link those two together I then need to specifically only grab the Limits for Sub-Type 10.
In the end I would like to have a result that looks like this.
Result:
ID File_ID Dat1 Dat2 Dat3... Dat20 Lim1 Lim2
1 1 50 52 53 40 60
2 1 12 43 52 40 60
3 1 32 42 62 40 60
I hope this is clear enough to understand. It seems to me like an issue of joining more than 2 tables, but I have been unable to find a suitable solution online as of yet. If you have a solution or any advice it would be greatly appreciated.
Your 'Files' table is actually 2 separate (but related) concepts that have been merged. If you break them out using subqueries you'll have a much easier time making a join. Note that joining like this is not the most efficient method, but then again neither is the given schema...
SELECT Data.*, Limits.Lim1, Limits.Lim2
FROM (SELECT * FROM Files WHERE SubType IS NOT NULL) DataFiles
JOIN (SELECT * FROM Files WHERE SubType IS NULL) LimitFiles
ON LimitFiles.Type = DataFiles.Type
JOIN Data
ON DataFiles.ID = Data.File_ID
JOIN Limits
ON LimitFiles.ID = Limits.File_ID
AND DataFiles.SubType = Limits.SubType
ORDER BY Data.File_ID
UPDATE
To be more specific on how to improve the schema: Currently, the Files table doesn't have a clear way to differentiate between Data and Limit file entries. Aside from this, the Data entries don't have a clear link to a single Limit file entry. Although both of these can be figured out as in the SQL above, such logic might not play well with the query optimizer, and certainly can't guarantee the Data-Limit link that you require.
Consider these options:
Instead of linking to a 'Limit' file via Type, link directly to a Limit entry Id. Set a foreign key on that link to ensure the expected Limit entry is available.
Separate the 'Limit' entries from the 'Data' entries by putting them in a separate table.
Create an index on the foreign key. For that matter, add indices for all foreign keys - SQL Server doesn't do this by default.
Of these, I would consider having a foreign key as essential, and the others as modest improvements.

how does a SQL query work?

How does a SQL query work?
How does it get compiled?
Is the from clause compiled first to see if the table exists?
How does it actually retrieve data from the database?
How and in what format are the tables stored in a database?
I am using phpmyadmin, is there any way I can peek into the files where data is stored?
I am using MySQL
sql execution order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT .
SQL Query mainly works in three phases .
1) Row filtering - Phase 1: Row filtering - phase 1 are done by FROM, WHERE , GROUP BY , HAVING clause.
2) Column filtering: Columns are filtered by SELECT clause.
3) Row filtering - Phase 2: Row filtering - phase 2 are done by DISTINCT , ORDER BY , LIMIT clause.
In here i will explain with an example . Suppose we have a students table as follows:
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
10
Rahim
80
A
11
Karim
81
B
12
Abdullah
92
D
Now we run the following sql query:
select section_,sum(marks) from students where id_<10 GROUP BY section_ having sum(marks)>100 order by section_ LIMIT 2;
Output of the query is:
section_
sum
A
247
B
131
But how we got this output ?
I have explained the query step by step . Please read bellow:
1. FROM , WHERE clause execution
Hence from clause works first therefore from students where id_<10 query will eliminate rows which has id_ greater than or equal to 10 . So the following rows remains after executing from students where id_<10 .
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
2. GROUP BY clause execution
now GROUP BY clause will come , that's why after executing GROUP BY section_ rows will make group like bellow:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
3
Maria
10
C
8
Jahid
25
C
3. HAVING clause execution
having sum(marks)>100 will eliminates groups . sum(marks) of D group is 185 , sum(marks) of A groupd is 247 , sum(marks) of B group is 131 , sum(marks) of C group is 35 . So we can see tha C groups's sum is not greater than 100 . So group C will be eliminated . So the table looks like this:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
4. SELECT clause execution
select section_,sum(marks) query will only decides which columns to prints . It is decided to print section_ and sum(marks) column .
section_
sum
D
185
A
245
B
131
5. ORDER BY clause execution
order by section_ query will sort the rows ascending order.
section_
sum
A
245
B
131
D
185
6. LIMIT clause execution
LIMIT 2; will only prints first 2 rows.
section_
sum
A
245
B
131
This is how we got our final output .
Well...
First you have a syntax check, followed by the generation of an expression tree - at this stage you can also test whether elements exist and "line up" (i.e. fields do exist WITHIN the table). This is the first step - any error here any you just tell the submitter to get real.
Then you have.... analysis. A SQL query is different from a program in that it does not say HOW to do something, just WHAT THE RESULT IS. Set based logic. So you get a query analyzer in (depending on product bad to good - oracle long time has crappy ones, DB2 the most sensitive ones even measuring disc speed) to decide how best to approach this result. This is a really complicated beast - it may try dozens or hundreds of approaches to find one he believes to be fastest (cost based, basically some statistics).
Then that gets executed.
The query analyzer, by the way, is where you see huge differences. Not sure about MySQL - SQL Server (Microsoft) shines in that it does not have the best one (but one of the good ones), but that it really has nice visual tools to SHOW the query plan, compare the estimates the the analyzer to the real needs (if they differ too much table statistics may be off so the analyzer THINKS a large table is small). They present that nicely visually.
DB2 had a great optimizer for some time, measuring - i already said - disc speed to put it into it's estimates. Oracle went "left to right" (no real analysis) for a long time, and took user provided query hints (crap approach). I think MySQL was VERY primitive too in the start - not sure where it is now.
Table format in database etc. - that is really something you should not care for. This is documented (clearly, especially for an open source database), but why should you care? I have done SQL work for nearly 15 years or so and never had that need. And that includes doing quite high end work in some areas. Unless you try building a database file repair tool.... it makes no sense to bother.
The order of SQL statement clause execution-
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
My answer is specific to Oracle database, which provides tutorials pertaining to your queries. Well, when SQL database engine processes any SQL query/statement, It first starts parsing and within parsing it performs three checks Syntax, Semantic and Shared Pool. To know how do these checks work? Follow the link below.
Once query parsing is done, it triggers the Execution plan. But hey Database Engine! you are smart enough. You do check if this SQL query has already been parsed (Soft Parse), if so then you directly jump on execution plan or else you deep dive and optimize the query (Hard Parse). While performing hard parse, you also use a software called Row Source Generation which provides Iterative Execution Plan received from optimizer. Enough! see the SQL query processing stages below.
Note - Before execution plan, it also performs Bind operations for variable's values and once the query is executed It performs Fetch to obtain the records and finally store into result set. So in short, the order is-
PASRE -> BIND -> EXECUTE -> FETCH
And for in depth details, this tutorial is waiting for you.
This may be helpful to someone.
If you're using SSMS for Sql Server and want to know where your data files are stored, you can use this query
SELECT
mdf.database_id,
mdf.name,
mdf.physical_name as data_file,
ldf.physical_name as log_file,
db_size = CAST((mdf.size * 8.0)/1024 AS DECIMAL(8,2)),
log_size = CAST((ldf.size * 8.0 / 1024) AS DECIMAL(8,2))
FROM (SELECT * FROM sys.master_files WHERE type_desc = 'ROWS' ) mdf
JOIN (SELECT * FROM sys.master_files WHERE type_desc = 'LOG' ) ldf
ON mdf.database_id = ldf.database_id
Here's a copy of the output