Note that I've modified table/field names etc. for readability. Some of the original names are quite confusing.
I have three different tables:
Retailer (Id+Code is a unique key)
- Id
- Code
- LastReturnDate
- ...
Delivery/DeliveryHistory (combination of Date+RetailerId is unique)
- Date
- RetailerId
- HasReturns
- ...
Delivery and DeliveryHistory are almost identical. Data is periodically moved to the history table, and there's no surefire way to know when this last happened. In general, the Delivery-table is quite small -- usually less than 100,000 rows -- while the history table will typically have millions of rows.
My task is to update the LastReturnDate field for each retailer based on the current highest date value for which HasReturns is true in Delivery or DeliveryHistory.
Previously this has been solved with a view defined as follows:
SELECT Id, Code, MAX(Date) Date
FROM Delivery
WHERE HasReturns = 1
GROUP BY Id, Code
UNION
SELECT Id, Code, MAX(Date) Date
FROM DeliveryHistory
WHERE HasReturns = 1
GROUP BY Id, Code
And the following UPDATE statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code AND EXISTS (
SELECT * FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code
HAVING
MAX(Date) > LastReturnDate OR
(LastReturnDate IS NULL AND MAX(Date) IS NOT NULL))
The EXISTS-clause guards against updating fields where the current value is greater than the new one, but this is actually not a significant concern, because it's hard to see how that could ever happen during normal program execution. Note also how the AND Max(Date) IS NOT NULL part is in fact superfluous, since it's impossible for Date to be null in DeliveryView. But the EXISTS-clause appears to actually improve performance slightly.
However, the performance of the UPDATE has recently been horrendous. In a database where the Retailer table contains only 1000-2000 relevant entries, the UPDATE has been taking more than five minutes to run. Note that it does this even if I remove the entire EXISTS clause, i.e. with this very simply statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code
I've therefore been looking into a better solution. My first idea was to create a temporary table, but after a while I tried to write it as a MERGE statement:
MERGE INTO Retailer
USING (SELECT Id, Code, MAX(Date) Date FROM DeliveryView GROUP BY Id, Code)
ON (Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHEN MATCHED THEN
UPDATE SET LastReturnDate = Date WHERE Code = :Code
This seems to work, and it's more than an order of magnitude faster than the UPDATE.
I have three questions:
Can I be certain that this will have the same effect as the UPDATE in all cases (disregarding the edge case of LastReturnDate already being larger than MAX(Date))?
Why is it so much faster?
Is there some better solution?
Query plans
MERGE plan
Cost: 25,831, Bytes: 1,143,828
Plain language
Every row in the table SCHEMA.Delivery is read.
The rows were sorted in order to be grouped.
Every row in the table SCHEMA.DeliveryHistory is read.
The rows were sorted in order to be grouped.
Return all rows from steps 2, 4 - including duplicate rows.
The rows from step 5 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 6.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 8.
Every row in the table SCHEMA.Retailer is read.
The result sets from steps 9, 10 were joined (hash).
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were merged.
Rows were remotely merged.
Technical
Plan Cardinality Distribution
14 MERGE STATEMENT REMOTE ALL_ROWS
Cost: 25 831 Bytes: 1 143 828 3 738
13 MERGE SCHEMA.Retailer ORCL
12 VIEW SCHEMA.
11 HASH JOIN
Cost: 25 831 Bytes: 1 192 422 3 738
9 VIEW SCHEMA.
Cost: 25 803 Bytes: 194 350 7 475
8 SORT GROUP BY
Cost: 25 803 Bytes: 194 350 7 475
7 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 25 802 Bytes: 194 350 7 475
6 SORT UNIQUE
Cost: 25 802 Bytes: 134 550 7 475
5 UNION-ALL
2 SORT GROUP BY
Cost: 97 Bytes: 25 362 1 409
1 TABLE ACCESS FULL TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 94 Bytes: 210 654 11 703
4 SORT GROUP BY
Cost: 25 705 Bytes: 109 188 6 066
3 TABLE ACCESS FULL TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 16 827 Bytes: 39 333 636 2 185 202
10 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 653 390 2 230
UPDATE plan
Cost: 101,492, Bytes: 272,060
Plain language
Every row in the table SCHEMA.Retailer is read.
One or more rows were retrieved using index SCHEMA.DeliveryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.Delivery were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
One or more rows were retrieved using index SCHEMA.DeliveryHistoryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.DeliveryHistory were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
Return all rows from steps 4, 7 - including duplicate rows.
The rows from step 8 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 9.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were updated.
Rows were remotely updated.
Technical
Plan Cardinality Distribution
14 UPDATE STATEMENT REMOTE ALL_ROWS
Cost: 101 492 Bytes: 272 060 1 115
13 UPDATE SCHEMA.Retailer ORCL
1 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 272 060 1 115
12 VIEW SCHEMA.
Cost: 90 Bytes: 52 2
11 SORT GROUP BY
Cost: 90 Bytes: 52 2
10 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 90 Bytes: 52 2
9 SORT UNIQUE
Cost: 90 Bytes: 36 2
8 UNION-ALL
4 SORT GROUP BY
Cost: 15 Bytes: 18 1
3 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 14 Bytes: 108 6
2 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHasReturns [Analyzed] ORCL
Cost: 2 12
7 SORT GROUP BY
Cost: 75 Bytes: 18 1
6 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 74 Bytes: 4 590 255
5 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHistoryHasReturns [Analyzed] ORCL
Cost: 6 509
Related
I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26
With example tables:
create table user_login (
user_id integer not null,
login_time numeric not null, -- seconds since epoch or similar
constraint unique(user_id, login_time)
);
create table user_page_visited (
page_id integer not null,
page_visited_at numeric not null -- seconds since epoch or similar
);
with example data:
> user_login
user_id login_time
1 1 100
2 1 140
> user_page_visited
page_id page_visited_at
1 1 100
2 1 200
3 2 120
4 2 130
5 3 160
6 3 150
I wish to return all rows of user_page_visited that fall into a range based off user_login.login_time, for example, return all pages accessed within 20 seconds of an existing login_time:
> user_page_visited
page_id page_visited_at
1 1 100
3 2 120
5 3 160
6 3 150
How would I do this efficiently when both tables have lots of rows? For example, the following query does something similar (returns duplicate rows when ranges overlap), but seems to very slow:
select * from
user_login l cross join
user_page_visited v
where v.page_visited_at >= l.login_time
and v.page_visited_at <= l.login_time + 20;
First, use regular join syntax:
select *
from user_login l join
user_page_visited v
on v.page_visited_at >= l.login_time and
v.page_visited_at <= l.login_time + 20;
Next, be sure that you have indexes on the columns used for the join. . . user_login(login_time) and user_page_visited(page_visited_at).
If these don't work, then you still have a couple of options. If the "20" is fixed, you can vary the type of index. There are also tricks if you are only looking for one match between, say, the login and the page visited.
This solution is based on the comments of the answer from Gordon Linoff.
First we retrieve the tuples that were accessed in the same time slice as a user connection or in the following time slice using the following query:
SELECT DISTINCT page_id, page_visited_at
FROM user_login
INNER JOIN user_page_visited ON login_time::INT / 20 = page_visited_at::INT / 20 OR login_time::INT / 20 = page_visited_at::INT / 20 - 1;
We now need indexes in order to get a good query plan:
CREATE INDEX i_user_login_login_time_20 ON user_login ((login_time::INT / 20));
CREATE INDEX i_user_page_visited_page_visited_at_20 ON user_page_visited ((page_visited_at::INT / 20));
CREATE INDEX i_user_page_visited_page_visited_at_20_minus_1 ON user_page_visited ((page_visited_at::INT / 20 - 1));
If you EXPLAIN the query with these indexes, you get a BitmapOr on two Bitmap Index Scan operations, with some low constant cost. On the other hand, without these indexes you get a sequential scan with a way higher cost (I tested with tables of ~100k tuples each).
However this query gives too much results. We need to filter it again to get the final result:
SELECT DISTINCT page_id, page_visited_at
FROM user_login
INNER JOIN user_page_visited ON login_time::INT / 20 = page_visited_at::INT / 20 OR login_time::INT / 20 = page_visited_at::INT / 20 - 1
WHERE page_visited_at BETWEEN login_time AND login_time + 20;
Using EXPLAIN on this query shows that PostgreSQL still uses the Bitmap Index Scans.
With ~100k rows in user_login and ~200k rows in user_page_visited the query needs ~1.4s to retrieve ~200k rows versus 3.5s without the slice prefilter.
(uname -a: Linux shepwork 4.4.26-gentoo #8 SMP Mon Nov 21 09:45:10 CET 2016 x86_64 AMD FX(tm)-6300 Six-Core Processor AuthenticAMD GNU/Linux)
I was wondering if the community could help me optimize this query but achieve the same results. Currently, it takes roughly 22 minutes to return. I have tried a few different things but they took longer.
Any help is appreciated!
GL_TYPE - SIZE 1MB 2409 ROWS
GL_DEFINITION - SIZE 1MB 53 ROWS
GL_JOURNAL - SIZE 1.24GB 5,725,500 or greater ROWS
SELECT MAX (CT.ENTITY_ID) ENTITY_ID,
MAX (CT.CASH_TYPE) RULE_CODE,
SUM (
(SELECT NVL (SUM (JB.CLOSING_BALANCE), 0)
FROM GL_JOURNAL JB
WHERE JB.GL_PRIME_ACCT = CD.GL_PRIME_ACCT
AND JB.GL_SUB_ACCT = CD.GL_SUB_ACCT
AND JB.ENTITY_ID = CT.ENTITY_ID
AND JB.GL_SYS_PERIOD = '201509'
AND JB.GL_BASIS = 'NA'
AND JB.GL_SOURCE <> '000')) as BEG_BALANCE
FROM GL_TYPE CT, GL_DEFINITION CD
WHERE CT.TYPE_CODE = CD.TYPE_CODE
GROUP BY CT.ENTITY_ID, CT.TYPE_CODE
SELECT STATEMENT
COST 7
SORT AGGREGATE
BYTES: 45 Cardinality: 1
TABLE ACCESS BY INDEX ROWID TABLE GL_JOURNAL
COST 10 BYTES: 45 Cardinality: 10
INDEX RANGE SCAN INDEX
COST 3 Cardinality: 14
HASH GROUP BY
COST 7 BYTES: 69,861 CARDINALITY:2,409
#HASH JOIN
COST 5 BYTES: 412,467 CARDINALITY:14,223
INDEX FULL SCAN INDEX (UNIQUE)
COST 1: BYTES: 848 CARDINALITY:53
INDEX FAST FULL SCAN INDEX (UNIQUE)
COST 3 BYTES:31,317 CARDINALITY:2,409
I suspect that moving the subquery to the from clause would be a win:
SELECT MAX(CT.ENTITY_ID) as ENTITY_ID,
MAX(CT.CASH_TYPE) as RULE_CODE,
COALESCE(SUM(JB.CLOSING_BALANCE), 0) as BEG_BALANCE
FROM GL_TYPE CT JOIN
GL_DEFINITION CD
ON CT.TYPE_CODE = CD.TYPE_CODE LEFT JOIN
GL_JOURNAL JB
ON JB.GL_PRIME_ACCT = CD.GL_PRIME_ACCT AND
JB.GL_SUB_ACCT = CD.GL_SUB_ACCT AND
JB.ENTITY_ID = CT.ENTITY_ID AND
JB.GL_SYS_PERIOD = '201509' AND
JB.GL_BASIS = 'NA' AND
JB.GL_SOURCE <> '000'
GROUP BY CT.ENTITY_ID, CT.TYPE_CODE;
This query also wants to take advantage of indexes: GL_TYPE(TYPE_CODE), GL_DEFINITION(TYPE_CODE), GL_JOURNAL(GL_PRIME_ACCT, GL_SUB_ACCT, ENTITY_ID, GL_SYS_PERIOD, GL_BASIS, GL_SOURCE, CLOSING_BALANCE).
I am using SQLite.
I have a query which gets records after going through 6 different tables.
Each table contain many records.
The query below has been written based on the PK-FK relationship, but it's taking too much time to retrieve the data.
I can't be able to do Altering, Indexing on database.
Select distinct A.LINK_ID as LINK_ID,
B.POI_ID
from RDF_LINK as A,
RDF_POI as B,
RDF_POI_ADDRESS as c,
RDF_LOCATION as d,
RDF_ROAD_LINK as e,
RDF_NAV_LINK as f
where B.[CAT_ID] = '5800'
AND B.[POI_ID] = c.[POI_ID]
AND c.[LOCATION_ID] = d.[LOCATION_ID]
AND d.[LINK_ID] = A.[LINK_ID]
AND A.[LINK_ID] = e.[LINK_ID]
AND A.[LINK_ID] = f.[LINK_ID]
Am I using the wrong method? Do I need to use IN?
EXPLAIN QUERY PLAN command output ::
0 0 3 SCAN TABLE RDF_LOCATION AS d (~101198 rows)
0 1 0 SEARCH TABLE RDF_LINK AS A USING COVERING INDEX sqlite_autoindex_RDF_LINK_1 (LINK_ID=?) (~1 rows)
0 2 5 SEARCH TABLE RDF_NAV_LINK AS f USING COVERING INDEX sqlite_autoindex_RDF_NAV_LINK_1 (LINK_ID=?) (~1 rows)
0 3 4 SEARCH TABLE RDF_ROAD_LINK AS e USING COVERING INDEX NX_RDFROADLINK_LINKID (LINK_ID=?) (~2 rows)
0 4 1 SEARCH TABLE RDF_POI AS B USING AUTOMATIC COVERING INDEX (CAT_ID=?) (~7 rows)
0 5 2 SEARCH TABLE RDF_POI_ADDRESS AS c USING COVERING INDEX sqlite_autoindex_RDF_POI_ADDRESS_1 (POI_ID=? AND LOCATION_ID=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR DISTINCT
There is an AUTOMATIC index on RDF_POI.CAT_ID.
This means that the database thinks it is worthwhile to create a temporary index just for this query.
You should create this index permanently:
CREATE INDEX whatever ON RDF_POI(CAT_ID);
Furthermore, the CAT_ID lookup does not appear to have a high selectivity.
Run ANALYZE so that the database has a better idea of the shape of your data.
I have two tables and i have to query my postgresql database. The table 1 has about 140 million records and table 2 has around 50 million records of the following.
the table 1 has the following structure:
tr_id bigint NOT NULL, # this is the primary key
query_id numeric(20,0), # indexed column
descrip_id numeric(20,0) # indexed column
and table 2 has the following structure
query_pk bigint # this is the primary key
query_id numeric(20,0) # indexed column
query_token numeric(20,0)
The sample db of table1 would be
1 25 96
2 28 97
3 27 98
4 26 99
The sample db of table2 would be
1 25 9554
2 25 9456
3 25 9785
4 25 9514
5 26 7412
6 26 7433
7 27 545
8 27 5789
9 27 1566
10 28 122
11 28 1456
I am preferring queries in which i would be able to query in blocks of tr_id. In range of 10,000 as this is my requirement.
I would like to get output in the following manner
25 {9554,9456,9785,9514}
26 {7412,7433}
27 {545,5789,1566}
28 {122,1456}
I tried in the following manner
select query_id,
array_agg(query_token)
from sch.table2
where query_id in (select query_id
from sch.table1
where tr_id between 90001 and 100000)
group by query_id
I am performing the following query which takes about 121346 ms and when some 4 such queries are fired it still takes longer time. Can you please help me to optimise the same.
I have a machine which runs on windows 7 with i7 2nd gen proc with 8GB of RAM.
The following is my postgresql configuration
shared_buffers = 1GB
effective_cache_size = 5000MB
work_mem = 2000MB
What should I do to optimise it.
Thanks
EDIT : it would be great if the results ordered according to the following format
25 {9554,9456,9785,9514}
28 {122,1456}
27 {545,5789,1566}
26 {7412,7433}
ie according to the order of the queryid present in table1 ordered by tr_id. If this is computationally expensive may be in the client code i would try to optimise it. But I am not sure how efficient it would be.
Thanks
Query
I expect a JOIN to be much faster that the IN condition you have presently:
SELECT t2.query_id
,array_agg(t2.query_token) AS tokens
FROM t1
JOIN t2 USING (query_id)
WHERE t1.tr_id BETWEEN 1 AND 10000
GROUP BY t1.tr_id, t2.query_id
ORDER BY t1.tr_id;
This also sorts the results as requested. query_token remains unsorted per query_id.
Indexes
Obviously you need indexes on t1.tr_id and t2.query_id.
You obviously have that one already:
CREATE INDEX t2_query_id_idx ON t2 (query_id);
A multicolumn index on t1 may improve performance (you'll have to test):
CREATE INDEX t1_tr_id_query_id_idx ON t1 (tr_id, query_id);
Server configuration
If this is a dedicated database server, you can raise the setting for effective_cache_size some more.
#Frank already gave advise on work_mem. I quote the manual:
Note that for a complex query, several sort or hash operations might
be running in parallel; each operation will be allowed to use as much
memory as this value specifies before it starts to write data into
temporary files. Also, several running sessions could be doing such
operations concurrently. Therefore, the total memory used could be
many times the value of work_mem;
It should be just big enough to be able to sort your queries in RAM. 10 MB is more than plenty to hold 10000 of your rows at a time. Set it higher, if you have queries that need more at a time.
With 8 GB on a dedicated database server, I would be tempted to set shared_buffers to at least 2 GB.
shared_buffers = 2GB
effective_cache_size = 7000MB
work_mem = 10MB
More advice on performance tuning in the Postgres Wiki.