I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?
I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.
There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
unhealthy index - try to run REINDEX INDEX indexname
bad statistics - try to run ANALYZE tablename
suboptimal state of table - try to run VACUUM tablename
too low, or to high setting of effective_cache_size
issues with IO - some systems has a problem with high random IO, try to increase random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
benchmark your IO if it is possible (like bonie++)
To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.
I can think of two reasons:
Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.
To see if that is the problem, run
ANALYZE eods;
and see if that changes the execution plans chosen.
Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.
At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:
SET enable_indexscan = off;
Then repeat the query.
We have 2 tables dispense and dispensedetail. On Dispense there is a foreign key PatientID.
Most processes and queries are in the context of a patient.
Currently the dispense table has a non-clustered index (DispenseID, PatientID).
The DispenseDetail has DispenseDetailID as primary key and a non-clustered index (DispenseID).
We are noticing some slowness that is caused by pageIo latches (sh) as sql server has to bring data from disk into memory.
I am thinking about a clustered index (DispenseID,DispenseDetailID) which can help retrieve the dispense details of a particular patient but it may worsen the insert dispense.. Dispense inserts are more important as without them there won't be data to query.
Will a non-clustered index (DispenseID,DispenseDetailID) help any?
Any comments or thoughts will be much appreciated.
Thanks!
Information for sqlonly
There are 4 physical CPUs each with 6 cores totalling 24 cores. There are 32 virtual CPUs.
The database is on a VM where there are 4 physical CPUs each with 6 cores totally 24 cores.and there are 32 virtual CPUs.
The dispense table has 4000000+ rows. The dispense detail has 11000000+ rows.
I don't know how to compute or get the average page io latch wait time. Querying sys.dm_os_latch_stats and ordering by wait time, this is the result set:
latch_class waiting_requests_count wait_time_ms max_wait_time_ms
BUFFER 62658377 97584783 12051
ACCESS_METHODS_DATASET_PARENT 950195 7870081 19652
ACCESS_METHODS_HOBT_VIRTUAL_ROOT 799403 5071290 5692
BACKUP_OPERATION 785245 372930 206
LOG_MANAGER 7 40403 11235
ACCESS_METHODS_HOBT_COUNT 7959 19728 1587
NESTING_TRANSACTION_FULL 122342 7969 59
ACCESS_METHODS_ACCESSOR_CACHE 67877 5143 65
ACCESS_METHODS_BULK_ALLOC 1644 734 49
ACCESS_METHODS_HOBT 15 76 15
SPACEMGR_ALLOCEXTENT_CACHE 169 71 10
SPACEMGR_IAM_PAGE_RANGE_CACHE 68 49 4
NESTING_TRANSACTION_READONLY 1942 11 1
SERVICE_BROKER_WAITFOR_MANAGER 31 9 4
TRACE_CONTROLLER 1 1 1
APPEND_ONLY_STORAGE_FIRST_ALLOC 11 1 1
In dev I used the current indexes to get just dispenseID and dispenseDetailID for the patient -- the outcome it is an index seek. However that result set must be inserted into a temp table to get other fields and the insert to temp table is costly so the net is no improvement gained.
Thanks!
Hi I'm curious about why index doesn't work when data rows are large even 100.
Here's select for 10 data:
mydb> explain select * from data where user_id=1;
+-----------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------|
| Index Scan using ix_data_user_id on data (cost=0.14..8.15 rows=1 width=2043) |
| Index Cond: (user_id = 1) |
+-----------------------------------------------------------------------------------+
EXPLAIN
Here's select for 100 data:
mydb> explain select * from data where user_id=1;
+------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------|
| Seq Scan on data (cost=0.00..44.67 rows=1414 width=945) |
| Filter: (user_id = 1) |
+------------------------------------------------------------+
EXPLAIN
How can index work when data rows are 100?
100 is not a large amount of data. Think 10,000 or 100,000 rows for a respectable amount.
To put it simply, records in a table are stored on data pages. A data page typically has about 8k bytes (it depends on the database and on settings). A major purpose of indexes is to reduce the number of data pages that need to be read.
If all the records in a table fit on one page, there is no need to reduce the number pages being read. The one page will be read. Hence, the index may not be particularly useful.
I have a table with list partitions on two columns in order of MY_ID (integer with values 1,2,3,5,8...1100), RUN_DATE (past some days).
My query is
select * from my_partitioned_table
where run_date = '10-sep-2014'
and my_id in (select my_id from mapping_table where category = 1)
;
It is going for full table scan, with following explain plan.
PX RECEIVE 115K 4M 600 1,01 PCWP
PX SEND BROADCAST :TQ10000 115K 4M 600 1,00 P->P BROADCAST
PX BLOCK ITERATOR 115K 4M 600 1,00 PCWC
TABLE ACCESS FULL MAPPING_TABLE 115K 4M 600 1,00 PCWP
PX BLOCK ITERATOR 1G 412G 34849 1,01 PCWC 1 16
TABLE ACCESS FULL MY_PARTITIONED_TABLE 1G 412G 34849 1,01 PCWP KEY KEY
How can I force it to access only certain partitions rather than going for full table scan?
Sorry I am little new to Oracle hints and couldn't find specific question before.
That query plan indicates that it is going after one (or more) partitions of my_partitioned_table. So partition pruning is happening already.
You've cut off the column headers when you posted your explain plan (it would also be helpful to get a fixed width version). But the last two columns are almost assuredly the start and end partitions. When you see KEY for a start or an end partition, that means that Oracle is determining the set of partitions that it actually needs to scan at runtime. In this case, it needs to determine the set of my_id values that your subquery will return before it can determine which partitions from your table need to be accessed. The TABLE ACCESS FULL bit merely indicates that it is going to do a full scan of the partition(s) that it needs to access.
Guys i have the following oracle sql query that gives me the monthwise report between the dates.Basically for nov month i want sum of values between the dates 01nov to 30 nov.
The table that is being queried is residing in another database and accesssed using dblink. The DT columns is of NUMBER type (for ex 20101201).
SELECT /*+ PARALLEL (A 8) */ /*+ DRIVING_STATE(A) */
TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM')- 1,'MM'),'MONYYYY') "MONTH",
TYPE AS "TYPE", COLUMN, COUNT (DISTINCT A) AS "A_COUNT",
COUNT (COLUMN) AS NO_OF_COLS, SUM (DURATION) AS "SUM_DURATION",
SUM (COST) AS "COST" FROM **A#LN_PROD A**
WHERE DT >= TO_NUMBER(TO_CHAR(add_months(SYSDATE,-1),'YYYYMM"01"'))
AND DT < TO_NUMBER(TO_CHAR(SYSDATE,'YYYYMM"01"'))
GROUP BY TYPE, COLUMN
The execution of the query is taking a day long and not completed. kindly suggest me , if their is any optimisation that can be suggested to my DBA on the dblink, or any tuning that can be done on the query , or rewriting the same.
UPDATES ON THE TABLE
The table is partiontioned on the date column and almost 1 billion records.
Below i have given the EXPLAIN PLAN from TOAD
**Plan**
SELECT STATEMENT REMOTE ALL_ROWSCost: 1,208,299 Bytes: 34,760 Cardinality: 790
12 PX COORDINATOR
11 PX SEND QC (RANDOM) SYS.:TQ10002 Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
10 SORT GROUP BY Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
9 PX RECEIVE Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
8 PX SEND HASH SYS.:TQ10001 Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
7 SORT GROUP BY Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
6 PX RECEIVE Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
5 PX SEND HASH SYS.:TQ10000 Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
4 SORT GROUP BY Cost: 1,208,299 Bytes: 34,760 Cardinality: 790
3 FILTER
2 PX BLOCK ITERATOR Cost: 1,203,067 Bytes: 15,066,833,144 Cardinality: 342,428,026 Partition #: 11 Partitions accessed #1 - #5
1 TABLE ACCESS FULL TABLE CDRR.FRD_CDF_DATA_INTL_IN_P Cost: 1,203,067 Bytes: 15,066,833,144 Cardinality: 342,428,026 Partition #: 11
The following things i am going to do today ,any additional tips would be helpful.
I am going to gather the tablewise statistics for this table, which may give optimal
execution plan.
Check whether an local index is created for the partition .
using BETWEEN instead of >= and <.
As usual for this type of question, an explain plan would be useful. It would help us work out what is actually going on in the database.
Ideally you want to make sure the query is running on the remote database the sending the result set back, rather than sending the data across the link and running the query locally. This ensures that less data is sent across the link. The DRIVING_SITE hint can help with this, although Oracle is usually fairly smart about it so it might not help at all.
Oracle seems to have got better at running remote queries but there still can be problems.
Also, it might pay to simplify some of your date conversions.
For example, replace this:
TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM')- 1,'MM'),'MONYYYY')
with this:
TO_CHAR(add_months(TRUNC(SYSDATE,'MM'), -1),'MONYYYY')
It is probably slightly more efficient but also is easier to read.
Likewise replace this:
WHERE DT >=TO_NUMBER(TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM')-1,'MM'),'YYYYMMDD'))
AND DT < TO_NUMBER(TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM'),'MM'),'YYYYMMDD'))
with
WHERE DT >=TO_NUMBER(TO_CHAR(add_months(TRUNC(SYSDATE,'MM'), -1),'YYYYMMDD'))
AND DT < TO_NUMBER(TO_CHAR(TRUNC(SYSDATE,'MM'),'YYYYMMDD'))
or even
WHERE DT >=TO_NUMBER(TO_CHAR(add_months(SYSDATE,-1),'YYYYMM"01"'))
AND DT < TO_NUMBER(TO_CHAR(SYSDATE,'YYYYMM"01"'))
It may be because several issues:
1.Network speed because the database may be residing on different hardware.
However you can refer this link
http://www.experts-exchange.com/Database/Oracle/Q_21799513.html.
There is a similar issue.
Impossible to answer without knowing the table structure, constraints, indexes, data volume, resultset size, network speed, level of concurrency, execution plans etcetera.
Some things I would investigate:
If the table is partitioned, does statistics exist for the partition the query is hitting? A common problem is that statistics are gathered on an empty partition before data has been inserted. Then when you query it (before the statistics are refreshed) Oracle chooses an index scan, when in fact it should use an FTS on that partition.
Also related to statistics: Make sure that
WHERE DT >=TO_NUMBER(TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM')-1,'MM'),'YYYYMMDD'))
AND DT < TO_NUMBER(TO_CHAR(TRUNC(TRUNC(SYSDATE,'MM'),'MM'),'YYYYMMDD'))
generates the same execution plan as:
WHERE DT >= 20101201
AND DT < 20110101
Updated
What version of Oracle are you on? The reason I'm asking is that on Oracle 10g and later, there is another implementation of group by that should have been selected in this case (hashing rather than sorting). It looks like you are basically sorting the 342 million rows returned from the date filter (14 gigabytes). Do you have the RAM to back that up? Otherwise you will be doing a multipass sort, spilling to disk. This is likely what is happening.
According to the plan, about 790 rows will be returned. Is that in the right ballpark?
If so, you can rule out network issues :)
Also, I'm not entirely familiar with the format on that plan. Is the table sub partitioned? Otherwise I don't get the partition #11 reference.