How to understand avoid FULL TABLE SCAN - sql

I have a table that is responsible to store log.
The DDL is this:
CREATE TABLE LOG(
"ID_LOG" NUMBER(12,0) NOT NULL ENABLE,
"DATA" DATE NOT NULL ENABLE,
"OPERATOR_CODE" VARCHAR2(20 BYTE),
"STRUCTURE_CODE" VARCHAR2(20 BYTE),
CONSTRAINT "LOG_PK" PRIMARY KEY ("ID_LOG")
);
with these two indices:
CREATE INDEX STRUCTURE_CODE ON LOG ("OPERATOR_CODE");
CREATE INDEX LOG_01 ON LOG ("STRUCTURE_CODE", "DATA") ;
but this query produce a FULL TABLE SCAN:
SELECT log.data AS data1,
OPERATOR_CODE,
STRUCTURE_CODE
FROM log
WHERE data BETWEEN to_date('03/03/2008', 'DD-MM-YYYY')
AND to_date('08/03/2015', 'DD-MM-YYYY')
AND STRUCTURE_CODE = '1601';
Why I see always a FULL TABLE SCAN on column DATA and STRUCTURE_CODE?
(I have tried also on create two different index for STRUCTURE_CODE and DATA but I have always a full table scan)

Did you run stats on your new index and the table?
How much data is in that table and what percentage of it is likely to be returned by that query? Sometimes a full table scan is better for small tables or for queries that will return a large percentage of the data.

How many rows do you have in that table?
How many is returned by this query?
Please include the explain plan.
If loading the table and doing a full table scan (FTS) is cheaper (in IO cost) than utilizing an index, the table will be loaded and FTS will happen. [Basically the same what Necreaux said]
This can happen either if the table is small, or the expected result set size is big.
What is small? FTS will almost always happen if the table is smaller than DB_FILE_MULTIBLOCK_READ_COUNT. This case, the table usually can be loaded into memory with one big read. It's not always an issue, check the IO cost in the explain plan.
What is big? If the table is pretty big, and you'll return the most of it, it is cheaper to read up the whole table in a few large IO calls, than making some index reads then making a lot of tiny IO calls all around to the table.
Blind guessing from your query (without the explain plan results), I think it would first consider an index range scan (over LOG_01), followed by a table access by rowid (to get the OPERATOR_CODE as it is not in the index), but either it decides that your table is too small, or that there are so many rows to be returned from that date range/structure_code, that rolling through the table is cheaper (in IO Cost terms).

Related

Indexing not working with column using range operations in oracle

I have created index on timestamp column for my table, but when I am querying and checking the explain plan in oracle it is doing the full table scan rather that range scan
Below is the DDL script for the table
CREATE TABLE EVENT (
event_id VARCHAR2(100) NOT NULL,
status VARCHAR2(50) NOT NULL,
timestamp NUMBER NOT NULL,
action VARCHAR2(50) NOT NULL
);
ALTER TABLE EVENT ADD CONSTRAINT PK_EVENT PRIMARY KEY ( event_id ) ;
CREATE INDEX IX_EVENT$timestamp ON EVENT (timestamp);
Below is the explain plan query used to get the explain plan -
EXPLAIN PLAN SET STATEMENT_ID = 'test3' for select * from EVENT where timestamp between 1620741600000 and 1621900800000 and status = 'CANC';
SELECT * FROM PLAN_TABLE WHERE STATEMENT_ID = 'test3';
Here is the explain plan that oracle returned -
I am not sure why the index is not working here, rather it is still doing the full table scan even after creating the index on the timestamp column.
Can someone please help me with this.
Gordon is correct. You need this index to speed up the query you showed us.
CREATE INDEX IX_EVENT$timestamp ON EVENT (status, timestamp);
Why? Your query requires an equality match on status and then a range scan on timestamp. Without the possibility of using the index for the equality match, Oracle's optimizer seems to have decided it's cheaper to scan the table than the index.
Why did it decide that?
Who knows? Hundreds of programmers have been working on the optimizer for many decades.
Who cares? Just use the right index for the query.
The optimizer is cost based. So conceptually the optimizer will evaluate all the available plans, estimate the cost, and pick the one that has the lowest estimated cost. The costs are estimated based on statistics. The index statistics are automatically collected when an index is built. However your table statistics may not reflect real life.
An Active SQL Monitor report will help you diagnose the issue.

PostgreSQL index reduces data size but makes the query slower

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.
One of the queries I'm trying to run looks as follows:
explain analyze
select tender->>'procurementMethodType' as procurement_method,
tender->>'status' as tender_status,
sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2
The query plan and execution time are the following:
The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:
create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))
The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:
The query with index is slower! How can this be possible?
EDIT: the table itself contains two columns: the ID column and the jsonb data column:
create table tenders (
id uuid primary key,
tender jsonb
)
The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.
You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.
But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

Index Decreases Number of Rows Read; No performance Gain

I created a non-clustered, non-unique index on a column (date) on a large table (16 million rows), but am getting very similar query speeds when compared to the exact same query that's being forced to not use any indexes.
Query 1 (uses index):
SELECT *
FROM testtable
WHERE date BETWEEN '01/01/2017' AND '03/01/2017'
ORDER BY date
Query 2 (no index):
SELECT *
FROM testtable WITH(INDEX(0))
WHERE date BETWEEN '01/01/2017' AND '03/01/2017'
ORDER BY date
Both queries take the same amount of time to run, and return the same result. When looking at the Execution plan for each, Query 1's number of rows read is
~ 4 million rows, where as Query 2 is reading 106 million rows. It appears that the index is working, but I'm not gaining any performance benefits from it.
Any ideas as to why this is, or how to increase my query speed in this case would be much appreciated.
Create Indexes with Included Columns: Cover index
This topic describes how to add included (or nonkey) columns to extend the functionality of nonclustered indexes in SQL Server by using SQL Server Management Studio or Transact-SQL. By including nonkey columns, you can create nonclustered indexes that cover more queries. This is because the nonkey columns have the following benefits:
They can be data types not allowed as index key columns.
They are not considered by the Database Engine when calculating the
number of index key columns or index key size.
An index with nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.
CREATE NONCLUSTERED INDEX IX_your_index_name
ON testtable (date)
INCLUDE (col1,col2,col3);
GO
You need to build an index around the need of your query - this quick and free video course should bring you up to speed really quick.
https://www.brentozar.com/archive/2016/10/think-like-engine-class-now-free-open-source/

Efficient inserts with duplicate checks for large tables in Postgres

I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?
Thanks in advance!
Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.
Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS query can make use of this unique index.
If you have queries with the WHERE clause only filtering on the modem_serial field, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost)
columns.
The order of the columns in the index matters.

SQL Server 2005 - Index of updt_tmstmp field

I am investigating indexes and have read through many articles and would like some expert advice. As a warning, index fields are fairly new to me and a bit confusing even after reading up on the subject!
To simplify, I have a table that has a guid (transaction id), event id and an updt_tmstmp (there are many other fields but unimportant to this question).
My PK is the transaction_id and event_id and the table is ordered by these keys. Since the transaction_id is a guid, the updt_tmstmp field is very randomized. As the table has grown to 6 million records the query has slowed. My idea was to add an index on the updt_tmstmp field. Our extracts search often on the table and look for the transaction_id's that have had updates in the past 24 hours. The query is scanning the entire table to find the records that have updated. Average time 1 minute
Details Current:
Table size: 6.2 million records
Index: transaction_id + event_id (clustered)
Details Attempted:
Additional Index: updt_tmstmp (non-unique, non-clustered)
When I did this update and ran the query it improved by about 10% and the explain plan indicates I am still table scanning an index. I expected a little bit better performance than this. My updt_tmstmp is not guaranteed to be unique (I blame the application programmer for doing this :) ).
The query I am using to access this is a standard start_time - end_time. updt_tmstmp >= #start_time and updt_tmstmp < #end_time
Thanks in advance and have a great day!
Chris
Question: Given a clustered index on event_ID, and non_clustered indexes on transaction_id and updt_tmstmp, I would recommend the following:
-- drop the clustered index on event_id and the non-clustered index on updt_tmstmp.
-- Create the clustered index on updt_tmstmp.
Logic: The SQL Server query optimizer has always favored clustered indexes over non-clustered indexes. The showplan for the query most likes shows a clustered index SCAN when using event_id as the clustered index. By moving the clustered index to updt-tmstmp, the query showplan should show a range search for all transactions in the last 24 hours and should do it quickly because the cluster key is sorted and is physically adjacent on disk... perfect for a range scan of a cluster.
By doing this, you will have accomplished a few key design goals.
Defined the clustered index key with as few columns as possible, considering columns that have one that are very selective or contain many distinct values
-- Also... the updt-tmpstmp field data will be accessed sequentially and rarely if at all updated. Perfect for a clustered index. The updt-tmstmp field will be used frequently to sort the data retrieved from the table.
I suggest that you use the following before the query to assist you in understand query optimizing behavior.
set statistics io on
go
set statistics time on
go
Table
Name of the table.
Scan count
Number of index or table scans performed.
logical reads
Number of pages read from the data cache.
physical reads
Number of pages read from disk.
read-ahead reads
Number of pages placed into the cache for the query.
lob logical reads
Number of text, ntext, image, or large value type (varchar(max), nvarchar(max), varbinary(max)) pages read from the data cache.
lob physical reads
Number of text, ntext, image or large value type pages read from disk.
lob read-ahead reads
Number of text, ntext, image or large value type pages placed into the cache for the query.
SQL Server parse and compile elapsed time and cpu
SQL Server Execution Time and cpu