PostgreSQL index reduces data size but makes the query slower - sql

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.
One of the queries I'm trying to run looks as follows:
explain analyze
select tender->>'procurementMethodType' as procurement_method,
tender->>'status' as tender_status,
sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2
The query plan and execution time are the following:
The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:
create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))
The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:
The query with index is slower! How can this be possible?
EDIT: the table itself contains two columns: the ID column and the jsonb data column:
create table tenders (
id uuid primary key,
tender jsonb
)

The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.
You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.
But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

Related

Index not used when Oracle SQL query has large number of selections in the IN(...) statement

I have a SELECT statment which a lot of selections in the IN(...) statement. The column for the IN statment has a nonunique index and is VARCHAR(50). When the number of elements in the IN statement goes over a certain threshold, the index is not used.
My select is structured like this
SELECT T.*, RANK() OVER
(PARTITION BY KEY_ID ORDER BY OBS_DATE ASC) AS XRANK
FROM "MY_TABLE" T WHERE KEY_ID IN('A','B','C')
But in reality there are a few hundred more elements in the IN statement and they are not called A, B, C.
If I reduce the number of items in my IN statement to 50 the index is used and the query takes 0.003s. 7k rows returned
If I double the items for my IN statment to 100, the index is not used and a full table scan is performed taking 0.4s to return 14k rows.
I'm not sure why the index is not used but I want to see what would happen if it was, so I tried to experiment I with a hint,
SELECT /*+ index(MY_TABLE,MY_INDEX) */ O.*, RANK() OVER ...blah blah
But the hint is ignored. When I run the explain plan it is not used and the query is the same speed.
Any advice would be appreciated, especially
Why is the index not being used when there is a higher number of elements in the IN statment
Why is the hint being ignored.
Thanks.
It's because of selectivity. If the query is not selective enough, then a full table scan is usually better than multiple index range scans (not unique).
If the number of values is low, the cost based optimizer will consider the cost of multiple index range scans together is still lower than the cost of a full table scan. If you add too many values, then the index range scan will surpass the cost of the full table scan.
Now, the costs are relative, and it depends on how the optimizer is configured (hints), but also depends on the histogram and table stats. Are they up to date?
By the way, you say 7k and 14k rows are selected. What percentage of the table does that represent? If it's too high the engine will go for the heap ignoring the index.
Having said all this, I think this is a really bad design. Instead of sending 100 parameters over the wire, can you produce those values from another table/select instead?

poorly performing query on order lines table

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.
As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).
For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.
Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

Oracle sql statement on very large table

I relative new to sql and I have a statement which takes forever to run.
SELECT
sum(a.amountcur)
FROM
custtrans a
WHERE
a.transdate <= '2013-12-31';
I's a large table but the statemnt takes about 6 minutes!
Any ideas why?
Your select, as you post it, will read 99% of the whole table (2013-12-31 is just a week ago, and i assume most entries are before that date and only very few after). If your table has many large columns (like varchar2(4000)), all that data will be read as well when oracle scans the table. So you might read several KB each row just to get the 30 bytes you need for amountcur and transdate.
If you have this scenario. create a combined index on transdate and amountcur:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
With the combined index, oracle can read the index to fulfill your query and doesn't have to touch the main table at all, which might result in considerably less data that needs to be read from disk.
Make sure the table has an index on transdate.
create index custtrans_idx on custtrans (transdate);
Also if this field is defined as a date in the table then do
SELECT sum(a.amountcur)
FROM custtrans a
WHERE a.transdate <= to_date('2013-12-31', 'yyyy-mm-dd');
If the table is really large, the query has to scan every row with transdate below given.
Even if you have an index on transdate and it helps to stop the scan early (which it may not), when the number of matching rows is very high, it would take considerable time to scan them all and sum the values.
To speed things up, you could calculate partial sums, e.g. for each passed month, assuming that your data is historical and past does not change. Then you'd only need to scan custtrans only for 1-2 months, then quickly scan the table with monthly sums, and add the results.
Try to create an index only on column amountcur:
CREATE INDEX myindex ON custtrans(amountcur)
In this case Oracle will read most probably only the Index (Index Full Scan), nothing else.
Correction, as mentioned in comment. It must be a composite Index:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
But maybe it is a bit useless to create an index just for a single select statement.
One option is to create an index on the column used in the where clause (this is useful if you want to retrieve only 10-15% rows by using indexed column).
Another option is to partition your table if it has millions of rows. In this case also if you try to retrieve 70-80% data, it wont help.
The best option is first to analyze your requirements and then make a choice.
Whenever you deal with date functions it's better to use to_date() function. Do not rely on implicit data type conversion.

Optimize a Sql query: filter an unindexed field

I have a table Orders that stores orders, with fields:
Id
Date
Amount
Cost
Currency
I tried the following query:
SELECT SUM(Amount)-SUM(NFC1)
FROM Orders
WHERE Date BETWEEN '20121101' AND '20121231'
AND Currency = 'EUR'
Now, according to Oracle SQL Developer, what makes the query slow is the Currency = 'EUR' filter, since the other operations have much lower cost.
I checked the indexes and I have an index on Id, and another index on Date.
It seems to me, by the query analysis, that the DBMS first finds the records matching the required dates and then scans the whole table to find the records having Currency='EUR'. Currency is a VARCHAR.
Is there any way to optimize the query? I mean, is there a way to avoid the full scan?
From a general point of view, is it possible to prevent the DBMS from performing a full table scan after records have already been filtered by date, but rather find the records that match the Currency among those who have already been filtered by date?
Thanks a lot
It seems to me, by the query analysis, that the DBMS first finds the records matching the required dates and then scans the whole table to find the records having Currency='EUR'. Currency is a VARCHAR.
It does not scan the whole table.
Rather, it takes the row pointers (rowid's or the PRIMARY KEY values if the table is an IOT) from the index records and looks up the currency in the table rows in a nested loop. Since the index you're using does not contain Currency, it needs to be looked up somehow to do the filtering.
To work around this, you would need to create a composite index on (Currency, Date)
If creating another index is not an option, you may try creating a MATERIALIZED VIEW and create an index on that.
Build an index on a Currency field, or a compound index on Date and Currency if you often filter using both fields
Is there any way to optimize the query? I mean, is there a way to avoid the full scan?
A full scan might just be the most optimized plan available. Filtering data from a large portion of the table is usually faster by full scanning the table. The database can use fast, sequential disk reads.