poorly performing query on order lines table - sql

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.

As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).

For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.

Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

Related

PostgreSQL index reduces data size but makes the query slower

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.
One of the queries I'm trying to run looks as follows:
explain analyze
select tender->>'procurementMethodType' as procurement_method,
tender->>'status' as tender_status,
sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2
The query plan and execution time are the following:
The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:
create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))
The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:
The query with index is slower! How can this be possible?
EDIT: the table itself contains two columns: the ID column and the jsonb data column:
create table tenders (
id uuid primary key,
tender jsonb
)
The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.
You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.
But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

Possibilities of Query tuning in my case using SQL Server 2012

I have 2 tables called Sales and SalesDetails; SalesDetails has 90 million rows.
When I want to retrieve all records for 1 year, it almost takes 15 minutes, and it's still not yet completed.
I tried to retrieve records for 1 month, it took 1 minute 20 seconds and returns around 2.5 million records. I know it's huge.
Is there any solution to reduce the execution time?
Note
I don't want to create any index, because it already has enough indexes by default
I don't know what you mean when say that you have indices "by default." As far as I know, creating the two tables you showed us above would not create any indices by default (other than maybe the clustered index).
That being said, your query is tough to optimize, because you are aggregating and taking sums. This behavior generally requires touching every record, so an index may not be usable. However, we may still be able to speed up the join using something like this:
CREATE INDEX idx ON sales (ID, Invoice) INCLUDE (Date, Register, Customer)
Assuming SQL Server chooses to use this index, it could scan salesDetails and then quickly lookup every record against this index (instead of the sales table itself) to complete the join. Note that the index covers all columns required by the select statement.

SQL-SERVER Query is taking too much time

I have this simple query but it is taking 1 min for just 0.5M records even all columns mentioned in select are in Non Clustered index.
Both tables have approx 1M records and approx 200 columns in each.
Does having lots of records in table or having lot of index causing the slowness.
SELECT catalog_items.id,
catalog_items.store_code,
catalog_items.web_code AS web_code,
catalog_items.name AS name,
catalog_items.name AS item_description,
catalog_items.image_thumnail AS image_thumnail,
catalog_items.purchase_description AS purchase_description,
catalog_items.sale_description AS sale_description,
catalog_items.taxable,
catalog_items.is_unique_item,
ISNULL(catalog_items.inventory_posting_flag, 'Y') AS inventory_posting_flag,
catalog_item_extensions.total_cost,
catalog_item_extensions.price
FROM catalog_items
LEFT OUTER JOIN catalog_item_extensions ON catalog_items.id = catalog_item_extensions.catalog_item_id
WHERE catalog_items.trans_flag = 'A';
Update: execution plan showing index missing it the same index is already there. Why?
I'm not convinced that the plan is wrong currently, on the basis that you mention selecting 500k rows, out of a table of 1m rows. Even with an index as suggested by others, the selectivity of that index is pretty weak, from a tipping point perspective (https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/ ) - even with 200 columns I wouldn't expect 500k out of 1m rows per table to result in index seeks with lookups, a full scan would be faster in the CBO's view.
The missing index question - if you look closely its not just suggesting the index on trans_flag, it's suggesting to index the field and then INCUDE a number more. We can't see how many it's suggesting to include, but I would expect it to be all of them in the query and it's basically suggesting you create a covering index. Even in an NC Index Scan scenario this would be faster to scan than the base table.
We also have no information about physical layouts as yet, how the page is constructed, level of fragmentation etc, or even what kind of disks the data is on and overall size. That image_thumbnail field for example is suggestive of a large row size overall, which means we are dealing with off page storage into LOB / SLOB.
In short - even with a query plan, there is no 'easy' answer here in my view.
For this query
select . . .
from catalog_items ci left outer join
catalog_item_extensions cie
on ci.id = cie.catalog_item_id
where ci.trans_flag = 'A'
I would recommend an index on catalog_items(trans_flag, id) and catalog_item_extensions(catalog_item_id).

Two very alike select statements different performance

I've just came across some weird performance differences.
I have two selects:
SELECT s.dwh_end_date,
t.*,
'-1' as PROMOTION_DROP_EMP_CODE,
trunc(sysdate +1) as PROMOTION_END_DATE,
'K01' as PROMOTION_DROP_REASON,
-1 as PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
Which takes approximately 20 seconds.
And this one:
SELECT s.dwh_end_date,
s.dwh_product_key,
s.promotion_expire_date,
s.PROMOTION_DROP_EMP_CODE,
s.PROMOTION_END_DATE,
s.PROMOTION_DROP_REASON,
s.PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
That takes approximately 400 seconds
They are basically the same - its just to assure that I've updated my data correct (first select is to update the FCT tables) second select to make sure every thing updated correctly.
The only differences between this two selects, is the columns I select. (STG table has two columns - dwh_p_key and prom_expire_date)
First select explain plan
Second select explain plan
What can cause this weird behaviour?..
The FCT tables is indexed UNIQUE (dwh_product_key, dwh_end_date) and partitioned by dwh_end_date (250 million records), the STG doesn't have any indexes (and its only 15k records)
Thanks in advance.
The plans are not exactly the same. The first query uses a fast full scan of the index on fct_customer_services and doesn't need to access any blocks from the actual table, since you only refer to the two indexed columns.
The second query does have to access the table blocks to get the other unidexed column values. It's doing a full table scan - slower and more expensive than a full index scan. The optimiser doesn't see any improvement from using the index and then accessing specific table rows, presumably because the cardinality is too high - it needs to access too many table rows to save any effort by hitting the index first. Doing so would be even slower.
So the second query is slower because it has to read the whole table from disk/cache rather than just the whole index, and the table is much larger than the index. You can look at the segments assigned to both objects (index and table) to see the ratio of their sizes.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.