Does it make sense to partition my database table with 10,000,000 rows if I have a query that goes through every row? - sql

I have a postgres database with a table of reviews that has 15 columns and 10,000,000 rows of data.
**Columns**
id
product_id
_description
stars
comfort_level
fit
quality
recommend
created_at
email
_yes
_no
report
I want to get every review and put it onto my front end, but since that is a bit impractical, I decided to only get 4,000 using this query: SELECT * FROM reviews ORDER BY created_at LIMIT 4000;. With an index, this is pretty fast (6.819ms). I think this could be faster, so would partitioning help in this case? Or even in the case of retrieving all 10,000,000 reviews? Or would it make more sense to split my table and use JOIN clauses in my queries?

This query will certainly become slower with partitioning:
At best, you have the table partitioned by created_at, so that only one partition would be scanned. But the other partitions have to be excluded, which takes extra time during query planning or execution.
At worst, the table is not partitioned by created_at, and you have to scan all partitions.
Note that the speed of an index scan does not depend on the size of the table.
I don't know what your problem is. 7 milliseconds for 4000 rows doesn't sound so bad to me.

10M rows is at the top end of the range for a table classed as "small". I wouldn't worry about anything fancy until you get to at least 50M rows.

Related

poorly performing query on order lines table

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.
As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).
For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.
Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

Possibilities of Query tuning in my case using SQL Server 2012

I have 2 tables called Sales and SalesDetails; SalesDetails has 90 million rows.
When I want to retrieve all records for 1 year, it almost takes 15 minutes, and it's still not yet completed.
I tried to retrieve records for 1 month, it took 1 minute 20 seconds and returns around 2.5 million records. I know it's huge.
Is there any solution to reduce the execution time?
Note
I don't want to create any index, because it already has enough indexes by default
I don't know what you mean when say that you have indices "by default." As far as I know, creating the two tables you showed us above would not create any indices by default (other than maybe the clustered index).
That being said, your query is tough to optimize, because you are aggregating and taking sums. This behavior generally requires touching every record, so an index may not be usable. However, we may still be able to speed up the join using something like this:
CREATE INDEX idx ON sales (ID, Invoice) INCLUDE (Date, Register, Customer)
Assuming SQL Server chooses to use this index, it could scan salesDetails and then quickly lookup every record against this index (instead of the sales table itself) to complete the join. Note that the index covers all columns required by the select statement.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

How to utilize table partition in oracle database in effective manner?

I have created a partitioned table as
CREATE TABLE orders_range(order_id NUMBER
,client_id NUMBER
,order_date DATE)
PARTITION BY RANGE(order_date)
(PARTITION orders2011 VALUES LESS THAN (to_date('1/1/2012','dd/mm/yyyy'))
,PARTITION orders2012 VALUES LESS THAN (to_date('1/1/2013','dd/mm/yyyy'))
,PARTITION orders2013 VALUES LESS THAN (MAXVALUE));
when I am selecting the records using
SELECT * FROM ORDERS_RANGE partition(orders2011);
in explain plan the cpu cost is 75
but when i go for normal query using where clause the cpu cost is only 6 then what is the advantage of table partitioning when it comes to performance?
Can anyone explain me in detail?
Thanks in advance.
First, you generally can't directly compare the cost of two different plans running against two different objects. It is entirely possible that one plan with a cost of 10,000 will run much more quickly than a different plan with a cost of 10. You can compare the cost of two different plans for a single SQL statement within a single 10053 trace (so long as you remember that these are estimates and if the optimizer estimates incorrectly, many cost values are incorrect and the optimizer is likely to pick a less efficient plan). It may make sense to compare the cost between two different queries if you are trying to work out the algorithm the optimizer is using for a particular step but that's pretty unusual.
Second, in your example, you're not inserting any data. Generally, if you're going to partition a table, you're doing so because you have multiple GB of data in that table. If you compare something like
SELECT *
FROM unpartitioned_table_with_1_billion_rows
vs
SELECT *
FROM partitioned_table_with_1_billion_rows
WHERE partition_key = date '2014-04-01' -- Restricts the data to only 10 million rows
the partitioned approach will, obviously, be more efficient not least of all because you're only reading the 10 million rows in the April 1 partition rather than the 1 billion rows in the table.
If the table has no data, it's possible that the query against the partitioned table would be a tiny bit less efficient since you've got to do more things in the course of parsing the query. But reading 0 rows from a 0 row table is going to take essentially no time either way so the difference in parse time is likely to be irrelevant.
In general, you wouldn't ever use the ORDERS_RANGE partition(orders2011) syntax to access data. In addition to hard-coding the partition name, which means that you'd often be resorting to dynamic SQL to assemble the query, you'd be doing a lot more hard parsing and that you'd be putting more pressure on the shared pool and that you'd risk making a mistake if someone changed the partitioning on the table. It makes far more sense to supply a predicate on the partition key and to let Oracle work out how to appropriately prune the partitions. In other words
SELECT *
FROM orders_range
WHERE order_date < date '2012-01-01'
would be a much more sensible query.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.