Could adding a column with month identificator increase queries performance? - sql

Situation is as follows. There is a table with about 40 000 000 rows per month times 24 months, so lets say almost 1 000 000 000 rows. Each rows has got a timestamp column with index created on this column.
Most frequent queries are the once that aggregate data for the specific month - for example January 2016. If we assign a separete identificator for every month, lets call it "idm" and for January 2016 make it equal 1 (February 2016 = 2 and so on), create index on idm, would it have any effect on query performance comparing WHERE statements :
timestamp >= '20160101' AND timestamp < '20160201'
idm = 1
?
Would using idm be faster?

If you have an index on timestamp and on the proposed idm column, then the two would probably be identical. This an an approximate answer. If you have other conditions in the where clause, then the idm = 1 is better for performance. It allows more ways of using optimization.
However, indexes are not the right approach. Because of the nature of your data and queries, you should consider table partitions. This would allow each month of data to be stored separately. You can read about table partitioning here.
If you don't want to partition the tables, I would recommend making idm or timestamp a clustered index. This will help queries, even the where clause selects a relatively high proportion of rows in the table.

Related

poorly performing query on order lines table

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.
As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).
For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.
Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

Possibilities of Query tuning in my case using SQL Server 2012

I have 2 tables called Sales and SalesDetails; SalesDetails has 90 million rows.
When I want to retrieve all records for 1 year, it almost takes 15 minutes, and it's still not yet completed.
I tried to retrieve records for 1 month, it took 1 minute 20 seconds and returns around 2.5 million records. I know it's huge.
Is there any solution to reduce the execution time?
Note
I don't want to create any index, because it already has enough indexes by default
I don't know what you mean when say that you have indices "by default." As far as I know, creating the two tables you showed us above would not create any indices by default (other than maybe the clustered index).
That being said, your query is tough to optimize, because you are aggregating and taking sums. This behavior generally requires touching every record, so an index may not be usable. However, we may still be able to speed up the join using something like this:
CREATE INDEX idx ON sales (ID, Invoice) INCLUDE (Date, Register, Customer)
Assuming SQL Server chooses to use this index, it could scan salesDetails and then quickly lookup every record against this index (instead of the sales table itself) to complete the join. Note that the index covers all columns required by the select statement.

PostgreSQL: Effective group by any column with timestamp

I have one big table (about 1 million records) with 50 columns (one timestamp and other entity's parameters) and I want to make queries like:
select param_name, count(*)
from big_table
where timestamp > {{start}} and timestamp < {{end}}
group by param_name
So how I can make execution of this queries as fast as possible? Time bounds can be arbitrary. I am using PostgreSQL.
Now I am thinking about making 50 INDEXes kind of (timestamp, param_name). But it can provide huge indexes.
Is there any better solution?
A million records with 50 columns is not particularly large.
I would start with the obvious index on big_table(timestamp, param_name) and see how that works.
Actually, the index on big_table(time_stamp) would be sufficient under many circumstances -- particularly, if you are only trying to summarize a small minority of rows.

Best Index For Partitioned Table

I am querying a fairly large table that has been range partitioned (by someone else) by date into one partition per day. On average there are about 250,000 records per day. Frequently queries will be by a range of days -- usually looking for one day, 7 day week or a calendar month. Right now querying for more than 2 weeks is not performing well--have a normal date index created. If I query for more than 5 days it doesn't use the index, if I use an index hint it performs o.k. from about 5 days to 14 days but beyond that the index hint doesn't help much.
Given that the hint does better than the optimizer I am doing a gather statistics on the table.
However, my question going forward is, in general, if I wanted to create an index on the date field in the table, is it best to create a range partitioned index? Is it best to create a range index with a daily range similar to the table partition? What would be the best strategy?
This is Oracle 11g.
Thanks,
related to your question, partitioning strategy will depend on how you are going to query the data, the best strategy would be to query as fewer partitions as possible. e.g. if you are going to run monthly reports you'd rather create montly range partitioning and not daily range partitioning. If all your queryies will be around data that's within a couple of days then daily range partitioning would be fine.
Given numbers you provided in my opininon you overpartition data.
p.s. quering each partition requires additional reading(than if it were just one partition), so optimizer opts for table access full to reduce reading of indexes.
Try to create a global index on date column. If the index is partitioned and you select -let's say- 14 days, then Oracle has to read 14 indexes. Having a single index on the entire table, i.e. "global index" it has to read only 1 index.
Note, when you truncate or drop a partition then you have to rebuild the index afterwards.
I'm guessing that you could be writing your SQL wrong.
You said you're querying by date. If your date column has time part and you want to extract records from one day, from specific time of the day, e.g. 20:00-21:00, then yes, an index would be beneficial and I would recommend a local index for this (partitioned by day as just like table).
But since your queries span a range of days, it seems this is not the case and you just want all data (maybe filtered by some other attributes). If so, a partition full scan will always be much faster than index access... provided you benefit from partition pruning! Because if not - and you're actually performing a full table scan - this is expected to be very, very slow (in most cases).
So what could go wrong? Are you using plain date in WHERE clause? Note that:
SELECT * FROM trx WHERE trx_date = to_date('2014-04-03', 'YYYY-MM-DD');
will scan only one partition, whereas:
SELECT * FROM trx WHERE trunc(trx_date) = to_date('2014-04-03', 'YYYY-MM-DD');
will scan all partitions, as you apply a function to partitioning key and the optimizer can no longer determine which partitions to scan.
It would be much easier to tell for sure if you provided table definition, total number of partitions, sample data and your queries with explain plans. If possible, please edit your question and include more details.

Averaging large amounts of data in SQL Server

It is desired to perform averaging calculations on a large set of data. Data is captured from the devices rather often and we want to get the last day's average, the last week's average, the last month's average and the last year's average.
Unfortunately, taking the average of the last year's data takes several minutes to complete. I only have a basic knowledge of SQL and am hoping that there is some good information here to speed things up.
The table has a timestamp, an ID that identifies which device the data belongs to and a floating point data value.
The query I've been using follows this general example:
select avg(value)
from table
where id in(1,2,3,4) timestamp > last_year
Edit: I should clarify also that they are requesting that these averages be calculated on a rolling basis. As in "year to date" averages. I do realize that just simply due to the sheer volume of results, we may have to compromise.
For this kind of problems you can always try the following solutions:
1) optimize the query: look at the query plan, create some indexes, defrag the existing ones, run the query when the server is free, etc
2) create a cache table.
To populate the cache table chose one of the following strategies:
1) use triggers on the tables that affects the result and on insert,update,delete refresh the cache table. The trigger should run very, very, very fast. Other condition is to not block any records ( otherwise you will end up in a deadlock if the server is busy )
2) populate the cache table with a job once per day/hour/etc
3) one solution that I like is to populate the cache by a SP when the result is needed ( ex: when the report is requested by the user ) and to use some logic to serialize the process ( only one user at one time can generate the cache ) plus some optimization to not recompute the same rows next time ( ex: if no row was added for the yesterday, and in cache I have the result for yesterday, I don't recalculate that value - calculate only the new values from the last run )
You might want to consider making the Clustered index on the timestamp. Typically clustered index is wasted on the id. One caution on this, the sort order of the output of other sql statements may change if there was no explicit sort.
You can make a caching table, for statistics cache, it should have something similar to this structure:
year | reads_sum | total_reads | avg
=====|============|=============|=====
2009 | 6817896234 | 564345 |
at the end of the year, you fill avg (average) field with the, now quick to calculate, value.