Averaging large amounts of data in SQL Server - sql

It is desired to perform averaging calculations on a large set of data. Data is captured from the devices rather often and we want to get the last day's average, the last week's average, the last month's average and the last year's average.
Unfortunately, taking the average of the last year's data takes several minutes to complete. I only have a basic knowledge of SQL and am hoping that there is some good information here to speed things up.
The table has a timestamp, an ID that identifies which device the data belongs to and a floating point data value.
The query I've been using follows this general example:
select avg(value)
from table
where id in(1,2,3,4) timestamp > last_year
Edit: I should clarify also that they are requesting that these averages be calculated on a rolling basis. As in "year to date" averages. I do realize that just simply due to the sheer volume of results, we may have to compromise.

For this kind of problems you can always try the following solutions:
1) optimize the query: look at the query plan, create some indexes, defrag the existing ones, run the query when the server is free, etc
2) create a cache table.
To populate the cache table chose one of the following strategies:
1) use triggers on the tables that affects the result and on insert,update,delete refresh the cache table. The trigger should run very, very, very fast. Other condition is to not block any records ( otherwise you will end up in a deadlock if the server is busy )
2) populate the cache table with a job once per day/hour/etc
3) one solution that I like is to populate the cache by a SP when the result is needed ( ex: when the report is requested by the user ) and to use some logic to serialize the process ( only one user at one time can generate the cache ) plus some optimization to not recompute the same rows next time ( ex: if no row was added for the yesterday, and in cache I have the result for yesterday, I don't recalculate that value - calculate only the new values from the last run )

You might want to consider making the Clustered index on the timestamp. Typically clustered index is wasted on the id. One caution on this, the sort order of the output of other sql statements may change if there was no explicit sort.

You can make a caching table, for statistics cache, it should have something similar to this structure:
year | reads_sum | total_reads | avg
=====|============|=============|=====
2009 | 6817896234 | 564345 |
at the end of the year, you fill avg (average) field with the, now quick to calculate, value.

Related

A query with MIN(date) not finished in 20 hours: should it be like that, or I did something wrong?

Inspired with a post by Tommaso Pifferi, I've created a PostgreSQL (11) database to operate on my time series data: 316K financial instruments, 139M records in total. Time series of different instruments vary in length and time periods, and often have gaps. There are two tables: description of instruments and data of time series records. The structure is very simple:
TABLE instruments has
instr_id INT PRIMARY KEY and
9 more columns describing each instrument,
TABLE timeseries has
PRIMARY KEY (instr_id, date) where
instr_id INT REFERENCES instruments(instr_id) connects time series records with instrument description,
date DATE NOT NULL is the date of time series records
There is no index on date.
5 more columns containing indicators such as price, trading volume, etc.
I work in Python 3.7, use psycopg2 as the driver and sqlalchemy as the ORM (but this is probably irrelevant). First I filled in the database using DataFrame.to_sql, ran VACUUM and checked that simple queries work correctly. Then I wanted to add to the table instruments some columns summarizing time series properties. Here is the first query I ran using cursor.execute() in order to test this idea. It supposed to find for each time series, what is the date of the earliest time record:
ALTER TABLE instruments
ADD begin DATE;
UPDATE instruments SET
begin = (
SELECT MIN(date) FROM timeseries
WHERE timeseries.instr_id=instruments.instr_id
);
This query has been running on a desktop PC (Intel i5, 8GB memory, Windows 7) for about 20 hours with no result. The server activity displayed in pgAdmin 4 looks as below.
I am new to relational databases and SQL. Is it normal that such a query performs so long, or do I do anything wrong?
Updates like that are typically faster if you aggregate once over everything and join that into the UPDATE statement:
UPDATE instruments
SET "begin" = t.start_date
FROM (
SELECT instr_id, MIN(date)
FROM timeseries
group by instr_id
) t
WHERE t.instr_id = instruments.instr_id;
The answer by a_horse_with_no_name is the correct one, but if you want to speed up the query without rewriting it, you should
CREATE INDEX ON timeseries (date);
That would speed up the repeated subselect and hence the whole query considerably.
What has to be done to get MIN(date)? Well - whole table of 139M records has to be scanned... For every instrument - and that is explanation.
To see how query is executed, please, use explain - here you can find documentation. Note that using explain analyze can take that 5 hours since query has to be executed in order to collect all the information.
What to do? You can create index. Question is if that would work. PG will use index if query fetches less then 2% of the table. In other cases it will go with seqscan - read of whole table. If you feel that seqscan is your case you can consider adding date to the index - that way, instead of reading the table DB can use stats of that index. To check - use explain.
That is general answer. Just try to play with it. If you have more questions, we can try to build up final answer.

Possibilities of Query tuning in my case using SQL Server 2012

I have 2 tables called Sales and SalesDetails; SalesDetails has 90 million rows.
When I want to retrieve all records for 1 year, it almost takes 15 minutes, and it's still not yet completed.
I tried to retrieve records for 1 month, it took 1 minute 20 seconds and returns around 2.5 million records. I know it's huge.
Is there any solution to reduce the execution time?
Note
I don't want to create any index, because it already has enough indexes by default
I don't know what you mean when say that you have indices "by default." As far as I know, creating the two tables you showed us above would not create any indices by default (other than maybe the clustered index).
That being said, your query is tough to optimize, because you are aggregating and taking sums. This behavior generally requires touching every record, so an index may not be usable. However, we may still be able to speed up the join using something like this:
CREATE INDEX idx ON sales (ID, Invoice) INCLUDE (Date, Register, Customer)
Assuming SQL Server chooses to use this index, it could scan salesDetails and then quickly lookup every record against this index (instead of the sales table itself) to complete the join. Note that the index covers all columns required by the select statement.

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

Best Index For Partitioned Table

I am querying a fairly large table that has been range partitioned (by someone else) by date into one partition per day. On average there are about 250,000 records per day. Frequently queries will be by a range of days -- usually looking for one day, 7 day week or a calendar month. Right now querying for more than 2 weeks is not performing well--have a normal date index created. If I query for more than 5 days it doesn't use the index, if I use an index hint it performs o.k. from about 5 days to 14 days but beyond that the index hint doesn't help much.
Given that the hint does better than the optimizer I am doing a gather statistics on the table.
However, my question going forward is, in general, if I wanted to create an index on the date field in the table, is it best to create a range partitioned index? Is it best to create a range index with a daily range similar to the table partition? What would be the best strategy?
This is Oracle 11g.
Thanks,
related to your question, partitioning strategy will depend on how you are going to query the data, the best strategy would be to query as fewer partitions as possible. e.g. if you are going to run monthly reports you'd rather create montly range partitioning and not daily range partitioning. If all your queryies will be around data that's within a couple of days then daily range partitioning would be fine.
Given numbers you provided in my opininon you overpartition data.
p.s. quering each partition requires additional reading(than if it were just one partition), so optimizer opts for table access full to reduce reading of indexes.
Try to create a global index on date column. If the index is partitioned and you select -let's say- 14 days, then Oracle has to read 14 indexes. Having a single index on the entire table, i.e. "global index" it has to read only 1 index.
Note, when you truncate or drop a partition then you have to rebuild the index afterwards.
I'm guessing that you could be writing your SQL wrong.
You said you're querying by date. If your date column has time part and you want to extract records from one day, from specific time of the day, e.g. 20:00-21:00, then yes, an index would be beneficial and I would recommend a local index for this (partitioned by day as just like table).
But since your queries span a range of days, it seems this is not the case and you just want all data (maybe filtered by some other attributes). If so, a partition full scan will always be much faster than index access... provided you benefit from partition pruning! Because if not - and you're actually performing a full table scan - this is expected to be very, very slow (in most cases).
So what could go wrong? Are you using plain date in WHERE clause? Note that:
SELECT * FROM trx WHERE trx_date = to_date('2014-04-03', 'YYYY-MM-DD');
will scan only one partition, whereas:
SELECT * FROM trx WHERE trunc(trx_date) = to_date('2014-04-03', 'YYYY-MM-DD');
will scan all partitions, as you apply a function to partitioning key and the optimizer can no longer determine which partitions to scan.
It would be much easier to tell for sure if you provided table definition, total number of partitions, sample data and your queries with explain plans. If possible, please edit your question and include more details.

Ad hoc queries against high cardinality columns

How to improve the performance of ad hoc queries against tables having hundreds of high cardinality columns and millions of records?
In my case, I have a table with one indexed DATE column SDATE, one VARCHAR2 column NE and 750 numeric columns most of them high cardinality columns with values in the range of 0 to 100. The table is updated with almost 20000 new records every hour. The queries against this table look like:
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND V1 > :V1 AND V3 < :V3
or
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND NE = :NE AND V4 > :V4
etc.
So far, I have always advised users not to enter big interval dates so as to put a limit on the number of records resulted from the date index access path; however, from time to time it becomes necessary to specify bigger intervals.
If V1, V2, ..., V750 were all low cardinality columns, I would have been able to utilize bitmap indexes. Unfortunately they are not.
What's the advice on this? How should I tackle this problem?
Thanks.
I assume you're stuck with the design, so a few thoughts that I'd probably look at -
1) use partitions - if you have partitioning option
2) use some triggers to denormalise (or normalise in this case) a query table which is more optimised for the query usage
3) make some snapshots
4) look at having a current table or set of tables which has the days records (or some suitable subset), and roll them over to a big table to store hsitory.
It depends on usage patterns and all the other constraints the system has - this may get you started, if you have more details a better solution is probably out there.
I think the big problem would be the inserts. You have an index on sdate wich slow the inserts and speed up the selects. But, returning to your problems:
If users specify an interval wich is large (let's say >5%) it is beter to have the table partitioned by sdate in a daily or weekly or monthly manner.
Oracle partitioning docs
(If you partition the table, don't forget to partition also the index. And if you want to do it live, use exchange partition ).
Also, as workaround, if you have a powerfull machine, you may use parallel queries.
Oracle Parallel docs