I have a SQLServer database with tables containing temperature measurements.
The tables have columns for MeasurementId(prim key), SensorId, Timestamp and Value.
We now have have enough measurements that our queries are starting to get a little bit slow and I'm trying to improve this.
The Timestamp values are not necessarily in order, but for each SensorId they are ordered.
My question is: Is there anyway I can use this knowlegde to improve the performance of a query like SELECT * FROM MeasurmentTable WHERE SensorId=xx AND Timestamp>yy ?
I.e, can i hint to SQL Server that once you've narrowed your results to a unique SensorId, the rows are guaranteed to be ordered by timestamp?
For your query, you simply want a composite index:
create index idx_measurementtable_sensorid_timestamp
on MeasurementTable(SensorId, Timestamp);
Related
I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).
I have a table records having three fields:
id - the row id
value - the row value
source - the source of the value
timestamp - the time when the row was inserted (should this be a unix timestamp or a datetime?)
And I want to perform a query like this:
SELECT timestamp, value FROM records WHERE timestamp >= a AND timestamp <= b
However in a table with millions of records this query is super inefficient!
I am using Azure SQL Server as the DBMS. Can this be optimised?
If so can you provide a step-by-step guide to do it (please don't skip "small" steps)? Be it creating indexes, redesigning the query statement, redesigning the table (partitioning?)...
Thanks!
After creating an index on the field you want to search, you can use a between operator so it is a single operation, which is most efficient for sql.
SELECT XXX FROM ABC WHERE DateField BETWEEN '1/1/2015' AND '12/31/2015'
Also, in SQL Server 2016 you can create range indexes for use on things like time-stamps using memory optimized tables. That's really the way to do it.
I would recommend using the datetime, or even better the datetime2 data type to store the date data (datetime2 being better as it has a higher level of precision, and with lower precision levels will use less storage).
As for your query, based upon the statement you posted you would want the timestamp to be the key column, and then include the value. This is because you are using the timestamp as your predicate, and returning the value along with it.
CREATE NONCLUSTERED INDEX IX_Records_Timestamp on Records (Timestamp) INCLUDE (Value)
This being said, be careful of your column names. I would highly recommend not using reserved keywords for columns names as they can be a lot more difficult to work with.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a data file with 800 million rows and 3 columns. The csv file size is 30 GB.
I need to do some analysis on the data. It took very long time to load it to SQL server.
Also, it took about 10 minutes to a SQL query like:
SELECT MAX(VALUE) AS max_s
FROM [myDB].[dbo].[myTable]
Also, I need to do other statistics analysis for each column.
SELECT COUNT(*) as num_rows, COUNT(DISTINCT VARIABLE1) as num_var1 FROM [myDB].[dbo].[myTable]
If I want to improve the analysis/query efficiency , SQL server or other tools can help me ?
How about R ? But, my laptop has only 8 GB memory. It is impossible to load the whole data in to a data frame.
More info about data is here
get statistics information by SQL query efficiently for table with 3 columns and 800 million rows
Some solutions have been given. I really appreciate. But, I would like to find out whether we have more efficient solutions.
You can greatly speed up your SQL queries by indexing your data, especially with large tables.
CREATE CLUSTERED INDEX index_name
ON [myDB].[dbo].[myTable] (value, cardID, locationID)
The command above creates a clustered index for your table. Place your actual column names within the round brackets. A clustered index sorts your rows in the order specified within the round brackets. You can create additional non-clustered indexes, but it is generally advisable to have at least one clustered index on your table.
If you have a unique identifier (e.g. an id for each observation that is truly distinct) in your data, you can create a UNIQUE INDEX by using the CREATE UNIQUE INDEX statement. This is generally the best way to speed up your queries.
Generally speaking, again, you should index your data in descending order of cardinality; this means that the columnn with most distinct values goes first in your "ON table (...)" statement, followed by columns with gradually fewer distinct values.
Index syntax
Some more information on indexes
I'm having a problem with a slow query. Consider the table tblVotes - and it has two columns - VoterGuid, CandidateGuid. It holds votes cast by voters to any number of candidates.
There are over 3 million rows in this table - with about 13,000 distinct voters casting votes to about 2.7 million distinct candidates. The total number of rows in the table is currently 6.5 million.
What my query is trying to achieve is getting - in the quickest and most cache-efficient way possible (we are using SQL Express) - the top 1000 candidates based on the number of votes they have received.
The code is:
SELECT CandidateGuid, COUNT(*) CountOfVotes
FROM dbo.tblVotes
GROUP BY CandidateGuid
HAVING COUNT(*) > 1
ORDER BY CountOfVotes DESC
... but this takes a scarily long time to run on SQL express when there is a very full table.
Can anybody suggest a good way to speed this up and get it running in quick time? CandidateGuid is indexed individually - and there is a composite primary key on CandidateGuid+VoterGuid.
If you have only two columns in a table, a "normal" index on those two fields won't help you much, because it is in fact a copy of your entire table, only ordered. First check in execution plan, if your index is being used at all.
Then consider changing your index to clustered index.
Try using a top n, instead of a having clause - like so:
SELECT TOP 1000 CandidateGuid, COUNT(*) CountOfVotes
FROM dbo.tblVotes
GROUP BY CandidateGuid
ORDER BY CountOfVotes DESC
I don't know if SQL Server is able to use the composite index to speed this query, but if it is able to do so you would need to express the query as SELECT CandidateGUID, COUNT(VoterGUID) FROM . . . in order to get the optimization. This is "safe" because you know VoterGUID is never NULL, since it's part of a PRIMARY KEY.
If your composite primary key is specified as (CandidateGUID, VoterGUID) you will not get any added benefit of a separate index on just CandidateGUID -- the existing index can be used to optimize any query that the singleton index would assist in.
I have a table Orders that stores orders, with fields:
Id
Date
Amount
Cost
Currency
I tried the following query:
SELECT SUM(Amount)-SUM(NFC1)
FROM Orders
WHERE Date BETWEEN '20121101' AND '20121231'
AND Currency = 'EUR'
Now, according to Oracle SQL Developer, what makes the query slow is the Currency = 'EUR' filter, since the other operations have much lower cost.
I checked the indexes and I have an index on Id, and another index on Date.
It seems to me, by the query analysis, that the DBMS first finds the records matching the required dates and then scans the whole table to find the records having Currency='EUR'. Currency is a VARCHAR.
Is there any way to optimize the query? I mean, is there a way to avoid the full scan?
From a general point of view, is it possible to prevent the DBMS from performing a full table scan after records have already been filtered by date, but rather find the records that match the Currency among those who have already been filtered by date?
Thanks a lot
It seems to me, by the query analysis, that the DBMS first finds the records matching the required dates and then scans the whole table to find the records having Currency='EUR'. Currency is a VARCHAR.
It does not scan the whole table.
Rather, it takes the row pointers (rowid's or the PRIMARY KEY values if the table is an IOT) from the index records and looks up the currency in the table rows in a nested loop. Since the index you're using does not contain Currency, it needs to be looked up somehow to do the filtering.
To work around this, you would need to create a composite index on (Currency, Date)
If creating another index is not an option, you may try creating a MATERIALIZED VIEW and create an index on that.
Build an index on a Currency field, or a compound index on Date and Currency if you often filter using both fields
Is there any way to optimize the query? I mean, is there a way to avoid the full scan?
A full scan might just be the most optimized plan available. Filtering data from a large portion of the table is usually faster by full scanning the table. The database can use fast, sequential disk reads.