I have a database table that receives close to 1 million inserts a day that needs to be searchable for at least a year. Big hard drive and lots of data and not that great hardware to put it on either.
The table looks like this:
id | tag_id | value | time
----------------------------------------
279571 55 0.57 2013-06-18 12:43:22
...
tag_id might be something like AmbientTemperature or AmbientHumidity and the time is captured when the reading is taken from the sensor.
I'm querying on this table in a reporting format. I want to see all data for tags 1,55,72, and 4 between 2013-11-1 and 2013-11-28 at 1 hour intervals.
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name,
ROW_NUMBER() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
) k
WHERE seqnum = 1
ORDER BY time";
Can I optimize this table or my query at all? How should I set up my indexes?
It's pretty slow with a table size of 100 million + rows. It can take several minutes to get a data set of 7 days at an hourly interval with 3 tags in the query.
Filtering on the result of the row number function will make the query painfully slow. Also it will prevent optimal index use.
If your primary reporting need is hourly information you might want to consider storing which rows are the first sensor reading for a tag in a specific hour.
ALTER TABLE tag_values ADD IsHourlySensorReading BIT NULL;
In an hourly process, you calculate this column for new rows.
DECLARE #CalculateFrom DATETIME = (SELECT MIN(time) FROM tag_values WHERE IsHourlySensorReading IS NULL);
SET #CalculateFrom = dateadd(hour, 0, datediff(hour, 0, #CalculateFrom));
UPDATE k
SET IsHourlySensorReading = CASE seqnum WHEN 1 THEN 1 ELSE 0 END
FROM (
SELECT id, row_number() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
WHERE tv.time >= #CalculateFrom
AND tv.IsHourlySensorReading IS NULL
) as k
Your reporting query then becomes much simpler:
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
AND IsHourlySensorReading=1
) k
ORDER BY time;
The following index will help calculating the IsHourlySensorReading column. But remember, indexes will also cause your million inserts per day to take more time. Test thoroughly!
CREATE NONCLUSTERED INDEX tag_values_ixnc01 ON tag_values (time, IsHourlySensorReading) WHERE (IsHourlySensorReading IS NULL);
Use this index for reporting if you need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (time, tag_id, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Use this index for reporting if you don't need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (tag_id, time, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Some additional things to consider:
Is ORDER BY time really required?
Table partitioning can seriously improve both insert and query performance. Depending on your situation I would partition on either tag_id or date.
Instead of creating a column with an IsHourlySensorReading indicator, you can also create a separate table/database for specific reporting requirements and only load the relevant data into that.
I'm not an expert on sqlserver, but I would seriously consider setting this up as a partitioned table. This would also make archiving easier as partitions could simply be dropped (rather than an expensive delete from where...).
Also (with a bit of luck) the optimiser will only look in the partitions required for the data.
Related
I'm trying to create the most optimal query where the database would return the names of readers who often borrow sci-fi books. That's what I'm trying to optimize:
SELECT reader.name,
COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END)
FROM reader
JOIN book ON book.reader_id = reader.id
GROUP BY reader.name
ORDER BY COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END) DESC
LIMIT 10;
How can I improve my query other than with INNER JOIN or memory consumption increase?
This is my ERD diagram:
You could try to add your criteria in your join statement and just use the total count. It really depends on how much data you have etc....
SELECT reader.name,
COUNT(1) AS COUNTER
FROM reader
JOIN book ON book.reader_id = reader.id
AND book.status_id = 1
AND book.category = 2
GROUP BY reader.name
ORDER BY COUNTER DESC
LIMIT 10;
Assuming at least 10 readers that pass the criteria (like another answer also silently assumes), else you get fewer than 10 result rows.
Start with the filter. Aggregate & limit before joining to the second table. Much cheaper:
SELECT r.reader_id, r.surname, r.name, b.ct
FROM (
SELECT reader_id, count(*) AS ct
FROM book
WHERE status_id = 1
AND category_id = 2
GROUP BY reader_id
ORDER BY ct DESC, reader_id -- tiebreaker
LIMIT 10
) b
JOIN reader r ON r.id = b.reader_id
ORDER BY b.ct DESC, r.reader_id; -- tiebreaker
A multicolumn index on (status_id, category_id) would help a lot. Or an index on just one of both columns if either predicate is very selective. If performance of this particular query is your paramount objective, have this partial multicolumn index:
CREATE INDEX book_reader_special_idx ON book (reader_id)
WHERE status_id = 1 AND category_id = 2;
Typically, you'd vary the query, then this last index is too specialized.
Additional points:
Group by reader_id, which is the primary key (I assume) and guaranteed to be unique - as opposed to reader.name! Your original is likely to fail completely, name being just the "first name" from the looks of your ERD.
It's also typically substantially faster to group by an integer instead of varchar(25) (two times). But that's secondary, correctness comes first.
Also output surname and reader_id to disambiguate identical names. (Even name & surname are not reliably unique.)
count(*) is faster than count(1) while doing the same, exactly.
Add a tiebreaker to the ORDER BY clause to get a stable sort order and deterministic results. (Else, the result can be different every time with ties on the count.)
I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance
For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.
I have a Postgresql table with about 50 million entries. Now i want to find two id's, by looking for the first time and the last time a timestamp appears, to give me a range of entries.
Finding the "first" id takes about 100 milliseconds.
But finding the second id takes about 3 minutes.
Query for finding the first id
SELECT id
FROM transactions
WHERE "hashBlock" =
(SELECT hash
FROM blocks
WHERE n_time > 1262300400
ORDER BY id ASC
LIMTI 1)
ORDER BY id ASC
LIMTI 1
Query for finding the second id
Select id
FROM transactions
WHERE "hashBlock" =
(SELECT hash
FROM blocks
WHERE n_time < 1306879200
ORDER BY id DESC
LIMTI 1)
ORDER BY id DESC
LIMTI 1
I guess the longer runtime result in the query going from the first id until it find an id which satisfies the query and the second one starting at the last id.
Is there any way to speed up the second query?
I would create the following indexes, if you don't have them already:
create index ix1 on transactions ("hashBlock", id);
create index ix2 on blocks (n_time, id, hash);
I have a table with 1 billion rows that holds possible solutions to a goal setting program.
The combination of each column's value creates a successful goal path. I want to filter records to show the top 10 rows that are ordered by the choice of the user. Someone may want the lowest possible retirement age, then lowest deposit amount. Someone else may want the highest possible survival chance, then highest ending balance, ...
Here are my columns:
age tinyint
retirement_age tinyint
retirement_length tinyint
survival smallint
deposit int
balance_start int
balance_end int
SLOW 10 MIN QUERY:
select top(10) age,retirement_age,retirement_length,survival,deposit,balance_start,balance_end
from TABLE
where
age >= 30
and survival >= 8000 --OUT OF 10000
and balance_start <= 20000
and retirement_age >= 60
and retirement_age <= 75
and retirement_length >= 10
and retirement_length <= 25
and deposit >= 1000
and deposit <= 20000
ORDER BY -- (COLUMN ORDER PREFERENCES UNKNOWN)
retirement_age,
deposit,
retirement_length desc,
balance_end desc,
age desc,
survival desc
That query takes 10 min.
All of the records are generated once, so there is no more writing/updating to the database. I was thinking I should index each column, but have not done so. The database is 30GB right now, but space is not an issue.
I have run the Estimated Execution plan:
select: 0%
parallelism: 0%
sort: 23%
table scan: 77%
Have you tried creating an index like
CREATE INDEX IX_TABLE ON [TABLE]
(age,survival,balance_start,retirement_age,retirement_length,deposit)
INCLUDE (balance_end)
The order of the index fields (age,survival,balance_start,retirement_age,retirement_length,deposit) will make a difference if not all the fields are used in the WHERE clause, so make sure to put them in order of most used.
Also, the order of the included columns does not make any difference.
Seeing as the table values will not change, you can create more than one such index to improve the performance of other queries where it does not use all the fields in the WHERE clause
I ended up creating separate indexes on each of the columns in my where and order clauses with the default order:
CREATE INDEX IX_age ON TABLE (age desc)
CREATE INDEX IX_retirement_age ON TABLE (retirement_age)
CREATE INDEX IX_retirement_length ON TABLE (retirement_length desc)
CREATE INDEX IX_survival ON TABLE (survival desc)
CREATE INDEX IX_deposit ON TABLE (deposit)
CREATE INDEX IX_balance_start ON TABLE (balance_start)
CREATE INDEX IX_balance_end ON TABLE (balance_end desc)
We have a table which keeps the log of internet usage inside our company. this table is filled by a software bought by us and we cannot make any changes to its table. This table does not have a unique key or index (to make the data writing faster as its developers say)
I need to read the data in this table to create real time reports of internet usage by our users.
currently I'm reading data from this table in chunks of 1000 records. My problem is keeping the last record I have read from the table, so I can read the next 1000 records.
what is the best possible solution to this problem?
by the way, earlier records may get deleted by the software as needed if the database file size gets big.
Depending on your version of SQL Server, you can use row_number(). Once the row_number() is assigned, then you can page through the records:
select *
from
(
select *,
row_number() over(order by id) rn
from yourtable
) src
where rn between 1 and 1000
Then when you want to get the next set of records, you could change the values in the WHERE clause to:
where rn between 1001 and 2000
Based on your comment that the data gets deleted, I would do the following.
First, insert the data into a temptable:
select *, row_number() over(order by id) rn
into #temp
from yourtable
Then you can select the data by row number in any block as needed.
select *
from #temp
where rn between 1 and 1000
This would also help;
declare #numRecords int = 1000 --Number of records needed per request
declare #requestCount int = 0 --Request number starting from 0 and increase 1 by 1
select top (#numRecords) *
from
(
select *, row_number() over(order by id) rn
from yourtable
) T
where rn > #requestCount*#numRecords
EDIT: As per comments
CREATE PROCEDURE [dbo].[select_myrecords]
--Number of records needed per request
declare #NumRecords int --(= 1000 )
--Datetime of the LAST RECORD of previous result-set or null for first request
declare #LastDateTime datetime = null
AS
BEGIN
select top (#NumRecords) *
from yourtable
where LOGTime < isnull(#LastDateTime,getdate())
order by LOGTime desc
END
Without any index you cannot efficiently select the "last" records. The solution will not scale. You cannot use "real-time" and "repeated table scans of a big logging table" in the same sentence.
Actually, without any unique identification attribute for each row you cannot even determine what's new (proof: say, you had a table full of thousands of booleans. How would you determine which ones are new? They cannot be told apart! You cannot find out.). There must be something you can use, like a combination of DateTime, IP or so. Or, you can add an IDENTITY column which is likely to be transparent to the software you use.
Probably, the software you use will tolerate you creating an index on some ID or DateTime column as this is transparent to the software. It might create more load, so be sure to test it (my guess: you'll be fine).