I am trying to see if using a custom index for a specific type of data might reduce fragmentation in my database.
[Edit: we are using MS SQL Server 2008 R2]
I have an SQL database containing timestamped measurement data. Lots of data is inserted all the time, but once inserted it practically never needs to be updated. These timestamps are, however, not unique, as several devices (around 50 of them) measure the data at the same time.
This means that every 50 rows in the table contain equal timestamp values. This data is received more or less simultaneously, although I could take additional care to ensure that rows are written as sequentially as possible (if that would help), perhaps by keeping them in memory for some time and then writing only when I get the data from all the devices for a single timestamp.
We are using NHibernate with Guid.Comb to avoid index lookups we would have with plain bigint IDs. As opposed to plain GUIDs, this should reduce fragmentation, but for so many inserts, fragmentation nevertheless happens very soon.
Since my data is timestamped, and data is inserted almost sequentially (increasing timestamps), I am wondering if there is a more clever way to create a primary key with a unique clustered index for this table. Timestamp column is basically a bigint number (.NET DateTime ticks).
I have also noticed that a non-clustered index over that same timestamp column also gets pretty fragmented. So what index strategy would you recommend to reduce heap fragmentation in this case?

A seperate column for a key doesn't make a lot of sense for this table since you won't be updating any of the data. I imagine you'll be doing a lot of queries though, probably based on that timestamp column.
You could try making the primary key a combination of the timestamp column and a device id column. You could try making that clustered. That should allow you to write nearly as fast as possible. If you query by device however, you may need another index on device id and timestamp (the reverse). I wouldn't make the reverse the clustered one though, as that will make the writes happen all over the table rather than on the trailing pages. And if most queries involve a date range and more than one device, clustering on timestamp first should give you the best performance.


SQL Server - can GUID be a good choice as part of a clustered index?

I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.
A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).
Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

Indexing a 'non guessable' key for quick retrieval?

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.
I have a session key that I send to the client as a cookie. This is a GUID.
I also have a surrogate IDENTITY int column.
I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.
I only send the GUID to the client browser:
a) i dont want my technical 'hacker'
users being able to guage what 'user
id' they are - i.e. know how many
visitors we have had to the site in total
b) i want to make sure noone messes with data maliciously - nobody can guess a GUID
I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.
I don't know which of the following I should do :
Index the GUID column and always use that to find the row
Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.
There may be several thousand sessions a day.
PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.
In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.
Some notes:
If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.
Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.
Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId
"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."
Declare your table as
create table MyTable(
id uniqueidentifier default newsequentialid() not null primary key clustered
However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.
Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.
If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.
More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.