Fact/Dim Table Time Value - sql

I'm setting up Fact and Dim tables and trying to figure out the best way to setup my time values. AdventureworksDW uses a timekey (UID) for each time entry in the DimTime table. I'm wondering there's any reason I shouldn't just use a time value instead i.e. 0106090800 (My granularity is hourly)?

"Intelligent keys" (in this case, a coded date and hour number) can lead to problems when you want to change definitions in your dimension. For example, your users might insist on a change from local time to UTC. Now your key is no longer actually a useful number, it's the old value in the dimension.
Further, with a midnight roll-over issue, the date part of your intelligent key might not match the actual date of the UTC vs. local time change.
To prevent the key from becoming a problem, you can't use it for any calculation of any kind. In which case, it's little better than a simple GUID or auto-increment number.
Auto-increment keys (or GUIDS) are fast and simple. Most important, they are trivially consistent across all dimensions.
Time happens to have a numeric mapping, but it helps to look at this is a weird coincidence, not a basis for a good design.

Here's Ralph Kimball's latest on time dimension. It's dated 2004, but it's still good.
This one will help, too.

The primary key should be surrogate, meaningless -- however, using YYYYMMDD for date dimension key is hard to resist, and also allows for easy table partitioning. The trick is that it should still be regarded as meaningless -- the fact that it looks like a date should be regarded as purely coincidental. This key should never be exposed to business users.

Related

Bigtable: Avoiding hotspotting when using timestamps on row keys

Cloud Bigtable docs on schema design for time series say:
In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.
Additionally, here's what they recommend to avoid hotspotting:
If you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp, the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.
Field promotion is suggested:
Move fields from the column data into the row key to make writes non-contiguous.
For example:
BATTERY#20150301124501001 --> BATTERY#Corrie#20150301124501001
Questions:
Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
That depends what your query looks like. For example, if you want to query Corrie's battery status from T1 to T2, you can construct a row range easily: [BATTERY#Corrie#T1, BATTERY#Corrie#T2]. However, if you want to query the battery status of all the users, then all the rows with prefix BATTERY will be scanned.
So, the most important queries you have should dictate which fields you promote to the row key. Also, fields with high cardinality help more when promoted to row key, as they distribute load to a larger number of tablets.
On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
I am not entirely sure what you mean by "query a range only the timestamp", can you provide an example?
A lot will depend on what "TIMESTAMP" means. If you always want to query for last 10 minutes, then all of your queries will go to a single server at any given time and you will experience hotspotting.
Another thing to keep in mind is that if you don't design the row key properly, writes will encounter hotspotting and you will not get good write throughput. Its recommended to design row-keys to avoid hotspotting.

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.
This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)
This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

Why database designers do not make IDENTITY columns start from the min value rather than 1?

As we know, In Sql Server, The IDENTITY (n,m) means that the values will start from n, and the increment value is m, but I noticed that all database designers make Identity columns as IDENTITY(1,1) , without taking advantage of all values of int data type which are from (-2,147,483,648) to (2,147,483,647),
I am planning to make all Identity columns as IDENTITY (-2,147,483,648, 1), (the identity columns are hidden from the application user).
Is that a good idea ?
If you find that 2billion values isn't enough, you're going to find out that 4billion isn't enough either (needing more than twice as many of anything over the lifetime of a project, than it was first designed for, is hardly rare*), so you need to take a different approach entirely (possibly long values, possibly something totally different).
Otherwise you're just being strange and unreadable for no gain.
Also, who doesn't have a database where they know that e.g. item 312 is the one with some nice characteristics for testing particular things? I know I have some arbitrary ids burned in my head. They may call it "so good they named it twice", but I'll always know New York as "city 657, covers most of our test cases". It's only a shorthand, but -2147482991 wouldn't be as handy.
*To add a bit to that. With some things you might say "ah about 100" and find it's actually 110, okay. With others you'll find actually it's actually 100,000 - you were out by orders of magnitude. The higher the number, the more often the mistake is of this sort due to the sort of problems that end up with estimates in the billions being different to those that end up with answers in the dozens. If you estimate 200 is your max in a given case, you should probably leave room for maybe a few hundred more. If you estimate 2billion in a given case, you should probably leave room for a few quadrillion more. That said, the only time I saw someone actually start an id at minus 2billion they ended up having about 3,000 rows.
If you have a class that represent your table in your code (which is very likely to happen), everytime you will create a new object it will be assigned the ID 0 by default. It could lead to mistakes that overwrite data in the database if the ID 0 is already assigned. This also makes it easy to determine if an object is new or if it came from the database by just doing if (myObject.ID != 0)
On the SQL Server side the negative ID-s are ok, handled like positive numbers, so you could do that.
The others are right, you should think about different suggestion, but the major problems are the applications connected to your database.
Let say take MS Enviroment. Here is an example:
.NET DataSet is using negative ID-s on autoincremented id-s to track changes in a code. So may you will have trouble, because:
The negative keys are used for temporary instances for the rows.
Here is the reference : MSDN
So definietly it is not a good idea to design a database like this for MSSQL in an MS Enviroment.
One of the 'nice' side effects of working with integers close to zero is that they are easy on the eye and easy for devs, testers etc to remember, especially with debugging and unit testing in mind.
Also, surrogate keys have a nasty habit of creeping into business terminology, e.g. users may be able to see the PK sitting in the URL querystring at the top of the browser - the more digits in the number, the more likely they are to misquote something in a helpdesk query.
So this is one of the reasons why I'm quite happy to seed my identities at 1, and not at -2147483231, and instead, as will, as #Jon suggests, move up to a BIGINT anytime that I may ever need more than 2 billion rows in my table.
As you transition from the negative numbers to positive ids, you will cross zero. That means (assuming you are actually inserting a couple of billion rows) that you will eventually have an identifier of zero. This is not intrinsically bad, but could present a potential edge-case for ORM tools or simply sloppy applicaiton code that has difficulty differentiating between a zero and a null.
Negative IDs can be useful for testing in a live environment where dummy data needs to be mixed with real data for testing purposes, then disposed of once the testing is complete. This should be done only with very good reason - I've only used the technique once.
Negative IDs can also be useful for Administrator purposes in single-row, read-only tables (i.e., no transactions, no executable SQL run on the table).
Aside from those specific purposes, identity values <= 0 will generate more heat than light.
And just to add, according to MS Docs (https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/implementing-identity-in-a-memory-optimized-table?view=sql-server-2014), IDENTITY(1, 1) is supported on a memory-optimized table. However, identity columns with definition of IDENTITY(x, y) where x != 1 or y != 1 are NOT supported on memory-optimized tables.
So in my opinion, and the other reasons alluded to by the other users, IDENTITY(1, 1) is more practical and memory efficient for memory-optimized tables.
Start with INT IDENTITY(1, 1), then when you have maxed-out you can follow these steps to 'upgraded' to BIGINT:
Drop current PK constraint:
ALTER TABLE [dbo].[tbName] DROP CONSTRAINT [PK_tbName_Id]
GO
Upgrade PK to BIGINT:
ALTER TABLE [dbo].[tbName] ALTER COLUMN [dbo.Id] BIGINT
Recreate the PK Constraint:
ALTER TABLE [dbo].[tbName] ADD CONSTRAINT [PK_tbName_Id] PRIMARY KEY
CLUSTERED
(
[Id] ASC
)
Hope this helps, and good lucky!

Is it preferred to use end-time or duration for events in sql? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My gut tells me that start time and end time would be better than start time and duration in general, but I'm wondering if there are some concrete advantages or disadvantages to the differing methods.
The advantage for strttime and endtime I am seeing is that if you want to call all events active during a certain time period you don't have to look outside that time period.
(this is for events that are not likely to change much after initial input and are tied to a specific time, if that makes a difference)
I do not see it as a preference or a personal choice. Computer Science is, well, a science, and we are programming machinery, not a sensitive child.
Re-inventing the Wheel
Entire books have been written on the subject of Temporal Data in Relational Databases, by giants of the industry. Codd has passed on, but his colleague and co-author C J Date, and recently H Darwen carry on the work of progressing and refining the Relational Model, in The Third Manifesto. The seminal book on the subject is Temporal Data & the Relational Model by C J Date, Hugh Darwen, and Nikos
A Lorentzos.
There are many who post opinions and personal choices re CS subjects as if they were choosing ice cream. This is due to not having had any formal training, and thus treating their CS task as if they were the only person on the planet who had come across that problem, and found a solution. Basically they re-invent the wheel from scratch, as if there were no other wheels in existence. A lot of time and effort can be saved by reading technical material (that excludes Wikipedia and MS publications).
Buy a Modern Wheel
Temporal Data has been a problem that has been worked with by thousands of data modellers following the RM and trying to implement good solutions. Some of them are good and others not. But now we have the work of giants, seriously researched, and with solutions and prescribed treatment provided. As before, these will eventually be implemented in the SQL Standard. PostgreSQL already has a couple of the required functions (the authors are part of TTM).
Therefore we can take those solutions and prescriptions, which will be (a) future-proofed and (b) reliable (unlike the thousands of not-so-good Temporal databases that currently exist), rather than relying on either personal opinion, or popular votes on some web-site. Needless to say, the code will be much easier as well.
Inspect Before Purchase
If you do some googling, beware that there are also really bad "books" available. These are published under the banner of MS and Oracle, by PhDs who spend their lives at the ice cream parlour. Because they did not read and understand the textbooks, they have a shallow understanding of the problem, and invent quite incorrect "solutions". Then they proceed to provide massive solutions, not to Temporal data, but to the massive problems inherent in their "solutions". You will be locked into problems that have been identified and sole; and into implementing triggers and all sorts of unnecessary code. Anything available free is worth exactly the price you paid for it.
Temporal Data
So I will try to simplify the Temporal problem, and paraphrase the guidance from the textbook, for the scope of your question. Simple rules, taking both Normalisation and Temporal requirements into account, as well as usage that you have not foreseen.
First and foremost, use the correct Datatype for any kind of Temporal column. That means DATETIME or SMALLDATETIME, depending on the resolution and range that you require. Where only DATE or TIME portion is required , you can use that. This allows you to perform date & time arithmetic using SQL function, directly in your WHERE clause.
Second, make sure that you use really clear names for the columns and variables.
There are three types of Temporal Data. It is all about categorising the properly, so that the treatment (planned and unplanned) is easy (which is why yours is a good question, and why I provide a full explanation). The advantage is much simpler SQL using inline Date/Time functions (you do not need the planned Temporal SQL functions). Always store:
Instant as SMALL/DATETIME, eg. UpdatedDtm
Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec or NumDays
There are some technicians who argue that Interval should be stored in DATETIME, regardless of the component being used, as (eg) seconds or months since midnight 01 Jan 1900, etc. That is fine, but requires more unwieldy (not complex) code both in the initial storage and whenever it is extracted.
whatever you choose, be consistent.
Period or Duration. This is defined as the time period between two separate Instants. Storage depends on whether the Period is conjunct or disjunct.
For conjunct Periods, as in your Event requirement: use one SMALL/DATETIME for EventDateTime; the end of the Period can be derived from the beginning of the Period of the next row, and EndDateTime should not be stored.
For disjunct Periods, with gaps in-between yes, you need 2 x SMALL/DATETIMEs, eg. a RentedFrom and a RentedTo. If it is in the same row.
Period or Duration across rows merely need the ending Instant to be stored in some other row. ExerciseStart is the Event.DateTime of the X1 Event row, and ExerciseEnd is the Event.DateTime of the X9 Event row.
Therefore Period or Duration stored as an Interval is simply incorrect, not subject to opinion.
Data Duplication
Separately, in a Normalised database, ie. where EndDateTime is not stored (unless disjoint, as per above), storing a datum that can be derived will introduce an Update Anomaly where there was none.
with one EndDateTime, you have version of a the truth in one place; where as with duplicated data, you have a second version of the fact in another column:
which breaks 1NF
the two facts need to be maintained (updated) together, transactionally, and are at the risk of being out of synch
different queries could yeild different results, due to two versions of the truth
All easily avoided by maintaining the science. The return (insignificant increase in speed of single query) is not worth destroying the integrity of the data for.
Response to Comments
could you expand a little bit on the practical difference between conjunct and disjunct and the direct practical effect of these concepts on db design? (as I understand the difference, the exercise and temp-basal in my database are disjunct because they are distinct events separated by whitespace.. whereas basal itself would be conjunct because there's always a value)
Not quite. In your Db (as far as I understand it so far):
All the Events are Instants, not conjunct or disjunct Periods
The exceptions are Exercise and TempBasal, for which the ending Instant is stored, and therefore they have Periods, with whitespace between the Periods; thus they are disjunct.
I think you want to identify more Durations, such a ActiveInsulinPeriod and ActiveCarbPeriod, etc, but so far they only have an Event (Instant) that is causative.
I don't think you have any conjunct Periods (there may well be, but I am hard pressed to identify any. I retract what I said (When they were Readings, they looked conjunct, but we have progressed).
For a simple example of conjunct Periods, that we can work with re practical effect, please refer to this time-series question. The text and perhaps the code may be of value, so I have linked the Q/A, but I particularly want you the look at the Data Model. Ignore the three implementation options, they are irrelevant to this context.
Every Period in that database is Conjunct. A Product is always in some Status. The End-DateTime of any Period is the Start-DateTime of the next row for the Product.
It entirely depends on what you want to do with the data. As you say, you can filter by end time if you store that. On the other hand, if you want to find "all events lasting more than an hour" then the duration would be most useful.
Of course, you could always store both if necessary.
The important thing is: do you know how you're going to want to use the data?
EDIT: Just to add a little more meat, depending on the database you're using, you may wish to consider using a view: store only (say) the start time and duration, but have a view which exposes the start time, duration and computed end time. If you need to query against all three columns (whether together or separately) you'll want to check what support your database has for indexing a view column. This has the benefits of convenience and clarity, but without the downside of data redundancy (having to keep the "spare" column in sync with the other two). On the other hand, it's more complicated and requires more support from your database.
End - Start = Duration.
One could argue you could even use End and Duration, so there really is no difference in any of the combinations.
Except for the triviality that you need the column included to filter on it, so include
duration: if you need to filter by duration of execution time
start + end: if you need to trap for events that both start and end within a timeframe

SQL Db index recommendation

I am trying to see if using a custom index for a specific type of data might reduce fragmentation in my database.
[Edit: we are using MS SQL Server 2008 R2]
I have an SQL database containing timestamped measurement data. Lots of data is inserted all the time, but once inserted it practically never needs to be updated. These timestamps are, however, not unique, as several devices (around 50 of them) measure the data at the same time.
This means that every 50 rows in the table contain equal timestamp values. This data is received more or less simultaneously, although I could take additional care to ensure that rows are written as sequentially as possible (if that would help), perhaps by keeping them in memory for some time and then writing only when I get the data from all the devices for a single timestamp.
We are using NHibernate with Guid.Comb to avoid index lookups we would have with plain bigint IDs. As opposed to plain GUIDs, this should reduce fragmentation, but for so many inserts, fragmentation nevertheless happens very soon.
Since my data is timestamped, and data is inserted almost sequentially (increasing timestamps), I am wondering if there is a more clever way to create a primary key with a unique clustered index for this table. Timestamp column is basically a bigint number (.NET DateTime ticks).
I have also noticed that a non-clustered index over that same timestamp column also gets pretty fragmented. So what index strategy would you recommend to reduce heap fragmentation in this case?
Maybe take a look at this answer, HiLo looks interesting.
Also, maybe your fragmentation is not result of the discrepancy between the ordering of the index values and the order in which they are added, but natural file growth effect (as explained here)?
A seperate column for a key doesn't make a lot of sense for this table since you won't be updating any of the data. I imagine you'll be doing a lot of queries though, probably based on that timestamp column.
You could try making the primary key a combination of the timestamp column and a device id column. You could try making that clustered. That should allow you to write nearly as fast as possible. If you query by device however, you may need another index on device id and timestamp (the reverse). I wouldn't make the reverse the clustered one though, as that will make the writes happen all over the table rather than on the trailing pages. And if most queries involve a date range and more than one device, clustering on timestamp first should give you the best performance.