Database ETL Design Question

Database ETL Design Question - sql

A dataset I receive for routine refresh purposes contains a date field that's actually VARCHAR.
As this will be an indexed/searched field, I'm left with...
1) Converting the field to DATETIME and validating and normalizing the data values when refreshing
or...
2) Leaving the data as-is and forming my queries to accommodate various valid date formats, i.e.,
WHERE DateField = 'CCYYMMDD' OR DateField = 'MM/DD/CCYY' OR ....
The refresh would be on a monthly basis; "cleaning" the data would add about 35% time to the ETL cycle. My queries on the date field would all be equalities; I do not need to range search.
Also, I'm a one man shop, so the more hands-off the overall solution the better.
So which scenario am I better off doing? All opinions appreciated.

I think this is a great question. Here's my opinion:
I'm a big believer in the idea that in the long run you'll save more time and have fewer headaches by using data types for the purpose for which they were intended. That means dates in date fields, characters in character fields, etc. If you go with option 2 you'll need to remember to code for all the various possible date formats every time you query the table. If you set this down and come back a year from now, are you going to remember?
By contrast, if you use a date field and do the upfront work in the ETL process of dealing with the dates properly, you will always know just how to interact with the field. And I'm not even going into performance implications.
And in this case, I'm not sure you'll even see a short-term benefit. If there are, for example 5 different possible date formats in the source data, you'll need to account for those one way or another; either in the ETL or in the output queries. The code to transform those 5 formats in ETL is not materially more complicated than the code to manage those 5 formats in the output queries.
And if the data could literally arrive in an infinite number of formats, you have big problems either way. Either your ETL will break or your queries will break. It is, to a certain extent, an irreducible complexity.
I would suggest that you take the time to code the proper transforms into your ETL. But do yourself a favor and code a preprocessing step that identifies dates in formats that won't properly transform and alerts you to them. If you see patterns; i.e., if any format shows up more than once, code a transform for it. Over time you'll be left manually cleaning fewer and fewer of those nasty dates. With luck, your 35% will drop to 5% or less.
Good luck!

You are better off cleaning the data. First dates which are not good dates are meaningless so it's pointless to store them. Second, it is harder to fix a bad datatype choice later than it is to never make it. Querying will not only be easier but it will be faster than if you use a varchar. And things like ordering will work correctly as well as date functions. Third, I can't imagine that cleaning this would add that much to your import, I clean data all the time without it being a problem. But if it does, then clean the data in a staging table that no other process is using (so you aren't affecting users on prod) and then do the load to the prod tables from nice clean data.

Clean the data up front and store the dates as dates.
I work with systems that store dates as strings and there appear to be an unlimited number of ways to store the dates. This makes it very difficult to create a query to will work against a future new date format.
If you store dates as strings then you should apply constraints to make sure the data is stored in the proper format. Or, just convert the date strings to dates and let the database apply the valid date constraint itself. It is usually best to let the database do the work for you.

Definitely better off cleaning the data and loading into date column as this will ensure the integrity.

Related

What is the advantage of using a date dimension table over directly storing a date?

I have a need to store a fairly large history of data. I have been researching the best ways to store such an archive. It seems that a datawarehouse approach is what I need to tackle. It seems highly recommended to use a date dimension table rather than a date itself. Can anyone please explain to me why a separate table would be better? I don't have a need to summarize any of the data, just access it quickly and efficiently for any give day in the past. I'm sure I'm missing something, but I just can't see how storing the dates in a separate table is any better than just storing a date in my archive.
I have found these enlightening posts, but nothing that quite answers my question.
What should I have in mind when building OLAP solution from scratch?
Date Table/Dimension Querying and Indexes
What is the best way to store historical data in SQL Server 2005/2008?
How to create history fact table?

Well, one advantage is that as a dimension you can store many other attributes of the date in that other table - is it a holiday, is it a weekday, what fiscal quarter is it in, what is the UTC offset for a specific (or multiple) time zone(s), etc. etc. Some of those you could calculate at runtime, but in a lot of cases it's better (or only possible) to pre-calculate.
Another is that if you just store the DATE in the table, you only have one option for indicating a missing date (NULL) or you need to start making up meaningless token dates like 1900-01-01 to mean one thing (missing because you don't know) and 1899-12-31 to mean another (missing because the task is still running, the person is still alive, etc). If you use a dimension, you can have multiple rows that represent specific reasons why the DATE is unknown/missing, without any "magic" values.
Personally, I would prefer to just store a DATE, because it is smaller than an INT (!) and it keeps all kinds of date-related properties, the ability to perform date math etc. If the reason the date is missing is important, I could always add a column to the table to indicate that. But I am answering with someone else's data warehousing hat on.

Lets say you've got a thousand entries per day for the last year. If you've a date dimension your query grabs the date in the date dimension and then uses the join to collect the one thousand entries you're interested in. If there's no date dimension your query reads all 365 thousand rows to find the one thousand you want. Quicker, more efficient.

Computed datetime for select - split or concatenate?

I'm designing the database layout for an app which will make heavy use of time-based queries. I'm trying to figure out which would be the optimal choice for DB layout. I control the insert/update process, but the data will be feeding back into various Excel spreadsheets, and the consumers will be varied enough that it's not really realistic to expect to be able to do much on the client side. Any of the fields could be used for either WHEREs or ORDER BYs. My options are:
A stored datetime field, and seperate calculated date / time fields
Stored time and date fields, and a calculated datetime field
No calculated fields - store all 3 fields seperately at INSERT/UPDATE time
It seems more sensible to calculate a field rather than duplicate data and risk inconsistencies, which leaves me to decide whether to split a datetime, or concatenate seperate fields to get the desired calculated field.
My gut tells me that concatenating should be more efficient than splitting, but is there really much in it?

Option 1 storing a datetime field and calculating date and time parts from that should be fine.
I would always prefer having less data stored physically.
If that computation is too complex however slowing you down you might consider making it a PERSISTED COMPUTED COLUMN which is in a way a compromise between Option 1 and Option 3.
(Except for that you do not have to insert it manually)
See Point 3 in this link for more information on persisted computed columns.

database - date - multiple columns or one?

I'm working on a database, and can see that the table was set up with multiple columns (day,month,year) as opposed to one date column.
I'm thinking I should convert that to one, but wanted to check if there's much point to it.
I'm rewriting the site, so I'm updating the code that deals with it anyway, but I'm curious if there is any advantage to having it that way?
The only thing it gets used for is to compare data, where all columns get compared, and I think that an integer comparison might be faster than a date comparison.

Consolidate them to a single column - an index on a single date will be more compact (and therefore more efficient) than the compound index on 3 ints. You'll also benefit from type safety and date-related functions provided by the DBMS.
Even if you want to query on month of year or day of month (which doesn't seem to be the case, judging by your description), there is no need to keep them separate - simply create the appropriate computed columns and intex them.

The date column makes sense for temporal data because it is fit for purpose.
However, if you have a specific use-case where you are more often comparing month-to-month data instead of using the full date, then there is a little bit of advantage - as you mentioned - int columns are much leaner to store into index pages and faster to match.
The downsides are that with 3 separate int columns, validation of dates is pretty much a front-end affair without resorting to additional coding on the SQL Server side.

Normally, a single date field is ideal, as it allows for more efficient comparison, validity-checks at a low level, and database-side date-math functions.
The only significant advantage of separating the components is when a day or month first search (comparison) is frequently needed. Maybe an "other events that happened on this day" sort of thing. Or a monthly budgeting application or something.
(Even then, a proper date field could probably be made to work efficiently with proper indexing.)

Yes, I would suggest you replace the 3 columns with a single column that contains the date in Julian which is a floating point number. The part before the dot gives the day, the part after the dot gives the time within the day. Calculations will be easy and you can also easily convert Julian back into month/day/year etc. I believe that MS Excel stores dates internally as a floating point number so you will be in good company.

Is it ok to store all types of values as image data type in sql?

Is it ok to store data of different data types as a single universal data type, say 'image' in SQL? I will also store the data type value in another column and use this value inside my code to convert the data back into its proper type.
The advantage I get by doing this - I can avoid joining n number of tables.
Can some one point me to the down sides of storing data in this way?

This is a pretty bad idea for several reasons:
Query performance will be bad because the optimizer can't utilize indexes, foreign keys etc. because they don't exist.
You don't get referential integrity and thus can make no assumption about your data.
You introduce a new additional point of failure, because what if the data doesn't correspond to the specified data type?
The person having to maintain the code after you will hate you.

I think you would be ill-advised to take this approach for at least 2 reasons
You would, as you note, constantly be type casting on input and output. I can't see this as a useful operation, though I can see it as a time-consuming one.
You would exchange the (modest) difficulties of joining N tables for the (much less modest) difficulty of making N joins on your one mother-of-all-tables.
And then there is the more philosophical argument, along the lines that you are proposing to use a multi-tool (SQL) as if it were a hammer. Not every data type is always a nail. You will be much more productive, and I would assert enjoy your work more, if you work with rather than against the nature of your tools.
And I agree with what Daniel Hilgarth has already written.

Integer representation of a date

In the recent project, we had an issue with the performance of few queries that relied heavily on ordering the results by datetime field (MSSQL 2008 database).
When we executed the queries with ORDER BY RecordDate DESC (or ASC) the queries executed 10x slower than without that. Ordering by any other field didn't produce such slow results.
We tried all the indexing options, used the tuning wizard, nothing really made any difference.
One of the suggested solutions was converting the datetime field to the integer field representing the number of seconds or miliseconds in that datetime field. It would be calculated by a simple algorithm, something like "get me the number of seconds from RecordDate to 1980-01-01". This value would be stored at insertion, and the all the sorting would be done on the integer field, and not on the datetime field.
We never tried it, but I'm curious what do you guys think?

I always store dates as ints, using the standardised unix timestamp as most languages I program in use that as a default date-time representation. Obviously, this makes sorting on a date much more efficient.
So, yes, I recommend it :)

I think basically that's how the SQL datetime datatype is stored behind the scenes in SQL Server, so I'd be surprised about these results.
Can you replicate the slowness in Northwinds or Pubs - if so it might be worth a call to MS as it shouldn't be 10x slower. If not then there maybe something odd about your table.
If you are using SQL 2008 and you only need to store dates (not the time portion) you could try using the new date datatype. This has less precision and so should be quicker to sort.

Are the inserts coming from .Net Code...
You could store the DateTime.Ticks value in a bigint column on the DB and index on that.
In terms of updating your existing Database, it should be relatively trivial to write a CLR Function for converting existing DateTimes to TickCount along the lines of
ALTER TABLE dbo.MyTable ADD TickCount BigInt Null
Update dbo.MyTable Set TickCount = CLRFunction(DateTimeColumn)
It definitely feasible and would dramatically improve your sorting abilility

Aren't datetimes stored as a number already?

Do you actually need the DateTime or more specifically, the 'time' part? If not, I would investigate storing the date either as the integer or string representation of an ISO date format (YYYYMMDD) and see if this gives you the require performance boost. Storing ticks/time_t values etc would give you the ability to store the time as well, but I wouldn't really bother with this unless you really need the time component as well. Plus, the added value of storing a humanly readable date is that it is somewhat easier to debug data-related problems simply because you can read and understand the data your program in operating on.

The only sensible way to store dates is as Julian days - unix timestamps are way to short in scope.
By sensible I mean really in the code - it's generally (but not always) better to store dates in the database as datetime.
The database problem that you are experiencing sounds like a different problem. I doubt that changing the field type is going to make a massive difference.
It is hard to be specific without seeing detailed information such as the queries, the amount of records etc, but general advice would be to restructure the order and method of the query to reduce the number of records being ordered - as that can impact massively on performance.

I don't really understand why indexing doesn't help, if SQL behind the covers stores the date as integer representation.
Sorting by the ID columns produces excellent results, or by any other indexed field.

I vote indexing. As I said in the comments above, your dates are stored as two int's behind the scenes anyway (sql 2000 anyway). I can't see this making a difference. Hard to say what the real problem is w/o more info, but my gut feeling is that this isn't the problem. If you have a dev environemnt (and you should :) ), try making the int field there and running the raw queries. It shouldn't be difficult to do, and you'll have conclusive results on that idea.

Is your RecordDate one of the fields in the WHERE clause? Also, is RecordDate your only ORDER BY criteria? Thirdly, is your Query a multi-table join or a single table query? If you are not SELECTING on RecordDate, and using it as the ORDER BY criteria, this could be the cause of the performance issue, as the indexes would not really contribute to the sort in this case. The indexes would try to solve the join issues, and then the sort would happen afterwards.
If this is the case, then changing the data-type of your RecordDate may not help you much, as you are still applying a sort on a recordset after the fact.

I've seen a BI database where the dates are stored as an integer in YYYMMDD format. A separate table is used to relate these ints to the equivalent datetime, formatted string, year number, quarter number, month number, day of week, holiday status, etc. All you have to do is join to that table to get anything date related that you need. Very handy.

I would advise you to use a Julian date as used in Excel (link text). All financial applications are using this representation to gain performance and it provides a relatively good range of values.

SELECT CAST(REPLACE(convert(varchar, GETDATE(), 102),'.','')AS INT)
-- works quite well (and quick!).

I believe the datetime is physically stored as float so the improvement would be the same as when converting float to INT.
I would rather use indexes as that is what they are designed for, and the datatime is designed for storing dates with times. There is a set of functions associated with the datetime so if you decide to use custom storage type you will need to take care of that yourself.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas