Computed datetime for select - split or concatenate? - sql

I'm designing the database layout for an app which will make heavy use of time-based queries. I'm trying to figure out which would be the optimal choice for DB layout. I control the insert/update process, but the data will be feeding back into various Excel spreadsheets, and the consumers will be varied enough that it's not really realistic to expect to be able to do much on the client side. Any of the fields could be used for either WHEREs or ORDER BYs. My options are:
A stored datetime field, and seperate calculated date / time fields
Stored time and date fields, and a calculated datetime field
No calculated fields - store all 3 fields seperately at INSERT/UPDATE time
It seems more sensible to calculate a field rather than duplicate data and risk inconsistencies, which leaves me to decide whether to split a datetime, or concatenate seperate fields to get the desired calculated field.
My gut tells me that concatenating should be more efficient than splitting, but is there really much in it?

Option 1 storing a datetime field and calculating date and time parts from that should be fine.
I would always prefer having less data stored physically.
If that computation is too complex however slowing you down you might consider making it a PERSISTED COMPUTED COLUMN which is in a way a compromise between Option 1 and Option 3.
(Except for that you do not have to insert it manually)
See Point 3 in this link for more information on persisted computed columns.

Related

Does using calculated fields in Access increase efficiency

I have a database that keeps track of attendance for students in a school. There's one table (SpecificClasses) with dates of all of the classes, and another table (Attendance) with a list of all the students in each class and their attendance on that day.
The school wants to be able to view that data in many different ways and to filter it according to many different parameters. (I won't paste the entire query here because it is quite complicated and the details are not important for my question.) One of their options they want is to view the attendance of a specific student on a certain day of the week. Meaning, they want to be able to notice if a student is missing every Tuesday or something like that.
To make the query be able to do that, I have used DatePart("w",[SpecificClasses]![Day]). However, running this on every class (when we may be talking about hundreds of classes taken by one student in one semester) is quite time-consuming. So I was thinking of storing the day of the week manually in the SpecificClasses table, or perhaps even in the Attendance table to be able to avoid making a join, and just being very careful in my events to keep this data up-to-date (meaning to fill in the info when the secretaries insert a new SpecificClass or fix the Day field).
Then I was wondering whether I could just make a calculated field that would store this value. (The school has Access 2010 so I don't have to worry about compatibility). If I create a calculated field, does Access actually store that field and remember it for the future and not have to recalculate it each time?
As HansUp mentions in his answer, a Calculated field cannot be indexed so it might not give you much of a performance boost. However, since you are using Access 2010 you could create a "real" Integer field named [WeekdayNumber] and put an index on it,
and then use a Before Change data macro to insert the Weekday() value for you:
(The Weekday() function gives the same result as DatePart("w", ...).)
I was wondering whether I could just make a calculated field that
would store this value.
No, not for a calculated field expression which uses DatePart(). Access supports a limited set of functions for calculated fields, and DatePart() is not one of those.
If I create a calculated field, does Access actually store that field
and remember it for the future and not have to recalculate it each
time?
Doesn't apply to your current case. But for a calculated field which Access would accept, yes, that is the way it works.
However a calculated field can not be indexed so that limits how much improvement it can offer in terms of data retrieval speed. If you encounter another situation where you can create a valid calculated field, test the performance to see whether you notice any improvement (vs. calculating the value in a query).
For your DatePart() query problem, consider creating a calendar table with a row for each date and include the weekday number as a separate indexed field. Then you could join the calendar table into your query, avoid the need to compute DatePart() again, and allow Access to use the indexed weekday number to quickly identify which rows match the weekday of interest.

database - date - multiple columns or one?

I'm working on a database, and can see that the table was set up with multiple columns (day,month,year) as opposed to one date column.
I'm thinking I should convert that to one, but wanted to check if there's much point to it.
I'm rewriting the site, so I'm updating the code that deals with it anyway, but I'm curious if there is any advantage to having it that way?
The only thing it gets used for is to compare data, where all columns get compared, and I think that an integer comparison might be faster than a date comparison.
Consolidate them to a single column - an index on a single date will be more compact (and therefore more efficient) than the compound index on 3 ints. You'll also benefit from type safety and date-related functions provided by the DBMS.
Even if you want to query on month of year or day of month (which doesn't seem to be the case, judging by your description), there is no need to keep them separate - simply create the appropriate computed columns and intex them.
The date column makes sense for temporal data because it is fit for purpose.
However, if you have a specific use-case where you are more often comparing month-to-month data instead of using the full date, then there is a little bit of advantage - as you mentioned - int columns are much leaner to store into index pages and faster to match.
The downsides are that with 3 separate int columns, validation of dates is pretty much a front-end affair without resorting to additional coding on the SQL Server side.
Normally, a single date field is ideal, as it allows for more efficient comparison, validity-checks at a low level, and database-side date-math functions.
The only significant advantage of separating the components is when a day or month first search (comparison) is frequently needed. Maybe an "other events that happened on this day" sort of thing. Or a monthly budgeting application or something.
(Even then, a proper date field could probably be made to work efficiently with proper indexing.)
Yes, I would suggest you replace the 3 columns with a single column that contains the date in Julian which is a floating point number. The part before the dot gives the day, the part after the dot gives the time within the day. Calculations will be easy and you can also easily convert Julian back into month/day/year etc. I believe that MS Excel stores dates internally as a floating point number so you will be in good company.

Compare queries on converted columns

Certain parts of my database are required to be extremely flexible to the point that the user might decide to manipulate number and/or data types of columns in a table. The data that is already in the table though should be preserved.
That leaves me with the only option of using nvarchar(max) as the data type for any column in any of those tables.
Be it the case that the user chooses to store integers in a certain column and then wants to get all rows with that field in a certain range. Then I should run a compare query over converted values of that column into int.
I am afraid that would a performance disaster. Assuming that I am left with no other design alternatives, what can I do to improve some performance in this scenario?
I can relate to this problem. An application, for instance, might be taking user input from an Excel spreadsheet and need to store this in a format as the user sees it. Once in the database, though, you might have other requirements on filtering and combining data.
You've solved half the problem. By storing the value in a character field, you can store what the user wants.
The second half is to store the value also as a reasonable way for the database to manipulate. I would decide on a set of base types, perhaps just float and datetime, depending on the application. Then, when a user inserts a value, you can do the conversion and set the value in a separate columns. Your table might have columns like this:
ColumnX_WhatTheUserSees nvarchar(max),
ColumnX_Type char(1) not null default 'C', -- 'C'haracter, 'F'loat, 'D'atetime
ColumnX_Float float,
ColumnX_Datetme
The insertion logic then goes something like this:
insert into t(ColumnX_WhatTheUSerSees, ColumnX_type, ColumnX_Float, ColumnX_Datetime)
select #ColX,
(case when isnumeric(#Colx) = 1 then 'F'
when isdate(#Colx) = 1 then 'D'
else 'C'
end),
(case when isnumeric(#Colx) = 1 then cast(#Colx as float) end),
(case when isdate(#Colx) = 1 then cast(#Colx as datetime) end)
The above code is meant for illustrative purposes only. You may need to handle special cases you are not interested in (perhaps you think '1e5' should be a string or you might want to handle numbers with parentheses as negative numbers).
You can handle the extra part of the update through a before insert or before update trigger, so the user would never see the extra complexity. You can provide a view so the user sees only the "WhatTheUserSess" columns.
Finally, SQL does offer the sql_variant data type. This provides an alternative route for what you want. However, it would lose the initial user formatting (which has been important when I've encountered similar problems).
Given what you said, perhaps you could add an additional int column for each column and a trigger that will populate it as an int if the user puts one in the nvarchar(max) column) then at least you would only have to convert the data once, rather than each time you query it. Otherwise , yes you are stuck with the poorly performing conversion to an integer (whcih is problematic since you have to preserve earlier information that may not be int) in order to do any kind of ordering or mathmatical calculation. Another possibility is to have a string column and an int column (and a trigger to make sure only one of the two is populated) and then a view that coalesces them for display for when you ned to show all records. A meta table to tell you which one the client is using could help you in wswrting queries. No matter what this is a mess. Have you considered that a nosql solution might be better for your requirment?? That is the use case for NoSQL, data athat is unstructured. If we knew the real use for this data, it is possible we could suggest a better design alternative.
(Turn Rant on - Personally, without knowing more, I would question the need for any application to be that flexible. Often requirements add more flexibility than users actually require or will use and developers dutifully build it. I have seen this in every single COTS program I have had to support. Users in general think they want flexibility - making it a sales point, but find it so hard to use that they will not use it in practice. Sometimes we need to do a better job of pushing back when the requirement will make the software run slowly or be virtually unusable. Turn Rant off. )

Database ETL Design Question

A dataset I receive for routine refresh purposes contains a date field that's actually VARCHAR.
As this will be an indexed/searched field, I'm left with...
1) Converting the field to DATETIME and validating and normalizing the data values when refreshing
or...
2) Leaving the data as-is and forming my queries to accommodate various valid date formats, i.e.,
WHERE DateField = 'CCYYMMDD' OR DateField = 'MM/DD/CCYY' OR ....
The refresh would be on a monthly basis; "cleaning" the data would add about 35% time to the ETL cycle. My queries on the date field would all be equalities; I do not need to range search.
Also, I'm a one man shop, so the more hands-off the overall solution the better.
So which scenario am I better off doing? All opinions appreciated.
I think this is a great question. Here's my opinion:
I'm a big believer in the idea that in the long run you'll save more time and have fewer headaches by using data types for the purpose for which they were intended. That means dates in date fields, characters in character fields, etc. If you go with option 2 you'll need to remember to code for all the various possible date formats every time you query the table. If you set this down and come back a year from now, are you going to remember?
By contrast, if you use a date field and do the upfront work in the ETL process of dealing with the dates properly, you will always know just how to interact with the field. And I'm not even going into performance implications.
And in this case, I'm not sure you'll even see a short-term benefit. If there are, for example 5 different possible date formats in the source data, you'll need to account for those one way or another; either in the ETL or in the output queries. The code to transform those 5 formats in ETL is not materially more complicated than the code to manage those 5 formats in the output queries.
And if the data could literally arrive in an infinite number of formats, you have big problems either way. Either your ETL will break or your queries will break. It is, to a certain extent, an irreducible complexity.
I would suggest that you take the time to code the proper transforms into your ETL. But do yourself a favor and code a preprocessing step that identifies dates in formats that won't properly transform and alerts you to them. If you see patterns; i.e., if any format shows up more than once, code a transform for it. Over time you'll be left manually cleaning fewer and fewer of those nasty dates. With luck, your 35% will drop to 5% or less.
Good luck!
You are better off cleaning the data. First dates which are not good dates are meaningless so it's pointless to store them. Second, it is harder to fix a bad datatype choice later than it is to never make it. Querying will not only be easier but it will be faster than if you use a varchar. And things like ordering will work correctly as well as date functions. Third, I can't imagine that cleaning this would add that much to your import, I clean data all the time without it being a problem. But if it does, then clean the data in a staging table that no other process is using (so you aren't affecting users on prod) and then do the load to the prod tables from nice clean data.
Clean the data up front and store the dates as dates.
I work with systems that store dates as strings and there appear to be an unlimited number of ways to store the dates. This makes it very difficult to create a query to will work against a future new date format.
If you store dates as strings then you should apply constraints to make sure the data is stored in the proper format. Or, just convert the date strings to dates and let the database apply the valid date constraint itself. It is usually best to let the database do the work for you.
Definitely better off cleaning the data and loading into date column as this will ensure the integrity.

Are there any advantages to use varchar over decimal for Price and Value

I was arguing with my friend against his suggestion to store price, value and other similar informations in varchar.
My point of view are on the basis of
Calculations will become difficult as we need to cast back and forth.
Integrity of the data will be lost.
Poor performance of Indexes
Sorting and aggregate functions will also need casting
etc. etc.
But he was saying that in his previous employement everybody used to store such values in varchar, because the communication between DB and the APP will be very effective in this approach. (I still cant accept this)
Are there really some advantages in storing such values in varchar ?
Note : I'm not talking about columns like PhoneNo, IDs, ZIP Code, SSN etc. I know varchar is best suited for those. The columns are value based, and will for sure be involved in calculations some way or other.
None at all.
Try casting a values back and too and see how much data you lose.
DECLARE #foo TABLE (bar varchar(30))
INSERT #foo VALUES (11.2222222222)
INSERT #foo VALUES (22.3333333333)
INSERT #foo VALUES (33.1111111111)
SELECT CAST(CAST(bar AS float) AS varchar(30)) FROM #foo
I would also mention that his current employment does things differently... he isn't at his previous employment any more....
I think a big part of the reason to use the APPROPRIATE (in this case decimal) data type is to prevent invalid data. There's nothing to stop someone entering "The King" as a price in a varchar field.
I can see no advantages, and a whole heap of very severe disadvantages - the most pressing of which is performance (particularly when sorting).
Consider if you want to get a list of the N most expensive products, and you are storing your price as a VARCHAR. Here are some sample values (sorted in descending order)
SELECT Price FROM Table ORDER BY Price DESC
Price
-----
90
600
50
1000
Whoops! The sort order is, well, wrong! (Alphanumerical sorting, rather than value sorting).
If we want to do the sort properly then this means we either need to pad values with zeroes at the start, or convert each value to a double before we sort - but if we have to do a convert on every row this means that SQL server has no way of using statistics to predict what the results will be! This in turn means extremely poor performance, probably a table scan.
As Kragen notes, sorts will not necessarily come out in the right order.
Compares won't necessarily work either. If a field is defined as, say, decimal(8,2) and I give it the value "37.20", and later I write "select ... where price=37.2", the result will be true. But if I store a varchar 37.20 and compare it to 37.2, it will not be equal. Similarly if one or the other has leading zeros.
You could solve these problems by having the application insure that you always store the numbers with a fixed number of decimal places and padded with leading zeros. Oh, and make sure you have a consistent convention about storing minus signs. But then every place in the app that writes to this field must be sure that it follows exactly the same rules. We could do this of course, but why? The database engine will do it for us if we just declare the field numeric. Like, yes, I COULD mow my lawn with a pair of scissors, but why would I want to do this?
I don't understand what your friend is saying the advantage is supposed to be. Easier communication between app and database? How? Maybe he was using some unconventional language or database interface that couldn't read numeric values from the DB. I've never had an issue with this. Actually just saying that gets me to wondering if that isn't what happenned: That at his previous company they were using some language or tool that couldn't read decimals from the database because of an implementation problem, the only way they could get it to work was to declare all the numbers as varchar, and now he walks away thinking that's a generally good idea.
Ok . One word answer . Dont
You are right about correct data types having impact on performance (SQL Optimizer works differently for INT VS VARCHAR) , data consistency and integrity etc
if all we needed was VARCHAR I dont think we ever invented other types.
SQL is not dynamically typed. Static typing makes optimization better , index pages smaller and query operators efficient.
It is not the problem of source that consumer needs all strings as input. it is upto consumer to do type checking and consuming data. A DB should always have correct types .
(Forget about choosing between INT and VARCHAR i would say you should also think whether you should have INT or TINYINT ) these consideration makes a lot of difference
Data Types are best stored in fields that match the type between two different systems. In this case you are referring from your .Net objects to MS SQL server. You are correct with data integrity loss and with the need to cast/convert data types into useable forms. As for other types such as Phone Number, ZIP Code, SSN and so on; they too would benefit from dedicated data types. The main reason these are stored in VARCHAR/NVARCHAR is due to the number of different possibilities that are not needed in every system. But if you have a type that is commonly used and you want to constrain it you can build custom data types called User-defined types to store that data in SQL server. (Even more fun is CLR defined types see example on code project.)
The only advantage I can see with using any sort of variable-sized string-ish format would be if the field would have to accommodate an unknown amount of additional information. For example, "49.95#1/39.95#5/29.95#20/14.95#100,match=true/24.95#100" to indicate that this particular product has price points at 1, 5, 20, and 100 units, and the best 100-unit price is only available when all items are identical. Using strings to store such things is icky, but if the number of price-points is open-ended, using a variable-sized field might be better than having to create another table with one row per product/price-point combination. If you do go that route, it may be good to use XML serialization for the data, rather than an ad-hoc thing as shown above. An ad-hoc approach might allow faster parsing in some cases, but if things really are open-ended it could become a real pain to maintain.
Addendum: If you want to be able to do any type of sorting or searching based on price, you'll need to have separate columns for that. If you want to allow users to e.g. find the ten cheapest items at 100-piece mix/match quantity, and the database holds 10,000 possible items, the only way to satisfy the query with varchar-stored data would be to read all l0,000 items and evaluate what the best price would be given the restrictions. If users can only query based upon a small number of price/restriction combinations, it may be helpful to have a column for each one to allow direct queries.