What is the underlying datastructure of datetime values stored in SQL Server (2000 and 2005 if different)? Ie down to the byte representation?
Presumably the default representation you get when you select a datetime column is a culture specific value / subject to change. That is, some underlying structure that we don't see is getting formatted to YYYY-MM-DD HH:MM:SS.mmm.
Reason I ask is that there's a generally held view in my department that it's stored in memory literally as YYYY-MM-DD HH:MM:SS.mmm but I'm sure this isn't the case.
It's stored as an 8 byte field, capable of a range from 1753-01-01 through 9999-12-31, accurate to 0.00333 seconds.
The details are supposedly opaque, but most resources (1), (2) that I've found on the web state the following:
The first 4 bytes store the number of days since SQL Server's epoch (1st Jan 1900) and that the second 4 bytes stores the number of ticks after midnight, where a "tick" is 3.3 milliseconds.
The first four bytes are signed (can be positive or negative), which explains why dates earlier than the epoch can be represented.
Related
I'm trying to convert and upload latitude and longitude data into a database through an ETL process I created where we take the source data from a .csv file and convert it to DECIMAL. Here you have an example of what the two values look like:
Latitude (first column): 41.896585191199556
Longitude (second column):-87.66454238198166
I set the data type on the database as for:
Latitude DECIMAL(10,8)
Longitude DECIMAL(11,8)
The main problem arises when I try to convert data from file to database and then I get the message
[Flat File Source [85]] Error: Data conversion failed. The data conversion for column "Latitude" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
View of my process:
When trying to ignore the error Latitude and Longitude values in the database are changed to NULL... The flat file encoding is 65001.
I tried doing conversions for data types: float, DECIMAL, int and nothing helped.
My questions are:
what data type for these above values should I use in the target database.
what data type should i choose on input for flat file ?
what data type to set for conversion (I suspect the one we will have on the database) ?
please note that some records in the file are missing the location
view from Data:
view from Data Conversion:
UPDATE
When FastParse is run I receive an error message as below:
What data type should I choose in this case ? I set everything up as #billinkc suggested. When I set an integer, for example DT_I4, it results in NULL and the same error as before (in this message there is no possibility to select some data type for the value of Latitude, i.e. DECIMAL or STRING).
You need DECIMAL(11,8). That has three digits before the decimal place and either digits after.
The conversion failure is no doubt happening when you have longitudes above 100 or less than -100.
The error reported indicates the failure point is the Flat File Source
[Flat File Source [85]] Error: Data conversion failed. The data conversion for column "Latitude" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
I'm on a US locale machine so you could be running into issues with the decimal separator. If that's the case, then in your Flat File Source, right click and select Show Advanced Editor. Go to Input and Output Properties, and under the Flat File Source Output, expand Output Columns and for each column that is a floating point number, check the FastParse option.
If that works, great, you have a valid Flat File Source.
I was able to get this working two different ways. I defined two Flat File Connection Managers in my package: FFCM Dec and FFCM String While I prefer to minimize the number of operations and transforms I apply to my packages, declaring the data types as strings can help you get past the hurdle of "I can't even get my data flow to start because of bad data"
Source data
I created a CSV saved as UTF-8
Latitude,Longitude
41.896585191199556,-87.66454238198166
FFCM Dec
I configured a standard CSV
I defined my columns with the DataType of DT_DECIMAL
FFCM String
Front page is the same but on the columns in the Advanced section, I left the data type as DT_WSTR with a length of 50
At this point, we've defined the basic properties of how the source data is structured.
Destination
I went with consistency on the size for the destination. You're not going to save anything by using 10 vs 11 and I'm too lazy to look up the allowable domain for lat/long numbers
CREATE TABLE dbo.SO_65909630
(
[Latitude] decimal(18,15)
, [Longitude] decimal(18,15)
)
Data Flow
I need to run but you either use the correctly typed data when you bring it in (DFT DEC) or you transform it.
The blanks I see in your source data will likely need to be dealt with (either you have a column that needed to be escaped or there is no data - which will cause the data conversion to fail so I'd advocate this approach
Row counts are there just to provide a place to put a data viewer while I was building the answer
What data type should I use for lat and long
Decimal is an exact data type so it will store the exact value you supply. When used it takes the form of decimal(scale, precision). Before my current role, I had never used any other data type for non-whole numbers.
Books On Line on decimal and numeric (Transact-SQL) https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
Scale
The maximum total number of decimal digits to be stored. This number includes both the left and the right sides of the decimal point. The precision must be a value from 1 through the maximum precision of 38. The default precision is 18.
Precision
The number of decimal digits that are stored to the right of the decimal point. This number is subtracted from p to determine the maximum number of digits to the left of the decimal point. Scale must be a value from 0 through p, and can only be specified if precision is specified. The default scale is 0 and so 0 <= s <= p. Maximum storage sizes vary, based on the precision.
Precision Storage bytes
1 - 9 5
10-19 9
20-28 13
29-38 17
For the table I defined above, it will cost us 18 bytes (2 * 9) for each lat/long to store.
But let's look at the actual domain for latitude and longitude (on Earth) This magnificent answer on GIS.se is printed out and hangs from my work monitor https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude
Pasting the relevant bits here
The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. It should be more than good enough for tracking movements of glaciers and rivers. This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.
The seventh decimal place is worth up to 11 mm: this is good for much surveying and is near the limit of what GPS-based techniques can achieve.
The eighth decimal place is worth up to 1.1 mm: this is good for charting motions of tectonic plates and movements of volcanoes. Permanent, corrected, constantly-running GPS base stations might be able to achieve this level of accuracy.
The ninth decimal place is worth up to 110 microns: we are getting into the range of microscopy. For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!
Your input values show more than 10 digits of precision so I'm guessing it's a calculated value and not a "true observation". That's good, that gives us more wiggle room to work with.
Why, we could dial that decimal declaration down the following for half* the storage cost of the first one
CREATE TABLE dbo.SO_65909630_alt
(
[Latitude] decimal(8,5)
, [Longitude] decimal(8,5)
);
Well that's good, we've stored the "same" data at lower the cost. Maybe your use case is just "where are my stores" and even if you're Walmart with under 12000 stores, who cares? That's a trivial cost. But if you need to also store the coordinates of their customers, the storage cost per record might start to matter. Or use Amazon or Alibaba or whatever very large consumer retailer exists when you read this.
In my work, I deal with meteorological data and it comes in all shapes and sizes but a common source for me is Stage IV data It's just hourly rainfall amounts across the contiguous US. So 24 readings per coordinate, per day. Coordinate system is 1121 x 881 (987,601 points) so expressing hourly rainfall in the US for a day is 23,702,424 rows. The difference between 18 bytes versus 10 bytes can quickly become apparent given that Stage IV data is available back to 2008.
We actually use a float (or real) to store latitude and longitude values because it saves us a 2 bytes per coordinate.
CREATE TABLE dbo.SO_65909630_float
(
[Latitude] float(24)
, [Longitude] float(24)
);
INSERT INTO dbo.SO_65909630_alt
(
Latitude
, Longitude
)
SELECT * FROM dbo.SO_65909630 AS S
Now, this has caused me pain because I can't use an exact filter in queries because of the fun of floating point numbers.
My decimal typed table has this in it
41.89659 -87.66454
And my floating type table has this in it
41.89658 -87.66454
Did you notice the change to the last digit in Latitude? 8 not 9 as the decimal table has but either way, it doesn't matter
SELECT * FROM dbo.SO_65909630_float AS S WHERE S.Latitude = 41.89658
This won't find a row because of floating point rounding exact match nonsense. Instead, your queries become very tight range queries, like
SELECT * FROM dbo.SO_65909630_float AS S WHERE S.Latitude >= (41.89658 - .00005) AND S.Latitude <= (41.89658 + .00005)
where .00005 is a value that you'll have to experiment with given your data to find out how much you need to adjust the numbers to find it again.
Finally, for what it's worth, if you convert lat and long into the Geography Point it's going to coerce the input data type to float as it is.
I work on an application that stores datetimes in a SQL Server database. Some of these are a point in time stored in UTC (such as log item datetimes), while others are a literal date/time (such as "take medication X at 4pm on 20 July, irrespective of your timezone).
Problem is that these both have a date and time component, so using a datetime2 column type makes sense for both. We're now in a situation where it is often unclear in our app whether a date/time column is a UTC point in time or a literal date/time.
What is the most common practice to distinguish between these 2 cases? I can think of these options:
1) End all UTC columns in ...Utc, while literal date/time columns have no special ending.
2) End all literal columns in ...Literal, while UTC date/time columns have no special ending.
3) Give UTC columns the data type datetime2 and literal date/time columns datetimeoffset.
Always try to use the appropriate type first and then good naming. If datetime2(0) is a good fit, use it.
In my system I add a suffix to the column name, for example: PlaybackStartedLocal datetime2(0), PlaybackStartedUTC datetime2(0). In my case I have to store both local and UTC values for the same event, because some reports need local value, some UTC and it is very difficult to convert between them later.
In general it is a good practice to include units of measurement into the column/variable name.
What do you prefer to see:
PlaybackDurationMSec or PlaybackDuration
LengthMeters / LengthMiles or Length
A well-known example when two teams of programmers didn't notice that they were interpreting metric values as imperial and visa versa: A disaster investigation board reports that NASA’s Mars Climate Orbiter burned up in the Martian atmosphere because engineers failed to convert units from English to metric.
The software calculated the force the thrusters needed to exert in
pounds of force. A separate piece of software took in the data
assuming it was in the metric unit: newtons.
Looking at the functions here. The formatting shows fractional seconds available to the millisecond (0.001) but is it accurate to the millisecond? I have not been able to find the resolution of these calls in any of the documentation.
https://www.sqlite.org/datatype3.html#section_2_2
SQLite does not have a storage class set aside for storing dates
and/or times. Instead, the built-in Date And Time Functions of SQLite
are capable of storing dates and times as TEXT, REAL, or INTEGER
values:
TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS").
REAL as Julian day numbers, the number of days since noon in Greenwich on November
24, 4714 B.C. according to the proleptic Gregorian calendar.
INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.
Applications can chose to store dates and times in any of these
formats and freely convert between formats using the built-in date and
time functions.
It appears that you can reach the highest resolution with ISO8601 strings. There should be no problem with accuracy with these strings, as long as you're not mixing storage representations.
This depends on the date format.
INTEGER numbers are accurate to the second.
TEXT values are accurate to the millisecond. You could specify more digits in the fractional seconds fields, but the built-in function will ignore all after the first three.
The resolution of Julian day numbers is better than a millisecond, but when formatting them, the built-in functions will not output more than three fractional digits.
If you want to store time values with higher resolution than is permitted by the date and time functions, you can opt for INTEGER storage, but you're on your own for conversion functions and the like.
Storing unix epoch timestamps with up to nanosecond precision will cover dates from 1970 until minimally the year 2262 (if nanoseconds). You're on your own for conversion to/from ISO date strings and time unit-relative arithmetic, but if you're content with that, then you'll have a fast, high precision and compact storage format.
I have a system that receives sensor values from devices at a rate timed in nanoseconds, so storing and indexing on those timestamps has been very helpful. I do have to provide a query interface that converts from ISO, Python datetime and other formats into ns and back, but that's the deal.
Working with 8-byte timestamps helped keep the record length short, with quick inserts and queries.
I have a database with integer fields (columns) named fSystemDate, fOpenned, fStatusDate, etc... I think they represent dates, but I don't know their format. The values in those fields are how these: 76505, 76530, 76554, 76563.
I do not have examples with the real date associated with them.
Solved. See answers.
I found that this format is part of a programming language called Clarion and his date numbering starts at the date 28-December-1800.
I can convert clarion data to sql date in two ways:
SELECT DATEADD(day, 76505, '28-12-1800')
where the result would be 2010-06-15 00:00:00.
SELECT CONVERT(DateTime,76505 - 36163)
where the result is same. The number 36163 is used to adjust a SQL. This is the number of days between 1-Jan-1900 (MSSQL datetime numbering starts) and 28-Dec-1800 (Clarion datetime numbering starts).
The result in my case is correct because I asked them (my customer) examples of data from your application and compare information.
It's rather hard to help you given just a number. It looks like your dates are some sort of serial number. But without any other data points
epoch. An epoch is the zero point of a calendrical system.
increment. How big is a tick in the serial number? 1 day? 1 hour, 1 minute? A week? A month?
source hardware/operating system. From what computer system did the value originate? Different systems represent dates differently, using different calendrical systems with different epochs.
source software system. What software created the value? Was it custom software? What language what it written in? When? What is the backing store for the data? Databases, filesystems, etc., might all have their own internal date representation.
the represented value. If 76563 is indeed a representation of a date, what date does it represent? Or at least, does it represent a recent date? a date in the past? a date in the future?
It's impossible to answer your question. This page might help you:
http://www.itworld.com/article/2823335/data-center/128452-Just-dating-The-stories-behind-12-computer-system-reference-dates.html
It lists some common epochs for different computer systems:
Edited to note: here's one data point for you: Adding 76,563 days to 1 Jan 1800 yields the date 16 August 2009.
We had this programming discussion on Freenode and this question came up when I was trying to use a VARCHAR(255) to store a Date Variable in this format: D/MM/YYYY. So the question is why is it so bad to use a VARCHAR to store a date. Here are the advantages:
Its faster to code. Previously I used DATE, but date formatting was a real pain.
Its more power hungry to use string than Date? Who cares, we live in the Ghz era.
Its not ethically correct (lolwut?) This is what the other user told me...
So what would you prefer to use to store a date? SQL VARCHAR or SQL DATE?
Why not put screws in with a hammer?
Because it isn't the right tool for the job.
Some of the disadvantages of the VARCHAR version:
You can't easily add / subtract days to the VARCHAR version.
It is harder to extract just month / year.
There is nothing stopping you putting non-date data in the VARCHAR column in the database.
The VARCHAR version is culture specific.
You can't easily sort the dates.
It is difficult to change the format if you want to later.
It is unconventional, which will make it harder for other developers to understand.
In many environments, using VARCHAR will use more storage space. This may not matter for small amounts of data, but in commercial environments with millions of rows of data this might well make a big difference.
Of course, in your hobby projects you can do what you want. In a professional environment I'd insist on using the right tool for the job.
When you'll have database with more than 2-3 million rows you'll know why it's better to use DATETIME than VARCHAR :)
Simple answer is that with databases - processing power isn't a problem anymore. Just the database size is because of HDD's seek time.
Basically with modern harddisks you can read about 100 records / second if they're read in random order (usually the case) so you must do everything you can to minimize DB size, because:
The HDD's heads won't have to "travel" this much
You'll fit more data in RAM
In the end it's always HDD's seek times that will kill you. Eg. some simple GROUP BY query with many rows could take a couple of hours when done on disk compared to couple of seconds when done in RAM => because of seek times.
For VARCHAR's you can't do any searches. If you hate the way how SQL deals with dates so much, just use unix timestamp in 32 bit integer field. You'll have (basically) all advantages of using SQL DATE field, you'll just have to manipulate and format dates using your choosen programming language, not SQL functions.
Two reasons:
Sorting results by the dates
Not sensitive to date formatting changes
So let's take for instance a set of records that looks like this:
5/12/1999 | Frank N Stein
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
If we were to store the data your way, but sorted on the dates in assending order SQL will respond with the resultset that looks like this:
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
Where if we stored the dates as a DATETIME, SQL will respond correctly ordering them like this:
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
1/22/2005 | Drake U. La
Additionally, if somewhere down the road you needed to display dates in a different format, for example like YYYY-MM-DD, then you would need to transform all your data or deal with mixed content. When it's stored as a SQL DATE, you are forced to make the transform in code, and very likely have one spot to change the format to display all dates--for free.
Between DATE/DATETIME and VARCHAR for dates I would go with DATE/DATETIME everytime. But there is a overlooked third option. Storing it as a INTEGER unsigned!
I decided to go with INTEGER unsigned in my last project, and I am really satisfied with making that choice instead of storing it as a DATE/DATETIME. Because I was passing along dates between client and server it made the ideal type for me to use. Instead of having to store it as DATE and having to convert back every time I select, I just select it and use it however I want it. If you want to select the date as a "human-readable" date you can use the FROM_UNIXTIME() function.
Also a integer takes up 4 bytes while DATETIME takes up 8 bytes. Saving 50% storage.
The sorting problem that Berin proposes is also solved using integer as storage for dates.
I'd vote for using the date/datetime types, just for the sake of simplicity/consistency.
If you do store it as a character string, store it in ISO 8601 format:
http://www.iso.org/iso/date_and_time_format
http://xml.coverpages.org/ISO-FDIS-8601.pdf
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Among other things, ISO 8601 date/time string (A) collate properly, (B) are human readable, (C) are locale-indepedent, and (D) are readily convertable to other formats. To crib from the ISO blurb, ISO 8601 strings offer
representations for the following:
Date
Time of the day
Coordinated universal time (UTC)
Local time with offset to UTC
Date and time
Time intervals
Recurring time intervals
Representations can be in one of two formats: a basic format
that has a minimal number of characters and an extended format
that adds characters to enhance human readability. For example,
the third of January 2003 can be represented as either 20030103
or 2003-01-03.
[and]
offer the following advantages over many of the locally used
representations:
Easily readable and writeable by systems
Easily comparable and sortable
Language independent
Larger units are written in front of smaller units
For most representations the notation is short and of constant length
One last thing: If all you need to do is store a date, then storing it in the ISO 8601 short form YYYYMMDD in a char(8) column takes no more storage than a datetime value (and you don't need to worry about the 3 millisecond gap between the last tick of the one day and the first tick of the next. But that's a matter for another discussion. If you break it up into 3 columns — YYYY char(4), MM char(2), DD char(2) you'll use up the same amount of storage, and get more options for indexing. Even better, store the fields as a short for yyyy (4 bytes), and a tinyint for each of MM and DD — now you're down to 6 bytes for the date. The drawback, of course, to decomposing the date components into their constituent parts is that conversion to proper date/time data types is complicated.