I'm converting date to datetime in several DBs, but the final file size changes from 300mb to 3gb...anyone knows the reason of that huge growth?
I'm a bit of a self-taught database person, so if anyone has a more decent explanation, you should mark that as an answer. Here's what I've learned over the past two years.
(My)SQL is a pessimist by nature; if you change the type and collation of a field, it's going to inflate the size of the table to fit as much data as it'd contain by maximum. (You'd have the same effect of inflating the table size with varchar fields where you change from UTF8_general_ci to UTF8MB4_general_ci).
The difference between changing collation and type is that by changing the type from date to datetime, you're asking your system to generate new data. So today's date stored in a MYSQL datefield is written as 2019-12-12 if you change that to a datetime type, it gets changed to: 2019-12-12 00:00:00
Changing the type thus adds new data; this data are the hours, minutes and seconds. It's however meaningless (all 00) as there was nothing there in the first place.
So now you have all those 00's in your database for hours, minutes and seconds. That's probably taking up quite a lot of space, but my gut feeling is that it shouldn't inflate your database from 300mb to 3gb.
Date type size is about 3 bytes, datatime is about 8 bytes. If there is a million records...thats is :)
Related
I've got a timestamp column in BigQuery, and now I realize I could have used a date data type to represent this column instead (I don't need fine time granularity). My table is large and expensive to query so I'm wondering whether I'll save money by converting it to a new column of type DATE instead.
However, the official BigQuery documentation on data types doesn't seem to indicate how many bytes a date object requires. Does anyone here know?
DATE and TIMESTAMP both require 8 bytes
You can see more details at Data size calculation
In my app, I'm just using a SQLlite database for development. Now in the migration, I declare a DATE datatype which laravel seems to handle without any problem, and in the database itself creates it as a varchar.
According to this nice article (http://www.sqlitetutorial.net/sqlite-date/) SQLite has basically got three options for handling dates:
Using the TEXT storage class for storing SQLite date and time Using
REAL storage class to store SQLite date and time values
Using INTEGER to store SQLite date and time values
So as I'm trying to formulate my approach, I'm thinking ahead that I will likely end up, at some point, need to step up and move to a higher performance SQL database (mySQL / Postgres / etc. ) And then may have datatype translation challenges.
But then also, at the application layer, Laravel itself has some manipulations.
Now, the question I'm asking is this, What is the benefit of one type over another? Is there some kind of reason to choose one type over another? My thinking is that TEXT is nice and human-readable for backend support, but it may require addiotnal coding to manipulate strings.
INTEGERS are probably more efficient, and would be translatable to a bigger SQL server easier than text.
Does anyone know of a comparison of the pro's and con's of various choices?
Any advice? Thanks in advance.
The size of integer is 4 bytes. The size of a letter in text is 1 byte.
To represent date and time you need 1 UTC number when you use integer. So its much better to user 4 bytes of integer than using 8 bytes of text. I dont see how real can be better than integer for the exact same reason. I would say you should use integer.
My table has a category 'timestamp' where the timestamps are formatted 2015-06-22 18:59:59
However, using DBVisualizer Free 9.2.8 and Vertica, when I try to pull up rows by timestamp with a
SELECT * FROM table WHERE timestamp = '2015-06-22 18:59:59';
(directly copy-pasting the stamp), nothing comes up. Why is this happening and is there a way around it?
FYI, saying "the timestamps are formatted 2015-06-22 18:59:59" is incorrect if you are indeed using a TIMESTAMP type. Such types have their own internal representation of a date-time value, almost always a count since epoch. In your case with Vertica, 8 bytes are used for such storage. The formatting of the date-time value happens when a string representation is generated. Never confuse the string representation with the date-time value. Conflating the two may well be related to your problem/confusion.
A few different thoughts about possible problems…
String Literals
Are you sure Vertica takes strings as timestamp literals? That format you used is common SQL format. But given that Vertica seems to be a specialized database, I would double-check that.
If strings are not allowed, you may need to call some kind of function to transform the string into a date-time values.
Fractional Second
As the comment by Martin Smith points out, the doc for Timestamp-related data types in Vertica 7.1 says those types can have a fractional second to resolution of microseconds. That means up to 6 decimal places of a fraction.
So if you are searching for "2015-06-22 18:59:59" but the stored value is "2015-06-22 18:59:59.012345", no match on the query.
Half-Open
The fractional seconds issue described above is often the cause of problems people have when handling a span of time. If you naïvely try to pinpoint the ending time, you are likely to have problems. Seeing the "59:59" in your example string makes me think this applies to you.
The better approach to spans of time is "Half-Open" (or Half-Closed, whatever) where the beginning is inclusive while the ending is exclusive. Common notation for this is [). In comparison logic this means: value >= start AND value < stop. Notice the lack of EQUALS SIGN in the stop comparison. In English we would say "look for an hour's worth of invoices starting at 2:00 PM and going up to, but not including, 3:00 PM".
Half-Open for a week means Monday-Monday, for a month the first of one month to the first of the next month, and for a year the January 1 of one year to January 1 of the following year.
Half-Open means not using BETWEEN in SQL. SQL's BETWEEN has often be criticized. Instead do something like the following to look for an hour's worth of invoices. Notice the Z on the end of string literal which means "UTC time zone" ("Z" for "Zulu"). (But verify, as my SQL syntax may need fixing.)
SELECT *
FROM some_table_
WHERE invoice_received_ >= '2015-06-22 18:00:00Z'
AND invoice_received_ < '2015-06-22 19:00:00Z'
;
This query will catch any values such as '2015-06-22 18:59:59.654321" which seems to be eluding you.
Reserved Word
I hope you have not really named your table 'table' and your column 'timestamp'. Such use of keywords and reserved words can cause explicit errors or more subtle weird problems.
Tip: The easy way to avoid any of the over a thousand reserved words in various databases is to append a trailing underscore. The SQL standard explicitly promises to never using a trailing underscore in its reserved words. So use "timestamp_" rather than "timestamp". Another example: "invoice_" table and "received_" column. I recommend doing that as a habit on everything your name in SQL: columns, tables, constraints, indexes, and so on.
Time Zone
You are using the TIMESTAMP which is short for TIMESTAMP WITHOUT TIME ZONE. Or so I presume; the Vertica doc is vague but that is the common usage as seen in the Postgres doc, and may even be standard SQL.
Anyways, TIMESTAMP WITHOUT TIME ZONE is usually the wrong type for most business purposes. The WITH time zone is misnamed and often misunderstood as a consequence: It means "with respect for time zone" where data inputs that include an offset or other time zone information from UTC are adjusted to UTC during the INSERT/UPDATE operations. The WITHOUT type simply ignores any such offset or time zone information.
The WITHOUT type should only be used for the concept of a date-time generally without being tied to any one locality. For example, saying "Christmas this year starts at beginning of December 25, 2015". That means in any time zone rather than a specific time zone. Obviously Christmas starts earlier in Paris, for example, than in Montréal.
If you are timestamping legal documents such as invoices, or booking appointments with people across time zones, or scheduling shipments in various localities, you should be using WITH time zone type.
So back to your possible problem: Test how Vertica or your client app or your database driver is handling your input string. It may be adjusting time zones as part of the parsing of the string using your client machine’s current default time zone. When sent to the database, that value will not match the stored value if during storage no adjustment to UTC was made.
Tip: Generally best practice is to do all your storage and business logic in UTC, adjusting to local time zones only where expected by user.
We had this programming discussion on Freenode and this question came up when I was trying to use a VARCHAR(255) to store a Date Variable in this format: D/MM/YYYY. So the question is why is it so bad to use a VARCHAR to store a date. Here are the advantages:
Its faster to code. Previously I used DATE, but date formatting was a real pain.
Its more power hungry to use string than Date? Who cares, we live in the Ghz era.
Its not ethically correct (lolwut?) This is what the other user told me...
So what would you prefer to use to store a date? SQL VARCHAR or SQL DATE?
Why not put screws in with a hammer?
Because it isn't the right tool for the job.
Some of the disadvantages of the VARCHAR version:
You can't easily add / subtract days to the VARCHAR version.
It is harder to extract just month / year.
There is nothing stopping you putting non-date data in the VARCHAR column in the database.
The VARCHAR version is culture specific.
You can't easily sort the dates.
It is difficult to change the format if you want to later.
It is unconventional, which will make it harder for other developers to understand.
In many environments, using VARCHAR will use more storage space. This may not matter for small amounts of data, but in commercial environments with millions of rows of data this might well make a big difference.
Of course, in your hobby projects you can do what you want. In a professional environment I'd insist on using the right tool for the job.
When you'll have database with more than 2-3 million rows you'll know why it's better to use DATETIME than VARCHAR :)
Simple answer is that with databases - processing power isn't a problem anymore. Just the database size is because of HDD's seek time.
Basically with modern harddisks you can read about 100 records / second if they're read in random order (usually the case) so you must do everything you can to minimize DB size, because:
The HDD's heads won't have to "travel" this much
You'll fit more data in RAM
In the end it's always HDD's seek times that will kill you. Eg. some simple GROUP BY query with many rows could take a couple of hours when done on disk compared to couple of seconds when done in RAM => because of seek times.
For VARCHAR's you can't do any searches. If you hate the way how SQL deals with dates so much, just use unix timestamp in 32 bit integer field. You'll have (basically) all advantages of using SQL DATE field, you'll just have to manipulate and format dates using your choosen programming language, not SQL functions.
Two reasons:
Sorting results by the dates
Not sensitive to date formatting changes
So let's take for instance a set of records that looks like this:
5/12/1999 | Frank N Stein
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
If we were to store the data your way, but sorted on the dates in assending order SQL will respond with the resultset that looks like this:
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
Where if we stored the dates as a DATETIME, SQL will respond correctly ordering them like this:
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
1/22/2005 | Drake U. La
Additionally, if somewhere down the road you needed to display dates in a different format, for example like YYYY-MM-DD, then you would need to transform all your data or deal with mixed content. When it's stored as a SQL DATE, you are forced to make the transform in code, and very likely have one spot to change the format to display all dates--for free.
Between DATE/DATETIME and VARCHAR for dates I would go with DATE/DATETIME everytime. But there is a overlooked third option. Storing it as a INTEGER unsigned!
I decided to go with INTEGER unsigned in my last project, and I am really satisfied with making that choice instead of storing it as a DATE/DATETIME. Because I was passing along dates between client and server it made the ideal type for me to use. Instead of having to store it as DATE and having to convert back every time I select, I just select it and use it however I want it. If you want to select the date as a "human-readable" date you can use the FROM_UNIXTIME() function.
Also a integer takes up 4 bytes while DATETIME takes up 8 bytes. Saving 50% storage.
The sorting problem that Berin proposes is also solved using integer as storage for dates.
I'd vote for using the date/datetime types, just for the sake of simplicity/consistency.
If you do store it as a character string, store it in ISO 8601 format:
http://www.iso.org/iso/date_and_time_format
http://xml.coverpages.org/ISO-FDIS-8601.pdf
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Among other things, ISO 8601 date/time string (A) collate properly, (B) are human readable, (C) are locale-indepedent, and (D) are readily convertable to other formats. To crib from the ISO blurb, ISO 8601 strings offer
representations for the following:
Date
Time of the day
Coordinated universal time (UTC)
Local time with offset to UTC
Date and time
Time intervals
Recurring time intervals
Representations can be in one of two formats: a basic format
that has a minimal number of characters and an extended format
that adds characters to enhance human readability. For example,
the third of January 2003 can be represented as either 20030103
or 2003-01-03.
[and]
offer the following advantages over many of the locally used
representations:
Easily readable and writeable by systems
Easily comparable and sortable
Language independent
Larger units are written in front of smaller units
For most representations the notation is short and of constant length
One last thing: If all you need to do is store a date, then storing it in the ISO 8601 short form YYYYMMDD in a char(8) column takes no more storage than a datetime value (and you don't need to worry about the 3 millisecond gap between the last tick of the one day and the first tick of the next. But that's a matter for another discussion. If you break it up into 3 columns — YYYY char(4), MM char(2), DD char(2) you'll use up the same amount of storage, and get more options for indexing. Even better, store the fields as a short for yyyy (4 bytes), and a tinyint for each of MM and DD — now you're down to 6 bytes for the date. The drawback, of course, to decomposing the date components into their constituent parts is that conversion to proper date/time data types is complicated.
What is the underlying datastructure of datetime values stored in SQL Server (2000 and 2005 if different)? Ie down to the byte representation?
Presumably the default representation you get when you select a datetime column is a culture specific value / subject to change. That is, some underlying structure that we don't see is getting formatted to YYYY-MM-DD HH:MM:SS.mmm.
Reason I ask is that there's a generally held view in my department that it's stored in memory literally as YYYY-MM-DD HH:MM:SS.mmm but I'm sure this isn't the case.
It's stored as an 8 byte field, capable of a range from 1753-01-01 through 9999-12-31, accurate to 0.00333 seconds.
The details are supposedly opaque, but most resources (1), (2) that I've found on the web state the following:
The first 4 bytes store the number of days since SQL Server's epoch (1st Jan 1900) and that the second 4 bytes stores the number of ticks after midnight, where a "tick" is 3.3 milliseconds.
The first four bytes are signed (can be positive or negative), which explains why dates earlier than the epoch can be represented.