How many bytes does a BigQuery Date require - google-bigquery

I've got a timestamp column in BigQuery, and now I realize I could have used a date data type to represent this column instead (I don't need fine time granularity). My table is large and expensive to query so I'm wondering whether I'll save money by converting it to a new column of type DATE instead.
However, the official BigQuery documentation on data types doesn't seem to indicate how many bytes a date object requires. Does anyone here know?

DATE and TIMESTAMP both require 8 bytes
You can see more details at Data size calculation

Related

Invalid Time String Error when trying to change type of data from string to time

I am very new to data analytics and I need some help troubleshooting a SQL error I got. So, I have a column in this table which transferred over from Excel to SQL as a string type rather than a time piece of data. I want to make it into a time type so i can further analyze it.
So, I did the attached query to try and change the type of data using the CAST function. . However, it could not complete the query thanks to an outlier in the data set I have yet to clean the data and this was one of my first steps to so, but how do I remove this particular row that contains the invalid time string so the query can actually work? Or is there a better way to convert this entire column from text string to time?
BigQuery Time types adjust values outside the 24 hour boundary - 00:00:00 to 24:00:00; for example, if you subtract an hour from 00:30:00, the returned value is 23:30:00.
Based on your screenshot it looks like you are storing a duration? So 330 hours, 25 minutes and 55 seconds?
You would probably be best using timestamp, converting the hours to days and adding the remainder to your minutes and seconds.
You can then cast the resulting string to timestamp.
Edit
A much simpler solution is just cast('330:25:55' as interval) - thanks to #MatBailie

Bulk insert into BigQuery from a CSV gives an error on the timezone format but a direct insert works fine

I'm having some trouble with BigQuery and going around in circles with support but it boils down to this.
I have manually entered, handwritten form data I want to get into BigQuery. Later I will be creating some electronic means to replace the paper forms but for the moment I have thousands of rows of data that are largely in excel. I have converted them to CSV, massaged any excel created formatting errors but am stuck when it comes to timezones. BigQuery has 2 formats - timestamp and datetime. The first being a UTC time and the second (after some digging) intended to be just 'what someone would see on a watch' so call it local time.
I can create a table with a field with the datatype of timestamp and do an insert with "2019-12-01 11:34:00 Australia/Adelaide" and it will convert the time to UTC and insert. If I take the exact same value and knock it up in a quick CSV field (manually not through excel) BigQuery will produce an error of 'Unrecognised timezone: Australia/Adelaide; Could not parse "2019-12-01 11:34:00 Australia/Adelaide' as timestamp for field tsfield (position 0) starting at location 0'
To get my historical information in Im probably going to have to give up, convert everything to UTC before I upload just to get it in but Im now concerned about what happens when I try to do this programmatically later too.
edit: For wayward future traveler searching for the same answer... I decided to just convert the values myself for BigQuery and lean into it. I converted all times before uploading into UTC and then let it just apply the timezone stored as a 'timestamp' data type THEN for 'local time' I used the 'datetime' type which does not include a timezone suffix and left it in its original time. I don't have anything telling you where 'local time' is but hopefully can derive from the UTC for most people.

File size grows when modify date to datetime

I'm converting date to datetime in several DBs, but the final file size changes from 300mb to 3gb...anyone knows the reason of that huge growth?
I'm a bit of a self-taught database person, so if anyone has a more decent explanation, you should mark that as an answer. Here's what I've learned over the past two years.
(My)SQL is a pessimist by nature; if you change the type and collation of a field, it's going to inflate the size of the table to fit as much data as it'd contain by maximum. (You'd have the same effect of inflating the table size with varchar fields where you change from UTF8_general_ci to UTF8MB4_general_ci).
The difference between changing collation and type is that by changing the type from date to datetime, you're asking your system to generate new data. So today's date stored in a MYSQL datefield is written as 2019-12-12 if you change that to a datetime type, it gets changed to: 2019-12-12 00:00:00
Changing the type thus adds new data; this data are the hours, minutes and seconds. It's however meaningless (all 00) as there was nothing there in the first place.
So now you have all those 00's in your database for hours, minutes and seconds. That's probably taking up quite a lot of space, but my gut feeling is that it shouldn't inflate your database from 300mb to 3gb.
Date type size is about 3 bytes, datatime is about 8 bytes. If there is a million records...thats is :)

what is realdate in SQL?

I have some SQLite database in which one of the columns has data type as realdate and the column has value as 2453137.5
can anyone please comment on this?
any help is appreciated :)
From SQLlite Docs
SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in Date And Time Functions of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values:
TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS").
REAL as Julian day numbers, the number of days since noon in Greenwich on November 24, 4714 B.C. according to the proleptic Gregorian calendar.
INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.
Applications can chose to store dates and times in any of these formats and freely convert between formats using the built-in date and time functions.
In your example you are using REAL datatype to store Dates. It will give the output which is not human readable.
For eg., If i'm storing current date and time
CREATE TABLE
IF NOT EXISTS DATEREAL (d1 real);
INSERT INTO DATEREAL (d1)
VALUES(julianday('now'));
SELECT * from DATEREAL;
Output : 2458792.7882345
You can read this using built-in date() and time() as shown below
SELECT
date(d1),
time(d1)
FROM
datereal;
Output :
date(d1) time(d1)
2019-11-05 06:55:03
Check demo here
One of the powerful features of SQLite is allowing you to choose the storage type.
Real number has 2 advantages:
High precision regarding fraction seconds
Longest time range
I got this answer from a user named Zso.
Here's the link to the original post How do DATETIME values work in SQLite?.
Hope this might help you to understand better.

When to use VARCHAR and DATE/DATETIME

We had this programming discussion on Freenode and this question came up when I was trying to use a VARCHAR(255) to store a Date Variable in this format: D/MM/YYYY. So the question is why is it so bad to use a VARCHAR to store a date. Here are the advantages:
Its faster to code. Previously I used DATE, but date formatting was a real pain.
Its more power hungry to use string than Date? Who cares, we live in the Ghz era.
Its not ethically correct (lolwut?) This is what the other user told me...
So what would you prefer to use to store a date? SQL VARCHAR or SQL DATE?
Why not put screws in with a hammer?
Because it isn't the right tool for the job.
Some of the disadvantages of the VARCHAR version:
You can't easily add / subtract days to the VARCHAR version.
It is harder to extract just month / year.
There is nothing stopping you putting non-date data in the VARCHAR column in the database.
The VARCHAR version is culture specific.
You can't easily sort the dates.
It is difficult to change the format if you want to later.
It is unconventional, which will make it harder for other developers to understand.
In many environments, using VARCHAR will use more storage space. This may not matter for small amounts of data, but in commercial environments with millions of rows of data this might well make a big difference.
Of course, in your hobby projects you can do what you want. In a professional environment I'd insist on using the right tool for the job.
When you'll have database with more than 2-3 million rows you'll know why it's better to use DATETIME than VARCHAR :)
Simple answer is that with databases - processing power isn't a problem anymore. Just the database size is because of HDD's seek time.
Basically with modern harddisks you can read about 100 records / second if they're read in random order (usually the case) so you must do everything you can to minimize DB size, because:
The HDD's heads won't have to "travel" this much
You'll fit more data in RAM
In the end it's always HDD's seek times that will kill you. Eg. some simple GROUP BY query with many rows could take a couple of hours when done on disk compared to couple of seconds when done in RAM => because of seek times.
For VARCHAR's you can't do any searches. If you hate the way how SQL deals with dates so much, just use unix timestamp in 32 bit integer field. You'll have (basically) all advantages of using SQL DATE field, you'll just have to manipulate and format dates using your choosen programming language, not SQL functions.
Two reasons:
Sorting results by the dates
Not sensitive to date formatting changes
So let's take for instance a set of records that looks like this:
5/12/1999 | Frank N Stein
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
If we were to store the data your way, but sorted on the dates in assending order SQL will respond with the resultset that looks like this:
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
Where if we stored the dates as a DATETIME, SQL will respond correctly ordering them like this:
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
1/22/2005 | Drake U. La
Additionally, if somewhere down the road you needed to display dates in a different format, for example like YYYY-MM-DD, then you would need to transform all your data or deal with mixed content. When it's stored as a SQL DATE, you are forced to make the transform in code, and very likely have one spot to change the format to display all dates--for free.
Between DATE/DATETIME and VARCHAR for dates I would go with DATE/DATETIME everytime. But there is a overlooked third option. Storing it as a INTEGER unsigned!
I decided to go with INTEGER unsigned in my last project, and I am really satisfied with making that choice instead of storing it as a DATE/DATETIME. Because I was passing along dates between client and server it made the ideal type for me to use. Instead of having to store it as DATE and having to convert back every time I select, I just select it and use it however I want it. If you want to select the date as a "human-readable" date you can use the FROM_UNIXTIME() function.
Also a integer takes up 4 bytes while DATETIME takes up 8 bytes. Saving 50% storage.
The sorting problem that Berin proposes is also solved using integer as storage for dates.
I'd vote for using the date/datetime types, just for the sake of simplicity/consistency.
If you do store it as a character string, store it in ISO 8601 format:
http://www.iso.org/iso/date_and_time_format
http://xml.coverpages.org/ISO-FDIS-8601.pdf
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Among other things, ISO 8601 date/time string (A) collate properly, (B) are human readable, (C) are locale-indepedent, and (D) are readily convertable to other formats. To crib from the ISO blurb, ISO 8601 strings offer
representations for the following:
Date
Time of the day
Coordinated universal time (UTC)
Local time with offset to UTC
Date and time
Time intervals
Recurring time intervals
Representations can be in one of two formats: a basic format
that has a minimal number of characters and an extended format
that adds characters to enhance human readability. For example,
the third of January 2003 can be represented as either 20030103
or 2003-01-03.
[and]
offer the following advantages over many of the locally used
representations:
Easily readable and writeable by systems
Easily comparable and sortable
Language independent
Larger units are written in front of smaller units
For most representations the notation is short and of constant length
One last thing: If all you need to do is store a date, then storing it in the ISO 8601 short form YYYYMMDD in a char(8) column takes no more storage than a datetime value (and you don't need to worry about the 3 millisecond gap between the last tick of the one day and the first tick of the next. But that's a matter for another discussion. If you break it up into 3 columns — YYYY char(4), MM char(2), DD char(2) you'll use up the same amount of storage, and get more options for indexing. Even better, store the fields as a short for yyyy (4 bytes), and a tinyint for each of MM and DD — now you're down to 6 bytes for the date. The drawback, of course, to decomposing the date components into their constituent parts is that conversion to proper date/time data types is complicated.