I've got a huge (1.5GB) CSV file, with dates in it in the format 2014-12-25. I have managed to upload it to BigQuery with the format string for this column. I'm wondering if I can transform this in situ to a datetime format, without having to download the data, parse it and send it back?
I have used the BigQuery GUI (newbie) but am happy to use the CLI if this will make it easier.
You can use some of Date and time functions to "transform" string represented date to datetime
For example
SELECT '2014-12-25', TIMESTAMP('2014-12-25')
Added:
If you feel that you really need to have your data with date in timestamp format vs string and you have this data (string) already in BigQuery - you can do just similar to below query with writing to new table.
SELECT
TIMESTAMP(date_string) as date_timestamp,
< list all the rest of the fields >
FROM original_table
Related
Been struggling with some datasets I want to use which have a problem with the date format.
Bigquery could not load the files and returned the following error:
Could not parse '4/12/2016 2:47:30 AM' as TIMESTAMP for field date (position 1) starting at location 21 with message 'Invalid time zone:
AM'
I have been able to upload the file manually but as strings, and now would like to set the fields back to the proper format, However, I just could not find a way to change the format of the date column from string to proper DateTime format.
Would love to know if this is possible as the file is just too long to be formatted in excel or sheets (as I have done with the smaller files from this dataset).
now would like to set the fields back to the proper format ... from string to proper DateTime format
Use parse_datetime('%m/%d/%Y %r', string_col) to parse datetime out of string
If applied to sample string in your question - you got
As #Mikhail Berlyant rightly said, using the parse_datetime('%m/%d/%Y %r', string_col)
function will convert your badly formatted dates to a standard format as per ISO 8601 accepted by Google Bigquery . the best option will then be to save these query results to a new table on the database in your Bigquery Project.
I had a similar issue.
Below is an image of my table which i uploaded with all columns in String format .
Next up was that i applied the following settings to the query below
The Settings below stored the query output to a new table called heartrateSeconds_clean on the same dataset
The Write if empty option is a good option to avoid overwriting the existing raw data or just arbitrarily writing output to a temporary table, except if you are sure you want to do so. Save the settings and proceed to Run your Query.
As seen below, the output schema of the new table is automatically updated
Below is the new preview of the resulting table
NB: I did not apply an ORDER BY clause to the Results hence the data is not ordered by any specific column in both versions of the same table.
This dataset has over 2M rows.
Been struggling with some datasets I want to use which have a problem with the date format.
Bigquery could not load the files and returned the following error:
Could not parse '4/12/2016 2:47:30 AM' as TIMESTAMP for field date (position 1) starting at location 21 with message 'Invalid time zone:
AM'
I have been able to upload the file manually but as strings, and now would like to set the fields back to the proper format, However, I just could not find a way to change the format of the date column from string to proper DateTime format.
Would love to know if this is possible as the file is just too long to be formatted in excel or sheets (as I have done with the smaller files from this dataset).
now would like to set the fields back to the proper format ... from string to proper DateTime format
Use parse_datetime('%m/%d/%Y %r', string_col) to parse datetime out of string
If applied to sample string in your question - you got
As #Mikhail Berlyant rightly said, using the parse_datetime('%m/%d/%Y %r', string_col)
function will convert your badly formatted dates to a standard format as per ISO 8601 accepted by Google Bigquery . the best option will then be to save these query results to a new table on the database in your Bigquery Project.
I had a similar issue.
Below is an image of my table which i uploaded with all columns in String format .
Next up was that i applied the following settings to the query below
The Settings below stored the query output to a new table called heartrateSeconds_clean on the same dataset
The Write if empty option is a good option to avoid overwriting the existing raw data or just arbitrarily writing output to a temporary table, except if you are sure you want to do so. Save the settings and proceed to Run your Query.
As seen below, the output schema of the new table is automatically updated
Below is the new preview of the resulting table
NB: I did not apply an ORDER BY clause to the Results hence the data is not ordered by any specific column in both versions of the same table.
This dataset has over 2M rows.
I am trying to store date datatype column in BigQuery via Spark
cast(from_unixtime(eventtime*60) as date) as createdDate
I have tried to_date as well like below, but no luck
to_date(from_unixtime(eventtime*60)) as createdDate
Now I am trying to save this dataset using Spark-BigQuery connector, it is giving me error that field createdDate has changed type from DATE to INTEGER. But when I try to print the schema in spark, its correctly saying that the column data types is Date.
|-- createdDate: date (nullable = false)
Not sure why its failing while loading data into BigQuery.
And the same things works if I change the data types from Date to Timestamp. Please advice.
The resolution is to use intermediateFormat as Orc. With intermediate format as Avro it is not working, and we can't use parquet(default) format as we have array data type in our table where Big Query create intermediate format like explained here.
Save Array<T> in BigQuery using Java
I'm uploading a CSV file to Google BigQuery using bq load on the command line. It's working great, but I've got a question about converting timestamps on the fly.
In my source data, my timestamps are formatted as YYYYMM, e.g. 201303 meaning March 2013.
However, Google BigQuery's timestamp fields are documented as only supporting Unix timestamps and YYYY-MM-DD HH:MM:SS format strings. So unsurprisingly, when I load the data, these fields don't convert to the correct date.
Is there any way I can convey to BigQuery that these are YYYYMM strings?
If not I can convert them before loading, but I have about 1TB of source data, so I'm keen to avoid that if possible :)
Another alternative is to load this field as STRING, and convert it to TIMESTAMP inside BigQuery itself, copying the data into another table (and deleting the original one afterwards), and doing the following transformation:
SELECT TIMESTAMP(your_ts_str + "01") AS ts
An alternative to Mosha's answer can be achieved by:
SELECT DATE(CONCAT(your_ts_str, "01")) as ts
I have an SSIS package which contains the following flow. The package has two Data Flow Tasks.
Data Flow Task 1:
In the first data flow task, output from a stored procedure involving multiple tables in different databases is written to two tables in a database.
Data Flow Task 2:
In the second data flow task, output from one of the above mentioned tables is written to a Flat File.
All the tables involved in the package have a column name BirthDate of data type VARCHAR(10). This column contains date values as string in the format YYYYMMDD.
Data written to the flat file is being saved in the format YYYY-MM-DD. However, I want the date value to be written to the flat file in the format MM-DD-YYYY
Questions:
How can I achieve the date format MM-DD-YYYY within the flat file?
Do I need to change the column data type from VARCHAR(10) to some other data type in the tables? I have lots of tables in various databases.
I use a data conversion tool for similar situation, no need to change the source table data types. It help maintain source and destination table integrity.
Best of Luck
The .NET Framework has some very nice string manipulation capabilities, and a Script Transformation would take care of this problem easily. Add a Script Transformation Component to your dataflow; in the Input Columns section of the Script Transformation Editor, make sure you include BirthDate in the Available Input Columns grid and set its UsageType to ReadWrite. Then, this little bit of C# code will convert your YYYYMMDD strings to MM-DD-YYYY:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
DateTime birthDate = DateTime.ParseExact(
Row.BirthDate,
"yyyyMMdd",
System.Globalization.CultureInfo.InvariantCulture);
Row.BirthDate = birthDate.ToString("MM-dd-yyyy");
}
With respect to your second question: Ideally, if you're storing a date in a database, you should be using the appropriate datatype. You might want to ask around your organization to find out why all those BirthDate columns were defined as VARCHAR - there may well have been a good reason at the time.
If it's always YYYYMMDD -- that is, zero-filled like 20130221, I'd just use simple string functions:
SUBSTRING([Birthdate],5,2) + '-'
+ SUBSTRING([Birthdate],7,2) + '-'
+ SUBSTRING([Birthdate],1,4)
I agree that it would be better to use an actual datetime or date datatype. But if you can't or won't, then a simple expression like this seems like the easiest way to me.
Here's one solution. Optimally, you should alter the column to the correct data type:
SELECT CONVERT(VARCHAR, CONVERT(DATETIME, wrongtypecol, 120), 110) AS date
FROM tbl
See the demo