Load csv with timestamp column to athena table - sql

I have started using Athena Query engine on top of my S3 FILEs
some of them are timestamp format columns.
I have created a simple table with 2 columns
CREATE EXTERNAL TABLE `test`(
`date_x` timestamp,
`clicks` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-athena-query-results-123-us-east-1/test'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1525003090')
I have tried to load a file and query it with Athena:
which look like that:
"2018-08-09 06:00:00.000",12
"2018-08-09 06:00:00.000",42
"2018-08-09 06:00:00.000",22
I have tried a different type format of timestamps such as DD/MM/YYYY AND YYY-MM-DD..., tried setting the time zone for each row - but none of them worked.
Each value I have tried is showing in Athena as this results:
date_x clicks
1 12
2 42
3 22
I have tried using a CSV file with and without headers
tried using with and without quotation marks,
But all of them showing defected timestamp.
My column on Athena must be Timestamp - rather it without timezone.
Please don't offer to use STRING column or DATE columns, this is not what i need.
How should the CSV File look like so Athena will recognize the Timestamp column?

Try the FORMAT: yyyy-MM-dd HH:mm:ss.SSSSSS
Article https://docs.amazonaws.cn/en_us/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html suggests:
"Timestamp values in text files must be in the format yyyy-MM-dd HH:mm:ss.SSSSSS, as the following timestamp value shows: 2017-05-01 11:30:59.000000 . "

Related

AWS Athena mis-interpreting timestamp column

I'm processing CSV files, outputting parquet files using Pandas in an AWS Lambda function, saving the data to an S3 bucket to query with Athena. The RAW input format to the Lambda function is CSV, with a unix timestamp in UTC that looks like:
Timestamp,DeviceName,DeviceUUID,SignalName,SignalValueRaw,SignalValueScaled,SignalType,Valid
1605074410110,F2016B1E.CAP.0 - 41840982B40192,323da038-bb49-4f3a-a045-925194364e5b,X.ALM.FLG,0,0,INTEGER,true
I parse the Timestamp like:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='ms')
df.head()
Timestamp DeviceName DeviceUUID SignalName SignalValueRaw SignalValueScaled SignalType SubstationId StationBankId FeederId year month day hour DeviceNameClean DeviceType
0 2020-11-11 06:00:10.110 F2016B2W.MLR.0 - 41841005000073 3c4839b1-ab99-4164-b415-4653948360ef CVR_X_ENGAGED_A 0 0 BOOLEAN Kenton FR2016B2 F2016B2W 2020 11 11 6 MLR.0 - 41841005000073 MLR
I process the data further in the Lambda function, then output a parquet file.
I then run a Glue crawler against the parquet files that this script outputs, and in S3, can query the data fine:
2020-11-14T05:00:43.609Z,02703ee8-b08a-4c49-9581-706f905aa192,FR22607.REG.0,REG,REG.0,ROSS,FR22607,,0,0,0,0,0,0,0,0,,0.0,,,,0.0,,,,1.0,,
The glue crawler correctly identifies the column as timestamp:
CREATE EXTERNAL TABLE `cvr_event_log`(
`timestamp` timestamp,
`deviceuuid` string,
`devicename` string,
`devicetype` string,
...
But when I then query the table in Athena, I get this for the date:
"timestamp","deviceuuid","devicename","devicetype",
"+52840-11-19 16:56:55.000","0ca4ed37-930d-4778-b3a8-f49d9b498364","FR22606.REG.0","REG",
What has Athena so confused about the timestamp?
For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601. It looks like this: "2020-11-14 20:33:42".
You can use from_iso8601_timestamp(ts) to parse ISO 8601 timestamps in queries.
Glue crawlers sadly misinterprets things quite often and creates tables that don't work properly with Athena.

Redshift COPY command returns stl_load_error 1205 Invalid Date Format - length must be 10 or more

I am copying a .csv file from S3 into Redshift and the Redshift COPY command returns
stl_load_error 1205
Invalid Date Format - length must be 10 or more.
My dates are all 10 characters long and in the default 'YYYY-MM-DD' format.
Command:
COPY [table]
FROM [file location]
ACCESS_KEY_ID [___]
SECRET_ACCESS_KEY [____]
FORMAT AS CSV
IGNOREHEADER 1;
The table was created using:
CREATE TABLE finance.commissions_life (
submitted_date date,
campaign varchar(40),
click_id varchar(40),
local_id varchar(40),
num_apps float);
And the .csv is in that exact format as well.
Is anyone else having a similar issue?
When I have run into this error in the past, I always fall back on explicitly defining both the delimiter to be used, and the date format:
COPY db.schema.table
FROM 's3://bucket/folder/file.csv'
CREDENTIALS 'aws_access_key_id=[access_key];aws_secret_access_key=[secret_access_key]'
DELIMITER AS ','
DATEFORMAT 'YYYY-MM-DD'
IGNOREHEADER 1
;
If you have the ability to alter the S3 file's structure/format, you should explicitly wrap the dates in quotes, and save it as a tab-delimited text file instead of a CSV. If you can do this, your COPY command would then be:
COPY db.schema.table
FROM 's3://bucket/folder/file.csv'
CREDENTIALS 'aws_access_key_id=[access_key];aws_secret_access_key=[secret_access_key]'
DELIMITER AS '\t'
DATEFORMAT 'YYYY-MM-DD'
IGNOREHEADER 1
REMOVEQUOTES
;
Additionally, you should be able to query the system table stl_load_errors to gather additional information on the exact row/text that is causing the load to fail:
SELECT *
FROM stl_load_errors
ORDER BY starttime DESC
;
Extending the answer provided by #John, use dateformat 'auto' in the copy command to have more flexibility. Also, if his answer resolved the issue, please mark it as accepted so that we know.
If not, can you query the system error table to see the erroneous records and edit your question to publish the "raw_line" or "raw_field_value" value ?
The issue was that the table I was uploading had an index that was offsetting the columns. The column that was supposed to be the ten-character date wasn't aligned with the date column in the database table.
Thank you for your help!
In my case I was adding the date with quotes "2018-01-03", it accepted the date 2018-01-03 without the quotes.
The error message is really misleading.
In my case I was receiving the error message as of type mismatching:
The target column was Date while the actual value to be loaded was a Timestamp. Converting the column to the correct Data Type fixed the issue for me.

BigQuery extract does not return dates in ISO format

I want to export a BQ table in a json file. The table has a lot of nested fields, arrays etc. with Timestamp columns.
When I run a
bq extract --destination_format NEWLINE_DELIMITED_JSON data:example.table gs://test/export_json.json
I get a json back, but the dates are not in the format I would expect. They are in "ProcessedTimestamp":"2019-03-06 20:20:52.588 UTC" but I would expect them to be "ProcessedTimestamp":"2019-03-06T20:20:52.588Z".
If I do
SELECT
TO_JSON_STRING(t, TRUE) AS json_example
FROM
Table
then I can see that the dates are in the ISO format. Should not the same be happening on extract since I have specified the format to be NEWLINE_DELIMITED_JSON?

How to convert timestamp from different timezones in Hive

I am querying a table in Hive with json payloads and am extracting the timestamp from these payloads. the problem is that timestamps are present in different timezone formats and I'm trying to extract them all in my timezone.
I am currently using the following:
select
from_unixtime(unix_timestamp(get_json_object (table.payload,
'$.timestamp'), "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))
FROM table
This is returning the correct values if the timestamp is in this format: 2018-08-16T08:54:05.543Z --> 2018-08-16 18:54:05 (changed format and converted into my timezone)
However the query above returns 'null' if the payload contains the timestamp in this format:
2018-09-13T01:35:08.460+0000
2018-09-13T11:35:09+10:00
How can I adjust my query to work for all types of timestamps all converting to proper timezone (+10 is my timezone!) and all in the same format?
Thanks in advance!
How about the following macro:
create temporary macro extract_ts(ts string)
from_unixtime(unix_timestamp(regexp_extract(ts, '(.*)\\+(.*)', 1), "yyyy-MM-dd'T'HH:mm:ss") + 3600*cast(regexp_extract(ts, '(.*)\\+(.*)\\:(.*)', 2) as int));
e.g.,
hive> select extract_ts('2018-09-13T11:35:09+10:00');
OK
2018-09-13 21:35:09
Without regexp use Z for +1000 of XXX for +10:00 :
select unix_timestamp('2016-07-30T10:29:33.000+03:00', "yyyy-MM-dd'T'HH:mm:ss.SSSXXX") as t1
select unix_timestamp('2016-07-30T10:29:33.000+0300', "yyyy-MM-dd'T'HH:mm:ss.SSSZ") as t2
Full docs about time formats:
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

how to process text timestamp in hive

I have a column in hive table stored as text. The text looks as shown below
2007-01-01T00:00:00+00:00
I am trying to find difference in time between two timestamp value stored as text in the above format.
Suppose we've got an Hive table dateTest with two column date1 string, date2 string
and suppose that table containing a row with this values:
2007-01-01T00:00:00+00:00,2007-02-01T00:00:00+00:00
The dates are in ISO 8601 UTC format, so if you run this query:
select datediff(from_unixtime(unix_timestamp(date2, "yyyy-MM-dd'T'HH:mm:ss")),from_unixtime(unix_timestamp(date1, "yyyy-MM-dd'T'HH:mm:ss"))) as days
from datetest;
the result is 31