AWS Athena mis-interpreting timestamp column - pandas

I'm processing CSV files, outputting parquet files using Pandas in an AWS Lambda function, saving the data to an S3 bucket to query with Athena. The RAW input format to the Lambda function is CSV, with a unix timestamp in UTC that looks like:
Timestamp,DeviceName,DeviceUUID,SignalName,SignalValueRaw,SignalValueScaled,SignalType,Valid
1605074410110,F2016B1E.CAP.0 - 41840982B40192,323da038-bb49-4f3a-a045-925194364e5b,X.ALM.FLG,0,0,INTEGER,true
I parse the Timestamp like:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='ms')
df.head()
Timestamp DeviceName DeviceUUID SignalName SignalValueRaw SignalValueScaled SignalType SubstationId StationBankId FeederId year month day hour DeviceNameClean DeviceType
0 2020-11-11 06:00:10.110 F2016B2W.MLR.0 - 41841005000073 3c4839b1-ab99-4164-b415-4653948360ef CVR_X_ENGAGED_A 0 0 BOOLEAN Kenton FR2016B2 F2016B2W 2020 11 11 6 MLR.0 - 41841005000073 MLR
I process the data further in the Lambda function, then output a parquet file.
I then run a Glue crawler against the parquet files that this script outputs, and in S3, can query the data fine:
2020-11-14T05:00:43.609Z,02703ee8-b08a-4c49-9581-706f905aa192,FR22607.REG.0,REG,REG.0,ROSS,FR22607,,0,0,0,0,0,0,0,0,,0.0,,,,0.0,,,,1.0,,
The glue crawler correctly identifies the column as timestamp:
CREATE EXTERNAL TABLE `cvr_event_log`(
`timestamp` timestamp,
`deviceuuid` string,
`devicename` string,
`devicetype` string,
...
But when I then query the table in Athena, I get this for the date:
"timestamp","deviceuuid","devicename","devicetype",
"+52840-11-19 16:56:55.000","0ca4ed37-930d-4778-b3a8-f49d9b498364","FR22606.REG.0","REG",
What has Athena so confused about the timestamp?

For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601. It looks like this: "2020-11-14 20:33:42".
You can use from_iso8601_timestamp(ts) to parse ISO 8601 timestamps in queries.
Glue crawlers sadly misinterprets things quite often and creates tables that don't work properly with Athena.

Related

Quicksight data from Athena for correct TimeStamp

I am trying to use date time in AWS quicksight from Athena and parse it out from each file name (paht) that has a format like this '2022-06-02 19:26:48.491730xxxxx'
Using this
date(date_format(AT_TIMEZONE(cast (regexp_extract("$path", '\w{4}-\w{2}-\w{2} \w{2}:\w{2}:\w{2}.w{6}') as timestamp),'America/Los_Angeles'),'yyyy-MM-dd HH:mm:ss:SSSSSS')) AS time_stamp,
I get null
This should work for you:
date_parse(date_format(AT_TIMEZONE(cast (regexp_extract("$path", '\w{4}-\w{2}-\w{2} \w{2}:\w{2}:\w{2}') as timestamp),'America/Los_Angeles'),'%Y-%m-%d %H:%i:%s'),'%Y-%m-%d %H:%i:%s') AS time_stamp

Why do timestamp column returns empty value in redshift?

I have some json files in s3 of which I am trying to analyze in redshift and redshift spectrum.
s3 file has a key that has a timestamp value in the format "Thu, 18 Mar 2021 08:50:35 +0000"and when I am trying to query this particular column in redshift it returns an empty value,
note other keys in the s3 file is being fetched by redshift only the published date key having timestamp data type is returned as empty.
SELECT b.published_date
FROM jatinspectrum.extable a, a.enteries b
this query is executed successfully but no output

BigQuery extract does not return dates in ISO format

I want to export a BQ table in a json file. The table has a lot of nested fields, arrays etc. with Timestamp columns.
When I run a
bq extract --destination_format NEWLINE_DELIMITED_JSON data:example.table gs://test/export_json.json
I get a json back, but the dates are not in the format I would expect. They are in "ProcessedTimestamp":"2019-03-06 20:20:52.588 UTC" but I would expect them to be "ProcessedTimestamp":"2019-03-06T20:20:52.588Z".
If I do
SELECT
TO_JSON_STRING(t, TRUE) AS json_example
FROM
Table
then I can see that the dates are in the ISO format. Should not the same be happening on extract since I have specified the format to be NEWLINE_DELIMITED_JSON?

Load csv with timestamp column to athena table

I have started using Athena Query engine on top of my S3 FILEs
some of them are timestamp format columns.
I have created a simple table with 2 columns
CREATE EXTERNAL TABLE `test`(
`date_x` timestamp,
`clicks` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-athena-query-results-123-us-east-1/test'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1525003090')
I have tried to load a file and query it with Athena:
which look like that:
"2018-08-09 06:00:00.000",12
"2018-08-09 06:00:00.000",42
"2018-08-09 06:00:00.000",22
I have tried a different type format of timestamps such as DD/MM/YYYY AND YYY-MM-DD..., tried setting the time zone for each row - but none of them worked.
Each value I have tried is showing in Athena as this results:
date_x clicks
1 12
2 42
3 22
I have tried using a CSV file with and without headers
tried using with and without quotation marks,
But all of them showing defected timestamp.
My column on Athena must be Timestamp - rather it without timezone.
Please don't offer to use STRING column or DATE columns, this is not what i need.
How should the CSV File look like so Athena will recognize the Timestamp column?
Try the FORMAT: yyyy-MM-dd HH:mm:ss.SSSSSS
Article https://docs.amazonaws.cn/en_us/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html suggests:
"Timestamp values in text files must be in the format yyyy-MM-dd HH:mm:ss.SSSSSS, as the following timestamp value shows: 2017-05-01 11:30:59.000000 . "

correct timestamp dtype to cast BigQuery TIMESTAMP

I have a table with a column timestamp in type TIMESTAMP in BigQuery. When I display it on my console, I can see timestamps as follows: 2015-10-19 21:25:35 UTC
I then download my table using the BigQuery API, and when I display the result of the query, I notice that this timestamp has been converted in some kind of very big integer like 1.445289935E9
Therefore, in order to load this table as a pandas.DataFrame, I have to convert it back to a pandas compatible timestamp. How can I do that? In other words, what numpy or pandas dtype shall I use in my pandas.read_csv to load my bigquery timestamp in a numpy/pandas timestamp?