epoch with milliseconds to timestamp with milliseconds conversion in Hive - hive

How can I convert unix epoch with milliseconds to timestamp with milliseconds In Hive?
Neither cast() nor from_unixtime() function is working to get the timestamp with milliseconds.
I tried .SSS but the function just increases the year and doesn't take it as a part of millisecond.
scala> spark.sql("select from_unixtime(1598632101000, 'yyyy-MM-dd hh:mm:ss.SSS')").show(false)
+-----------------------------------------------------+
|from_unixtime(1598632101000, yyyy-MM-dd hh:mm:ss.SSS)|
+-----------------------------------------------------+
|52628-08-20 02:00:00.000 |
+-----------------------------------------------------+

I think you can just cast():
select cast(1598632101000 / 1000.0 as timestamp)
Note that this produces a timestamp datatype rather than a string, as in from_unixtime().

from_unixtime works with seconds, not milliseconds. Convert to timestamp in seconds from_unixtime(ts div 1000), concatenate with '.'+ milliseconds (mod(ts,1000)) and cast as timestamp. Tested in Hive:
with your_data as (
select stack(2,1598632101123, 1598632101000) as ts
)
select cast(concat(from_unixtime(ts div 1000),'.',mod(ts,1000)) as timestamp)
from your_data;
Result:
2020-08-28 16:28:21.123
2020-08-28 16:28:21.0

Here's another way in pure Spark Scala using UDF to wrap the Java function to return new Timestamp(ms)
import java.sql.Timestamp
val fromMilli = udf((ms:Long) => new Timestamp(ms))
#Test
val df = Seq((1598632101123L)).toDF("ts")
df.select(fromMilli($"ts")).show(false)
Result
+-----------------------+
|UDF(ts) |
+-----------------------+
|2020-08-28 16:28:21.123|
+-----------------------+

Related

Apply a function to a column in PySpark dataframe

I'm new to Spark, PySpark to be specific. I have a dataframe that looks like
col_1 | col_2 | col_3
apple | red | 2016-01-28 00:56:55
banana | yellow | 2011-01-14 10:26:33.231
I've a function convert() that converts a datetime string like 2016-01-28 00:56:55 (may or may not have millisecond) into a float number representing UNIX time, like 1453971415. What's the PySpark way of applying this function to my col_3, so all the timestamps in col_3 are unix time?
You can use from_unixtime to cast the string to a timestamp and then cast("long") to get the unix timestamp
If all your timestamps end with milliseconds, you can directly use the "yyyy-MM-dd HH:mm:ss.SSS" format for conversion:
from pyspark.sql.functions import *
df.withColumn('col_3', from_unixtime(unix_timestamp('col_3', 'yyyy-MM-dd HH:mm:ss.SSS'))\
.cast("timestamp")).withColumn('col_3', col('col_3').cast("long")).show()
But if you have a mix of timestamps with and without milliseconds, you can use substring to convert them into the "yyyy-MM-dd HH:mm:ss" format.
df.withColumn('col_3', from_unixtime(unix_timestamp(substring(col("col_3"),0,19), 'yyyy-MM-dd HH:mm:ss'))\
.cast("timestamp")).withColumn('col_3', col('col_3').cast("long")).show()
+------+------+----------+
| col_1| col_2| col_3|
+------+------+----------+
| apple| red|1453960615|
|banana|yellow|1295018793|
+------+------+----------+
Milliseconds can be stripped off because they don't affect the Unix timestamp.

Difference between unix_timestamp and casting to timestamp

I am having a situation for a hive table, to convert a two fields of numeric string (T1 and T2) to date timestamp format "YYYY-MM-DD hh:mm:ss.SSS" and to find difference of both.
I have tried two methods:
Method 1: Through CAST
Select CAST(regexp_replace(substring(t1, 1,17),'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{3})','$1-$2-$3 $4:$5:$6.$7') as timestamp), CAST(regexp_replace(substring(t2, 1,17),'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{3})','$1-$2-$3 $4:$5:$6.$7') as timestamp), CAST(regexp_replace(substring(t1, 1,17),'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{3})','$1-$2-$3 $4:$5:$6.$7') as timestamp) - CAST(regexp_replace(substring(t2, 1,17),'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{3})','$1-$2-$3 $4:$5:$6.$7') as timestamp) as time_diff
from tab1
And getting output as
Method 2: Through unix_timestamp
Select from_unixtime (unix_timestamp(substring(t1,1,17),'yyyyMMddhhmmssSSS'),'yyyy-MM-dd hh:mm:ss.SSS'), from_unixtime (unix_timestamp(substring(t2,1,17),'yyyyMMddhhmmssSSS'),'yyyy-MM-dd hh:mm:ss.SSS'), from_unixtime (unix_timestamp(substring(t1,1,17),'yyyyMMddhhmmssSSS'),'yyyy-MM-dd hh:mm:ss.SSS') - from_unixtime (unix_timestamp(substring(t2,1,17),'yyyyMMddhhmmssSSS'),'yyyy-MM-dd hh:mm:ss.SSS') as time_diff
from tab1;
And getting output as
I am not getting clear why there is difference in outputs.
unix_timestamp() gives you epoch time ie. time in seconds since unix epoch 1970-01-01 00:00:00
Whereas the the timestamp will provide date and time viz YYYY-MM-DD T HH:MI:SS
Hence an accurate way would be to convert the string timestamp to unix_timestamp(), subtract and then convert back using from_unixtime()
eg.
select from_unixtime(unix_timestamp('2020-04-12 01:30:02.000') - unix_timestamp('2020-04-12 01:29:43.000'))
Method 2 finally equates to something like this
select ('2020-04-12 01:30:02.000' - '2020-04-12 01:29:43.000') as time_diff;
You cannot subtract dates like this.. you have to use DateDiff.
In Hive DateDiff returns > 0 only if there is a diff in day else you get zero.

How do I convert gps time to date in BigQuery

Currently I am using
SELECT TIMESTAMP_SECONDS(CAST(my_time_unix_ns/1000 AS int64)) AS my_date,...
But some of the columns store time in gps ns. How do I convert them into date?
There is no support for nanosecond precision within Cloud BigQuery. BigQuery CURRENT_TIMESTAMP() returns only up to milliseconds (Example:1), and the CAST() function supports only up to millisecond precision level (#Example-2, 3 and 4). For more context on timestamp precision, please refer to the supported range of BigQuery timestamps [1], which is 0001-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999.
On the other hand, I assume the Unix time function you are using returns an integer value larger than the capacity of Int64. Please refer to the numeric data type documentation [2]
Example-1:
SELECT CURRENT_TIMESTAMP() AS Current_Time
Result: 2019-12-24 17:51:44.419542 UTC
Example-2:
SELECT CAST('2019-12-24 00:00:00.000000' AS TIMESTAMP)
Result: 2019-12-24 00:00:00 UTC
Example-3:
SELECT CAST('2019-12-24 11:12:47.145482+00' AS TIMESTAMP)
Result: 2019-12-24 11:12:47.145482 UTC
Example-4:
SELECT CAST('2019-12-24 11:12:47.14548200' AS TIMESTAMP)
Result: error: "Could not cast literal "2019-12-24 11:12:47.14548200" to type TIMESTAMP "
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp-type
[2] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#integer-type

How to convert "2019-11-02T20:18:00Z" to timestamp in HQL?

I have datetime string "2019-11-02T20:18:00Z". How can I convert it into timestamp in Hive HQL?
try this:
select from_unixtime(unix_timestamp("2019-11-02T20:18:00Z", "yyyy-MM-dd'T'HH:mm:ss"))
If you want preserve milliseconds then remove Z, replace T with space and convert to timestamp:
select timestamp(regexp_replace("2019-11-02T20:18:00Z", '^(.+?)T(.+?)Z$','$1 $2'));
Result:
2019-11-02 20:18:00
Also it works with milliseconds:
select timestamp(regexp_replace("2019-11-02T20:18:00.123Z", '^(.+?)T(.+?)Z$','$1 $2'));
Result:
2019-11-02 20:18:00.123
Using from_unixtime(unix_timestamp()) solution does not work with milliseconds.
Demo:
select from_unixtime(unix_timestamp("2019-11-02T20:18:00.123Z", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"));
Result:
2019-11-02 20:18:00
Milliseconds are lost. And the reason is that function unix_timestamp returns seconds passed from the UNIX epoch (1970-01-01 00:00:00 UTC).

Hive date cast chopping of milli seconds

Below date cast is not displaying milli seconds.
select from_unixtime(unix_timestamp("2017-07-31 23:48:25.957" , "yyyy-MM-dd HH:mm:ss.SSS"));
2017-07-31 23:48:25
What is the way to get milli seconds?
Thanks.
Since this string is in ISO format, the casting can be done straightforward
hive> select cast("2017-07-31 23:48:25.957" as timestamp);
OK
2017-07-31 23:48:25.957
or
hive> select timestamp("2017-07-31 23:48:25.957");
OK
2017-07-31 23:48:25.957
because unix_timestamp is based on seconds, it truncate milliseconds.
Instead, you can transform string to timestamp using date_format, which preserve milliseconds. And then from_utc_timestamp.
select from_utc_timestamp(date_format("2017-07-31 23:48:25.957",'yyyy-MM-dd HH:mm:ss.SSS'),'UTC') as datetime