Hive Date functions not properly handling the dates - hive

I have a daily job that handles the loads based on the date I derive using hive date functions. It was running fine until 2 days ago and the issue started from 12/30/2019. It is showing the year as 2020 when I use the date_format else it shows 2019. See below.
hive> select current_date;
OK
2019-12-31
Time taken: 0.182 seconds, Fetched: 1 row(s)
hive> select date_format(current_date,'dd-MMM-YYYY');
OK
31-Dec-2020
Time taken: 0.429 seconds, Fetched: 1 row(s)
hive> select cast(date_format(date_sub(CURRENT_DATE,1),'YYYYMMdd') AS string);
OK
20201230
Did anyone else face this issue.

Looks like you got into a classic mistake people do.
A common mistake is to use YYYY. yyyy specifies the calendar year
whereas YYYY specifies the year (of “Week of Year”), used in the ISO
year-week calendar. In most cases, yyyy and YYYY yield the same
number, however they may be different. Typically you should use the
calendar year.
Change your code as below (lower case yyyy) to get correct results:
hive> select date_format(current_date,'dd-MMM-yyyy');
OK
31-Dec-2019
select cast(date_format(date_sub(CURRENT_DATE,1),'yyyyMMdd') AS string);
OK
20191230
Make sure you change CURRENT_DATE to '2019-12-31' for testing purposes.

Related

Calculate difference between start_time and end_time in seconds from unix_time yyyy-MM-dd HH:mm:ss

I'm still learning SQL and I found a couple of solutions on SQL Server or Postgreы, but it doesn't seen to work on HUE
DATEDIFF, only allows me to calculate difference between days
seconds, minutes are not available. Help is very welcome.
I was able to split the timestamp with substring_index, but then I can't find the right approach to compare and subtract start_time to end_time in order to obtain the accurate account of seconds. I can't find time functions so I'm assuming I should calculate it based on timestamp. obtained as
from_unixtime(unix_timestamp(start_time, "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"), 'yyyy-MM-dd HH:mm:ss')
substring_index(start_time, 'T', -1)s_tm,
substring_index(end_time, 'T', -1)e_tm
start_date 2018-06-19 13:59:41
end_date 2018-06-19 14:01:17
desired output
01:36
Solution for Hive.
Difference in seconds:
select UNIX_TIMESTAMP('2018-06-19T14:01:17.000000',"yyyy-MM-dd'T'HH:mm:ss.SSSSSS")-
UNIX_TIMESTAMP('2018-06-19T13:59:41.000000',"yyyy-MM-dd'T'HH:mm:ss.SSSSSS") as seconds_diff
Result:
96
Now calculate difference in HH:mm:ss:
select concat_ws(':',lpad(floor(seconds_diff/3600),2,'0'), --HH
lpad(floor(seconds_diff%3600/60),2,'0'), --mm
lpad(floor(seconds_diff%3600%60),2,'0') --ss
)
from
(
select --calculate seconds difference
UNIX_TIMESTAMP('2018-06-19T14:01:17.000000',"yyyy-MM-dd'T'HH:mm:ss.SSSSSS")-
UNIX_TIMESTAMP('2018-06-19T13:59:41.000000',"yyyy-MM-dd'T'HH:mm:ss.SSSSSS") as seconds_diff
) s
Result:
OK
00:01:36
Time taken: 1.071 seconds, Fetched: 1 row(s)
See also this answer about format convertion: https://stackoverflow.com/a/23520257/2700344

Convert 17 digit with decimal point timestamp in SQL to date

Trying to convert 43439.961377314816 into date. Currently I am using this code:
SELECT
(timestamp '1970-01-01 00:00:00 GMT' +
numtodsinterval(WRITETIMESTAMP, 'SECOND')) at time zone 'CST',
WRITETIMESTAMP
FROM
t.table
but I am getting this result:
01-JAN-70 06.03.59.961377315 AM CST
Date should be:
12/05/2018
This produces the date that you want:
select date '1899-12-30' + 43439.961377314816
from dual;
It looks like you are using Excel dates or something similar.
You have two problems in your query. First, you used the wrong base time. As pointed out by #GordonLinoff, the base time for an Excel date is actually 1900-01-01, and Excel treats 1900 as a leap year. This is not an error in Excel, per se, but a conscious design decision which was made to copy the (buggy) behavior of Lotus 1-2-3, which did have this bug. So - in Lotus 1-2-3 it's a bug, but in Excel it's a feature. :-) Secondly, in Excel dates the integer portion represents the number of days since the base date, and the fractional portion represent fraction of a day. In your NUMTODSINTERVAL call, however, you specified the interval_unit argument as 'SECOND'; it should have been 'DAY'.
Putting these things together we get
WITH cte AS (SELECT 43439.961377314816 AS WRITETIMESTAMP FROM DUAL)
SELECT
(timestamp '1899-12-30 00:00:00 GMT' + numtodsinterval(WRITETIMESTAMP, 'DAY')) at time zone 'CST',
WRITETIMESTAMP
FROM
cte
dbfiddle here
Best of luck.
This looks like expected behavior to me. 43439 seconds/60/60 = 12 hours and you're getting about 12 hours from the starting timestamp.
SELECT numtodsinterval('43439.961377314816', 'SECOND') as i FROM dual;
I
----------------------
+00 12:03:59.961377315
Why would you think that would give you a date in 2018?
Here is a working formula to put in Excel that works for Chromium browsers.
Chrome/Edge: =((Cell/1000000-11644473600)*1000000)/86400000000+DATE(1970,1,1)

Calculate time difference between two columns of string type in hive without changing the data type string

I am trying to calculate the time difference between two columns of a row which are of string data type. If the time difference between them is less than 2 hours then select the first column of that row else if the time difference is greater than 2 hours then select the second column of that row. It can be done by converting the columns to datetime format, but I want the result to be in string only. How can I do that? The data looks like this:
col1(string type)
2018-07-16 02:23:00
2018-07-26 12:26:00
2018-07-26 15:32:00
col2(string type)
2018-07-16 02:36:00
2018-07-26 14:29:00
2018-07-27 15:38:00
I think you don't need to convert the columns to datetime format, since the data in your case is already ordered (yyyy-MM-dd hh:mm:ss). You just need to take all the digits and take it into one string (yyyyMMddhhmmss) then you can apply your selection which is bigger or smaller than 2 hours (here 20000 since the hour is followed by mmss). By looking at your example (assuming col2 > col1), this query would work:
SELECT case when regexp_replace(col2,'[^0-9]', '')-regexp_replace(col1,'[^0-9]', '') < 20000 then col1 else col2 end as col3 from your_table;
Use unix_timestamp() to convert string timestamp to seconds.
The difference in hours will be:
hive> select (unix_timestamp('2018-07-16 02:23:00')- unix_timestamp('2018-07-16 02:36:00'))/60/60;
OK
-0.21666666666666667
Important update: this method will work correctly only if time zone is configured as UTC. Because for DST timezones for some marginal cases Hive converts time during timestamp operations. Consider this example for PDT time zone:
hive> select hour('2018-03-11 02:00:00');
OK
3
Note the hour is 3, not 2. This is because 2018-03-11 02:00:00 cannot exist in PDT time zone because exactly at 2018-03-11 02:00:00 time is adjusted and becomes 2018-03-11 03:00:00.
The same happens when converting to unix_timestamp. For PDT time zone unix_timestamp('2018-03-11 03:00:00') and unix_timestamp('2018-03-11 02:00:00') will return the same timestamp:
hive> select unix_timestamp('2018-03-11 03:00:00');
OK
1520762400
hive> select unix_timestamp('2018-03-11 02:00:00');
OK
1520762400
And few links for your reference:
https://community.hortonworks.com/questions/82511/change-default-timezone-for-hive.html
http://boristyukin.com/watch-out-for-timezones-with-sqoop-hive-impala-and-spark-2/
Also have a look at this jira please: Hive should carry out timestamp computations in UTC

date_trunc in hive is working incorrectly

I am running below query:
select a.event_date,
date_format(date_trunc('month', a.event_date), '%m/%d/%Y') as date
from monthly_test_table a
order by 1;
Output:
2017-09-15 | 09/01/2017
2017-10-01 | 09/30/2017
2017-11-01 | 11/01/2017
Can anyone tell me why for date "2017-10-01" it is showing me date as "09/30/2017" after using date_trunc.
Thanks in Advance...!
You are reverse formatting so it is incorrect.
Use the below Code
select a.event_date,
date_format(date_trunc('month', a.event_date), '%Y/%m/%d') as date
from monthly_test_table a
order by 1;
You can use date_add with a logic to subtract 1-day(yourdate) to replicate trunc.
For eg:
2017-10-01 - day('2017-10-01') is 1 and you add 1-1=0 days
2017-08-30 - day('2017-08-30') is 30 and you add 1-30=-29 days
I faced the same issue recently and resorted to using this logic.
date_add(from_unixtime(unix_timestamp(event_date,'yyyy-MM-dd'),'yyyy-MM-dd'),
1-day(from_unixtime(unix_timestamp(event_date,'yyyy-MM-dd'),'yyyy-MM-dd'))
)
PS: As far as i know, there is no date_trunc function in Hive documentation.
As per the source code below: UTC_CHRONOLOGY time is translated w.r.t. locale, also in Description it is mentioned that session timezone will be the precision, also refer to below URL.
#Description("truncate to the specified precision in the session timezone")
#ScalarFunction("date_trunc")
#LiteralParameters("x")
#SqlType(StandardTypes.DATE)
public static long truncateDate(ConnectorSession session, #SqlType("varchar(x)") Slice unit, #SqlType(StandardTypes.DATE) long date)
{
long millis = getDateField(UTC_CHRONOLOGY, unit).roundFloor(DAYS.toMillis(date));
return MILLISECONDS.toDays(millis);
}
See https://prestodb.io/docs/current/release/release-0.66.html:::
Time Zones:
This release has full support for time zone rules, which are needed to perform date/time calculations correctly. Typically, the session time zone is used for temporal calculations. This is the time zone of the client computer that submits the query, if available. Otherwise, it is the time zone of the server running the Presto coordinator.
Queries that operate with time zones that follow daylight saving can
produce unexpected results. For example, if we run the following query
to add 24 hours using in the America/Los Angeles time zone:
SELECT date_add('hour', 24, TIMESTAMP '2014-03-08 09:00:00');
Output: 2014-03-09 10:00:00.000

Oracle timestamp, timezone and utc

I have an application, using an Oracle 11g (11.2.0.2.0 64 bit) db.
I have a lot of entries in a Person table. To access data I'm using different application (same data).
In this example I'm using birth_time field of my person table.
Some application queries data with birth_time directly, some other with to_char to reformat it, and some other with UTC function.
The problem is this: with same data, same query, result are different.
In this screenshot you can see the result with Oracle Sql developer (3.2.20.09)
All the timestamp are inserted with midnight timestamp, and in fact the to_char(..) and birth_time result are at midnight. UTC hours are returned with one hour less (Correct according to my timezone!) but some entry (here one for example, the last one) is TWO HOURS less (only few on thousand are Three)!!
The same query executed with sql*plus return the correct result with one hour of difference for all the entries!
Does anyone have a suggestion to approach this problem?
The issue is born because one of our application made with adobe flex seems to execute queries with UTC time, and the problems appears when you look at data with this component.
ps.:
"BIRTH_TIME" is TIMESTAMP (6)
Would it be possible for you to change the query used? If so, you could use the AT TIME ZONE expression to tell Oracle that this date is in UTC time zone:
SELECT SYS_EXTRACT_UTC(CAST(TRUNC(SYSDATE) AS TIMESTAMP)) AS val FROM dual;
Output:
VAL
----------------------------
13/11/20 23:00:00,000000000
Now, using AT TIME ZONE 'UTC' gets you the date you need:
SELECT SYS_EXTRACT_UTC(
CAST(
TRUNC(SYSDATE) AS TIMESTAMP)
AT TIME ZONE 'UTC') AS val FROM dual;
Output:
VAL
----------------------------
13/11/21 00:00:00,000000000