Issue with Hive Data types - sql

We have 3 columns from source , colA is of 3 digits, colB is of 5 digits and ColC is of 5 digits.
We need to create 13 digit unique id based on above 3 columns
Query used - select colA*1000000000000 + colC*100000 + colC
Example -
hive> select 123*1000000000000 + 12345*100000 + 12345;
OK
123001234512345 -- Not Expected
Time taken: 0.091 seconds, Fetched: 1 row(s)
On checking further, below hive query does not give me the correct results.
hive> !hive --version;
Hive 2.3.3-mapr-1904-r9
Git git://738a1fde0d37/root/opensource/mapr-hive-2.3/dl/mapr-hive-2.3 -r 265b539b942d0b9f4811b15880204dec5c0c7e1b
Compiled by root on Tue Aug 6 05:36:17 PDT 2019
From source with checksum 88f44b7532ffd7141c15cb5742e9cb51
hive> select cast(12345*1000000 as bigint);
OK
-539901888
Time taken: 0.126 seconds, Fetched: 1 row(s)
hive> select cast(12345*10000000 as bigint);
OK
-1104051584
Time taken: 0.02 seconds, Fetched: 1 row(s)
hive> select cast(12345*100000000 as bigint);
OK
1844386048
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive> select cast(12345*1000000000 as bigint);
OK
1263991296
Time taken: 0.032 seconds, Fetched: 1 row(s)
Whereas the below query works -
hive> select cast(12345*10000000000 as bigint);
OK
123450000000000
Time taken: 0.017 seconds, Fetched: 1 row(s)
hive> select cast(12345*1000 as bigint);
OK
12345000
Time taken: 0.025 seconds, Fetched: 1 row(s)
hive> select cast(12345*10000 as bigint);
OK
123450000
Time taken: 0.035 seconds, Fetched: 1 row(s)
hive> select cast(12345*100000 as bigint);
OK
1234500000
Time taken: 0.247 seconds, Fetched: 1 row(s)

As the documentation explains:
Integral literals are assumed to be INT by default, unless the number exceeds the range of INT in which case it is interpreted as a BIGINT, or if one of the following postfixes is present on the number.
In this expression:
cast(12345*1000000 as bigint)
The result of 12345*1000000 is cast as a bigint. That does not mean the multiplication is done using that type. For that, you need to cast before multiplying:
12345 * cast(1000000 as bigint)
Or, you can use the suffixes:
12345L * 1000000L
Note that no explicit cast() is required because the values are already bigint.

Related

How can I print milliseconds in Hive without ignoring zeros

I am trying to print 'YYYY-MM-d HH:mm:ss.S' which has exact 3 milliseconds in the end.
This is what I get normally.
hive> select current_timestamp();
OK
2020-09-22 12:00:26.658
But in edge cases I also get
hive> select current_timestamp();
OK
2020-09-22 12:00:25.5
Time taken: 0.065 seconds, Fetched: 1 row(s)
hive> select cast(current_timestamp() as timestamp);
OK
2020-09-22 12:00:00.09
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive> select current_timestamp() as string;
OK
2020-09-22 11:07:12.27
Time taken: 0.076 seconds, Fetched: 1 row(s)
What I am expecting is not to ignore 0's at the end like:
hive> select current_timestamp();
OK
2020-09-22 12:00:25.500
Time taken: 0.065 seconds, Fetched: 1 row(s)
hive> select cast(current_timestamp() as timestamp);
OK
2020-09-22 12:00:00.090
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive> select current_timestamp();
OK
2020-09-22 11:07:12.270
Time taken: 0.076 seconds, Fetched: 1 row(s)
What I tried:
hive> select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:MM:ss.S');
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
2020-09-22 11:09:30.0
Time taken: 0.064 seconds, Fetched: 1 row(s)
And I also tried converting current_timestamp() as string so it wont ignore 0's but that also don't work
Try rpad(string str, int len, string pad).
Doc:
Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Does it work when if you use date_format()?
select date_format(current_timestamp, 'yyyy-MM-dd HH:mm:ss.SSS')

conversion from string to timestamp is not working

The data in the table as below.
The column jobdate data type is string.
jobdate
1536945012211.kc
1536945014231.kc
1536945312809.kc
I want to convert it to time stamp as the format 2018-12-205 06:15:10.505
I have tried the following queries but returning NULL.
select jobdate,from_unixtime(unix_timestamp(substr(jobdate,1,14),'YYYY-MM-DD HH:mm:ss.SSS')) from job_log;
select jobdate,from_unixtime(unix_timestamp(jobdate,'YYYY-MM-DD HH:mm:ss.SSS')) from job_log;
select jobdate,cast(date_format(jobdate,'YYYY-MM-DD HH:mm:ss.SSS') as timestamp) from job_log;
Please help me.
Thanks in advance
Original timestamps are too long, use 10 digits:
hive> select from_unixtime(cast(substr('1536945012211.kc',1,10) as int),'yyyy-MM-DD HH:mm:ss.SSS');
OK
2018-09-257 10:10:12.000
Time taken: 0.832 seconds, Fetched: 1 row(s)
hive> select from_unixtime(cast(substr('1536945012211.kc',1,10) as int),'yyyy-MM-dd HH:mm:ss.SSS');
OK
2018-09-14 10:10:12.000
Time taken: 0.061 seconds, Fetched: 1 row(s)
hive>

How to set decimal values in hive stack command

I am trying to execute below hive stack command
select stack(2,'A',10.1, '2015-01-01','B',20.123, '2016-01-01');
But it is giving me error because of inconsistencies in decimal precisions, below is the error message
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'stack(2, 'A', 10.1BD, '2015-01-01', 'B', 20.123BD, '2016-01-01')' due to data type mismatch: Argument 2 (decimal(3,1)) != Argument 5 (decimal(5,3)); line 1 pos 7;
'Project [unresolvedalias(stack(2, A, 10.1, 2015-01-01, B, 20.123, 2016-01-01), None)]
+- OneRowRelation (state=,code=0)
Cast explicitly to double or decimal with required precision and scale:
hive> select stack(2,'A',cast(10.1 as double), '2015-01-01','B',cast(20.123 as double), '2016-01-01');
OK
A 10.1 2015-01-01
B 20.123 2016-01-01
Time taken: 2.818 seconds, Fetched: 2 row(s)
hive> select stack(2,'A',cast(10.1 as decimal(5,3)), '2015-01-01','B',cast(20.123 as decimal(5,3)), '2016-01-01');
OK
A 10.1 2015-01-01
B 20.123 2016-01-01
Time taken: 0.066 seconds, Fetched: 2 row(s)

How to trim leading zero in Hive

How to trim leading zero in Hive, I search too much on google but I didn't get any correct thing which is useful for my problem.
If digit is "00000012300234" want result like "12300234"
you can achieve it by using: regexp_replace String Function
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
The following removes leading zeroes, but leaves one if necessary (i.e. it wouldn't just turn "0" to a blank string).
hive> SELECT regexp_replace( "00000012300234","^0+(?!$)","") ;
OK
12300234
Time taken: 0.156 seconds, Fetched: 1 row(s)
hive> SELECT regexp_replace( "000000","^0+(?!$)","") ;
OK
0
Time taken: 0.157 seconds, Fetched: 1 row(s)
hive> SELECT regexp_replace( "0","^0+(?!$)","") ;
OK
0
Time taken: 0.12 seconds, Fetched: 1 row(s)
OR Using CAST - cast to int to string:
hive> SELECT CAST(CAST( "00000012300234" AS INT) as string);
OK
12300234
Time taken: 0.115 seconds, Fetched: 1 row(s)
hive> SELECT CAST( "00000012300234" AS INT);
OK
12300234
Time taken: 0.379 seconds, Fetched: 1 row(s)
hive>
nothing to do just cast the string in INT
SELECT CAST( "00000012300234" AS INT);
it will return 12300234
SELECT CAST( "00000012300234" AS INT) FROM <your_table> ;
--above SQL works works. But in case the number goes above INT range, then you need to have "BIGNINT" instead of "INT". Else you will see NULLs :-)
SELECT CAST( "00000012300234" AS BIGINT) FROM <your_table>;

Get current unix_timestamp in Hive

As the post How to select current date in Hive SQL, to get the current date in Hive, unix_timestamp can be used.
But I tried
select unix_timestamp();
and just,
unix_timestamp();
both give the error messages
FAILED: ParseException line 1:23 mismatched input '<EOF>' expecting FROM near ')' in from clause
FAILED: ParseException line 1:0 cannot recognize input near 'unix_timestamp' '(' ')'
respectively.
How can I use unix_timestamp properly in Hive?
UPDATED!
https://issues.apache.org/jira/browse/HIVE-178 has resolved this issue.
If you use 0.13 (released on 21 April 2014) or above, you can
-- unix_timestamp() is deprecated
select current_timestamp();
select 1+1;
without from <table>.
As Hive doesn't expose a dual table, you may want to create a single lined table, and use that table for that kind of querys.
You'll then be able to execute queries like
select unix_timestamp() from hive_dual;
A workaround is to use any existing table, with a LIMIT 1 or a TABLESAMPLE clause, but, depending on the size of your table, it will be less efficient.
# any_existing_table contains 10 lines
# hive_dual contains 1 line
select unix_timestamp() from any_existing_table LIMIT 1;
# Time taken: 17.492 seconds, Fetched: 1 row(s)
select unix_timestamp() from any_existing_table TABLESAMPLE(1 ROWS);
# Time taken: 15.273 seconds, Fetched: 1 row(s)
select unix_timestamp() from hive_dual ;
# Time taken: 16.144 seconds, Fetched: 1 row(s)
select unix_timestamp() from hive_dual LIMIT 1;
# Time taken: 14.086 seconds, Fetched: 1 row(s)
select unix_timestamp() from hive_dual TABLESAMPLE(1 ROWS);
# Time taken: 16.148 seconds, Fetched: 1 row(s)
Update
No need to pass any table name and limit statement. Hive does support select unix_timestamp() now.
More details :
Does Hive have something equivalent to DUAL?
BLOG POST : dual table in hive
To get the date out of timestamp use to_date function.
Try the below
select to_date(FROM_UNIXTIME(UNIX_TIMESTAMP())) as time from table_name;