Hive SQL - Test for \u0000 (ascii 00) without `chr()` - hive

I have a dataset with some corrupted data - a string column has some strings containing \u0000. I need to filter out all of them, and the only thing I have at my disposal is the where clause.
I tried WHERE field NOT LIKE concat('%', chr(00), '%'), but my hive distro (which is AWS EMR) doesn't recognize chr(). Is there another option for filling out my where clause to filter out fields containing \u0000, without using chr()?

You could try as follow
SELECT '\u0000' AS text;
+-------+--+
| text |
+-------+--+
| |
+-------+--+
-- NOT EMPTY
SELECT '\u0000abc' AS text;
+-------+--+
| text |
+-------+--+
| abc |
+-------+--+
-- NOT EMPTY
so
SELECT text
FROM(SELECT '\u0000abc' AS text) AS t
WHERE text NOT LIKE('\u0000%');
+-------+--+
| text |
+-------+--+
+-------+--+
-- EMPTY
SELECT text
FROM(SELECT '\u0000abc' AS text) AS t
WHERE text LIKE('\u0000%');
+-------+--+
| text |
+-------+--+
| abc |
+-------+--+
-- NOT EMPTY

Try out the following:
WHERE field NOT LIKE '%\000%'

Related

Updating the left side of a string up to a delimiter

My column "ColumnOne" in my table "MyTable" has values like this: Delimiter is character '-'
|Something |
|Something - SomeOtherThing |
|Something - SomethingElse |
|Something - Whatever |
|OtherThing - |
I want to update the values so eventually it look like this:
|Something |
| SomeOtherThing |
| SomethingElse |
| Whatever |
| |
So basically algorithm being to replace with white space and keep going until you see '-' , replace that too also with whitespace.
I tried the REPLACE command to say like
UPDATE MyTable SET ColumnOne = REPLACE(ColumnOne, ' - ', ' ' + ColumnOne) but that's wrong. I couldn't figure out the pattern for its second argument.
Any suggestions are appreciated.
Use charindex to find the amount of characters to change, stuff to perform the change, and replicate to generate a string of N spaces. Try this:
stuff(ColumnOne,1,charindex('-',ColumnOne),replicate(' ',charindex('-',ColumnOne))

hive regex find pattern and return it in select statement

I would like to extract 3 words before the selay dervice but the query returns an empty column :(
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\s+){3}selay dervice',0) from a
##########update 1
as pointed by Raid earlier, in Hive we cannot use \s and have to use \\s. I updated the above regex accordingly and it works
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\\s+){3}selay dervice',0) from a
Try below:
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'(?:[a-zA-Z0-9]+ ){3}selay dervice',0) from a
Note that if there are less than 3 words before selay dervice you will get empty results.
I tested similar query in latest apache hive and got something like below:
+----------------------------------+-----------------------------+
| key | regex_ext |
+----------------------------------+-----------------------------+
| rlk1 selay dervice | |
| selay dervice k4 | |
| k5 selay dervice ew | |
| thre word b4 selay dervice | thre word b4 selay dervice |
| four word be four selay dervice | word be four selay dervice |
+----------------------------------+-----------------------------+
Edit 1:
Result does not vary with or without ?
All 3 versions below gives same result.
'(?:[a-zA-Z0-9]+ )'
'([a-zA-Z0-9]+ )'
'([a-zA-Z0-9]+\\s)'
As per docs \s matches any white space not just spacebar

Display substring seperated by / in Hive

I have a column in my table with entries like:
this/is/my/dir/file
this/is/my/another/dir/file
I want to display the string without the filename:
this/is/my/dir/
This is the query which I am using:
select regexp_replace('this/is/my/another/dir/file','[^/]+','');
OK, you can use regexp_replace to remove the file and only reserve the dir path, as we know the file name does not contain the character '/' and is always located at the end of the dir path, so the regexp can be written as '[^/]+$', the examples as below, it means that replace the substring with regexp '[^/]+$' to an empty ''.
select regexp_replace('/this/is/my/dir/file','[^/]+$','') as dir;
+-------------------+
| dir |
+-------------------+
| /this/is/my/dir/ |
+-------------------+
select regexp_replace('this/is/my/another/dir/file','[^/]+$','') as dir;
+--------------------------+
| dir |
+--------------------------+
| this/is/my/another/dir/ |
+--------------------------+

PostgreSQL and PHP. Fetching from query adds space to char string

I have a table with a few fields. One field is char[128]. Now i store there a string 'hello'.
Now. In PHP i call: arr = pg_fetch_array(pg_query('select * from table')) but when I get value from this column i get 'hello '. When I execute 'select char_length(this_field) from table' using pgAdmin then I get value 5 not 6. Do you know why there is an extra space in PHP there?
Using VARCHAR instead of CHAR solves this problem.
padding to the length is documented:
https://www.postgresql.org/docs/current/static/datatype-character.html
character(n), char(n) fixed-length, blank padded
example:
t=# with c(t) as (values('abc'::char(3)),('a'::char(3)))
select t,concat(t,'.') from c;
t | concat
-----+--------
abc | abc.
a | a .
(2 rows)
regarding length:
t=# with c(t) as (values('abc'::char(3)),('a'::char(3)))
select t,concat(t,'.'),octet_length(t),char_length(t) from c;
t | concat | octet_length | char_length
-----+--------+--------------+-------------
abc | abc. | 3 | 3
a | a . | 3 | 1
(2 rows)
using character varying or text indeed changes this behaviour.

Replacing first occurence of character in a string using HiveQL

I am trying to replace the first occurrence of '-' in a string in Hive table. I am using HiveQL. I searched this topic here and other websites, but could not find clear explanation how to use metacharacters with regexp_replace() to do that.
This is a string from which I need to replace first '-' with empty space: 16-001-02707
The result should be like this: 16001-02707
This is the method I used:
select regexp_replace ('16-001-02707','[^[:digit:]]', '');
However, this doesn't do anything.
select regexp_replace ('16-001-02707','^(.*?)-', '$1');
16001-02707
Following the OP question in the comments
with t as (select '111-22-333333-4-555-6-7-8888-999999' as col)
select regexp_replace (col,'^(.*?)-','$1')
,regexp_replace (col,'^(.*?-.*?)-','$1')
,regexp_replace (col,'^((.*?-){2}.*?)-','$1')
,regexp_replace (col,'^((.*?-){3}.*?)-','$1')
,regexp_replace (col,'^((.*?-){4}.*?)-','$1')
,regexp_replace (col,'^((.*?-){5}.*?)-','$1')
from t
+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+
| _c0 | _c1 | _c2 | _c3 | _c4 | _c5 |
+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+
| 11122-333333-4-555-6-7-8888-999999 | 111-22333333-4-555-6-7-8888-999999 | 111-22-3333334-555-6-7-8888-999999 | 111-22-333333-4555-6-7-8888-999999 | 111-22-333333-4-5556-7-8888-999999 | 111-22-333333-4-555-67-8888-999999 |
+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+