Remove the Junk charcters from hive tables or from unix - sql

We have the tables in hive like below and we are generating the flat files from hive data while we are generating we found that there was junk characteres with in the data like below we have many characters in many columns can any one help us to remove those junk characters from hive table or from unix file ?
ÿ,ä,í,ã
Here problem the same data need to send the downstream when they are loading in to there DB it shows as double dollar but we design code double dollar as column delimiter.

Basic concept
hive> select regexp_replace('Hÿelloä íworlãd','[^a-zA-Z ]','');
OK
Hello world
Demo
Removing undesired character from the whole table and exporting it to a file.
create table t (i int,s1 string,s2 string);
insert into t values (1,'Hÿelloä','íworlãd'),(2,'ãGããood','Byÿe');
select * from t;
+---+---------+---------+
| i | s1 | s2 |
+---+---------+---------+
| 1 | Hÿelloä | íworlãd |
| 2 | ãGããood | Byÿe |
+---+---------+---------+
create external table t_ext (rec string)
row format delimited
fields terminated by '0'
location '/user/hive/warehouse/t'
;
insert overwrite local directory '/tmp/t_ext'
select regexp_replace(regexp_replace(rec,'[^a-zA-Z0-9 \\01]',''),'\\x01','<--->')
from t_ext
;
! ls /tmp/t_ext
;
000000_0
! cat /tmp/t_ext/000000_0
;
1<--->Hello<--->world
2<--->Good<--->Bye

This works as long as your tables contain only "primitive" types (no structs, arrays, maps etc.).
I really pushed the envelope here.
Demo
create table t (i int, dt date, str string, ts timestamp, bl boolean);
insert into t select 1,current_date,'Hello world',current_timestamp,true;
select * from t;
+-----+------------+-------------+-------------------------+------+
| t.i | t.dt | t.str | t.ts | t.bl |
+-----+------------+-------------+-------------------------+------+
| 1 | 2017-03-14 | Hello world | 2017-03-14 14:37:28.889 | true |
+-----+------------+-------------+-------------------------+------+
select regexp_replace
(
printf(concat('%s',repeat('$$%s',field(unhex(1),*,unhex(1))-2)),*)
,'(\\$\\$)|[^a-zA-Z0-9 -]'
,'$1'
)
from t
;
1$$2017-03-14$$Hello world$$2017-03-14 143728.889$$true

Related

Hive: merge or tag multiple rows based on neighboring rows

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?
This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.
Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

Load data to hive array of struct

I have data in CSV looks like
David,"""SMARTPHONE,6""|""COMPUTER,3""|""LAPTOP,1"""
I try to load this to my hive table
create table user_device(name string, devices array<struct<devicename: string, number : int>>)
FIELDS TERMINATED BY ','
collection items terminated by '|'
STORED AS TEXTFILE
LOCATION 'maprfs:///user/david/';
I expected to see
[{"devicename":"SMARTPHONE","number":6},{"devicename":"COMPUTER","number":3},{"devicename":"LAPTOP","number":1}]
But when I try to query the table, I see the array of struct is
[{"devicename":"\"\"\"SMARTPHONE","number":null}]
Rest of the array and struct are gone.
Does anyone know how I can achieve this?
Thanks
David
He is a code I used. In the approach I used python for cleaning before proceeding to HQL queries. So after doing some wrangling steps, I have a file like this below (saved without indices and headers) in my local file system since its a small file:
import pandas as pd
import numpy as np
Name devicename number
0 David SMARTPHONE 6
1 COMPUTER 3
2 LAPTOP 1
Then a temp table tempt is created and populated with data from LFS or HDFS:
create table tempt
(
name string,
devicename string,
number int
)
row format delimited
FIELDS TERMINATED BY ',';
load data local inpath '/path_to_file' overwrite into table tempt;
select * from tempt;
+--------------------+--------------------------+----------------------+--+
| tempt.name | tempt.devicename | tempt.number |
+--------------------+--------------------------+----------------------+--+
| David | SMARTPHONE | 6 |
| | COMPUTER | 3 |
| | LAPTOP | 1 |
+--------------------+--------------------------+----------------------+--+
And now
Insert overwrite table user_device
select name,
array(named_struct("devicename",devicename,"number",number)) from tempt;
select * from user_device;
and the output now is as you expected.
+-----------------+-------------------------------------------+--+
|user_device.name | user_device.devices |
+-----------------+-------------------------------------------+--+
| David | [{"devicename":"SMARTPHONE","number":6}] |
| | [{"devicename":"COMPUTER","number":3}] |
| | [{"devicename":"LAPTOP","number":1}] |
+-----------------+-------------------------------------------+--+
Cheers!

How to concatenate all tables' columns to a delimited format, without specifying each column separately

E.g.
Given a table with only primitive types, e.g. -
create table t (i int, dt date, str string, ts timestamp, bl boolean)
;
insert into t
select 1,date '2017-03-14','Hello world',timestamp '2017-03-14 14:37:28.889',true
;
select * from t
;
+-----+------------+-------------+-------------------------+------+
| t.i | t.dt | t.str | t.ts | t.bl |
+-----+------------+-------------+-------------------------+------+
| 1 | 2017-03-14 | Hello world | 2017-03-14 14:37:28.889 | true |
+-----+------------+-------------+-------------------------+------+
... and ||| as the requested delimiter
(for simplicity we can assume it does not appear anywhere in the data)
The requested result would be a single delimited string
1|||2017-03-14|||Hello world|||2017-03-14 14:37:28.889|||true
This solution is limited to tables that contain "primitive" types only (no structs, arrays, maps etc.).
select printf(concat('%s',repeat('|||%s',field(unhex(1),*,unhex(1))-2)),*)
from t
;
1|||2017-03-14|||Hello world|||2017-03-14 14:37:28.889|||true
P.s.
concat(*) works but the values are not separated by a delimiter.
concat_ws(...,*) yields an exception.

How to use Regex in SQL for extracting values after repetitive numbers

I have the following table (table1):
+---+---------------------------------------------+
+---|--------att1 --------------------------------+
| 1 | 10.2.5.4 4.3.2.1.in-addr.arpa |
| 2 | asd 100.99.98.97 97.3.2.1.a.b.c fsdf |
| 3 | fd 95.94.93.92 92.5.7.1.a.b.c |
| 4 | a 11.4.99.75 75.77.52.41.in-addr.arpa |
+---+---------------------------------------------+
I would like to get the following values (that are located after the repetitive numbers): in-addr.arpa, a.b.c, a.b.c, in-addr.arpa.
I tried to use the following format with no success:
SELECT att1
FROM table1
WHERE REGEXP_LIKE(att1 , '^(\d+?)\1$')
I would like it to run in Impala and Oracle.
Use REGEXP_SUBSTR (assuming you are using an Oracle DB).
select regexp_substr(att1,'[0-9]\.([^0-9]+)',1,1,null,1)
from table1
[0-9]\. a numeric followed by a .
[^0-9]+ any character other than a numeric is matched until the next numeric is found. () around this indicates the group (first in this case) and we only extract that part of the string.
Sample Demo

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+