How to handle comma separated decimal values in Hive? - hive

I have one CSV file and metadata for the same. Columns in this CSV is are delimited by pipe | symbol. Sample data is as follows:
name|address|age|salary|doj
xyz | abcdef|29 |567,34|12/02/2001
Here salary column is of type decimal but instead of using period . as decimal separator, comma , is used.
I created Hive external table as below and for this data Hive shows NULL for salary column.
create external table employee as(
name string,
address string,
age int,
salary decimal(7,3),
doj string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3://bucket/folder_having_many_csv_files/';
If I change data type of salary column to String then as expected, Hive works fine.
I would like to know how to tell Hive that this particular column is of type DECIMAL and decimal separator is comma (,) and not a period (.) symbol.

You could easily build table with salary as a string and replace the comma in a view on top. This is probably the easiest thing to do since the data is big and likely someone else owns it.
create view table employee_decimal as
select name
, address
, age
, cast(regexp_replace(salary, ',', '.') as decimal(7,3)) as salary
, doj
from employee;

Related

Hive columns-newline

Some of the columns in the hive has multiple lines of values which comes as newline
for example
Empid
Empname
Dept
company
year
month
day
1234
ASD
Finance
qqq
null
null
null
2015
6
3
But when I query the table with year it gives the correct answer
select year from tbl_name where year='2015'
what could be the reason for these multiline values and how to align these values in a proper column?
Depending on how the table is stored it can be possible to fix or not using SQL.
If table is based on text file (STORED AS TEXTFILE or using OpenCSVSerDe or JSON...) then rows being read by SerDe using newlines as delimiter and if column contain newline it is being split by newline on the lowest level.
If the table storage is binary format like ORC, it is not stored as rows delimited by newline. Values with newlines being read without splitting rows but newlines causing split of rows on output, the same happens if storage format is JSON and it contains combinations slash + n (\n), such combinations are being interpreted as newlines on output. It is possible to replace newlines with spaces or empty string using regexp_replace:
insert overwritre table tbl_name
select
Empid,
Empname,
Dept,
regexp_replace(company, '\\n',' ') company, --replace newline with space
`year`,
`month`,
`day`
from tbl_name ;
Also if column contains TABs it is also better to replace with spaces or remove them because \t causing columns shift. Use regexp_replace(col_name, '\\t',' ')

how to separate columns in hive

I have a file:
id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica
I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.
One possible solution is using RegexSerDe:
CREATE TABLE table my_table (
id string,
name string,
address string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$')
location 'put location here'
;
Replace location property with your table location and put the file(s) into that location.
First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.
Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'
Also you can easily test Regex for extracting columns even without creating table, like this:
select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
Result:
002|michael|57-D,costa rica
In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.
Answering question in the comment. You can have address with comma and one more column without comma like this:
select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102, mont vert|sydney
Checking comma is optional in Address column:
hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102 mont vert|sydney
Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562
[^,] means not a comma, last column can be everything except comma.
And of course add one more column to the DDL.

Hive table taking decimal value as NULL

I am facing strange issue.I tried with tab delimiter both in file and in table definition and comma as well.
But in both cases it reads the decimal values as NULL.But when I define this fields as INT it works fine.
Sample data with comma delimited values:
1,22.334
2,445.322
3,999.233
defined this table as
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by '\t' location '\tmp\data\'
similarly for comma delimited file
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by ',' location '\tmp\data\'
But in both cases it is reading decimal values as NULL?what is the issue
First thing is Decimal datatype doesn't not accept comma in data.
Second problem is you have to increase the decimal(3,3) to minimum decimal(7,3) for the sample data provided.
As decimal (3,3) cannot hold any of 3 values.
As your raw data contains comma in data,
You have to load the into table with all columns as string datatype .
Later use regular expression to remove the comma in data and load into second level hive table with decimal datatype.

How to get coumn names without getting truncated in SQLPLUS

How to do I get column names without getting truncated in SQLPLUS in Unix for select statement. This might look like duplicate question, but I have been searching for hours but couldn't find a convenient solution.
So far what I have found is
COLUMN COLUMN_NAME FORMAT SIZE;
Or
SELECT COLUMN1|| ',' || COLUMN2 || ',' || COLUMN3 FROM TABLE;
Both involves hardcoding,Is there any simpler solution without hardcoding.
Sorry for making it hard
Query: select * from Employee;
It has column names as Name,Salary,Age
What I get is:
Name Sala Ag
Steve 1000 30
John 2000 25
What I want is:
Name Salary Age
Steve 1000 30
John 2000 25
Setting size (width) of a column in SQL*Plus output.
SQL> column sex format a5
Seeing the current settings in effect.
SQL> column
Getting further help on usage.
SQL> help column
UPDATE
Setting format for all columns (in an awkward way). Assuming my users table defined as follows.
create table users(
id number
, username varchar2(20)
, credentials varchar2(90)
, lastname varchar2(20) not null
, firstname varchar2(20)
, emailaddress varchar2(42)
, emailisvalid number(1)
, sex char(1)
, created date default sysdate
);
We could issue this command putting the output into the file login.sql which is automatically executed every time you start SQL*Plus.
SQL> spool login.sql
SQL> select 'column ' || column_name || ' format a' || length(column_name) || ';'
from user_tab_cols where table_name = 'USERS';
column ID format a2;
column USERNAME format a8;
column PASSWORD format a8;
column LASTNAME format a8;
column FIRSTNAME format a9;
column EMAILADDRESS format a12;
column EMAILISVALID format a12;
column SEX format a3;
column CREATED format a7;
This problem:
Name Sala Ag
Steve 1000 30
John 2000 25
Can be solved like this:
SELECT Name,
CAST(Salary) AS VARCHAR(6) AS Salary,
CAST(Age) AS VARCHAR(3) AS Age
FROM Employee;
Note, once again... this is not what the query is returning, but how the client is displaying the results. Use a different client it would work differently. In this case the client is formatting the columns based on the column data type. When it sees a varchar (like name) it goes to the max data size. So we give it a bigger string data it will look ok.
This is NOT part of what is happening on the Server -- this goes to how the client displays. So if the query is going to be used by a different client (eg web page or application call) this won't matter when it is actually used by those clients.

SQL : Varchar add space to the end of the text

When I add element to column (varchar) I get extra space. For example if I have a table Student with name varchar(10) and I do:
INSERT INTO Student (id,name) VALUES (1,'student')
SELECT name FROM Student WHERE id=1
I get student[space][space][space].
How can I fix without changing the type to text?
Most databases output results from a query in a fixed tabular format. To see where a string really begins and ends, you need to concatenate another character. Here are three ways, depending on the database:
select '"'+name+'"'
select '"'||name||'"'
select concat('"', name, '"')
You'll probably find that the spaces are an artifact of the query tool.