Some of the columns in the hive has multiple lines of values which comes as newline
for example
Empid
Empname
Dept
company
year
month
day
1234
ASD
Finance
qqq
null
null
null
2015
6
3
But when I query the table with year it gives the correct answer
select year from tbl_name where year='2015'
what could be the reason for these multiline values and how to align these values in a proper column?
Depending on how the table is stored it can be possible to fix or not using SQL.
If table is based on text file (STORED AS TEXTFILE or using OpenCSVSerDe or JSON...) then rows being read by SerDe using newlines as delimiter and if column contain newline it is being split by newline on the lowest level.
If the table storage is binary format like ORC, it is not stored as rows delimited by newline. Values with newlines being read without splitting rows but newlines causing split of rows on output, the same happens if storage format is JSON and it contains combinations slash + n (\n), such combinations are being interpreted as newlines on output. It is possible to replace newlines with spaces or empty string using regexp_replace:
insert overwritre table tbl_name
select
Empid,
Empname,
Dept,
regexp_replace(company, '\\n',' ') company, --replace newline with space
`year`,
`month`,
`day`
from tbl_name ;
Also if column contains TABs it is also better to replace with spaces or remove them because \t causing columns shift. Use regexp_replace(col_name, '\\t',' ')
Related
I have a file:
id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica
I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.
One possible solution is using RegexSerDe:
CREATE TABLE table my_table (
id string,
name string,
address string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$')
location 'put location here'
;
Replace location property with your table location and put the file(s) into that location.
First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.
Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'
Also you can easily test Regex for extracting columns even without creating table, like this:
select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
Result:
002|michael|57-D,costa rica
In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.
Answering question in the comment. You can have address with comma and one more column without comma like this:
select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102, mont vert|sydney
Checking comma is optional in Address column:
hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102 mont vert|sydney
Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562
[^,] means not a comma, last column can be everything except comma.
And of course add one more column to the DDL.
I am facing strange issue.I tried with tab delimiter both in file and in table definition and comma as well.
But in both cases it reads the decimal values as NULL.But when I define this fields as INT it works fine.
Sample data with comma delimited values:
1,22.334
2,445.322
3,999.233
defined this table as
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by '\t' location '\tmp\data\'
similarly for comma delimited file
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by ',' location '\tmp\data\'
But in both cases it is reading decimal values as NULL?what is the issue
First thing is Decimal datatype doesn't not accept comma in data.
Second problem is you have to increase the decimal(3,3) to minimum decimal(7,3) for the sample data provided.
As decimal (3,3) cannot hold any of 3 values.
As your raw data contains comma in data,
You have to load the into table with all columns as string datatype .
Later use regular expression to remove the comma in data and load into second level hive table with decimal datatype.
I have one CSV file and metadata for the same. Columns in this CSV is are delimited by pipe | symbol. Sample data is as follows:
name|address|age|salary|doj
xyz | abcdef|29 |567,34|12/02/2001
Here salary column is of type decimal but instead of using period . as decimal separator, comma , is used.
I created Hive external table as below and for this data Hive shows NULL for salary column.
create external table employee as(
name string,
address string,
age int,
salary decimal(7,3),
doj string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3://bucket/folder_having_many_csv_files/';
If I change data type of salary column to String then as expected, Hive works fine.
I would like to know how to tell Hive that this particular column is of type DECIMAL and decimal separator is comma (,) and not a period (.) symbol.
You could easily build table with salary as a string and replace the comma in a view on top. This is probably the easiest thing to do since the data is big and likely someone else owns it.
create view table employee_decimal as
select name
, address
, age
, cast(regexp_replace(salary, ',', '.') as decimal(7,3)) as salary
, doj
from employee;
One field of table is made up of many values seperated by comma,
for example, a record of this field is:
598423,4803510,599121,98181856,1666529,106317962,4061964,7828860,598752,728067,599809,8799578,1666528,3253720,601990,601235
I want to spread the values in every record of this field in Hive.
Which function or method I can use to realize this?
Thanks.
I'm not entirely sure what you mean by "spread".
If you want an output table that has a value in every row like:
598423
4803510
599121
Then you could use explode(split(data,',')
Otherwise, if each input row has exactly 16 numbers and you want each of the numbers to reside in a different column, you have two options:
Define the comma as a delimiter for the input table ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
Split a single column into 16 columns using the split UDF: SELECT split(data,',')[0] as col1, split(data,',')[1] as col2, ...
I have a scenario where I have to convert varchar column into number column. While doing that I'm getting error invalid number. After debugging the values some has whitespaces and some other values entered as 56.678.90. Below is the query I tried to convert varchar into number,
select cast('45.56.78' as number) from dual or
select cast (' ' as number) from dual
Both the values which I have entered in the above query will be there under column 'lddfc' in table entry_header. Column lddfc has records as 456.99, 456.89.43, and whitespace. How can I display these values as number?
You haven't mentioned what variant of SQL you are using but if it's T-SQL you could remove leading and trailing spaces using LTRIM(RTRIM(yourValue)). Not sure about syntax for PL/SQL.
So your code would be select cast(LTRIM(RTRIM('45.56.78')) AS NUMBER) FROM DUAL
I don't think that you can convert '45.56.78' to a number though without removing one of the decimal points.