Import Data Into Hive Containing Whitespace

Import Data Into Hive Containing Whitespace - hive

I am importing data from a csv file into Hive. My table contains both strings and ints. However, in my input file, the ints have whitespace around them, so it kind of looks like this:
some string, 2 ,another string , 7 , yet another string
Unfortunately I cannot control the formatting of the program providing the file.
When I import the data using (e.g.):
CREATE TABLE MYTABLE(string1 STRING, alpha INT, string2 STRING, beta INT, string3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Then all my integers get set to NULL. I am assuming this is because the extra whitespace makes the parsing fail. Is there a way around this?

You can perform a multi-stage import. In the first stage, save all of your data as STRING and in the second stage use trim() to remove whitespace and then save the data as INT. You could also look into using Pig to read the data from your source files as raw text and then write it to Hive as with the correct data types.
Edit
You can also do this in one pass if you can point to your source file as an external table.
CREATE TABLE myTable(
string1 STRING, alpha STRING, string2 STRING, beta STRING, string3 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '\\server\path\file.csv'
INSERT INTO myOtherTable
SELECT string1,
CAST(TRIM(alpha) AS INT),
string2,
CAST(TRIM(beta) AS INT),
string3
FROM myTable;

Related

Can't remove empty space on SQL table using replace function

I received a csv file with a column called "Amount", which should be of a MONEY type inside my table.
First step I took was loading the csv file as is. So my table uses a string type for that Amount column, just because I know it will not be formatted as money on the source. Due to some spaces on that column, I can't convert from NVARCHAR to MONEY.
Here's the initial table structure:
CREATE TABLE #TestReplace (
Amount NVARCHAR(100)
)
Here's an example to what the client inserted as value for the column:
INSERT INTO #TestReplace VALUES('2 103.74')
Because there is a space into that string, I need to remove it so I can convert it to the MONEY type.
However, if I try the REPLACE SQL function, nothing happens. It's like the value does not change
SELECT REPLACE(Amount, ' ','') FROM #TestReplace
Amount after the replace command is still: 2 103.74
Am I missing something that does not catch the space after the number 2? Is there a better way to remove that space and convert from NVARCHAR to MONEY?
Appreciate all the help!

You have a character that is not a space but looks like one. If you are using 8-bit ASCII characters, you can determine what the value is using:
select ascii(substring(amount, 2, 1))
If this is an nvarchar() (as in your example):
select unicode(substring(amount, 2, 1))
Once you know what the character is, you can replace it.

To add to Gordon's answer, once you know the integer ASCII/Unicode value, you can use that in the replace function with the CHAR() and NCHAR() functions like so:
--For ASCII:
REPLACE(Amount, CHAR( /*int value*/ ), '')
--For Unicode:
REPLACE(Amount, NCHAR( /*int value*/ ), '')

Decimal input get rounded in create external table in Hive from CSV

I am creating a table in Hive from a CSV (comma separated) that I have in HDFS. I have three columns - 2 strings and a decimal one (with at max 18 values after the decimal dot and one before). Below, what I do:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table(
col1 STRING, col2 STRING, col_decimal DECIMAL(19,18))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/hdfs_path/folder/';
My col_decimal is being rounded to 0 when below 1 and to 1 when it's not actually decimal (I only have numbers between 0 and 1).
Any idea of what is wrong? By the way, I have also tried DECIMAL(38,37)
Thank you.

regex_replace to append to end of line?

I have a postgres table which contains rows that each hold multiple lines of text (split by new lines), for example...
The table name is formats, column is called format, an example format (1 table row) would look like the following:
list1=text1;
list2=text2;
list3=text3;
etc etc
I would like a way to identify the list2 string and then append additional text to the end of the same line.
So the outcome would be:
list1=text1;
list2=test2;additionaltext
list3=text3;
I have tried the below to try and pull in the 'capture string' into the replace string but have been unsuccessful so far.
regexp_replace(format, 'list2=.*', '\1 additionaltext','n');

To capture a pattern, you must enclose it in parenthesis.
regexp_replace(format, '(list2=.*)', '\1additionaltext', 'n')

Hive array specifying multiple delimiter in collection

I have dataset contains two arrays, both arrays separated by different delimiter..
Ex: 14-20-50-60 is 1st array seperated by -
12#2#333#4 is 2nd array seperated by #..
While creating table how do we specify delimiter in
Collection items terminated by '' ?
input
14-20-50-60,12#2#333#4
create table test(first array<string>, second array<string>)
row format delimited
fields terminated by ','
collection items terminated by '-' (How to specify two delimiters in the collection)

You cannot use multiple delimiters for the collection items. You can achieve what you are trying to do as below though. I have used the SPLIT function to create the array using different delimiters.
Data
14-20-50-60,12#2#333#4
SQL - CREATE TABLE
create external table test1(first string, second string)
row format delimited
fields terminated by ','
LOCATION '/user/cloudera/ramesh/test1';
SQL - SELECT
WITH v_test_array AS
(SELECT split(first, "-") AS first_array,
split(second, "#") AS second_array
FROM test1)
SELECT first_array[0], second_array[0]
FROM v_test_array;
OUTPUT
14 12
Hope this helps.

invalid input syntax for integer with postgres

i have a table:
id | detail
1 | ddsffdfdf ;df, deef,"dgfgf",/dfdf/
when I did: insert into details values(1,'ddsffdfdf ;df, deef'); => got inserted properly
When I copied that inserted value from database to a file,the file had: 1 ddsffdfdf ;df, deef
Then I loaded the whole csv file to pgsql database,with values in the format: 1 ddsffdfdf ;df, deef
ERROR: invalid input syntax for integer: "1 ddsffdfdf ;df, deef is obtained. How to solve the problem?

CSVs need a delimiter that Postgres will recognize to break the text into respective fields. Your delimiter is a space, which is insufficient. Your CSV file should look more like:
1,"ddsffdfdf df, deef"
And your SQL should look like:
COPY details FROM 'filename' WITH CSV;
The WITH CSV is important because it tells Postgres to use a comma as the delimiter and parses your values based on that. Because your second field contains a comma, you want to enclose its value in quotes so that its comma is not mistaken for a delimiter.
To look at a good example of a properly formatted CSV file, you can output your current table:
COPY details TO '/your/filename.csv' WITH CSV;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Import Data Into Hive Containing Whitespace - hive

Related

Can't remove empty space on SQL table using replace function

Decimal input get rounded in create external table in Hive from CSV

regex_replace to append to end of line?

Hive array specifying multiple delimiter in collection

invalid input syntax for integer with postgres

Categories

Resources