importing data with commas in numeric fields into redshift - sql

I am importing data into redshift using the SQL COPY statement. The data has comma thousands separators in the numeric fields which the COPY statement rejects.
The COPY statement has a number of options to specify field separators, date and time formats and NULL values. However I do not see anything to specify number formatting.
Do I need to preprocess the data before loading or is there a way to get redshift to parse the numbers corerctly?

Import the columns as TEXT data type in a temporary table
Insert the temporary table to your target table. Have your SELECT statement for the INSERT replace commas with empty strings, and cast the values to the correct numeric type.

Related

Insert commas delimited string into a string column of Athena table

I created an Amazon Athena table based on CSV in S3. I want to input a row of data with one column that includes commas. But it truncates the string from the commas each time.
For example:
insert into table_names (id, key_string)values (1,'{key1=1,key2=3}')
Each time, the column key_string only stores {key1=1.
I tried use double quote "{key1=1,key2=3}", escape char \"{key1=1,key2=3}\".
They don't work.
Any suggestion?

Hive table taking decimal value as NULL

I am facing strange issue.I tried with tab delimiter both in file and in table definition and comma as well.
But in both cases it reads the decimal values as NULL.But when I define this fields as INT it works fine.
Sample data with comma delimited values:
1,22.334
2,445.322
3,999.233
defined this table as
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by '\t' location '\tmp\data\'
similarly for comma delimited file
create table x(ID INT,SAL DECIMAL(3,3)) row format delimited fields terminated by ',' location '\tmp\data\'
But in both cases it is reading decimal values as NULL?what is the issue
First thing is Decimal datatype doesn't not accept comma in data.
Second problem is you have to increase the decimal(3,3) to minimum decimal(7,3) for the sample data provided.
As decimal (3,3) cannot hold any of 3 values.
As your raw data contains comma in data,
You have to load the into table with all columns as string datatype .
Later use regular expression to remove the comma in data and load into second level hive table with decimal datatype.

How to spread the values from a column in Hive?

One field of table is made up of many values seperated by comma,
for example, a record of this field is:
598423,4803510,599121,98181856,1666529,106317962,4061964,7828860,598752,728067,599809,8799578,1666528,3253720,601990,601235
I want to spread the values in every record of this field in Hive.
Which function or method I can use to realize this?
Thanks.
I'm not entirely sure what you mean by "spread".
If you want an output table that has a value in every row like:
598423
4803510
599121
Then you could use explode(split(data,',')
Otherwise, if each input row has exactly 16 numbers and you want each of the numbers to reside in a different column, you have two options:
Define the comma as a delimiter for the input table ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
Split a single column into 16 columns using the split UDF: SELECT split(data,',')[0] as col1, split(data,',')[1] as col2, ...

Is there a way to define replacement of one string to other in external table creation in greenplum.?

I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.

Leading zeros disappear when using bulk insert from file

I am using bulk insert to insert data from a csv file to a SQL table. One of the columns in the csv file is an "ID" columns: i.e. each cell in the column is an "ID number" that may have leading zeros. Example: 00117701, 00235499, etc.
The equivalent column in the SQL table is of varchar(255) type.
When I bulk insert the data into the table, the leading zeros in each element of the "ID" column disappear. In other words, 00117701 becomes 117701, etc.
Is this a column type problem? If not, what's the best way to overcome this problem?
Thanks!
not sure what is causing it to strip off the leading zeroes, but I had to 'fix' some data in the past and did something like this:
UPDATE <table> SET <field> = RIGHT('00000000'+cast(<field> as varchar(8)),8)
You may need to adjust it a bit for your purposes, but maybe you get the idea from it?