How to insert latin data into a Snowflake Table - sql

We have a scenario, that we need to insert some special characters coming from file to the Snowflake table.
For exp:
emp_id| emp_name
110|Famille immédiate
As the snowflake only allow UTF-8 format, when running the dml operation the data is not getting inserted into the table and throwing an error.
Have tried updating the file format command but no solution yet.
CREATE OR REPLACE FILE FORMAT DB.LayOut01_FORMAT TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1 ESCAPE_UNENCLOSED_FIELD = NONE REPLACE_INVALID_CHARACTERS = TRUE VALIDATE_UTF8 = FAlSE
What will be changes required to allow special charectors into the table as it is coming from source file ??
Insert Statement:
INSERT INTO DB.EMP_T ( emp_id, emp_name)
SELECT
(temp.$1) AS emp_id , (temp.$2) AS emp_name
from
$AZURE_FILE_STORAGE_LOCATION (file_format => DB.LayOut01_FORMAT, pattern=>'filename.csv') temp

UTF-8 is the only format for semi-structured data, but for structured you can insert data with different encodings.
Use on the file format the ENCODING parameter and set it to IS-8859-1, like:
CREATE FILE FORMAT ... ENCODING='ISO-8859-1'
For more information have a look here.

Related

Create External Table pointing to S3

How do we create an external table using Snowflake sql that points to a directory in S3? Below is the code I tried so far, but didn't work. Any help is highly appreciated.
create external table my_table
(
column1 varchar(4000),
column2 varchar(4000)
)
LOCATION 's3a://<externalbucket>'
Note : The file that I have in the S3 bucket is a csv file (comma seperated, double quotes enclosed and with header).
You will need to update your location to be an external stage, include the file_format parameter, and include the proper expression for the columns.
The location Parameter:
Specifies the external stage where the files containing data to be read are staged.
Additionally you'll need to define the file_format
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#required-parameters
So your statement should look more like this:
create external table my_table
(
column1 varchar as (value:c1::varchar),
column2 varchar as (value:c2::varchar)
)
location = #[namespace.]ext_stage_name[/path]
file_format = (type = CSV)
You may need to define additional paramaters in the file format to handle your file appropriately
Finally I sorted this out. Posting this answer as to make the answer simple to understand especially for the beginners.
Say that I have a csv file in the S3 location in the below format.
Step 1 :
Create a file format in which you can define what type of file it is, field delimiter, data enclosed in double quotes, skip the header of the file etc.
create or replace file format schema_name.pipeformat
type = 'CSV'
field_delimiter = '|'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
skip_header = 1
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Step 2 :
Create a Stage to specify the S3 details and file format.
create or replace stage schema_name.stage_name
url='s3://<path where file is kept>'
credentials=(aws_key_id='****' aws_secret_key='****')
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#required-parameters
Step 3 :
Create the external table based on the Stage name and file format.
create or replace external table schema_name.table_name
(
RollNumber INT as (value:c1::int),
Name varchar(20) as ( value:c2::varchar),
Marks int as (value:c3::int)
)
with location = #stage_name
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html
Step 4 :
Now you should be able to query from the external table.
select *
from schema_name.table_name

How to handle the embedded commas in hive?

For example if I have a csv file with three cols,
sno,name,salary
1,latha, 2000
2,Bhavish, Chaturvedi, 3000
How to load this type of file in hive. I tried few of the posts from stackoverflow, but it didn't worked.
I have created a external table:
create external table test(
id int,
name string,
salary int
)
fields terminated by '\;'
stored as text file;
and loaded the data into it.
But when done select * from table, I got all null's into it.
I think CSV file has column name then you have to skip header to avoid the error follow the following steps:
Step 1: Create table e.g
CREATE TABLE salary (sno INT, name STRING, salary INT)
row format delimited fields terminated BY ',' stored as textfile
tblproperties("skip.header.line.count"="1");
Step 2: load the CSV file into table e.g
load data local inpath 'file path' into table salary;
Step 3: Test the records
select * from salary;

Import CSV into SQL (CODE)

I want to import several CSV files automatically using SQL-code (i.e. without using the GUI). Normally, I know the dimensions of my CSV file. So, in many cases I create an empty table with, let say, x columns with the corresponding data types. Then, I import the CSV file into this table using BULK INSERT. However, in this case I don't know much about my files, i.e. information about data types and dimensions are not given.
To summerize the problem:
I receive a file path, e.g. C:...\DATA.csv. Then, I want to use this path in SQL-code to import the file to a table without knowing anything about it.
Any ideas on how to solve this problem?
Use something like this:
BULK INSERT tbl
FROM 'csv_full_path'
WITH
(
FIRSTROW = 2, --Second row if header row in file
FIELDTERMINATOR = ',', --CSV field delimiter
ROWTERMINATOR = '\n', --Use to shift the control to next row
ERRORFILE = 'error_file_path',
TABLOCK
)
If columns are not known, you could try with:
select * from OpenRowset
Or, do a bulk insert with only the first row as one big column, then parse it to create the dynamic main insert. Or bulk insert the whole file into a table with just one column, then parse that...
You can use OPENROWSET (documantation).
SELECT *
INTO dbo.MyTable
FROM
OPENROWSET(
BULK 'C:\...\mycsvfile.csv',
SINGLE_CLOB) AS DATA;
In addition, you can use dynamic SQL to parameterize table name and location of csv file.

How do I upload a key=value format file into a Hive table?

I am new to data engineering, so this might be a basic question, appreciate your help here.
I have a file which is in the following format -
first_name=A1 last_name=B1 city=Austin state=TX Zip=78703
first_name=A2 last_name=B2 city=Seattle state=WA
Note: No zip code available for the second row.
I need to upload this into Hive, in the following format:
First_name Last_name City State Zip
A1 B1 Austin TX 78703
A2 B2 Seattle WA NULL
Thanks for your help!!
I figured a way to do this in Hive. The idea is to first upload the entire data into a n*1 table (n is the number of rows), and then parsing the key names in the second step using the str_to_map function.
Step 1: Upload all data into 1 column table. Input a delimiter which you are sure will not parse your data, and doesn't exist (\002 in this case)
DROP TABLE IF EXISTS kv_001;
CREATE EXTERNAL TABLE kv_001 (
col_import string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
LOCATION 's3://location/directory/';
Step 2: Using the str_to_map function, extract the keys that are needed
DROP TABLE IF EXISTS required_table;
CREATE TABLE required_table
(first_name STRING
, last_name STRING
, city STRING
, state STRING
, zip INT);
INSERT OVERWRITE TABLE required_table
SELECT
params["first_name"] AS first_name
, params["last_name"] AS last_name
, params["city"] AS city
, params["state"] AS state
, params["zip"] AS zip
FROM
(SELECT str_to_map(col_import, '\001', '=') params FROM kv_001) A;
You can transform your file using python3 script and then upload it to hive table
Try this steps:
Script for example:
import sys
for line in sys.stdin:
line = line.split()
res = []
for item in line:
res.append(item.split("=")[1])
if len(line) == 4:
res.append("NULL")
print(",".join(res))
If only zip field can be empty, it works.
To apply it, use something like
cat file | python3 script.py > output.csv
Then upload this file to hdfs using
hadoop fs -copyFromLocal ./output.csv hdfs:///tmp/
And create the table in hive using
CREATE TABLE my_table
(first_name STRING, last_name STRING, city STRING, state STRING, zip STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/tmp/output.csv'
OVERWRITE INTO TABLE my_table;

Postgres Copy - Importing an integer with a comma

I am importing 50 CSV data files into postgres. I have an integer field where sometimes the value is a regular number (comma-delimited) and sometimes it is in quotations and uses a comma for the thousands.
For instance, I need to import both 4 and "4,000".
I'm trying:
COPY race_blocks FROM '/census/race-data/al.csv' DELIMITER ',' CSV HEADER;
And get the error:
ERROR: invalid input syntax for integer: "1,133"
How can I do this?
Let's assume you have only one column in your data.
First create temporary table with varchar column:
CREATE TEMP TABLE race_blocks_temp (your_integer_field VARCHAR);
Copy your data from file
COPY race_blocks_tmp FROM '/census/race-data/al.csv' DELIMITER ',' CSV HEADER;
Remove ',' from varchar field, convert data to numeric and insert into your table.
INSERT INTO race_blocks regexp_replace(your_integer_field, ',', '') :: numeric AS some_colun FROM race_blocks_tmp;