Create External Table pointing to S3 - sql

How do we create an external table using Snowflake sql that points to a directory in S3? Below is the code I tried so far, but didn't work. Any help is highly appreciated.
create external table my_table
(
column1 varchar(4000),
column2 varchar(4000)
)
LOCATION 's3a://<externalbucket>'
Note : The file that I have in the S3 bucket is a csv file (comma seperated, double quotes enclosed and with header).

You will need to update your location to be an external stage, include the file_format parameter, and include the proper expression for the columns.
The location Parameter:
Specifies the external stage where the files containing data to be read are staged.
Additionally you'll need to define the file_format
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#required-parameters
So your statement should look more like this:
create external table my_table
(
column1 varchar as (value:c1::varchar),
column2 varchar as (value:c2::varchar)
)
location = #[namespace.]ext_stage_name[/path]
file_format = (type = CSV)
You may need to define additional paramaters in the file format to handle your file appropriately

Finally I sorted this out. Posting this answer as to make the answer simple to understand especially for the beginners.
Say that I have a csv file in the S3 location in the below format.
Step 1 :
Create a file format in which you can define what type of file it is, field delimiter, data enclosed in double quotes, skip the header of the file etc.
create or replace file format schema_name.pipeformat
type = 'CSV'
field_delimiter = '|'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
skip_header = 1
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
Step 2 :
Create a Stage to specify the S3 details and file format.
create or replace stage schema_name.stage_name
url='s3://<path where file is kept>'
credentials=(aws_key_id='****' aws_secret_key='****')
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#required-parameters
Step 3 :
Create the external table based on the Stage name and file format.
create or replace external table schema_name.table_name
(
RollNumber INT as (value:c1::int),
Name varchar(20) as ( value:c2::varchar),
Marks int as (value:c3::int)
)
with location = #stage_name
file_format = pipeformat
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html
Step 4 :
Now you should be able to query from the external table.
select *
from schema_name.table_name

Related

Amazon Athena set location to single csv file

I would like to set the location value in my Athena SQL create table statement to a single CSV file as I do not want to query every file in the path. I can set and successfully query an s3 directory (object) path and all files in that path, but not a single file. Is setting a single file as the location supported?
Successfully queries CSV files in path:
LOCATION 's3://my_bucket/path/'
Returns zero results:
LOCATION 's3://my_bucket/path/filename.csv.gz'
Create table statement:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_db` (
`name` string,
`occupation` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ','
) LOCATION 's3://bucket-name/path/filename.csv.gz'
TBLPROPERTIES ('has_encrypted_data'='false');
I have read this Q&A and this, but it doesn't seem to address the same issue.
Thank you.
You could try adding path of that particular object in WHERE condition while querying:
SELECT * FROM default.my_db
WHERE "$path" = 's3://bucket-name/path/filename.csv.gz'

How to insert latin data into a Snowflake Table

We have a scenario, that we need to insert some special characters coming from file to the Snowflake table.
For exp:
emp_id| emp_name
110|Famille immédiate
As the snowflake only allow UTF-8 format, when running the dml operation the data is not getting inserted into the table and throwing an error.
Have tried updating the file format command but no solution yet.
CREATE OR REPLACE FILE FORMAT DB.LayOut01_FORMAT TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1 ESCAPE_UNENCLOSED_FIELD = NONE REPLACE_INVALID_CHARACTERS = TRUE VALIDATE_UTF8 = FAlSE
What will be changes required to allow special charectors into the table as it is coming from source file ??
Insert Statement:
INSERT INTO DB.EMP_T ( emp_id, emp_name)
SELECT
(temp.$1) AS emp_id , (temp.$2) AS emp_name
from
$AZURE_FILE_STORAGE_LOCATION (file_format => DB.LayOut01_FORMAT, pattern=>'filename.csv') temp
UTF-8 is the only format for semi-structured data, but for structured you can insert data with different encodings.
Use on the file format the ENCODING parameter and set it to IS-8859-1, like:
CREATE FILE FORMAT ... ENCODING='ISO-8859-1'
For more information have a look here.

How to handle the embedded commas in hive?

For example if I have a csv file with three cols,
sno,name,salary
1,latha, 2000
2,Bhavish, Chaturvedi, 3000
How to load this type of file in hive. I tried few of the posts from stackoverflow, but it didn't worked.
I have created a external table:
create external table test(
id int,
name string,
salary int
)
fields terminated by '\;'
stored as text file;
and loaded the data into it.
But when done select * from table, I got all null's into it.
I think CSV file has column name then you have to skip header to avoid the error follow the following steps:
Step 1: Create table e.g
CREATE TABLE salary (sno INT, name STRING, salary INT)
row format delimited fields terminated BY ',' stored as textfile
tblproperties("skip.header.line.count"="1");
Step 2: load the CSV file into table e.g
load data local inpath 'file path' into table salary;
Step 3: Test the records
select * from salary;

Hive select csv files only from directory

I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');
You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.

How do I upload a key=value format file into a Hive table?

I am new to data engineering, so this might be a basic question, appreciate your help here.
I have a file which is in the following format -
first_name=A1 last_name=B1 city=Austin state=TX Zip=78703
first_name=A2 last_name=B2 city=Seattle state=WA
Note: No zip code available for the second row.
I need to upload this into Hive, in the following format:
First_name Last_name City State Zip
A1 B1 Austin TX 78703
A2 B2 Seattle WA NULL
Thanks for your help!!
I figured a way to do this in Hive. The idea is to first upload the entire data into a n*1 table (n is the number of rows), and then parsing the key names in the second step using the str_to_map function.
Step 1: Upload all data into 1 column table. Input a delimiter which you are sure will not parse your data, and doesn't exist (\002 in this case)
DROP TABLE IF EXISTS kv_001;
CREATE EXTERNAL TABLE kv_001 (
col_import string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
LOCATION 's3://location/directory/';
Step 2: Using the str_to_map function, extract the keys that are needed
DROP TABLE IF EXISTS required_table;
CREATE TABLE required_table
(first_name STRING
, last_name STRING
, city STRING
, state STRING
, zip INT);
INSERT OVERWRITE TABLE required_table
SELECT
params["first_name"] AS first_name
, params["last_name"] AS last_name
, params["city"] AS city
, params["state"] AS state
, params["zip"] AS zip
FROM
(SELECT str_to_map(col_import, '\001', '=') params FROM kv_001) A;
You can transform your file using python3 script and then upload it to hive table
Try this steps:
Script for example:
import sys
for line in sys.stdin:
line = line.split()
res = []
for item in line:
res.append(item.split("=")[1])
if len(line) == 4:
res.append("NULL")
print(",".join(res))
If only zip field can be empty, it works.
To apply it, use something like
cat file | python3 script.py > output.csv
Then upload this file to hdfs using
hadoop fs -copyFromLocal ./output.csv hdfs:///tmp/
And create the table in hive using
CREATE TABLE my_table
(first_name STRING, last_name STRING, city STRING, state STRING, zip STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/tmp/output.csv'
OVERWRITE INTO TABLE my_table;