How do you bulk load parquet files into Snowflake from AWS S3? - amazon-s3

I'm trying to bulk load 28 parquet files into Snowflake from an S3 bucket using the COPY command and regex pattern matching. But each time I run the command in my worksheet, I'm getting the following bad response:
Copy executed with 0 files processed.
Inside a folder in my S3 bucket, the files I need to load into Snowflake are named as follows:
S3://bucket/foldername/filename0000_part_00.parquet
S3://bucket/foldername/filename0001_part_00.parquet
S3://bucket/foldername/filename0002_part_00.parquet
...
S3://bucket/foldername/filename0026_part_00.parquet
S3://bucket/foldername/filename0027_part_00.parquet
Using the Snowflake worksheet, I'm trying to load data into a pre-existing table, using the following commands:
CREATE or REPLACE file format myparquetformat type = 'parquet';
COPY INTO [Database].[Schema].[Table] FROM (
SELECT $1:field1::VARCHAR(512), $1:field2::INTEGER, $1:field3::VARCHAR(512),
$1:field4::DOUBLE, $1:field5::VARCHAR(512), $1:field6::DOUBLE
FROM #AWS_Snowflake_Stage/foldername/
(FILE_FORMAT => 'myparquetformat', PATTERN =>
'filename00[0-9]+_part_00.parquet')
)
on_error = 'continue';
I'm not sure why these commands fail to run.
In every example I've seen in the Snowflake documentation, "PATTERN" is only used within the COPY command outside of a SELECT query. I'm not sure if it's possible to use PATTERN inside a SELECT query.
In this case, I think it's necessary to use the SELECT query within the COPY command, since I'm loading in parquet data that would first need to be cast from a single column ($1) into multiple columns with appropriate data types for the table (varchar, integer, double). The SELECT query is what enables the importing of the parquet file into the existing table -- is it possible to find a way around this using a separate staging table?
It's a huge pain to load the parquet files one at a time. Is there any way to bulk load these 28 parquet files using the Snowflake worksheet? Or is it better to try to do this using a Python script and the Snowflake API?

For me the below worked, I agree my pattern is quite simple to select all parquet file in the location, but you can probably verify if the regex pattern is valid.
COPY INTO <TABLE_NAME> FROM (
SELECT
$1:col_name_1,
$1:col_name_2
FROM #STAGE_NAME/<PATH_TO_FILES>
)
PATTERN = '.*.parquet'
FORCE = TRUE
FILE_FORMAT = (
TYPE = 'parquet'
);
Side note, Keep in mind that Snowflake has a safety check to skip
files if it has already been Staged and loaded once successfully.

Related

insert into hive external table as select and ensure it generates single file in table directory

My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*

Deduplication on Amazon Athena

We have streaming applications storing data on S3. The S3 partitions might have duplicated records. We query the data in S3 through Athena.
Is there a way to remove duplicates from S3 files so that we don't get them while querying from Athena?
You can write a small bash script that executes a hive/spark/presto query for reading the dat, removing the duplicates and then writing it back to S3.
I don't use Athena but since it is just presto then I will assume you can do whatever can be done in Presto.
The bash script does the following :
Read the data and apply a distinct filter (or whatever logic you want to apply) and then insert it to another location.
For Example :
CREATE TABLE mydb.newTable AS
SELECT DISTINCT *
FROM hive.schema.myTable
If it is a recurring task, then INSER OVERWRITE would be better.
Don't forget to set the location of the hive db to easily identify the data destination.
Syntax Reference : https://prestodb.io/docs/current/sql/create-table.html
Remove the old data directory using aws s3 CLI command.
Move the new data to the old directory
Now you can safely read the same table but the records would be distinct.
Please use CTAS:
CREATE TABLE new_table
WITH (
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT DISTINCT *
FROM old_table;
Reference: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can not remove duplicate in Athena as it works on file it have work arrounds.
So some how duplicate record should be deleted from files in s3, most easy way would be shellscript.
Or
Write select query with distinct option.
Note: Both are costly operations.
Using Athena can make EXTERNAL TABLE on data stored in S3. If you want to modify existing data then use HIVE.
Create a table in hive.
INSERT OVERWRITE TABLE new_table_name SELECT DISTINCT * FROM old_table;

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

Writing data using PIG to HIVE external table

I wanted to create an external table and load data into it through pig script. I followed the below approach:
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
My problem is, when I first created the table, it is creating a directory in hdfs, and when I tried to store a file using script, it throws an error saying "folder already exists". It looks like pig store always writes to a new directory with only specific name?
Is there any way to avoid this issue?
And are there any other attributes we can use with STORE command in PIG to write to a specific diretory/file everytime?
Thanks
Ram
YES you can use the HCatalog for achieving your result.
remember you have to run your Pig script like:
pig -useHCatalog your_pig_script.pig
or if you are using grunt shell then simply use:
pig -useHCatalog
next is your store command to store your relation directly into hive tables use:
STORE C INTO 'HIVE_DATABASE.EXTERNAL_TABLE_NAME' USING org.apache.hive.hcatalog.pig.HCatStorer();

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.