Getting Null values while loading parquet data from s3 to snowflake

Getting Null values while loading parquet data from s3 to snowflake - amazon-s3

Problem Statement : To Load parquet data from aws s3 to snowflake table.
Command which I am using :
COPY INTO schema.test_table from
(select $1:ID::INTEGER, $1:DATE::TIMESTAMP, $1:TYPE::VARCHAR FROM
#s3_external_stage/folder/part-00000-c000.snappy.parquet)
file_format = (type=parquet);```
In result , I am getting null values
I queried parquet data with s3, it has values in it.
Not sure where I am missing.
Also, is there any way we can get data from parquet files into tables recursively
for ex :
s3_folder /
|
----fileabc.parquet
-----file_xyz.parquet

Related

Trino S3 partitions folder structure

I do not understand what paths Trino needs in order to create table from existing files. I use S3 + Hive metastore.
My JSON file:
{"a":1,"b":2,"snapshot":"partitionA"}
Create table command:
create table trino.partitioned_jsons (a INTEGER, b INTEGER, snapshot varchar) with (external_location = 's3a://bucket/test/partitioned_jsons/*', format='JSON', partitioned_by = ARRAY['snapshot']
What I have tried:
Store JSON file in s3://bucket/test/partitioned_jsons/partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot=partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot/partitionA.json
But all returns just an empty table.

How can I load same file into hive table using beeline

I needed to create huge test data in hive table. I tried following commands but it only inserts one partition data at a time.
connect to beeline:
beeline --force=true -u 'jdbc:hive2://<host>:<port>/<hive database name>;ssl=true;user=<username>;password=<pw>'
create partitioned table :
CREATE TABLE p101(
Name string,
Age string)
PARTITIONED BY(fi string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
I have created ins.csv file with data and copy it to hdfs location, its data is as follows.
Name,Age
aaa,33
bbb,22
ccc,55
then I tried to load same file for multiple partition ids with following command
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE p101 PARTITION(fi=1,fi=2,fi=3,fi=4,fi=5);
but it loads record only for partitionID=5.

You can only specify one partition for each insert into.
What you can do in order to have different partitions is add it into your csv file like this:
Name,Age,fi
aaa,33,1
bbb,22,2
ccc,55,3
Hive will automatically know that this is the partition.
LOAD DATA INPATH 'hdfs_path/ins.csv' INTO TABLE tmp.p101;

Hive Managed table getting "java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)"

I have created Hive Managed Table with ORC and PARQUET format. While getting the values from table with "Select * from table_name" I am getting below error.
java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)"

Check the DDL of the table. Table seems to be a bucketed table. However, the underlying folders/files are of different bucket sizes compared to the table definition.

Hive : Overwrite queries on s3 external tables is failing

Overwrite queries are failing on the external tables whose data is located on s3.I am using Hive 1.2
Steps to reproduce:
1)create a file with below 3 rows and place it at some location in s3
a,b,c
x,y,z
c,d,e
2)create external table:
create external table test(col1 string,col2 string,col3 string)
row format delimited fields terminated by ',' location '<S3LocationOfAboveFile>'
3)Do insert overwrite on this table:
insert overwrite table test select * from test order by col1;
I get an error and I see that the s3 file is deleted.
Job Submission failed with exception 'java.io.FileNotFoundException
(No such file or directory:<S3 location> )

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?

Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting Null values while loading parquet data from s3 to snowflake - amazon-s3

Related

Trino S3 partitions folder structure

How can I load same file into hive table using beeline

Hive Managed table getting "java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)"

Hive : Overwrite queries on s3 external tables is failing

Parquet Files Generation with hive

Categories

Resources