How to read parquet data with partitions from Aws S3 using presto? - amazon-s3

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks

You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

Related

Athena returns blank response for Partitioned data, what am I missing?

I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.