Trino S3 partitions folder structure - amazon-s3

I do not understand what paths Trino needs in order to create table from existing files. I use S3 + Hive metastore.
My JSON file:
{"a":1,"b":2,"snapshot":"partitionA"}
Create table command:
create table trino.partitioned_jsons (a INTEGER, b INTEGER, snapshot varchar) with (external_location = 's3a://bucket/test/partitioned_jsons/*', format='JSON', partitioned_by = ARRAY['snapshot']
What I have tried:
Store JSON file in s3://bucket/test/partitioned_jsons/partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot=partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot/partitionA.json
But all returns just an empty table.

Related

using logic to assign values to external table

Using Greenplum external table feature to create a temporary readable table to load all files from a specific folder in S3. Here is how the ext table is created:
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_data text)
LOCATION ('s3://some_s3_location/2021/06/17/
config=some_config.conf') FORMAT 'TEXT' (DELIMITER 'OFF' null E'' escape E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 100 PERCENT;
with file names in S3 like:
appData_A2342342342.json
(where A2342342342 is the ID of the JSON file)
Full path based on above example would be:
s3://some_s3_location/2021/06/17/appData_A2342342342.json
Note that the ext table only contains a single column (the JSON data).
I would like to have another column for the ID of the JSON file (in this case A2342342342)
How would I set up a create ext table statement so that I can grab the file name from S3 and parse it for the JSON_ID column value?
something like this....
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_id text, json_data text)

How to rename a column when creating an external table in Athena based on Parquet files in S3?

Does anybody know how to rename a column when creating an external table in Athena based on Parquet files in S3?
The Parquet files I'm trying to load have both a column named export_date as well as an export_date partition in the s3 structure.
An example file path is: 's3://bucket_x/path/to/data/export_date=2020-08-01/platform=platform_a'
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
So what I would like to do, is to rename the export_date column to export_date_exp. The AWS documentation indicates that:
To make Parquet read by index, which will allow you to rename
columns, you must create a table with parquet.column.index.access
SerDe property set to true.
https://docs.amazonaws.cn/en_us/athena/latest/ug/handling-schema-updates-chapter.html#parquet-read-by-name
But the following code does not load any data in the export_date_exp column:
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date_exp` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'parquet.column.index.access'='true')
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
This question has been asked already, but did not receive an answer:
How to rename AWS Athena columns with parquet file source?
I am asking again because the documentation explicitly says it is possible.
As a side note: in my particular use case I can just not load the export_date column, as I've learned that reading Parquet by name does not require you to load every column. In my case I don't need the export_date column, so this avoids the conflict with the partition name.

Hive overwrite table with new s3 location

I have a hive external table point to a location on s3. My requirement is I will be uploading a new file to this s3 location everyday and the data in my hive table should be overwritten.
Every day my script will create a folder under 's3://employee-data/' and place a csv file there.
eg. s3://employee-data/20190812/employee_data.csv
Now I want my hive table to pick up this new file under new folder everyday and overwrite the existing data. I can get the folder name - '20190812' through my ETL.
Can someone help.
I tried ALTER table set location 'new location'. However, this does not overwrite the data.
create external table employee
{
name String,
hours_worked Integer
}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
Set new location and the data will be accessible:
ALTER table set location 's3://employee-data/20190812/';
This statement points table to the new location, nothing is being overwritten of course.
Or alternatively make the table partitioned:
create external table employee
(
name String,
hours_worked Integer
)
partitioned by (load_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
then do ALTER TABLE employee recover partitions;
and all dates will be mounted in separate partitions and you can query them using
WHERE load_date='20190812'

How to read parquet data with partitions from Aws S3 using presto?

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks
You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.