Trino S3 partitions folder structure

Trino S3 partitions folder structure - amazon-s3

I do not understand what paths Trino needs in order to create table from existing files. I use S3 + Hive metastore.
My JSON file:
{"a":1,"b":2,"snapshot":"partitionA"}
Create table command:
create table trino.partitioned_jsons (a INTEGER, b INTEGER, snapshot varchar) with (external_location = 's3a://bucket/test/partitioned_jsons/*', format='JSON', partitioned_by = ARRAY['snapshot']
What I have tried:
Store JSON file in s3://bucket/test/partitioned_jsons/partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot=partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot/partitionA.json
But all returns just an empty table.

Related

using logic to assign values to external table

Using Greenplum external table feature to create a temporary readable table to load all files from a specific folder in S3. Here is how the ext table is created:
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_data text)
LOCATION ('s3://some_s3_location/2021/06/17/
config=some_config.conf') FORMAT 'TEXT' (DELIMITER 'OFF' null E'' escape E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 100 PERCENT;
with file names in S3 like:
appData_A2342342342.json
(where A2342342342 is the ID of the JSON file)
Full path based on above example would be:
s3://some_s3_location/2021/06/17/appData_A2342342342.json
Note that the ext table only contains a single column (the JSON data).
I would like to have another column for the ID of the JSON file (in this case A2342342342)
How would I set up a create ext table statement so that I can grab the file name from S3 and parse it for the JSON_ID column value?
something like this....
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_id text, json_data text)

How to rename a column when creating an external table in Athena based on Parquet files in S3?

Does anybody know how to rename a column when creating an external table in Athena based on Parquet files in S3?
The Parquet files I'm trying to load have both a column named export_date as well as an export_date partition in the s3 structure.
An example file path is: 's3://bucket_x/path/to/data/export_date=2020-08-01/platform=platform_a'
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
So what I would like to do, is to rename the export_date column to export_date_exp. The AWS documentation indicates that:
To make Parquet read by index, which will allow you to rename
columns, you must create a table with parquet.column.index.access
SerDe property set to true.
https://docs.amazonaws.cn/en_us/athena/latest/ug/handling-schema-updates-chapter.html#parquet-read-by-name
But the following code does not load any data in the export_date_exp column:
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date_exp` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'parquet.column.index.access'='true')
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
This question has been asked already, but did not receive an answer:
How to rename AWS Athena columns with parquet file source?
I am asking again because the documentation explicitly says it is possible.
As a side note: in my particular use case I can just not load the export_date column, as I've learned that reading Parquet by name does not require you to load every column. In my case I don't need the export_date column, so this avoids the conflict with the partition name.

Hive overwrite table with new s3 location

I have a hive external table point to a location on s3. My requirement is I will be uploading a new file to this s3 location everyday and the data in my hive table should be overwritten.
Every day my script will create a folder under 's3://employee-data/' and place a csv file there.
eg. s3://employee-data/20190812/employee_data.csv
Now I want my hive table to pick up this new file under new folder everyday and overwrite the existing data. I can get the folder name - '20190812' through my ETL.
Can someone help.
I tried ALTER table set location 'new location'. However, this does not overwrite the data.
create external table employee
{
name String,
hours_worked Integer
}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';

Set new location and the data will be accessible:
ALTER table set location 's3://employee-data/20190812/';
This statement points table to the new location, nothing is being overwritten of course.
Or alternatively make the table partitioned:
create external table employee
(
name String,
hours_worked Integer
)
partitioned by (load_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://employee-data/';
then do ALTER TABLE employee recover partitions;
and all dates will be mounted in separate partitions and you can query them using
WHERE load_date='20190812'

How to read parquet data with partitions from Aws S3 using presto?

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks

You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?

Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Trino S3 partitions folder structure - amazon-s3

Related

using logic to assign values to external table

How to rename a column when creating an external table in Athena based on Parquet files in S3?

Hive overwrite table with new s3 location

How to read parquet data with partitions from Aws S3 using presto?

Parquet Files Generation with hive

Categories

Resources