Trino not automatically scan the new load data in source - hive

I am using Trino to read partition parquet file using following two queries, but one thing I observed is Trino will not automatically scan the partition directories, which means when there are new loaded parquet files in the partition folder, the data will not automatically show up at Trino table that I create, I checked the Hive connector setting for hive.cache.enabled and hive.file-status-cache-tables, they are all default values, has anyone met this situation before, or any idea why this happens?
CREATE TABLE IF NOT EXISTS minio.xxx.currency (
currency_code VARCHAR,
currency_name VARCHAR,
firmid VARCHAR,
source VARCHAR,
year VARCHAR,
month VARCHAR,
day VARCHAR,
time VARCHAR,
correlationid VARCHAR
)
with (
external_location = 's3a://xxx/currency/',
partitioned_by = ARRAY['firmid','source','year','month','day','time','correlationid'],
format = 'PARQUET');
CALL minio.system.sync_partition_metadata
(
schema_name=>'xxx',
table_name=>'currency',
mode=>'FULL',
case_sensitive=>false
);

Related

Athena returns blank response for Partitioned data, what am I missing?

I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.

How to read parquet data with partitions from Aws S3 using presto?

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks
You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

hive load external table from dynamic location(partitioned)

I need to create a hive table(external) to load data generated by another process. I need to be partitioned by date, But the problem is, there is a random string in the path.
Example input paths :
/user/hadoop/output/FDQM9N4RCQ3V2ZYT/20170314/
/user/hadoop/output/FDPWMUVVBO2X74CA/20170315/
/user/hadoop/output/FDPGNC0ENA6QOF9G/20170316/
.........
.........
Here 4th field in the directory is dynamic(which cannot be guessed). Each of these directories will have multiple .gz files
What location would I give while creating the table?
CREATE EXTERNAL TABLE user (
userId BIGINT,
type INT,
date String
)
PARTITIONED BY (date String)
LOCATION '/user/hadoop/output/';
is this correct? if so how do I partition it based on the date(last field in the directory)?
Since you are not using the partition convention you'll have to add each partition manually.
The table location does not matter, but for clarity leave it as it is now.
I would recommend to use date type for the partition or at least the ISO format, YYYY-MM-DD.
I would not use date as a column name (nor int,string etc.).
PARTITIONED BY (dt date)
alter table user add if not exists partition (dt=date '2017-03-14') location '/user/hadoop/output/FDQM9N4RCQ3V2ZYT/20170314';
alter table user add if not exists partition (dt=date '2017-03-15') location '/user/hadoop/output/FDPWMUVVBO2X74CA/20170315';
alter table user add if not exists partition (dt=date '2017-03-16') location '/user/hadoop/output/FDPGNC0ENA6QOF9G/20170316';

Read multiple files in Hive table by date range

Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")