In our project we use AWS Glue Catalog tables with the data stored on S3 as Parquet files. We apply transformations (ETL) with Spark-SQL which reads data from these tables and derives the final result-set by creating multiple temporary views at every step.So, how do we collect statistics on these tables which can be used by Spark-SQL to generate better plans ?
I know that in Hive we can use ANALYZE TABLE mytable COLLECT STATISTICS; command to collect statistics. However, when I try the same command with the Glue Catalog tables, it throws an error - Can not create path from an empty string.
Can anyone please let me know how to collect stats on these glue tables ?
Sample Table DDL:
create table mydatabase.mytable
(
empid integer,
emp_name varchar(50),
emp_age varchar(10)
)
STORED AS PARQUET
Location 'S3://some_path/somefolder';
PS: Please ignore syntax errors in DDL. Please let me know if additional information is necessary
Related
I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
I tried Redshift Spectrum. Both of query below ended success without any error message, but I can't get the right count of the uploaded file in S3, it's just returned 0 row count, even though that file has over 3 million records.
-- Create External Schema
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
create external database if not exists;
-- Create External Table
create EXTERNAL TABLE spectrum_schema.principals(
tconst VARCHAR (20),
ordering BIGINT,
nconst VARCHAR (20),
category VARCHAR (500),
job VARCHAR (500),
characters VARCHAR(5000)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://xxxxx/xxxxx/'
I also tried the option, 'stored as parquet', the result was same.
My iam role has "s3:","athena:", "glue:*" permissions, and Glue table created successfully.
And just in case, I confirmed the same S3 file could be copied into table in Redshift Cluster successfully. So, I concluded the file/data has no issue by itself.
If there is something wrong with my procedure or query. Any advice would be appreciated.
As your DDL is not scanning any data it looks like the issue seems to be with it not understanding actual data in s3. To figure this out you can simply generate table using AWS Glue crawler.
Once the table is created you can compare this table properties with another table created using DDL in Glue data catalog. That will give you the difference and what is missing in your table that is created using DDL manually.
I want to copy an external table's schema and all its partition info from one database to another in Hive and in Presto (AWS Athena). To be clear, I dont want to copy any underlying data - just the metadata.
What is the best way to do this?
In Athena you cloud generate DDL with SHOW CREATE TABLE database1.tablename and just execute this statement replacing database1 to database2. It will copy schema but not data and partitions. To populate partitions you should execute MSCK REPAIR TABLE on database2.tablename. The same will works for Presto.
If you are unable to populate partitions with MSCK REPAIR TABLE you could copy it with Glue API:
import boto3
glue = boto3.client('glue')
paginator = glue.get_paginator('get_partitions')
DB_NAME_SRC = 'src'
DB_NAME_DST = 'dst'
TABLE = 'tablename'
partitions = []
for page in paginator.paginate(DatabaseName=DB_NAME, TableName=TABLE):
for partition in page['Partitions']:
del partition['DatabaseName']
del partition['TableName']
del partition['CreationTime']
partitions.append(partition)
print("Got %d partitions" % len(partitions))
glue.batch_create_partition(DatabaseName=DB_NAME_DST, TableName=TABLE, PartitionInputList=partitions)
In PrestoSql you can use CREATE TABLE ... LIKE syntax. See https://prestosql.io/docs/current/sql/create-table.html.
CREATE TABLE bigger_orders (
LIKE orders INCLUDING PROPERTIES
)
I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.