Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR? - hive

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,

Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

Related

Bigquery - Updating GCS path of Bigquery external tables

I have a requirement where I might have to update the Bigquery External tables on a periodic basis.
The GCS location has timestamp for every incremental run, I would like to update to the latest timestamp folder as the path of External table.
One way i see is only dropping the table and creating again by pointing it to latest folder. But, is there any other way to update it without dropping the table
As suggested by #Samuel , you can use the SQL statement CREATE or REPLACE EXTERNAL TABLES for your requirement. Scheduled queries support DML and DDL statements which can be used to create the new tables. You can use the below mentioned query parameter to create the table according to your schedule :
My_database_name.my_table_name.my_results_{run_date}
For more information you can refer to this documentation.

How to createOrReplaceTempView in Delta Lake?

I want to use Delta Lake tables in my Hive Metastore on Azure Data Lake Gen2 as basis for my company's lakehouse.
Previously, I used "regular" hive catalog tables. I would load data from parquet into a spark dataframe, and create a temp table using df.CreateOrReplaceTempView("TableName"), so I could use Spark SQL or %%sql magic to do ETL. After doing this, I can use spark.sql or %%sql on the TableName. When I was done, I would write my tables to the hive metastore.
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
I know I can persist Delta Tables in the Hive Metastore through a multitude of ways, for instance by creating a Managed catalog table through df.write.format("delta").saveAsTable("LakeHouseDB.TableName")
I also know that I can create a DeltaTable object through the DeltaTable(spark, table_path_data_lake), but then I can only use the Python API and not sql.
Does there exist some equivalent of CreateOrReplaceTempView(), or is there a better way to achieve ETL with SQL without 'writing' to the data lake first?
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
Not possible with Delta Lake since it relies heavily on a transaction log (_delta_log) under the data directory of a delta table.

AWS S3 - Inserting into bucketed ORC table

I'm looking at storing data in S3 in ORC format for querying with Athena.
I want to partition the data like so ...
.../year=2019/month=7/
... and bucketing the data further by id (each id will have multiple records for each month, there are lots of id's)
I want to be able to insert new data into this structure daily... I understand that I can't use the INSERT INTO statement from Athena because bucketed tables are not supported.
What would be the best way to insert data daily into a table of this structure? Is it even possible to do with bucketed data?
Cheers
Presto allows inserts into existing partitions of bucketed partitioned tables since Presto 312. If Athena does not support this, you can very easily run a Presto cluster yourself, e.g. using Starburst Presto AWS integration (I can recommend this for other reasons too, as it can be way cheaper than using Athena if you run more than just few queries. Disclaimer: I'm from Starburst)

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.