unioning tables from ec2 with aws glue - amazon-s3

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';

You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)

Related

Update table in Athena

I have a table in Athena created from S3. I wanted to update the column values using the update table command. Is the UPDATE Table command not supported in Athena?
Is there any other way to update the table ?
Thanks
Athena only supports External Tables, which are tables created on top of some data on S3. Since the S3 objects are immutable, there is no concept of UPDATE in Athena. What you can do is create a new table using CTAS or a view with the operation performed there, or maybe use Python to read the data from S3, then manipulate it and overwrite it.

cannot create a view in redshift spectrum external schema

I am facing an issue in creating a view in an external schema on a spectrum external table. Below is the script I am using to create the view
create or replace view external_schema.test_view as
select id, name from external_schema.external_table with no schema binding;
I'm getting below error
ERROR: Operations on local objects in external schema are not enabled.
Please help in creating view under spectrum external table
External tables are created in an external schema. An Amazon Redshift External Schema references a database in an external Data Catalog in AWS Glue or in Amazon Athena or a database in Hive metastore, such as Amazon EMR.
External schemas are not present in Redshift cluster, and are looked up from their sources. External tables are also only read only for the same reason.
As a result, you will not be able to bind a view that you are creating to a schema not is not stored in the cluster. You can create a view on top of external tables (WITH NO SCHEMA BINDING clause), but the view will reside in a schema local to Redshift.
TL;DR Redshift doesn’t support creating views in external schemas yet, so the view can only reside in a schema local to Redshift.
Replace external_schema with internal_schema as follows:
create or replace view internal_schema.test_view as
select id, name from external_schema.external_table with no schema binding;

Mapping AWS glue table columns to target RDS instance table columns

i have created a glue job, that takes data from S3 bucket and insert into **RDS postgres instance**.
In the S3 bucket I have created different folder (partition).
Can I map different columns in different partitions to same target RDS instance?
When you say partition in s3, is it indicated using the hive style ?? eg: bucket/folder1/folder2/partion1=xx/partition2=xx/partition3=yy/..
If so, you shouldnt be storing data with different structures of information in s3 partitions and then mapping them to a single table. However if its just data in different folders, like s3://bucket/folder/folder2/folder3/.. and these are actually genuinely different datasets, then yes it is possible to map those datasets to a single table. However you cannot do it via the ui. You will need to read these datasets as separate dynamic/data frames and join them using a key in glue/spark and load them to rds

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

using glue to get data from ec2 mysql to redshift

I'm trying to pull a table from a mysql database on an ec2 instance through to s3 to query in redshift. My current pipeline is I crawl the mysql database table with aws glue crawler to get the schema in the data catalog. Then I set up an aws etl job to pull the data in to an s3 bucket. Then I again crawl the data in the s3 bucket with another crawler to get the schema for the data in the s3 bucket in to the data catalog and then run the script below in redshift query window to pull the schema in to redshift. It seems like a lot of steps. Is there a more efficient way to do this? For example is there a way to re-use the schema from the first crawler so I don't have to crawl the data twice. It's the same table and columns.
script:
create external schema schema1
from data catalog database 'database1'
iam_role 'arn:aws:iam::228276746111:role/sfada'
region 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;