using glue to get data from ec2 mysql to redshift - amazon-s3

I'm trying to pull a table from a mysql database on an ec2 instance through to s3 to query in redshift. My current pipeline is I crawl the mysql database table with aws glue crawler to get the schema in the data catalog. Then I set up an aws etl job to pull the data in to an s3 bucket. Then I again crawl the data in the s3 bucket with another crawler to get the schema for the data in the s3 bucket in to the data catalog and then run the script below in redshift query window to pull the schema in to redshift. It seems like a lot of steps. Is there a more efficient way to do this? For example is there a way to re-use the schema from the first crawler so I don't have to crawl the data twice. It's the same table and columns.
script:
create external schema schema1
from data catalog database 'database1'
iam_role 'arn:aws:iam::228276746111:role/sfada'
region 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Related

Query S3 Bucket With Amazon Athena and modify values

I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?
Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena

Update Athena Table from 2 external tables in Athena from s3

I am relatively new to athena & s3.
I have an s3 bucket which contains 2 folders with csv files in both. I have created 2 external tables for each folder in athena.
I want to create another final table in athena which joins the two files and updates with more rows automatically as more files are added into the s3 bucket. Please could you advise the best way to get the output needed?
I have tried "create table from query" in athena. But the table remains static as i upload more files to s3, and doesnt update.
For this use-case I would suggest creating a view in Athena. You can read more on it here.

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command
In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS
Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

Is there a way PigActivity in AWS Pipeline can read schema from Athena tables created on S3 buckets

I have lot of legacy pig scripts that run on on-prem cluster, we are trying to move to AWS Data Pipeline (PigActivity) and want to make these pig scripts can read data from S3 buckets where my source data would reside. On-Prem Pig scripts use Hcatalog loader to read hive tables schema. So, if I create Athena tables on those S3 buckets, is there a way to read schema from those Athena tables inside the pig scripts? using some sort of loader similar to hcatloader?
Current: Below code works, but I have to define schema inside the pig script
%default SOURCE_LOC 's3://s3bucket/input/abc'
inp_data = LOAD '$SOURCE_LOC' USING PigStorage('\001') AS
(id: bigint, val_id: int, provision: chararray);
Want:
Read from a Athena table instead
Athena table: database_name.abc (schema as id:bigint, val_id:int, provision:string)
So, looking for something like below: so I do not have to define schema inside the pig script
%default SOURCE_LOC 'database_name.abc'
inp_data = LOAD '$SOURCE_LOC' USING athenaloader();
Is there a loader utility to read Athena? or is there an alternate solution to my need. please help

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)