Mapping AWS glue table columns to target RDS instance table columns - amazon-s3

i have created a glue job, that takes data from S3 bucket and insert into **RDS postgres instance**.
In the S3 bucket I have created different folder (partition).
Can I map different columns in different partitions to same target RDS instance?

When you say partition in s3, is it indicated using the hive style ?? eg: bucket/folder1/folder2/partion1=xx/partition2=xx/partition3=yy/..
If so, you shouldnt be storing data with different structures of information in s3 partitions and then mapping them to a single table. However if its just data in different folders, like s3://bucket/folder/folder2/folder3/.. and these are actually genuinely different datasets, then yes it is possible to map those datasets to a single table. However you cannot do it via the ui. You will need to read these datasets as separate dynamic/data frames and join them using a key in glue/spark and load them to rds

Related

Query S3 Bucket With Amazon Athena and modify values

I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?
Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena

Are the relations preserved when transferring data from a relational DB to S3?

There are options for transferring a DB snapshot from a relational database to S3 in AWS.
But S3 is an object store, so it only stores files (e.g. parquet).
Are the relationships (like keys) between tables in the relational DB somehow carried over to S3? Can queries still be made against the files in S3 that would allow joins to be made between tables?
There are no "keys" like foreign key, primary key in the exported parquet files in S3, but you can still query the the exported data directly through tools like Amazon Athena or Amazon Redshift Spectrum. For more information on using Athena to read Parquet data, see Parquet SerDe in the Amazon Athena User Guide. For more information on using Redshift Spectrum to read Parquet data, see COPY from columnar data formats in the Amazon Redshift Database Developer Guide.
The time it takes for the export to complete depends on the data stored in the database. For example, tables with well distributed numeric primary key or index columns will export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column will take longer because the export uses a slower single threaded process. For example if a table got a numeric pk and got 100,000 rows, during export data will be "partitioned" in a few portion, each portion are a directory in the S3 bucket, so that when you query data in Athena/Redshift spectrum with that id, AWS know what buckets to scan to get the data and thus improve performance and speed.
In summary, after data exported as columnar format like parquet in S3, you can do inplace query by Athena, load the data to redshift or data store for more analytics, etc..

Update Athena Table from 2 external tables in Athena from s3

I am relatively new to athena & s3.
I have an s3 bucket which contains 2 folders with csv files in both. I have created 2 external tables for each folder in athena.
I want to create another final table in athena which joins the two files and updates with more rows automatically as more files are added into the s3 bucket. Please could you advise the best way to get the output needed?
I have tried "create table from query" in athena. But the table remains static as i upload more files to s3, and doesnt update.
For this use-case I would suggest creating a view in Athena. You can read more on it here.

How does overwrite existing insert mode work in redshiftcopyactivity for aws data pipeline

I am new to aws data pipeline. We have a use case where we copy updated data into redshift . I wanted to know whether I can use OVERWRITE_EXISTING insert mode for redshiftcopyactivity. Also, please explain the internal working of OVERWRITE_EXISTING.
Data Pipelines are used to move data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.
"OVERWRITE_EXISTING", over writes the already existed data in to the destination table but with a constraint of unique identifier (Primary Key) in RedShift cluster.
You can use "TRUNCATE", if you dont want your table structure to be changed due to the addition of PK.
Though, you can find things here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html

unioning tables from ec2 with aws glue

I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)