Currently I’m listening events from AWS Kinesis and writing them to S3. Then I query them using AWS Glue and Athena.
Is there a way to import that data, possibly with some transformation, to an RDS instance?
There are several general approaches to take with regards to that task.
Read data from and Athena query into a custom ETL script (using a JDBC connection) and load into the database
Mount the S3 bucket holding the data to a file system (perhaps using s3fs-fuse), read the data using a custom ETL script, and push it to the RDS instance(s)
Download the data to be uploaded to the RDS instance to a filesystem using the AWS CLI or the SDK, process locally, and then push to RDS
As you suggest, use AWS Glue to import the data to from Athena to the RDS instance. If you are building an application that is tightly coupled with AWS, and if you are using Kinesis and Athena you are, then such a solution makes sense.
When connecting GLUE to RDS a couple of things to keep in mind (mostly on the networking side:
Ensure that DNS Hostnames are enabled the VPC hosting the target RDS instance
You'll need to setup a self-referencing rule in the Security Group associated with the target RDS instance
For some examples of code targetting a relational database, see the following tutorials
One approach for Postgres:
Install the S3 extension in Postgres:
psql=> CREATE EXTENSION aws_s3 CASCADE;
Run the query in Athena and find the CSV result file location in S3 (S3 output location is in Athena settings) (You can also inspect the "Download results" button to get the S3 path)
Create your table in Postgres
Import from S3:
SELECT aws_s3.table_import_from_s3(
'newtable', '', '(format csv, header true)',
aws_commons.create_s3_uri('bucketname', 'reports/Unsaved/2021/05/10/aa9f04b0-d082-328g-5c9d-27982d345484.csv', 'us-east-1')
);
If you want to convert empty values to null, you can use this: (format csv, FORCE_NULL (columnname), header true)
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html
Related
I am looking for some way to directly export the SQL query results to a CSV file from AWS lambda. I have found this similar question - Exporting table from Amazon RDS into a csv file. But it will not work with the AWS Golang API.
Actually, I want to schedule a lambda function which will daily query some of the views/tables from RDS(SQL Server) and put it to the S3 bucket in CSV format. So, I want to directly download the query results in the CSV form in the lambda and then upload it to S3.
I have also found data pipeline service of AWS to copy RDS data to S3 directly, but I am not sure if I can make use of it here.
It would be helpful if anyone can suggest me the right process to do it and references to implement it.
You can transfer files between a DB instance running Amazon RDS for
SQL Server and an Amazon S3 bucket. By doing this, you can use Amazon
S3 with SQL Server features such as BULK INSERT. For example, you can
download .csv, .xml, .txt, and other files from Amazon S3 to the DB
instance host and import the data from D:\S3\into the database. All
files are stored in D:\S3\ on the DB instance
Reeference:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/User.SQLServer.Options.S3-integration.html
I have huge CSV files in the zipped format in S3 storage. I need just a subset of columns from the data for Machine learning purposes. How should I extract those columns into EMR then to Redshift without transferring the whole files?
My idea is to process all files into EMR then extract subset and push the required columns into Redshift. But this taking a lot of time. Please let me know if there is an optimized way of handling this data.
Edit: I am trying to automate this pipeline using Kafka. Let say a new folder in added into S3, it should be processed in EMR using spark and stored into redshift without any manual intervention.
Edit 2: Thanks for input guys, I was able to create a pipeline From S3 to Redshift using Pyspark in EMR. Currently, I am trying to integrate Kafka into this pipeline.
I would suggest:
Create an external table in Amazon Athena (An AWS Glue crawler can do this for you) that points to where your data is stored
Use CREATE TABLE AS to select the desired columns and store them in a new table (with the data automatically stored in Amazon S3)
Amazon Athena can handle gzip format, but you'll have to check whether this includes zip format.
See:
CREATE TABLE - Amazon Athena
Examples of CTAS Queries - Amazon Athena
Compression Formats - Amazon Athena
If the goal is to materialise a subset of the file columns in a table in Redshift then one option you have is Redshift Spectrum, which will allow you to define an "external table" over the CSV files in S3.
You can then select the relevant columns from the external tables and insert them into actual Redshift tables.
You'll have an initial cost hit when Spectrum scans the CSV files to query them, which will vary depending on how big the files are, but that's likely to be significantly less than spinning up an EMR cluster to process the data.
Getting Started with Amazon Redshift Spectrum
We are planning to offload events from Kafka to S3 (e.g via using kafka connect). The target is to spin up a service (e.g. like amazon Athena) and provide a query interface on top of the exported avro events. The obstacle is that amazon Athena avro SerDe (uses org.apache.hadoop.hive.serde2.avro.AvroSerDe) does not support the magic bytes that schema registry is utilising for storing the schema id. Do you know of any alternative that can play nice with confluent schema registry?
Thanks!
Using S3 Connect's AvroConverter does not put any schema ID in the file. In fact, after the message is written, you lose the schema ID entirely.
We have lots of Hive tables that are working fine with these files, and users are querying using Athena, Presto. SparkSQL, etc.
Note: If you wanted to use AWS Glue, S3 Connect doesn't (currently, as of 5.x) offer automatic Hive partition creation like the HDFS Connector, so you might want to look for alternatives if you wanted to use it that way.
Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog.
So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.
My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?
Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:
Create an IAM role and policy to access a Glue Data Catalog
Create a policy for the target Glue Catalog
Look up the IAM role used to create the Databricks deployment
Add the Glue Catalog IAM role to the EC2 policy
Add the Glue Catalog IAM role to a Databricks workspace
Launch a cluster with the Glue Catalog IAM role
Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.
There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.
Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.
Currently I am trying to figure out a strategy to automate the below testing scenarios:
Compare data integrity between a set of files between On premise server and Amazon S3 (csv) i.e. comparing two file's entire content while they are residing in two diff server.
My thought: I have thought of using Java to compare both, but not sure bow to perform run time comparison between two diff servers. Otherwise I have to bring both of the files into same server for comparison.
Compare data integrity between Amazon S3 and Amazon Redshift (After data loaded from S3 to Redshift). Can I use Java to query Amazon S3 object and create a table first and then compare with Redshift? But I think although they are part of the same environment S3 and Redshift are 2 diff servers.
Please suggest if there is any SIT test framework to test On-premise to AWS Cloud migration.