Zipped Data in S3 that needs to be used for Machine Learning on EMR or Redshift - amazon-s3

I have huge CSV files in the zipped format in S3 storage. I need just a subset of columns from the data for Machine learning purposes. How should I extract those columns into EMR then to Redshift without transferring the whole files?
My idea is to process all files into EMR then extract subset and push the required columns into Redshift. But this taking a lot of time. Please let me know if there is an optimized way of handling this data.
Edit: I am trying to automate this pipeline using Kafka. Let say a new folder in added into S3, it should be processed in EMR using spark and stored into redshift without any manual intervention.
Edit 2: Thanks for input guys, I was able to create a pipeline From S3 to Redshift using Pyspark in EMR. Currently, I am trying to integrate Kafka into this pipeline.

I would suggest:
Create an external table in Amazon Athena (An AWS Glue crawler can do this for you) that points to where your data is stored
Use CREATE TABLE AS to select the desired columns and store them in a new table (with the data automatically stored in Amazon S3)
Amazon Athena can handle gzip format, but you'll have to check whether this includes zip format.
See:
CREATE TABLE - Amazon Athena
Examples of CTAS Queries - Amazon Athena
Compression Formats - Amazon Athena

If the goal is to materialise a subset of the file columns in a table in Redshift then one option you have is Redshift Spectrum, which will allow you to define an "external table" over the CSV files in S3.
You can then select the relevant columns from the external tables and insert them into actual Redshift tables.
You'll have an initial cost hit when Spectrum scans the CSV files to query them, which will vary depending on how big the files are, but that's likely to be significantly less than spinning up an EMR cluster to process the data.
Getting Started with Amazon Redshift Spectrum

Related

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler database using parquet from S3

I have created parquet files from SQL data using python. I am able to read the data on my local machine, so I know the parquet files are valid. I created a Glue Crawler that creates a database from Parquet files in S3 and the database shows the correct amount of records in the glue dashboard.
When I query that database in Athena it shows "No Results", but does show the column names.
Please see the images below for reference.
GLUE Table Properties
Athena Query
I figured it out. You cannot point to the root of the S3 bucket with parquet files in that location. Each parquet file needs to be in a folder that has the same name as your file. I don't think this is required, but for automation purposes, it makes the most sense...

In-place query of AWS S3 data without provisioning DB or creating tables

We are exploring usecases where we want to achieve in-place transformation and querying of S3 data lake data. We don't want to provision database and create tables (so we are not keen to consider Redshift or Athena) and we want the querying to be most cost-efficient. While we can use S3 Select to directly query S3 data, it has its own limitations such as we can query single object using S3 Select, etc. Are there any alternatives to achieve this? Please guide

How to transform data from S3 bucket before writing to Redshift DW?

I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.
Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.
So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.
Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.
Any help is appreciated.
There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.
Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.
This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.
Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:
1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule
2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.
One option is to load the 'raw' data into Redshift as 'staging' tables. Then, run SQL commands to manipulate the data (JOINs, etc) into the desired format.
Finally, copy the resulting data into the 'public' tables that users query.
This is a normal Extract-Load-Transform process (slightly different to ETL) that uses the capabilities of Redshift to do the transform.

Convert JSON to ORC [AWS]

This is my situation:
I have an application that rotates json files to an s3 bucket. I would need to convert those files in ORC format to be consulted from Athena or EMR.
My first attempt was a lambda programmed in Node, but I didn't find any module for the conversion.
I think it can be done more easily with GLUE or EMR, but I can not find a solution.
any help?
Thanks!
You can use glue. You will need a glue data catalog table that describes the schema of your data, you can create this automatically with a glue crawler.
Then create a glue job, if you follow the Add Job wizard you can select ORC as a data output format on the data targets section of the wizard.
If you go through the tutorials on AWS glue it will step you through doing something similar but converting into Parquet format, if you go through the same steps with your data but select ORC it should do what you want.

Test Amazon S3 and Redshift

Currently I am trying to figure out a strategy to automate the below testing scenarios:
Compare data integrity between a set of files between On premise server and Amazon S3 (csv) i.e. comparing two file's entire content while they are residing in two diff server.
My thought: I have thought of using Java to compare both, but not sure bow to perform run time comparison between two diff servers. Otherwise I have to bring both of the files into same server for comparison.
Compare data integrity between Amazon S3 and Amazon Redshift (After data loaded from S3 to Redshift). Can I use Java to query Amazon S3 object and create a table first and then compare with Redshift? But I think although they are part of the same environment S3 and Redshift are 2 diff servers.
Please suggest if there is any SIT test framework to test On-premise to AWS Cloud migration.