What are different/benefit between S3->Redshift and S3->EMR->Redshift? - amazon-s3

I have simple question what are pros/cons for import data from S3 directly into redshift and S3 to redshift via EMR. The reason that I post this question is about If I choose S3->redshift I could do transfrom in redshift using SQL. In other hand If I choose EMR that means using pig/hive or java instead of SQL. Which way I should go?
Thanks.

It depends on the amount of data you have to process and how much of the processing you can offload to Hadoop. Redshift has pretty good performance but it doesn't support too many concurrent operations so data transformation in Redshift may affect your user's querying performance. Also, in Hadoop you can process many types of data and file formats - Redshift is obviously more limited.

I am using S3 -> Redshift, and the performance is pretty good. Like the previous comment, there is a trade off, if you dont want block the user queries, either use Redshift WLM or EMR. In Redshift WLM, your process will be throttled, where as in EMR you will be charged for the aws resources.

Related

What tools to use for reporting for Terabyte scale of data in AWS

We have data in the range of 100 TeraBytes. We’d need to run pre-defined reports on this data (Not a lot of Adhoc reporting). For the pre-defined reports, the expected turn around time is 10s seconds, should not be in minutes.
We are all AWS server-less so far. The original data source is DynamoDB. What toolset should we use? Is redshift server-less an overkill for this? I hear data should be atleast PB scale to use it.
The 100 Terabytes of data are expected to accumulate in the period of next 5 years.
I have been leaning towards streaming data to S3 and using Athena for querying but the unpredictability of query performance is making us anxious.
Is redshift/ redshift serverless/ spectrum an overkill for this use case?
Any better toolset that I should explore?
Given this light query load but large data and S3 storage of the data I'd go with Redshift serverless with the large data located in S3 and accessed via spectrum. This will give you the full functionality of Redshift but not paying for a cluster up full time.
If you can stream data to S3 and your pre-defined report aggregate on a certain time window (like hour/day/week), I would suggest using a database that creates rollups or materialized views so your query will run fast (that is one of your requirements). With rollups or materialized views, you are looking at a much smaller table to query, saving you money on storage and on compute resources during query time. At the same time, you might want to pay attention to the ease of use and cost of creating rollups or materialized views.
Granted I don't know what your data looks like, but being able to handle things like nested JSON which is common with DynamoDB is probably another important requirement. Some systems can handle nested JSON out of the box and some will require ETL tooling to 'flatten' the data.
Here are a few I know that can do rollups and/or materialized views:
Rockset
Druid
Elasticsearch
Clickhouse
Good luck!

Using AWS Glue to Create a Table and move the dataset

I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.

Inserting large amount of data into Redshift

I'm looking to build a RESTful API in Go that would be in charge of inserting datas based on the data sent by multiple mobile apps (that would be stored in an Amazon Redshift cluster). Possibly receiving tens of thousands of requests per second.
From what I have read, Redshift give slow insert speeds.
That's why few people have advised me to use an intermediate database like dynamodb or s3 in which I'd perform the inserts first. Then, in a second time, I'd import the data to Redshift.
I'm wondering why would I need to use Redshift in that case as the data would already be stored in a database ? Do you think I can proceed differently ?
I have also thought of a simpler solution by writing to a queue and progressively inserting the data to redshift but I think it might be a problem if the queue gets increasingly bigger as the insert speed isn't fast enough to compensate the incoming data.
Thanks in advance for your help! :-)
Advice like this is normally off-topic for StackOverflow, but...
Amazon Redshift is a massively parallel processing (MPP) database with an SQL interface. It can be used to query TBs and even PBs of data and it can do it very efficiently.
You ask "why would I need to use Redshift" -- the answer is if your querying requirements cannot be satisfied with a traditional database. If you can satisfactorily use a normal database for your queries, then there's no real reason to use Redshift.
However, if your queries need Redshift, then you should continue to use it. The design of Redshift is such that the most efficient way to insert data is to load from Amazon S3 via the COPY command. It is inefficient to insert data via normal INSERT statements unless they are inserting many rows per INSERT statement (eg hundreds or thousands).
So, some questions to ask:
Do I need the capabilities of Amazon Redshift for my queries, or can a traditional database suffice?
Do I need to load data in real-time, or is it sufficient to load in batches?
If using batches, how often do I need to load the batch? Can I do it hourly or daily, or does it need to be within a few minutes of the data arriving?
You could also consider using Amazon Kinesis Firehose, which can accept a stream of data and insert it into an Amazon Redshift database automatically.

How to do Real-time loading into Amazon Redshift?

We are evaluating Amazon Redshift for real time data warehousing.
Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per transaction.
What is best practice for real time data loading to Amazon Redshift?
Shall we use JDBC and perform INSERT INTO statements, or try to use Kinesis Firehose, or perhaps AWS Lambda?
I'm concerned about using one of these services because both will use Amazon S3 as a middle layer and perform the COPY command which is suitable for bigger data sets, not for "one-row" inserts.
It is not efficient to use individual INSERT statements with Amazon Redshift. It is designed as a Data Warehouse, providing very fast SQL queries. It is not a transaction-processing database where data is frequently updated and inserted.
The best practice is to load batches (or micro-batches) via the COPY command. Kinesis Firehose uses this method. This is much more efficient, because multiple nodes are used to load the data in parallel.
If you are seriously looking at processing data in real-time, then Amazon Redshift might not be the best database to use. Consider using a traditional SQL database (eg those provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB) or even Elasticsearch. You should only choose to use Redshift if your focus is on reporting across large volumes of data, typically involving many table joins.
As mentioned in the Amazon Redshift Best Practices for Loading Data:
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.
The best option is Kinesis Firehose, which is working on batches of events. You are writing the events into Firehose, one by one, and it is batching it in an optimal way, based on your definition. You can define how many minutes to batch the events, or the size of the batch in MB.
You might be able to insert the event faster into Redshift with INSERT, but this method is not scalable. COPY designed to work in almost every scale.

Easiest way to persist Cassandra data to S3 using Spark

I am trying to figure out how to best store and retrieve data, from S3 to Cassandra, using Spark: I have log data that I store in Cassandra. I run Spark using DSE to perform analysis of the data, and it works beautifully. The log data grows daily, and I only need two weeks worth in Cassandra at any given time. I still need to store older logs somewhere for at least 6 months, and after research, S3 with Glaciar looks like the most promising solution. I'd like to use Spark, to run a daily job that finds the logs from day 15, deletes them from Cassandra, and sends them to S3. My problem is this: I can't seem to settle on the right format to save the Cassandra rows to a file, such that I can one day potentially load the file back into Spark, and run an analysis, if I have to. I only want to run the analysis in Spark one day, not persist the data back into Cassandra. JSON seems to be an obvious solution, but is there any other format that I am not considering? Should I use Spark SQL? Any advice appreciated before I commit to one format or another.
Apache Spark is designed for this kind of use case. It is a storage format for columnar databases. It provides column compression and some indexing.
It is becoming a de facto standard. Many big data platforms are adopting it or at least providing some support for it.
You can query it efficiently directly in S3 using SparkSQL, Impala or Apache Drill. You can also run EMR jobs against it.
To write data to Parquet using Spark, use DataFrame.saveAsParquetFile.
Depending on your specific requirements you may even end up not needing a separate Cassandra instance.
You may also find this post interesting