We are evaluating Amazon Redshift for real time data warehousing.
Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per transaction.
What is best practice for real time data loading to Amazon Redshift?
Shall we use JDBC and perform INSERT INTO statements, or try to use Kinesis Firehose, or perhaps AWS Lambda?
I'm concerned about using one of these services because both will use Amazon S3 as a middle layer and perform the COPY command which is suitable for bigger data sets, not for "one-row" inserts.
It is not efficient to use individual INSERT statements with Amazon Redshift. It is designed as a Data Warehouse, providing very fast SQL queries. It is not a transaction-processing database where data is frequently updated and inserted.
The best practice is to load batches (or micro-batches) via the COPY command. Kinesis Firehose uses this method. This is much more efficient, because multiple nodes are used to load the data in parallel.
If you are seriously looking at processing data in real-time, then Amazon Redshift might not be the best database to use. Consider using a traditional SQL database (eg those provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB) or even Elasticsearch. You should only choose to use Redshift if your focus is on reporting across large volumes of data, typically involving many table joins.
As mentioned in the Amazon Redshift Best Practices for Loading Data:
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.
The best option is Kinesis Firehose, which is working on batches of events. You are writing the events into Firehose, one by one, and it is batching it in an optimal way, based on your definition. You can define how many minutes to batch the events, or the size of the batch in MB.
You might be able to insert the event faster into Redshift with INSERT, but this method is not scalable. COPY designed to work in almost every scale.
Related
We have data in the range of 100 TeraBytes. We’d need to run pre-defined reports on this data (Not a lot of Adhoc reporting). For the pre-defined reports, the expected turn around time is 10s seconds, should not be in minutes.
We are all AWS server-less so far. The original data source is DynamoDB. What toolset should we use? Is redshift server-less an overkill for this? I hear data should be atleast PB scale to use it.
The 100 Terabytes of data are expected to accumulate in the period of next 5 years.
I have been leaning towards streaming data to S3 and using Athena for querying but the unpredictability of query performance is making us anxious.
Is redshift/ redshift serverless/ spectrum an overkill for this use case?
Any better toolset that I should explore?
Given this light query load but large data and S3 storage of the data I'd go with Redshift serverless with the large data located in S3 and accessed via spectrum. This will give you the full functionality of Redshift but not paying for a cluster up full time.
If you can stream data to S3 and your pre-defined report aggregate on a certain time window (like hour/day/week), I would suggest using a database that creates rollups or materialized views so your query will run fast (that is one of your requirements). With rollups or materialized views, you are looking at a much smaller table to query, saving you money on storage and on compute resources during query time. At the same time, you might want to pay attention to the ease of use and cost of creating rollups or materialized views.
Granted I don't know what your data looks like, but being able to handle things like nested JSON which is common with DynamoDB is probably another important requirement. Some systems can handle nested JSON out of the box and some will require ETL tooling to 'flatten' the data.
Here are a few I know that can do rollups and/or materialized views:
Rockset
Druid
Elasticsearch
Clickhouse
Good luck!
I have a lot of timeseries relatet data chunked into one hour intervals in parquet files stored in aws s3 (for every hour one file). The purpose would be to have an web application displaying that data. Since we cannot crawl every parquet on s3 on request my approach would be to use ETL processes to aggregate those series and store them as single parquet and in a dynamodb table for different aggregated view like year, month, week, day, hour, or even minutes. Furthermore that processed parquet would be made available with aws athena to be queried (not from the frontend since i expect to long waiting times for the queries to be executed)
Sounds like a reasonable plan. I assume you want some pointers to how to achieve that?
Depending on the size of your current data set, and your requirements for querying it with Athena you may not need to do the ETL.
You can set up a table with a location that is a prefix of all the Parquet files (e.g. s3://example/dir/ if the files are stored with keys like s3://example/dir/2020/03/13/12/file.parquet). If your total data set is not more than a couple of gigabytes I would recommend this. If your data set is larger, and it is organised into prefixes that contain each day or hour, you can create a partitioned table and add partitions with locations that use the prefix structure (e.g. s3://example/dir/2020/03/13, s3://example/dir/2020/03/12 for daily partitions or s3://example/dir/2020/03/13/11 and s3://example/dir/2020/03/13/12` for hourly partitions). Unless you have hundreds of gigabytes of data per day, or the queries you will run with Athena almost always only look at a few hours of data, I would recommend partitioning by date rather than hour to keep the number of partitions down.
If your existing Parquet files are very small, less than a hundred megabytes, and performance for the Athena queries is very important, you could try to ETL the files into bigger files to see if it helps. It may or may not, it will depend. I recommend that you use Athena itself for the ETL. You can use INSERT INTO to create new partitions in a table based on data in another table. I suggest automating this by creating an Event Bridge rule with a schedule that either triggers a Lambda function that runs the conversion query in Athena, or a Step Functions state machine if you want to avoid paying for the Lambda sitting idle waiting for the query to finish (or need to wait more than 15 minutes). AWS has a service called Glue ETL which was made for this kind of thing, but in my experience it is not worth it. Using Athena, Lambda, and Step Functions beats it in terms of usability and flexibility.
You can use the same mechanism for loading pre-calculated time series into DynamoDB – use Event Bridge to schedule a Lambda function that runs queries in Athena, and convert the results to store in DynamoDB. Use Step Functions to avoid paying for idle time when waiting for the queries to finish.
If Amazon Timestream ever gets released it may be a better target for storing the time series than DynamoDB. Also consider storing the pre-calculated time series as JSON, CSV, or Apache Arrow on S3 instead of DynamoDB, that can be cheaper and in some ways simpler depending on your use case.
Please let me know if you want more details on how to set up the partitioned Athena tables, but provide some more context so I can make it more concrete.
I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.
I'm looking to build a RESTful API in Go that would be in charge of inserting datas based on the data sent by multiple mobile apps (that would be stored in an Amazon Redshift cluster). Possibly receiving tens of thousands of requests per second.
From what I have read, Redshift give slow insert speeds.
That's why few people have advised me to use an intermediate database like dynamodb or s3 in which I'd perform the inserts first. Then, in a second time, I'd import the data to Redshift.
I'm wondering why would I need to use Redshift in that case as the data would already be stored in a database ? Do you think I can proceed differently ?
I have also thought of a simpler solution by writing to a queue and progressively inserting the data to redshift but I think it might be a problem if the queue gets increasingly bigger as the insert speed isn't fast enough to compensate the incoming data.
Thanks in advance for your help! :-)
Advice like this is normally off-topic for StackOverflow, but...
Amazon Redshift is a massively parallel processing (MPP) database with an SQL interface. It can be used to query TBs and even PBs of data and it can do it very efficiently.
You ask "why would I need to use Redshift" -- the answer is if your querying requirements cannot be satisfied with a traditional database. If you can satisfactorily use a normal database for your queries, then there's no real reason to use Redshift.
However, if your queries need Redshift, then you should continue to use it. The design of Redshift is such that the most efficient way to insert data is to load from Amazon S3 via the COPY command. It is inefficient to insert data via normal INSERT statements unless they are inserting many rows per INSERT statement (eg hundreds or thousands).
So, some questions to ask:
Do I need the capabilities of Amazon Redshift for my queries, or can a traditional database suffice?
Do I need to load data in real-time, or is it sufficient to load in batches?
If using batches, how often do I need to load the batch? Can I do it hourly or daily, or does it need to be within a few minutes of the data arriving?
You could also consider using Amazon Kinesis Firehose, which can accept a stream of data and insert it into an Amazon Redshift database automatically.
I have simple question what are pros/cons for import data from S3 directly into redshift and S3 to redshift via EMR. The reason that I post this question is about If I choose S3->redshift I could do transfrom in redshift using SQL. In other hand If I choose EMR that means using pig/hive or java instead of SQL. Which way I should go?
Thanks.
It depends on the amount of data you have to process and how much of the processing you can offload to Hadoop. Redshift has pretty good performance but it doesn't support too many concurrent operations so data transformation in Redshift may affect your user's querying performance. Also, in Hadoop you can process many types of data and file formats - Redshift is obviously more limited.
I am using S3 -> Redshift, and the performance is pretty good. Like the previous comment, there is a trade off, if you dont want block the user queries, either use Redshift WLM or EMR. In Redshift WLM, your process will be throttled, where as in EMR you will be charged for the aws resources.