Our data is stored in S3 as JSON without partitions. Until today we were using only athena but now we tried Redshift Spectrum.
We are running the same query twice.
Once using Redshift Spectrum and once using Athena. Both connect to the same data in S3.
Using Redshift Spectrum this report takes forever(more than 15 minutes) to run and using Athena it only takes 10 seconds to run.
The query that we are running in both cases in aws console is this:
SELECT "events"."persistentid" AS "persistentid",
SUM(1) AS "sum_number_of_reco"
FROM "analytics"."events" "events"
GROUP BY "events"."persistentid"
Any idea what's going on?
Thanks
The Redshift Spectrum processing power is limited by Redshift cluster size.
You can find the infomation from Improving Amazon Redshift Spectrum Query Performance
The Amazon Redshift query planner pushes predicates and aggregations
to the Redshift Spectrum query layer whenever possible. When large
amounts of data are returned from Amazon S3, the processing is limited
by your cluster's resources. Redshift Spectrum scales automatically to
process large requests. Thus, your overall performance improves
whenever you can push processing to the Redshift Spectrum layer.
On the other hand, Athena uses optimized amount of resource for the query, which may be larger than the Spectrum of a small Redshift cluster can get.
This has been confirmed by our testing on Redshift Spectrum performance with different Redshift cluster size.
Related
I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.
We are joining 2 tables, each 20TB in a spark job.
Running it on r5.4x large EMR cluster.
How can we optimize it, can anyone share few parameters to finetune the job and help it run quicker.
More Details :
Both input tables are timestamp partitioned.
Size of files in each partition is around 128MB each.
Input and output file format is parquet.
Output is written to an external glue table and stored on s3.
Thanks in advance.
why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location
WITH
( format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '**s3://target_bucket_name/my_data**'
)
AS
WITH my_data_2
AS
(SELECT * FROM existing_tablegenerated_data" limit 10)
SELECT *
FROM my_data_2;
Since it already has the bucket to store the data , why does Athena need the spill-bucket and what does it store there ?
Trino/Presto developer here who was directly involved in Spill development.
In Trino (formerly known as Presto SQL) the term "spill" refers to dumping on disk data that does not fit into memory. It is an opt-in feature allowing you to process larger queries. Of course, if all your queries require spilling, it's more efficient to simply provision a bigger cluster with more memory, but the functionality is useful when larger queries are rare.
Spilling involves saving temporary data, not the final query results. The spilled data is re-read back and deleted before the query completes execution.
Athena uses Lambda functions to connect to External Hive data stores
Because of the limit on Lambda function response sizes, responses larger than the threshold spill into an Amazon S3 location that you specify when you create your Lambda function. Athena reads these responses from Amazon S3 directly.
https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html
Is there a way for us to check how frequently a table has been accessed/queried in AWS redshift?
Frequency can be daily/monthly/every hour or whatever.. Can some one help me?
It could be sql queries using system tables from AWS Redshift or some python script. What is the best way?
I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here