Is it any way to auto refresh Redshift Matearilized view based on external table? - sql

I have materialized view in Redshift, that based on data from external Redshift Spectrum table, so it's impossible to use Redshift auto refresh feature.
I just don't wont to refresh it by hand.
I don't care much about data consistency, so delay for some time (up to 1+ hour) is fine for me.
So, is it any way to update materialized view automatically? Maybe, it's possible to configure some TTL for it? Any other ideas?

Usually, the location of the Spectrum files is Amazon S3.
You can automate the update by scheduled Lambda that triggers periodically.
You can also establish a trigger that fires every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired update.
I'd start simple if you can and work up to Redshift Data API and Step functions.
This first option might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.

Related

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

Lambda function and triggered S3 events

I have a lambda function that gets triggered every time a file is written onto an S3 bucket. My understanding is that every time a single file gets in (this is a potential scenario, rather than having a batch of files being sent), an API call is fired up and that means that I am charged. My question is: can I batch multiple files so that each API calls will only be called if, for example, I have a batch of 10 files? Is this a good practice? I should not be in the position of having a processing time greater than 15 minutes, so the use of the lambda is still fine.
Thank you
You can use SQS to decouple this scenario, the lambda triggering point will be SQS, in there you can set batch size whatever you want.
1 - One solution is group your files into a rar and put into S3. Thus for multiple files, your api will be triggered once only.
2 - The other solution as said by kamprasad is to use SQS.
3 - One last solution that I can think of is to use a cronjob to trigger the lambda as per your requirement. Inside your lambda do the processing using threads to make your task done faster. Keep in mind you have to choose Memory and time carefully in this scenario.
I've personally used the last solution quite frequently.

How to batch streaming inserts to BigQuery from a Beam job

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.
I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Different ways of updating bigquery table

In gcp, I need to update a bigquery table whenever a file (multiple formats such as json,xml) gets uploaded into a bucket. I have two options but not sure what are the pros/cons of each of them. Can someone suggest which is a better solution and why?
Approach 1 :
File uploaded to bucket --> Trigger Cloud Function (which updates the bigquery table) -->Bigquery
Approach 2:
File uploaded to bucket --> Trigger Cloud Function (which triggers a dataflow job) -->Dataflow-->Bigquery.
In production env, which approach is better suited and why? If there are alternative approaches,pls let me know.
This is quite a broad question, so I wouldn't be surprised if it gets voted to be closed. That said however, I'd always go #2 (GCS -> CF -> Dataflow -> BigQuery).
Remember, with Cloud Funtions there is a max execution time. If you kick off a load job from the Cloud Function, you'll need to bake logic into it to poll and check the status (load jobs in BigQuery are async). If it fails, you'll need to handle it. But, what if it's still running and you hit the max execution of your Cloud Function?
At least by using Dataflow, you don't have the problem of max execution times and you can simply rerun your pipeline if it fails for some transient reason e.g. network issues.

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.