Serverless.yml Aurora RDS: Infrastructure as code - serverless-framework

I'm looking for a way to declare via serverless.yml an Aurora DB with all the tables.
I would like to be able to deploy via serverless deploy a new Aurora instance with all the tables.
Thanks,

The easiest way I found is to declare the aurora DB resource with a YAML like:
Resources:
RDSCluster:
Type: AWS::RDS::DBCluster
Properties:
MasterUsername: ${self:custom.dbLogin}
MasterUserPassword: ${self:custom.dbPassword}
DatabaseName: MagaDB
Engine: aurora
EngineMode: serverless
ScalingConfiguration:
AutoPause: true
MaxCapacity: 2
MinCapacity: 1
SecondsUntilAutoPause: 300
EnableHttpEndpoint: true
StorageEncrypted: true
And then create an init.sql script that will instantiate all the tables.
The difference between Aurora and DynamoDB is that you need to declare tables when deploying a DynamoDB but you don't need to with Aurora.

To do this with the Serverless Framework, you'll need to write a CloudFormation template and include it inside the resources block of your serverless.yml file.
Here are the docs, so you can learn more about including CloudFormation in your serverless.yml file.
Here's a set of examples from AWS that can help, although they're extremely verbose and include lots of extra things you may not need.

Related

Can Matillion ETL convert multiple CSV files to Parquet?

I have the following folder hierarchy in an S3 bucket:
January/10
16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, 16b516c0-8f2a-eabd-770a-b8bbc83c5859.csv, …
In other words, every folder represents a calendar day.
I would like Matillion ETL to do the following transform:
January/10
AsingleParquetFile.Parquet
How can I implements this in Matillion ETL?
Matillion ETL would do this by using the CDW it's attached to. The exact answer would depend which CDW you are using, but would typically involve an external table and an unload component.
For example using Matillion ETL for Redshift:
followed by:

'DBT docs generate' does not populate model column-level comments in the catalog

I use dbt-snowflake 1.1.0 with the corresponding dbt-core 1.1.0.
I added documentation for my models in yaml files, i.e.:
> models/stage/docs.yml
version: 2
models:
- name: raw_weblogs
description: Web logs of customer interaction, broken down by attribute (bronze). The breakdown is performed using regular expressions.
columns:
- name: ip
description: IP address from which the request reached the server (might be direct customer IP or the address of a VPN/proxy).
...
Although these details show up correctly in the DBT UI when i run dbt docs generate and then dbt docs serve, yet they are not listed in target/catalog.json:
cat target/catalog.json | grep identity
(no results)
According to the DBT documentation, I understand that column comments should be part of catalog.json.
LATER EDIT: I tried running dbt --debug docs generate and it seems that all data is retrieved directly from the target environment (in my case, Snowflake). Looking at the columns of my model in Snowflake, they indeed do NOT have any comments posted on the in Snowflake.
It thus seems to me that the underlying error might be with the fact that dbt run does not correctly persist the column metadata to Snowflake.
After further investigation, I found out the reason for lacking comments was indeed the fact that the comments are written to catalog.json when running dbt docs generate based on what is received from the database, while dbt docs serve populates the UI by combining information from catalog.json with metadata (in my case, documented column comments) from the local dbt models.
The solution to persist such metadata in the database with dbt run was to add the following DBT configuration:
> dbt_project.yml
models:
<project>:
...
+persist_docs:
relation: true
columns: true

How to best implement dynamic dbt datasets

I'm cleaning up a dbt + BigQuery environment and trying to implement a staging environment that pulls from a staging dataset. Problem is that the current .yml files with source information all explicitly point to a production dataset.
One option that I am considering is a source wrapper function that will serve as an adapter and inject the proper dataset depending on some passed CLI var or profile target (which is different for the staging vs prod environments).
However, I'm fairly new to dbt so unsure if this is the best way to go about this. Would appreciate any insight you kind folks have :)
EDIT: I'm realizing that a source wrapper is not the way to go as it would mess with the generated DAG
You can supply the name of the schema for a source in a variable or environment variable, and set that variable at runtime.
In your sources.yml:
version: 2
sources:
- name: jaffle_shop
schema: "{{ var('source_jaffle_shop_schema) }}"
tables:
- name: orders
In your dbt_project.yml:
vars:
- source_jaffle_shop_schema: MY_DEFAULT_SCHEMA
And then to override at runtime:
dbt run --vars "{source_jaffle_shop_schema: my_other_schema}"

How to use CETAS (Synapse Serverless Pool) in dbt?

In Synapse Serverless Pool, I can use CETAS to create external table and export the results to the Azure Data Lake Storage.
CREATE EXTERNAL TABLE external_table
WITH (
LOCATION = 'location/',
DATA_SOURCE = staging_zone,
FILE_FORMAT = SynapseParquetFormat
)
AS
SELECT * FROM table
It will create an external table name external_table in Synapse and write a parquet file to my staging zone in Azure Data Lake.
How can I do this in dbt?
I was trying to do something very similar and run my dbt project with Synapse Serverless Pool, but ran into several issues. Ultimately I was mislead by CETAS. When you create the external table it creates a folder hierarchy, in which it places the parquet file. If you were to run the same script like the one you have as an example it fails because you cannot overwrite with CETAS. So dbt would be able to run it like any other model, but it wouldn't be easy to overwrite. Maybe if you dynamically made a new parquet every time the script is run and deleted the old one, but that seems like putting a small bandage on the hemorrhaging wound that is the synapse and severless pool interaction. I had to switch up my architecture for this reason.
I was trying to export as a parquet to maintain the column datatypes and descriptions so I didn't have to re-schematize. Also so I could create tables based of incremental points in my pipeline. I ended up finding a way to pull from a database that already had the datatype schemas, using the dbt-synapse adapter. Then if I needed an incremental table, I could materialize it as a table via dbt and dbt-synapse and access it that way.
What is your goal with the exported parquet file?
Maybe we can find another solution?
Here's the dbt-synapse-serverless adapter github where it lists caveats for serverless pools.
I wrote a materialization for CETAS (Synapse Serverless Pool) here: https://github.com/intheroom/dbt-synapse-serverless
It's a forked from dbt-synapse-serverless here: https://github.com/dbt-msft/dbt-synapse-serverless
Also you can use hooks in dbt to use CETAS.

When I run snowflake stage query I get aws error

I've created an s3 linked stage on snowflake called csv_stage with my aws credentials, and the creation was successful.
Now I'm trying to query the stage like below
select t.$1, t.$2 from #sandbox_ra.public.csv_stage/my_file.csv t
However the error I'm getting is
Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.]
Any idea why? Do I have to pass something in the query itself?
Thanks for your help!
Ultimately let's say my s3 location has 3 different csv files. I would like to load each one of them individually to different snowflake tables. What's the best way to go about doing this?
Regarding the last part of your question: You can load multiple files with one COPY INTO-command by using the file names or a certain regex-pattern. But as you have 3 different files for 3 different tables you also have to use three different COPY INTO-commands.
Regarding querying your stage you can find some more hints in these questions:
Missing List-permissions on AWS - Snowflake - Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.] and
https://community.snowflake.com/s/question/0D50Z00008EKjkpSAD/failure-using-stage-area-cause-access-denied-status-code-403-error-code-accessdeniedhow-to-resolve-this-error
https://aws.amazon.com/de/premiumsupport/knowledge-center/access-key-does-not-exist/
I found out the aws credential I provided was not right. After fixing that, query worked.
This approach works to import data from S3 into a snowgflake Table from a public S3 bucket:
COPY INTO SNOW_SCHEMA.table_name FROM 's3://test-public/new/solution/file.csv'