Creation of Table using DBT - dbt

Can we create a new table in DBT?
Can we copy the table structure which is present in the dev environment in the database to another environment using DBT?

Yes. However, Dbt needs a "reason" to create tables, for example, to materialize the data produced by one of its models. DBT cannot create table just for the creation's sake.
Well, strictly speaking, you can do this by putting CREATE TABLE... in a pre-hook or post-hook section, but I suppose this is not what you want since dbt makes no difference here at all.
You can define your existed table in sources where you can set database, schema and table name different from the target storage space where dbt writes data. And then, define a model something like:
{{ materialized="table" }}
select *
from {{ source('your_source', 'existed_table_name') }}
limit 1 /* add "limit 1" if you only want the structure */
Put necessary connection credentials in the profiles.yml, and build the model. Dbt will copy one row from source table into model table, before that model table creation is done for free.

Related

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

DBT select Big Query table from different Google Project

I am using DBT to read and write tables in Big Query, all running in my Google project X.
I have one table which I want to read in from a different Google project Y and put in a DBT model (which will then be saved as a table in project X).
Is it possible to do? And if yes, where do I define the different project in FROM {{ source('dataset_project_y', 'table_to_read')}}?
first, you need to declare the source in a source.yml file.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#declaring-a-source
for example, create a source_y.yml
sources:
- name: dataset_project_y
schema: dataset_y
database: 'project_y'
tables:
- name: table_to_read
identifier: table_to_read
after that,
you could refer to source table_to_read in any dbt model, select from it in any of the dbt models' SQL satements.
https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#selecting-from-a-source
For example, to use table_to_read in dbt_model_x.sql
{{
config(
materialized = "view",
)
}}
SELECT * FROM {{ source('dataset_project_y', 'table_to_read')}}

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

Using SSIS Package, How to validate the source records for duplicate before inserting?

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

Doctrine schema changes while keeping data?

We're developing a Doctrine backed website using YAML to define our schema. Our schema changes regularly (including fk relations) so we need to do a lot of:
Doctrine::generateModelsFromYaml(APPPATH . 'models/yaml', APPPATH . 'models', array('generateTableClasses' => true));
Doctrine::dropDatabases();
Doctrine::createDatabases();
Doctrine::createTablesFromModels();
We would like to keep existing data and store it back in the re-created database. So I copy the data into a temporary database before the main db is dropped.
How do I get the data from the "old-scheme DB copy" to the "new-scheme DB"? (the new scheme only contains NEW columns, NO COLUMNS ARE REMOVED)
NOTE:
This obviously doesn't work because the column count doesn't match.
SELECT * FROM copy.Table INTO newscheme.Table
This obviously does work, however this is consuming too much time to write for every table:
SELECT old.col, old.col2, old.col3,'somenewdefaultvalue' FROM copy.Table as old INTO newscheme.Table
Have you looked into Migrations? They allow you to alter your database schema in programmatical way. WIthout losing data (unless you remove colums, of course)
How about writing a script (using the Doctrine classes for example) which parses the yaml schema files (both the previous version and the "next" version) and generates the sql scripts to run? It would be a one-time job and not require that much work. The benefit of generating manual migration scripts is that you can easily store them in the version control system and replay version steps later on. If that's not something you need, you can just gather up changes in the code and do it directly through the database driver.
Of course, the more fancy your schema changes becomes, the harder the maintenance will get i.e. column name changes, null to not null etc.