Multiple Models Reference Single Schema - google-bigquery

I've just started building our BigQuery data warehouse transformations with DBT, so far i love every bit of it.
There is a requirement to incrementally merge 3 source tables in one final fact table.
Example sale, deliverynotes, invoices -----> to be merged to sale_transactions schema.
When i created two models to point in same schema i get this error
Compilation Error
dbt found two resources with the database representation "`iprocure-
edw`.`iprocure_wh_marts`.`sale_transactions`".
dbt cannot create two resources with identical database representations. To fix this,
change the configuration of one of these resources:
- model.sales_marketing.int_sales_saletransactions (models/intermediate/sales/int_sales_saletransactions.sql)
- model.sales_marketing.int_invoiced_saletransactions (models/intermediate/sales/int_invoiced_saletransactions.sql)
How can one achieve this requirement?

Related

How to create sharded table in GCP BigQuery

As we started working on GCP BigQuery, our code has to retrieve data from so called sharded table in a dataset. This table group is with the name seen like sometablename_(3000) with the icon represent as . The number there in parenthesis represents total count of tables created in the dataset so far with the date, and everyday the tables are getting added there by some other publishers and the count increases daily thus. Our code needs a wildcard query to limit date range to read data from this table which works fine. Only other option we see while creating a table from console is partition table which is represented differently.
But curious question is how are these tables getting created daily in the first place? When we manually tried creating another table with same name format, it's getting created as separate table but getting into this group. Not sure if documentation has any reference but can't find any.
So any help in understanding this background is appreciated.
Sharded tables are generated automatically once google-bigquery finds tables that share the following characteristics:
Exist in the same dataset
Have the exact same table schema
The same prefix
Have a suffix of the form _YYYYMMDD (eg. 20210130)
You can find additional info about sharded table on official documention, Partitioning versus sharding.
So, that means if I create 3 tables named BUSINES_YYYYMMDD it will be grouped once refreshed in the UI.
* Business_(3)
- Business_20211201
- Business_20211202
- Business_20211203
And if I want to query those tables I will just have to either go trough the ui and select the table.
# UI under schema tab
BUSINESS_20211203 2021-12-03 v # Filter tables under the shard
Table schema
...
Or just go directly to the query ui compose new query and perform a query.
Select * from my-project-id.my-dataset.Business_20211203 limit 1
So if you are getting tables created by publishers/org inside the same dataset that fits the conditions mention at the top it will be grouped.
About querying this groups, google recommends to do partition instead of sharding. You can see the process of converting sharded into partion table by going to this link.
Also, I found this post which also shows the vs of each mode.

Does replacing merge statements over several tables in a data vault model with conditional insert all into will reduce ingest time?

I am loading data on daily basis into a data vault model on Snowflake data warehouse.
I have split the ingest script (javascript procedure) into 3 main parts for logging purposes.
Getting data into temporary table
Metadata part, where I add data into several hubs and links that holds metadata of each row (location, data entry name…)
Loading main data holding indicators and their values into specific hubs and satellites.
Here is the average loading time of each part of a data file having around 2000 rows with ~300k indicator values:
3 to 5 seconds for adding data from stage into temporary table
19 to 25 seconds for adding metadata into 9 hubs, satellites and links
3 to 5 seconds for adding 2000 rows into a hub and then 300k values into sat and link.
For part 2, whether there is a need to insert or not as I am using a merge statement it will take the same time.
Many things comes to my mind as loading thousands of records take few seconds while merging into few hubs (tables) if value not found originally is taking way more.
Can I replace all merge statements of tables related to part 2 and replace it with one conditional
insert all
into table1 … where…
into table2 … where …
Can a insert into with conditions similar to when not matched of the merge statement may reduce the ingest time taking into considerations that where clause on each table will contain a select subquery to ensure existing data not added again?
I was reading this article on optimizing load into data vault model with its related scripts on github but still concerned about ingest time being reduced in an efficient way.
Admirable as Galavan's article is it comes with some fatal flaws around loading to the same hub in the case of same-as link or hierarchical links --- and that is, you will load duplicates. I would discourage you from using Multi-Table inserts to load hubs, links and satellites -- for analysis and testing on this please visit here: https://patrickcuba.medium.com/data-vault-test-automation-52c0316e8e3a
It's not to say MTI don't have a place in DV, they do! In the case of loading logarithmic PIT structures absolutely! An in-depth article on this is published here: https://patrickcuba.medium.com/data-vault-pit-flow-manifold-e2b68df26628
Now merges vs insert conversation in particular should not be in a Data Vault 2.0 vocabulary because DV2.0 is INSERT-ONLY. I did another piece on that here focussing on hashing but there is a segment discussing what happens at the micro-partition level in Snowflake that should discourage you from using MERGE INTO, visit here: https://patrickcuba.medium.com/data-vault-2-0-on-snowflake-5b25bb50ed9e
Seeing as you are building out your own DV automation tool these two blogs are worth a read too:
https://patrickcuba.medium.com/you-might-be-doing-datavault-wrong-888e9b0fa07d
https://medium.com/snowflake/decided-to-build-your-own-data-vault-automation-tool-a9a6273b9f9b

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Bigquery caching when hitting table would provide a different result?

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Add attribute to cube and reprocess without original database

Every year we keep a historical copy of one of our cubes. This year someone decided they wanted to pay us money to add an attribute to the cube which did not previously exists. Fine, I like money, but the issue is we don't have a backup of the database that we built this cube off of.
So a question arises in my head, do we need that original database to add a new attribute to this cube? Is it possible for us to add a new attribute to the cube and only process this attribute without having the cube orignal datasource?
Not having a great understanding of what is happening under the hood when I add an attribute to a SSAS cube and process, I can't say if this is or isn't possible. I could imagine that possibly, the cube has a snapshot in memory of the datasource that it can work off of. I can also imagine that this would be ridiculously inefficient so there is a chance this is no way in heck possible
EDIT: It at least would seem feasible to add a calculated member that makes use of existing data in the cube.
I also should mention that I tried to add an attribute to such a cube and received an error:
"Dimension [Partner] cannot be saved File system error failed to copy
file C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore to C:\\MYSQLSERVER\OLAP\DATA\2013_Cube.db\\.dim\.dstore file exists"
Sorry I faked those filepaths a little.
This task is very difficult. The only way I can imagine would be to manually reconstruct the original database based on the Data Source View (it has cached metadata), and then try to generate the data to populate it using a SSAS query tool (e.g. Excel, SSRS, OLE DB Provider for Analysis Services).
If you want to add one attribute in a dimension, you might be able to limit that effort to the source data for the dimension in question.
First let me explain based on the steps of the process how a cube stores the data!!!
Get the datasource - data!!! That is get access to the original databases/files etc. At this point all the data are at the primary source. All data are normalized one way or the other.
Construct a data warehouse. ELT process. At this point you combine all your data in a denormalized wharehouse, without foreign keys or any constraint. All data are now in an intermediate state in a denormalized sql database and ready to be used in the cube.
Construct the OLAP cube. The Data Warehouse is now your data-source. All data are now aggregated in rows inside the cube with their corresponding values. The redundancy is enormous and the data are 100% denormalized, they hardly follow a patern (Of course they do but it is not always easily understandable).
An example at this state would be a row like this
Company -> Department -> Room | Value(Employees)
ET LTD -> IT -> Room 4 -> | 4
The exactly same row would exist for Value(Revenue).
So in essence all data exist inside the SSAS Database (The cube).
Reconstructing the Database would mean a Great Deal of reverse engineering.
You could make a new C# program using MDX connectors and queries to get the data, and MSsql connectors to save them inside an OLTP database. MDX has a steep learning curve and few citations on websites, so the above method is not advisable.
There is no way that I know of to get the data from excel, as excel gets the pivot table data in a dynamic way from the DataConnection.