Any way to monitor status of materialized views in dbt? - dbt

is any way to know the refresh status of materialized views? I want to figure out how to track if the materialized refresh was successful .

Views, by definition, are not refreshed as such: they are always containing the latest data available in the source. E.g. if you were to query a staging model that is materialised as a view and looks like the following:
-- This is your staging model, materialised as a view
{{ config(materialized='view') }}
select * from {{ source('your_crm', 'orders') }}
You would get the freshest data from the source, even if it is an order (for the sake of the example) that was created 5min ago, as long as this order is already appearing in your source table.
So, long story short, you can always confirm this by querying any of your materialized views, and checking what data is available in their source.

Related

Why a query on VIEW is executing faster than a query on MATERIALIZED VIEW

I have created two views of a table in snowflake database with same select statement, one is a normal view and the other is a materialized view as below,
create view view1
as ( select *
from customer
where name ilike 'a%')
create materialized view view2
as ( select *
from customer
where name ilike 'a%')
Then queried the views as below,
Select *
from view1 ----normal view
Select *
from view2 -----materialized view
(suspended warehouse and resumed to remove any cache before executing above queries individually. I have repeated execution many times in same manner.)
But against expectation, Materialized view is always taking longer than normal view.
Why is this?
It could be a number of things. What I would suggest is here:
Ensure that the result cache is turned off
ALTER SESSION SET USE_CACHED_RESULT = FALSE
Run them in a warehouse that's been turned off for hours - In my experience, restarting the virtual warehouse does not completely delete cached data. Do not run the query while the VW is off, manually turn them on first before running the query to avoid query delays to provision the warehouse.
Run them and check the ff. in QUERY_HISTORY View to get a better idea of what have happened
PERCENTAGE_SCANNED_FROM_CACHE
COMPILATION_TIME
QUEUED_REPAIR_TIME
TRANSACTION_BLOCKED_TIME
EXECUTION_TIME - I believe this holds the actual execution time which excludes the time spent in compilation as opposed to TOTAL_ELAPSED_TIME
QUEUED_OVERLOAD_TIME
Here the QUERY_HISTORY documentation to get more details
You might also want to check the Query Profile, though I think the query using an MV would show a straightforward single step retrieve but would still be worth checking to compare and understand both queries.

In Google Bigquery, how to denormalize tables when the data is from different 3rd party source?

I have data about contacts in Salesforce. I also have data about the contacts in Intercom/Zendesk. I want to create a denormalized table where the data in Salesforce and Intercom is both merged into a single table so I can query about the contact. Imagine, I dumped the Salesforce data into a Bigquery table. The problem is that the we might not dump Intercom/Zendesk until later. So we may only add Salesforce data into a Bigquery table now. And later we may add Intercom data. My question is how to merge these (existing data in Salesforce BQ table and new data from Intercom)? Assume that the Email is the primary key in both 3rd party sources and we can join them.
Do we need to take the Salesforce data out of the BQ table and run it through some tool to merge both tables and create a new table in BQ?
What will happen if we keep getting new data in both Salesforce and Intercom?
Your case seems to be a good use case for Views.
A view is basically a virtual table that points to a query. You can define a view based on a query (lets call it query_1) and then you will be able to see that view as a table. However every time you run a query (lets call it query_2) using that view as source, internally BigQuery will execute query_1 and then execute your query_2 against the results of query_1.
In your case, you could create a query that use join to merge your tables and save this query as a view. You can create a view by clicking on Save view in the BigQuery console just like in the image below and then fill some required fields before saving.
In BigQuery there are also Materialized Views, that implements some cache technologies in order to make the view more similar to a table.
Some benefits of materialized views are:
Reduction in the execution time and cost for queries with aggregate functions. The largest benefit is gained when a query's computation
cost is high and the resulting data set is small.
Automatic and transparent BigQuery optimization because the optimizer uses a materialized view, if available, to improve the query
execution plan. This optimization does not require any changes to the
queries.
The same resilience and high availability as BigQuery tables.
To create a materialized view you have to run the below command:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
AS <my-query>
Finally, I would like to paste here the reference links for both views and materialized views in BigQuery. I suggest that you take a look at it and decide which one fits in your use case.
You can read more about querying Google Cloud Storage https://cloud.google.com/bigquery/external-data-cloud-storage.
You can take the extracts and place them into Google Cloud Storage under buckets i.e. Salesforce bucket and Zendesk bucket.
Once the files are available, you can create external tables on those bucket(1 table for each bucket) so that you would be able to query them independently.
Once you can query them , you can perform joins like normal tables.
You can replace the files in buckets when new data comes.

Google BigQuery - sync tables

I have 14 tables in BQ, which are updated several times a day.
Via JOIN of three of them, I have created a new one.
My question is, would this new table be updated each time new data are pushed into BQ tables on which it is based on? If not, is there way, how to make this JOIN "live" so the newly created table will be updated automatically?
Thank you!
BigQuery also supports views, virtual tables defined by a SQL query.
BigQuery's views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.
BigQuery supports up to eight levels of nested views.
You can create a view or materialized view so that your required data setup gets updated instantly but this queries underlying tables so beware of joining massive tables.
For more complex table sync from/to BQ and other Apps (two-way sync), I finally used https://www.stacksync.cloud/
It offers real-time update and eventually two-way sync. Check it out too for the less technical folks!

Replicate table from external database to internal

I need to replicate a table from an external db to an internal db for performance reasons. Several apps will use this local db to do joins and compare data. I only need to replicate every hour or so but if there is a performance solution, I would prefer to replicate every 5 to 10 minutes.
What would be the best way to replicate? The first thing that comes to mind is DROP and then CREATE:
DROP TABLE clonedTable;
CREATE TABLE clonedTable AS SELECT * from foo.extern#data.sourceTable;
There has to be a better way right? Hopefully an atomic solution to avoid the fraction of a second where the table doesn't exist but someone might try to query it.
The simplest possible solution would be a materialized view that is set to refresh every hour.
CREATE MATERIALIZED VIEW mv_cloned_table
REFRESH COMPLETE
START WITH sysdate + interval '1' minute
NEXT sysdate + interval '1' hour
AS
SELECT *
FROM foo.external_table#database_link;
This will delete all the data currently in mv_cloned_table, insert all the data from the table in the external database, and then schedule itself to run again an hour after it finishes (so it will actually be 1 hour + however long it takes to refresh between refreshes).
There are lots of ways to optimize this.
If the folks that own the source database are amenable to it, you can ask them to create a materialized view log on the source table. That would allow your materialized view to replicate just the changes which should be much more efficient and would allow you to schedule refreshes much more frequently.
If you have the cooperation of the folks that own the source database, you could also use Streams instead of materialized views which would let you replicate the changes in near real time (a lag of a few seconds would be common). That also tends to be more efficient on the source system than maintaining the materialized view logs would be. But it tends to take more admin time to get everything working properly-- materialized views are much less flexible and less efficient but pretty easy to configure.
If you don't mind the table being empty during a refresh (it would exist, it would just have no data), you can do a non-atomic refresh on the materialized view which would do a TRUNCATE followed by a direct-path INSERT rather than a DELETE and conventional path INSERT. The former will be much more efficient but will mean that the table appears empty when you're doing joins and data comparisons on the local server which seems unlikely to be appropriate in this situation.
If you want to go down the path of having the source side create a materialized view log so that you can do an incremental refresh, on the source side, assuming the source table has a primary key, you'd ask them to
CREATE MATERIALIZED VIEW LOG ON foo.external_table
WITH PRIMARY KEY
INCLUDING NEW VALUES;
The materialized view that you would create would then be
CREATE MATERIALIZED VIEW mv_cloned_table
REFRESH FAST
START WITH sysdate + interval '1' minute
NEXT sysdate + interval '1' hour
WITH PRIMARY KEY
AS
SELECT *
FROM foo.external_table#database_link;

ORACLE - Materialized View LOG

I have a table with a MVIEW Log, i would like to know if its suspicious to have :
SELECT count(*) from Table
8036132 rows
and
SELECT count(*) from MLOG$_Table
81657998 rows
Im asking this question because i get an error when trying to refresh my MVIEW
ORA-30036 : unable to extend segment by 4 in undo tablespace 'UNDOTBS1' and i would like to know if something could be done except of extending the undo Tablespace?
Thanks in advance
Yes, that is suspicious.
You need materialized view logs to be able to do a fast refresh. A fast refresh is really an incremental refresh: a refresh that only refreshes the last changes to avoid having to do a complete refresh, which could be time-consuming. If your materialized view log contains 10 times as much rows as your original table, then you defeat the purpose of a fast refresh.
I'd first look into why this materialized view log contains this much rows. If you can avoid that, then your other problem - the ORA-30036 - will likely disappear as well.
Regards,
Rob.