I have 14 tables in BQ, which are updated several times a day.
Via JOIN of three of them, I have created a new one.
My question is, would this new table be updated each time new data are pushed into BQ tables on which it is based on? If not, is there way, how to make this JOIN "live" so the newly created table will be updated automatically?
Thank you!
BigQuery also supports views, virtual tables defined by a SQL query.
BigQuery's views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.
BigQuery supports up to eight levels of nested views.
You can create a view or materialized view so that your required data setup gets updated instantly but this queries underlying tables so beware of joining massive tables.
For more complex table sync from/to BQ and other Apps (two-way sync), I finally used https://www.stacksync.cloud/
It offers real-time update and eventually two-way sync. Check it out too for the less technical folks!
Related
I have data about contacts in Salesforce. I also have data about the contacts in Intercom/Zendesk. I want to create a denormalized table where the data in Salesforce and Intercom is both merged into a single table so I can query about the contact. Imagine, I dumped the Salesforce data into a Bigquery table. The problem is that the we might not dump Intercom/Zendesk until later. So we may only add Salesforce data into a Bigquery table now. And later we may add Intercom data. My question is how to merge these (existing data in Salesforce BQ table and new data from Intercom)? Assume that the Email is the primary key in both 3rd party sources and we can join them.
Do we need to take the Salesforce data out of the BQ table and run it through some tool to merge both tables and create a new table in BQ?
What will happen if we keep getting new data in both Salesforce and Intercom?
Your case seems to be a good use case for Views.
A view is basically a virtual table that points to a query. You can define a view based on a query (lets call it query_1) and then you will be able to see that view as a table. However every time you run a query (lets call it query_2) using that view as source, internally BigQuery will execute query_1 and then execute your query_2 against the results of query_1.
In your case, you could create a query that use join to merge your tables and save this query as a view. You can create a view by clicking on Save view in the BigQuery console just like in the image below and then fill some required fields before saving.
In BigQuery there are also Materialized Views, that implements some cache technologies in order to make the view more similar to a table.
Some benefits of materialized views are:
Reduction in the execution time and cost for queries with aggregate functions. The largest benefit is gained when a query's computation
cost is high and the resulting data set is small.
Automatic and transparent BigQuery optimization because the optimizer uses a materialized view, if available, to improve the query
execution plan. This optimization does not require any changes to the
queries.
The same resilience and high availability as BigQuery tables.
To create a materialized view you have to run the below command:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
AS <my-query>
Finally, I would like to paste here the reference links for both views and materialized views in BigQuery. I suggest that you take a look at it and decide which one fits in your use case.
You can read more about querying Google Cloud Storage https://cloud.google.com/bigquery/external-data-cloud-storage.
You can take the extracts and place them into Google Cloud Storage under buckets i.e. Salesforce bucket and Zendesk bucket.
Once the files are available, you can create external tables on those bucket(1 table for each bucket) so that you would be able to query them independently.
Once you can query them , you can perform joins like normal tables.
You can replace the files in buckets when new data comes.
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
When I try to create a view which query more than 600 tables, BigQuery was running for a long time and response is :
BigQuery error in mk operation: Backend Error.
the query itself is like:
'select col1,col2,col3 from t1,t2,t3......t600'
I suspect the operation is timing out. The limit here is whether validating the view query can be completed within the deadline limits for a single synchronous request like view creation. This many tables may just be too many.
A potential work-around might be to shard this view: create smaller view tables, then a single view of the set of smaller views.
An alternate solution would be to explore your data layout. Perhaps you don't need 600 tables to hold your data? The BigQuery team announced at GCP Next 2016 that table partitioning by date will be coming soon, so if you are sharding your tables by day and need to reference years of data, then there will be a single-table solution for you soon.
I have two tables, Users and Transactions. In both tables, rows have a timestamp.
I am running into performance issues when running complex queries where the transaction dates are being normalized to reflect at what point of a user experience they happened (i.e. how many days after a user joined was the transaction processed).
This "normalized day" measure is calculated as
ceil(extract(days from T.tdate - U.created_at)) +1
Is there any way to index this such that I can increase the speed of the query?
There is no way to make such index directly.
You can add another field to the T table, which would store the value of your formula, and you can add an index on that field. You can maintain the value of that field using triggers.
No: PostgreSQL doesn't currently allow the creation of an index which references more than one table.
Alternatively you could:
Follow maniek's suggestion of using triggers to compute timestamp differences on insert/updates to both of the tables.
Use Postgresql 9.3's CREATE MATERIALIZED VIEW which allows you to define the view which takes a snapshot of the view's resultset and stores it as a table. Note that any changes made to the source tables after this snapshot are not automatically included. You can use REFRESH MATERIALIZED VIEW to update this view, just before generating reports for an updated snapshot.
So let's say I have a few million records to pull from in order to generate some reports, and instead of running my reports of the live table, I create a temp where I can then create my indexes and use it for further data extraction.
I know cached tables tend to be quicker / faster seeing as the data is stored in memory, but I'm curious to know if there are instances where using a physical temp table is better than Global Temporary Tables and why? What kind of scenario would one be better than the other when dealing with larger volumes of data?
Global Temporary Tables in Oracle are not like temporary tables in SQL Server. They are not cached in memory, they are written to the temporary tablespace.
If you are handling a large amount of data and retaining it for a reasonable amount of time - which seems likely as you want to build additional indexes - I think you should use a regular table. This is even more the case if your scenario has a single session, perhaps a background job, working with the data.
I use Subquery Factoring before I consider temp tables. If there's a need for reuse in various functions or procedures, I turn it into a view (which can turn into a materialized view depending on the data returned).
According to asktom:
...temp table and global temp table are synonymous in Oracle.
For reporting, temporary tables are helpful in that data can only be seen by the session that created it, meaning that you shouldn't have to worry about any concurrency issues.
With a non-temporary table you need to add a session handle/identifier to the table in order to distinguish between sessions.
The primary difference between ordinary (heap) tables and global temp tables in Oracle is their visibility and volatility:
Once rows are committed to an ordinary table they are visible to other sessions and are retained until deleted.
Rows in a global temp table are never visible to other sessions, and are not retained after the session ends.
So the choice should primarily be down to what your application design needs, rather than just about performance (not to say performance isn't important).
The contents of an Oracle temporary table are only visible within the session that created the data and will disappear when the session ends. So you will have to copy the data for every report.
Is this report you are doing a one time operation or will the report be run periodically? Copying large quantities of data just to run a report does not seem a good solution to me. Why not run the report on the original data?
If you can't use the original tables you may be able to create a meterialized view so the latest data is available when you need it.