I'm currently working on a data warehousing project with BigQuery.
BQ used to have this quota:
Daily destination table update limit — 1,000 updates per table per day
While this quota is still in the documentation, I understand that his has been removed according to this blog post:
https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
In our project we need live data for which requires a lot of updates. Before this blog post I would have gathered the records e.g. on GCS and pushed them every ~14 minutes into BQ.
With the removal of the table update limit, we could now stream all data immediately into BQ which would be actually vital for our solution as live data are required.
Question: Would you recommend now to stream data directly to BQ? Any objections?
I'm asking this as I think just because the quota has been removed, this doesn't automatically become a best practice. How do you handle the requirement for live data? Another option before has been external data sources with the known limitations.
Thank you for your answers!
This quota never applied to streaming. The quota mentioned in the blog applied to updates via DML queries only - SQL statements with INSERT, UPDATE, MERGE, DELETE statements.
Streaming inserts (via tabledata.InsertAll API, not SQL command) have different limits:
Maximum rows per second: 1,000,000
Maximum bytes per second: 1 GB
If you do need live data - definitely go with streaming. Note that it is costlier than GCS updates, but if you need fresh data - this is the way to go.
Related
members,
Currently we synchronise salesdata into BigQuery, and it allows us to make fast, detailed, practically realtime reports of all kinds of stats that we otherwise would not have available. We want to have a website that is able to use these reports and present this information to website-users.
Some specs:
Users are using the data as 'readonly'
We want to do the analysis 'on request', so as soon as a user opens the page, we would query BigQuery and the user would see their stats depending on the query
The stats could change for external sources but often the result will be equal, I take into my mind that BigQuery would cache the query
The average query processes about 100Mb of data, it takes >2 seconds for the whole backend to respond (so user request, query, return resultset) so performance is what we want
Why I doubt:
BigQuery would not be adviced
Could it run 'out of hand'
Dataset will grow bigger, but we will need to keep using all historical data in any case
I would be an option to get aggregated data into another database for doing the main calls, but that would give me not a 'realtime' experience.
I would love to hear your thoughts.
As per your requirement, you can consider Bigquery as an option since Bigquery is fully managed and supports analytics over petabyte-scale data, it will be able to handle large amounts of data. Bigquery is specially designed for performing OLAP transactions so analysis can be performed on requests. Bigquery uses cached query results through which you can cache the query and fetch results quickly.
If your dataset is very large and grows then you can create partitioned tables to store and manage your data and easily query the tables. Since your data can go out of hand, Bigquery being a fully managed service will automatically handle that load. Historical data can be stored and accessed but for that you can set the expiration time of the table and also check the optimized storage according to your requirement.
I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.
According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.
I am facing some issues regarding latency in updating BigQuery schema.
I have a table that receives streaming inserts and the schema is updated automatically whenever needed. The issue is that the schema update doesn't seem to take effect for sometime and inserts made in that duration drop the values of the new columns.
I found this answer from 2016 that says that there could be delays of up till 5 minutes before changes take effect.
Is this still the case and how do you work around this? If a timeout is the answer, then how long should you wait before writing to the new columns?
In order to get more meaningful and sense-full information on the subject, I would encourage you to check out this good written article, discovering Bigquery streaming inserts life-cycle, leveraging tabledata.insertAll Bigquery REST API method.
Actually, as documentation says, data Availability and Consistency are the most important requirements for ingesting data in real-time analyzing tasks:
Because BigQuery's streaming API is designed for high insertion
rates, modifications to the underlying table metadata exhibit are
eventually consistent when interacting with the streaming system. In
most cases metadata changes are propagated within minutes, but during
this period API responses may reflect the inconsistent state of the
table.
Admitting the fact that in some cases where metadata changes are required inline with streaming ingests, the documentation confirms the delay accomplishing this. Even caching mechanism that aims to gather metadata from tables in some circumstances does not guarantee the data changes, i.e. referencing streaming injections to the not existing table or entire columns in the shortest moment. Due to the complexity of GCP Bigquery server-less platform, that originally built on top of Dremel model, it is hardly to estimate the latency time for high throughputs of the particular streaming task, hence this not documented in GCP knowledge base.
Meanwhile, reading this Stack thread, #Sean Chen recommended to afford Bigquery metadata changes beforehand launching streaming ingests.
Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.
So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.