Google big query backfill takes very long

Google big query backfill takes very long - google-bigquery

I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.

According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.

Related

Google Pub/Sub + Cloud Run scalability

I have a python application writing pubsub msg into Bigquery. The python code use the google-cloud-bigquery library and the TableData.insertAll() method quota is 10,000 requests per second per table.Quotas documentation.
Cloud Run container auto scaling is set to 100 with 1000 requests per container.So technically, I should be able to reach 10 000 requests/sec right? With the BQ insert API being the biggest bottleneck.
I only have a few 100 requests per sec at the moment, with multiple service running at the same time.
CPU and RAM at 50%.

Now confirming your project structure, and a few details given in the comments; I would then review the Pub/Sub quotas and limits, especially the Quota and the Resource limits, both tables where you can check this information depending on the size and the Throughput quota units sections tells you how to calculate quota usage.
I would answer your question as a yes, you are able to reach 10,000 req/sec. And as in this question depending on the byte size you can have 10,000 row inserts unless the recommendation is 500.
The concurrency in Cloud Run can be modified in case you need to change it.

How to increase Google Sheets v4 API quota limitations

The new Google Sheets API v4 currently has an unlimited read/write quota per day (which is fantastic), but restricted to 500 reads/writes per account per 100 seconds, and 100 read/writes per key per 100 seconds (or, I have found, multiple keys coming from the same IP). This is probably plenty for most use cases, but I have an edge case that requires bringing a frequently-updated Google Sheet with 70 tabs down to a node.js server that distributes these to user's clients every ~30-60 seconds or so (users are data annotators who are student research assistants). This wasn't so bad early in the project when there were only 20-30 tabs, but now that the data is large the server is blowing through the 100 quota and returning errors every 10-15 minutes.
The problem is such that:
Frequent data updates: Only data on 1-5 of the 70 tabs is likely to be updated on any given minute, but which tabs have new data is random (so I am pulling down the whole sheet of 70 = 70 reads).
Update interval: The need for updates happens randomly at about 30 second to 5-minute intervals (so some within the quota, some about 3-5x the quota).
Throttling: I have tried throttling the update to be within the 100 calls/100 seconds (my previous solution), but this introduces large usability issues, significantly decreasing usability/productivity/work quality.
Quota increase: The sheets API does not currently appear to include a way to pay to increase the quota. It does allow filling out a form to request an increase in the quota, but I'm not sure what the mean response time is on this (my request is only a few days old).
Multiple service accounts: I have tried using multiple service accounts to get the full 500 requests/100 seconds quota (rather than the per-user quota), since this is a server, but Google Sheets looks to rate-limit to 100 requests/100 seconds from a given IP
Alternatives: I have considered that this project may have just grown beyond the size that Sheets is easily able to handle, but there do not appear to be any good, usable, self-hosted, collaborative spreadsheets with easy-to-interface-to APIs out there.
Are there settings/methods suggested to achieve the full 500 calls/100 seconds for a server?

You can request quota update in Google Cloud Platform and it will be increased to 2500 per account an 500 per user. (about your #4)

You can use spreadsheets.get to read the entire spreadsheet in a single call, rather than 1 call per request. Alternately, you can use spreadsheets.values.batchGet to read multiple different ranges in a single call, if all you need are the values.
The Drive API offers "push notifications", so you can get notified when changes occur and react to those, instead of polling for them. The latency of the notifications is a little on the slow side, but it gets the job done.

Google BigQuery for realtime call records data

I am thinking to use Google Big Query to store realtime call records involving around 3 million rows per day inserted and never updated.
I have signed up for a trial account and ran some tests
I have few concerns before i can go ahead with development
When streaming data via PHP it takes around 10-20 minutes sometime to get loaded on my tables and this is a show stopper for us because network support engineers need this data updated realtime to troubleshoot quality issues
Partitions, we can store data in partitions divided for each day but that also involves one partition being 2.5 GB on any given day and that shoots my costs to query data in range of thousands per month. Is there any other way to bring down cost here? We can store data partitioned per hour but there is no such support available.
If not BigQuery what other solutions are out there in market which can deliver similar performance and can solve these problems ?

You have the "Streaming insert" option which enables the records to be searchable in few seconds (it has its price).
See: streaming-data-into-bigquery
Check table-decorators for limiting query scan.

Keeping sustained responsivness

Are there any method to obtaining reliable throughput with BigQuery? I've been using it since May or March and we have always seen wild swings in latencies, generally not lasting longer than a few days - however now we are constantly slow.
It seems that this is due to the shared resource pool style of the service, however they do not make it easy to purchase private resources.
How does one obtain consistent, reliable results from BigQuery?
Details of my usage:
Streaming inserts (1000/s)
Queries on the streaming dataset(Up to 250GB table size)
Querying Via. the API using the PHP library
My queries times vary from 2 seconds to 10 minutes. Ideally A query should never exceed the 15 - 30 seconds range. Any information on what might improve my response times would be helpful

Would Amazon S3 work as storage for sensor data streaming in from millions of endpoints?

I am looking for options for reliable (and speedy) storage for small amounts of sensor data that would be coming in from (getting optimistic here) millions of endpoints.
The scale I'm talking is 1M endpoints, each sending 100 bytes every minute. This data needs to be available for analysis shortly after this.
Additionally, this data will be kept for a few years and may exceed 100TB of total storage.
Is S3 the solution to this, or would I be better off hosting my own NoSQL cluster like Cassandra/MongoDB etc?
Please let me know if I have not specified any information.

Yes, you could. But, there are no query mechanisms nor any method of reading multiple objects in one request in S3. You would also not have any mechanism to inspect the data before it's written.
This might be a better idea:
Have clients write sensor data onto a SQS queue
Your application reads messages off the queue and writes the data into SimpleDB or other data store
It would de-couple receipt of the data, with any data load/storage phase.
Note that many Amazon services have a per-request charge. For SQS it's $0.01/10000 requests. If you want to have 1 million clients write one message each minute request charges alone would be over $40,000 a month. Doubling when taking reading the messages into account.
(((1000000 * (60*24*30)) / 10000) * $0.01) * 2 = $86,400
For S3, it's $0.01/1000 for POSTs (client writes), and $0.01/10000 GETs (reads). For 1 million clients your per-request charges alone could easily reach $500,000 per month.
Ultimately, at 1 million clients, you likely need to run your own receiving endpoints simply due to economic factors.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Google big query backfill takes very long - google-bigquery

Related

Google Pub/Sub + Cloud Run scalability

How to increase Google Sheets v4 API quota limitations

Google BigQuery for realtime call records data

Keeping sustained responsivness

Would Amazon S3 work as storage for sensor data streaming in from millions of endpoints?

Categories

Resources