BigQuery jobs will get queued if there are no available slots. This is clearly stated at https://cloud.google.com/bigquery/docs/slots#query_execution_under_slot_resource_economy
If a query requests more slots than currently available, BigQuery queues up individual units of work and waits for slots to become available. As progress on query execution is made, and as slots free up, these queued up units of work get dynamically picked up for execution.
BigQuery can request any number of slots for a particular stage of a query. The number of slots requested is not related to the amount of capacity you purchase, but rather an indication of the most optimal parallelization factor chosen by BigQuery for that stage. Units of work queue up and get executed as slots become available.
When query demands exceed slots you committed to, you are not charged for additional slots, and you are not charged for additional on-demand rates. Your individual units of work simply queue up.
Is it possible to discover if a job stage was queued and if so, how long for?
I've taken a look at BigQuery > Documentation > Reference > JOBS view but didn't spot anything there might answer these questions.
Related
I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.
According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.
I configured capacity scheduler and schedule jobs in specific Queues. However, I see there are times when jobs in some Queues complete faster while other Queues have jobs waiting on the previous ones to commplete. This creates a scenario where half of my capacity is idle and other half is busy with jobs waiting to get resources.
Is there any config that I can tweak to maximize my utilization. I want to route waiting jobs to other queues where resources are available. Attached is a screenshot -
Seems like an issue with Capacity-Scheduler here, I switched to Fair-scheduler and definitely see huge improvements in cluster utilization, ~75% and way better than 40s with caoacity-scheduler
So the reason behind is when multiple users submits jobs to a same queue it can consume max resources, but a single user can't consume more than the capacity even though max capacity is greater than that.
So if you specify yarn.scheduler.capacity.root.QUEUE-1.capacity: 20 this to capacity-scheduler.xml one user can't take more than 20% resources for QUEUE-1 queue even though your cluster have free resources.
By default this user-limit-factor is set to 1. So if you set it to 2 your job can use 40% of resources if maximum allocated resources is greater than or equals to 40.
yarn.scheduler.capacity.root.QUEUE-1.user-limit-factor: 2
Please follow this blog
Is it possible to show alert or message popup every time I run queries in BQ GUI?I am afraid of spending query cost too much.
I hope BQMate has this function.
Sometimes the cost of the query can only be determined when the query is finished, e.g, federated tables, and the newly released clustering tables. If you're concerned about the cost, the best option is to set the Maximum Bytes Billed option, then you can be sure you'll never be charged for more than that. You can set a default value for this option in your project, but right now you have to contact the support to set it for your project.
A fast way to get a query cost estimation is checking the amount of data processed on the right side of the screen in the query validator, by performing a dry-run. Check here a "query validator" example. You have two options to calculate the cost:
Manually: query pricing is described here on GB units, so you can sum and multiply: 1 free TB per month, $5 per extra TB. If you expect to query more than 1TB of data per month, you should sum queries' used data to know when to start calculating costs.
Automatically: using the online pricing calculator, which is available for all Google Cloud Platform products.
If you want to set custom cost controls, have a look on this page, since custom quotas are not enabled by default. Cost controls can be applied on project -level or user-level by restricting the number of bytes billed. Nowadays you have to submit a request from the Google Cloud Platform Console to ask for them to be set, on 10TB increments. If the usage exceeds a set quota the error message is quite clear, and is different depending on the project/user quota exceeded. For project quota:
Custom quota exceeded: Your usage exceeded the custom quota for
QueryUsagePerDay, which is set by your administrator. For more information,
see https://cloud.google.com/bigquery/cost-controls
With no remaining quota, BigQuery stops working for everyone in that project.
If you want to constantly monitorize billing data for BigQuery, have a look on this tutorial, which explains how to create a billing dashboard using Data Studio.
I don't know about BQMate since this is from Vaint Inc.
During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.
We have an Analysis Services cube that needs to be as real-time as possible. It's a relatively small cube that currently takes a couple of seconds to process.
Are there any guidelines for this? I'm curious what other folks are doing.
Also, what would be the impact of processing the cube too frequently? Would the main concern be the load on the SSAS server and the source DB? In our case it would be fairly nominal. How would SSAS clients be affected? Current SSAS consumers are Excel, PerformancePoint, and Sharepoint/Excel Services.
I would say the first issue you'd have to consider is how much is this cube going to grow over time? If it is constantly updated and processed that couple seconds could quickly turn into 20 minutes.
For example, we currently have a cube that has 20 million rows (probably more now hehe) with financial data related to hospital billing and charges that takes about 20 mins to process and we do it once a day in the morning. Depending on the time of the year we sometimes do process during the day again but there have been no complaints as long as we notify people we are doing this.
Have you considered a real-time (ROLAP) partition to store the current day's data? This way, you get the performance of MOLAP for all your data prior to the current day, which you can process nightly, but have ROLAP's low latency for the data collected since the last cube process.
If your cube is small enough, you could even stretch that to be the current week's data, or more.
As far as the disadvantages of processing frequently, check out the below article, which says: "If the processing job succeeds, an exclusive lock is put on the object when changes are being committed, which means the object is temporarily unavailable for query or processing. During the commit phase of the transaction, queries can still be sent to the object, but they will be queued until the commit is completed."
http://technet.microsoft.com/en-us/library/ms174860.aspx
So your users will see an impact in query performance.
It may be that you have to 'put it out there' and track how it performs.
Once you can see how people are using the cube, you can determine if constant reprocessing is really necessary and if it is, you may have to optimise how this occurs.
Spcifically using "usage based optimisation" as described here:
http://www.databasejournal.com/features/mssql/article.php/3575751/Usage-Based-Optimization-in-Analysis-Services-2005.htm