Abort a table import stuck in 'pending' - google-bigquery

Similar questions have been asked but not exactly what I am looking for.
The problem: on some occasions importing a table from Google Cloud to Big Query gets stuck in a 'pending' state for hours if not days. Tables that get stuck in this state never seem to come out of it, or at least we didn't bother waiting that long. I know it's not a queue issue since in the mean time we can import other tables just fine. No errors are returned by Big Query.
My question: in this situation, and in general, how can we safely abort/cancel an import to Big Query without having the table quietly import on us without us knowing. This would actually apply to any table regardless of its state, as long as it hasn't finished importing.
Thanks.

You may be hitting job load rate limits. For example, if you try to start more than two load jobs per minute for the same table, the load jobs against that table will be defferred, while other load jobs against other tables may continue at normal speed.
There are per-project limits on rate at which load jobs will be started and limits on the number of load jobs that can be running per project at any one time. If you send jobs faster than this, we'll queue, but as you've noticed, our queueing is not a fair queue, and can start newer jobs before older ones.
Aborting pending jobs is a commonly requested feature. If you file a feature request here that will help us prioritize it.

Related

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Google BigQuery: Slow streaming inserts performance

We are using BigQuery as event logging platform.
The problem we faced was very slow insertAll post requests (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/insertAll).
It does not matter where they are fired - from server or client side.
Minimum is 900ms, average is 1500s, where nearly 1000ms is connection time.
Even if there is 1 request per second (so no throttling here).
We use Google Analytics measurement protocol and timings from the same machines are 50-150ms.
The solution described in BigQuery streaming 'insertAll' performance with PHP suugested to use queues, but it seems to be overkill because we send no more than 10 requests per second.
The question is if 1500ms is normal for streaming inserts and if not, how to make them faster.
Addtional information:
If we send malformed JSON, response arrives in 50-100ms.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
Also the failure rate fits the 99.9% uptime we have in the SLA, so there is no reason for objection.
There's something to keep in mind in regards to the SLA, it's a very strictly defined structure, the details are here. The 99.9% is uptime not directly translated into fail rate. What this means is that if BQ has a 30 minute downtime one month, and then you do 10,000 inserts within that period but didn't do any inserts in other times of the month, it will cause the numbers to be skewered. This is why we suggest a exponential backoff algorithm. The SLA is explicitly based on uptime and not error rate, but logically the two correlates closely if you do streaming inserts throughout the month at different times with backoff-retry setup. Technically, you should experience on average about 1/1000 failed insert if you are doing inserts through out the month if you have setup the proper retry mechanism.
You can check out this chart about your project health:
https://console.developers.google.com/project/YOUR-APP-ID/apiui/apiview/bigquery?tabId=usage&duration=P1D
It happens that my response is on the linked other article, and I proposed the queues, because it made our exponential-backoff with retry very easy, and working with queues is very easy. We use Beanstalkd.
To my experience any request to bigquery will take long. We've tried using it as a database for performance data but eventually are moving out due to slow response times. As far as I can see. BQ is built for handling big requests within a 1 - 10 second response time. These are the requests BQ categorizes as interactive. BQ doesn't get faster by doing less. We stream quite some records to BQ but always make sure we batch them up (per table). And run all requests asynchronously (or if you have to in another theat).
PS. I can confirm what Pentium10 sais about faillures in BQ. Make sure you retry the stuff that fails and if it fails again log it to file for retrying it another time.

identifying idle events in google Big Query.existing procedure

How can we identify idle events in Google Big Query.
The events that have been pushed by a user but not being queried.
I need to know if there is any existing procedure that i can follow.
I'm not sure I understand your question. Are you asking about query jobs that have been created but have not yet started running? If so, you can call jobs.list() and specify the PENDING state as a filter. Query jobs are generally only in the pending state for a very brief time however, unless you have specified batch priority.

Fault tolerant / high availability producer

I'm working on a distributed producer/consumer system using a messaging queue. The part that I'm interested in parallelising is the consumer side of it, and I'm happy with what I have for that.
However, I'm not sure what to do about the producer. I only need one producer running at a time since the load of the producing part of my system is not too high, but I want a reliable way of managing it, as in starting, stopping, restarting, and mainly, monitor it so that if the producer host fails another one can pick up.
If it helps, I'm happy with my consumer algorithm, the one that queues jobs, since it's fault tolerant to be down for a period of time and pick up the stuff that happened during the time it was down.
I'm sure there are tools or at least known patterns to do this and not reinvent the wheel.
I'm using rabbitmq but can use activemq, or even refactor into storm or something like that if needed, my code is not complex so far.
After a couple of weeks thinking about it, the simplest solution came to mind, and I'm actually very pleased with it, so I'll share it in case you find it useful, or point out if you think of any downsides, it seems to be working fine so far.
I have created the simplest table in my DB, called heartbeat, with a single timestamp field called ts, and is meant to have a single row all the time.
I start all of my potential producers every 5 minutes (quartz), and they do an update of the table if the ts fields is older than now() - 5 minutes. Because the update call is blocking, I'll have no db threading issues. Now, if the update returns > 0 it means that it actually modified the value of ts, and then I execute the actual producing code (queue jobs). If the update returns 0, it did not modify the table, because someone else did less than 5 minutes ago, and therefore this producer won't do anything until it checks again in 5 minutes.
Obviously the 5 minutes value is configurable, and this allows for a very neat upgrade with small changes to be able to execute several producers at the same time, if I ever had that need.

BigQuery Double Imports

I am using Google BigQuery from app engine. I have a cron job that runs every 15 minutes to do an export to BigQuery. Randomly though, the import runs twice. However, the appengine logs do not reflect this. I have a set of blobs I maintain to write data to bigquery with, and duplicate data is not being written to them. Has anyone else had bigquery problems with duplicate imports? Again, my appengine logs show the imports happening only one time, and I'm kind of at a loss how to troubleshoot.
One way to troubleshoot is to look at your import jobs. You can do this using the bq tool by running bq ls -j to list the jobs you've run, and bq show -j <job_id> to show details about particular jobs.
We've not heard of any other cases of duplicate loads. One idea to prevent this is to give your import jobs an id ... by default one gets created for you (it will look like job_). Job ids are enforced to be unique within a project, so if you generate an id per import you intend to do, if a double import is triggered, the second one will fail immediately because the job id will already exist.
I am facing the same problem where the jobs seem to have imported twice even though our logs shows it was submitted only once.
Also looked into the jobs and the above command shows it was successfully processed only once.
Note that since the job was only submitted once, not sure how controlling the job_id will help in this case. It seems to be something internal to BigQuery that might have caused the jobs to duplicate?
Let me know if you need anything from my end to investigate.
Thanks,