BigQuery Job State stuck in "RUNNING" - google-bigquery

I have a batch process that does a json job load usually once per hour. My process checks to see if there's a previous job still pending or running before it continues. It's not uncommon for jobs to be in PENDING state for some time before completing, I have seen upwards of an hour to a few hours.
In this case I have a job that has been in "RUNNING" state for about 4 hours now, I've never seen it quite this long. I have checked other posts here to see if anyone else had the issue and I did find one chap who's job took about 4 hours to complete.
My job ID for the job in question is: #job_Canl6-TBi4F6sWDOjB6ng7PxoZA. I know about why jobs can be stuck in PENDING states due to queue times, but I was not aware this was the case in the RUNNING state too - can anyone confirm in their experience that this is not abnormal? In my experience (I have been running this process for over a year) it is and can anyone confirm this is not a current back-end issue with BigQuery?
Many thanks in advance.

This issue should be resolved now. See comments #9 and #10 here:
http://code.google.com/p/google-bigquery/issues/detail?id=103
In the future, I'd recommend filing a bug via BigQuery's public issue tracker to notify us of bugs or other issues with the service. Your issue will likely get more prompt attention from the team that way. Also, StackOverflow is more appropriate for questions about BigQuery usage that will be relevant long-term; I suspect the moderators would prefer we not use StackOverflow as a bug tracker.
The public issue tracker is here:
https://code.google.com/p/google-bigquery/issues/list
Thanks, and apologies for the trouble today!

Related

SQL Agent job failure universal handling

I'm in a situation where I have a server running sql 2012 with roughly two hundred scheduled jobs (all are SSIS package executions). I'm facing a directive from management where I need to run some custom software to create a bug report ticket whenever a job fails. Right now I'm relying on half the jobs jobs notifying an operator on failure, while the other half do like a "go to step X- send failure email" for each step on failure, where "step X" is some sql that queries the DB and sends out an email saying which job failed at which step.
So what I'm looking for is some universal solution where I can have every job do the same thing when it fails (in this case, run some program that creates a bug tracking ticket). I am trying to avoid the situation where I manually go into every single job and add a new step at the end, with all previous steps changing to "go to step Y on failure" where step Y is this thing that creates the bug report.
My first thought was to create a new job that queries the execution history tables and looks for unhandled failures and then does the bug report creation itself. However, I already made the mistake of presenting this idea to the manager and was told it's not a viable solution because it's "reactive and not proactive" and also not creating tickets in real-time. I should know better than to brainstorm with non-programming management but it's too late, so that option is off the table and I haven't been able to uncover any other methods.
Any suggestions?
I'm proposing this as an answer, though it's not a technical solution. Present the possible solutions and let the manager decide:
Update all the Agent Jobs - This will take a lot of time and every job will need to be tested, which will also take a lot of time. I'd guess 2-8 weeks depending on how it's done.
Create an error handler job that monitors the logs and creates tickets based on those errors. This has two drawbacks - it is not "real-time" (as desired by the manager) and something will need to be put into place to insure errors are only reported once. This has the upside of being one change to manage. Also it can be made near real time if it were run on the minute.
A third option, which would be more a preliminary step, is to create an error report based off of the logs. This will help to understand the quantity and types of failures. This may help to shape the ultimate solution - do we want all these tickets, can they be broken up into different categories, do we want tickets for errors that are self-healing (i.e. connection errors which have built-in retries)?

CXSYNC_PORT wait type in Azure Sql Database

I'm facing this issue intermittently now, where the query (called from stored Procedure) goes for CXSYNC_PORT wait type and continues to remain in that for longer time (sometimes 8hours in stretch). I had to kill the process and then rerun the procedure. This procedure is called every 2-hours from ADF pipeline.
What's the reason for this behavior and how do I fix the issue?
I searched a lot and there is not Microsoft documents talk about the wait type: CXSYNC_PORT. Others have asked the same question but still with no more details.
Most suggestions are that ask the same problem in more forums. Or ask professional engineer for help, and they will deal with your problem separately and confidentially.
Ask Azure support for details help: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request
And here's the same question which Microsoft engineer gave more details about the issue:
As part of a fix CXPACKET waits were further broken down into
CXSYNC_CONSUMER and CXSYNC_PORT (and data transfer waits still
reported as CXPACKET) as to distinguish between different wait times
for correct diagnose of the problem.
Basically, CXPACKET is divided into 3: CXPACKET, CXSYNC_PORT,
CXSYNC_CONSUMER. CXPACKET is used for data transfer sync, while
CXSYNC_* are used for other synchronizations. CXSYNC_PORT is used for
synchronizing opening/closing of exchange port between consuming
thread and producing thread. Long waits here may indicate server load
and lack of available threads. Plans containing sort may contribute
this wait type because complete sorting may occur before port is
synchronized.
Please ref this link What is causing wait type CXSYNC_PORT and what to do about it? to get more useful messages. But for now, there isn't an exact solution.
use query hint OPTION(MAXDOP 1)
This will run your long running query in a single thread and you won't get the CX type waits. In my experience this can make a massive 10-20X decrease in execution time and will free up CPU for other tasks as there will be no context switching and thread coordination activity.

BigQuery job history in UI has disappeared

Experienced a weird problem with BigQuery UI this morning - for a specific project, all Job History both Personal and Project has disappeared. Load jobs are still showing up in the last month when i use the BQ LS command.
Has anyone seen this before, any advice? I've raised a call with the service desk but wondered what you guys think.
best wishes
Dave
It seems to be a bug in the BigQuery UI, there is a public issue reporting this scenario [1], as a workaround you can list all the jobs by using the CLI [2]
[1] https://issuetracker.google.com/118569383
[2] https://cloud.google.com/bigquery/docs/managing-jobs#listing_jobs_in_a_project
Turns out there is a shared, 1000 query / job list maximum in the UI. As we had begun to run many hundreds of queries per day, this was 'pushing' the job queries out of the list. I've requested this to be increased as a feature request in future
best wishes
Dave

Abort a table import stuck in 'pending'

Similar questions have been asked but not exactly what I am looking for.
The problem: on some occasions importing a table from Google Cloud to Big Query gets stuck in a 'pending' state for hours if not days. Tables that get stuck in this state never seem to come out of it, or at least we didn't bother waiting that long. I know it's not a queue issue since in the mean time we can import other tables just fine. No errors are returned by Big Query.
My question: in this situation, and in general, how can we safely abort/cancel an import to Big Query without having the table quietly import on us without us knowing. This would actually apply to any table regardless of its state, as long as it hasn't finished importing.
Thanks.
You may be hitting job load rate limits. For example, if you try to start more than two load jobs per minute for the same table, the load jobs against that table will be defferred, while other load jobs against other tables may continue at normal speed.
There are per-project limits on rate at which load jobs will be started and limits on the number of load jobs that can be running per project at any one time. If you send jobs faster than this, we'll queue, but as you've noticed, our queueing is not a fair queue, and can start newer jobs before older ones.
Aborting pending jobs is a commonly requested feature. If you file a feature request here that will help us prioritize it.

SQL Azure - One session locking entire DB for Update and Insert

SQL Azure issue.
I've got an issue that manifests as the following exception on our (asp.net) site:
Timeout expired. The timeout period elapsed prior to completion of
the operation or the server is not responding. The statement has been
terminated.
It also results in update and insert statements never completing in SMSS. There aren't any X or IX locks present when querying: sys.dm_tran_locks and there are no transactions when querying sys.dm_tran_active_transactions or sys.dm_tran_database_transactions.
The problem is present for every table in the database but other databases on the same instance don't cause the problem. The duration of the issue can be anywhere from 2 minutes to 2 hours and doesn't happen at any specific times of day.
The database is not full.
At one point this issue didn't resolve itself but I was able to resolve the issue by querying sys.dm_exec_connections finding the longest running session, and then killing it. The odd thing is, that the connection was 15 minutes old, but the lock issue had been present for over 3 hours.
Is there anything else I can check?
EDIT
As per Paul's answer below. I'd actually tracked down the problem before he answered. I will post the steps I used to figure this out below, in case they help anyone else.
The following queries were run when a "timeout period" was present.
select * from sys.dm_exec_requests
As we can see, all the WAIT requests are waiting on session 1021 which is the replication request! The TM Request indicates a DTC transaction and we don't use distributed transactions. You can also see the wait_type of SE_REPL_COMMIT_ACK which again implicates replication.
select * from sys.dm_tran_locks
Again waiting on session 1021
SELECT * FROM sys.dm_db_wait_stats ORDER BY wait_time_ms desc
And yes, SE_REPL_CATCHUP_THROTTLE has a total wait time of 8094034
ms, that is 134.9minutes!!!
Also see the following forum for details on this issue.
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I've been given the following answer in my communication with
Microsoft (we've seen this issue with 4 of our 15 databases in the EU
data center):
Question: Have there been changes to these soft
throttling limits in the last three weeks ie since my problems
started?
Answer: No, there has not.
Question: Are there ways we can
prevent or be warned we are approaching a limit?
Answer: No. The issue
may not be caused by your application but can be caused by other
tenants relying on the same physical hardware. In other words, your
application can have very little load and still run into the problem.
In other words, your own traffic may be a cause of this problem, but
it can just as well be caused by other tenants relying on the same
physical hardware. There's no way to know beforehand that the issue
will soon occur - it can occur at any time without warning. The SQL
Azure operations team does not monitor this type of error, so they
won't automatically try to solve the problem for you. So if you run
into it you have two opitions:
Create a copy of your db and use that and hope the db is placed on another server with less load.
Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you
You might be running into the SE_REPL* issues that are currently plaguing a lot of folks using Sql Azure (my company included).
When you experience the timeouts, try checking your wait requests for wait types of:
SE_REPL_SLOW_SECONDARY_THROTTLE
SE_REPL_COMMIT_ACK
Run the following to check your wait types on current connections:
SELECT TOP 10 r.session_id, r.plan_handle,
r.sql_handle, r.request_id,
r.start_time, r.status,
r.command, r.database_id,
r.user_id, r.wait_type,
r.wait_time, r.last_wait_type,
r.wait_resource, r.total_elapsed_time,
r.cpu_time, r.transaction_isolation_level,
r.row_count
FROM sys.dm_exec_requests r
You can also check a history of sorts for this by running:
SELECT * FROM sys.dm_db_wait_stats
ORDER BY wait_time_ms desc
If you're seeing a lot of SE_REPL* wait types and these are staying set on your connections for any length of time, then basically you're screwed.
Microsoft are aware of the problem, but I've had a support ticket open for a week with them now and they're still working on it apparently.
The SE_REPL* waits happen when the Sql Azure replication slaves fall behind.
Basically the whole db suspends queries while replication catches up :/
So essentially the aspect that makes Sql Azure highly available is causing databases to become randomly unavailable.
I'd laugh at the irony if it wasn't killing us.
Have a look at this thread for details:
http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8