Bigquery Api Java client intermittently returning bad results - google-bigquery

I am executing some long running quires using the big-query java client.
I construct a big-query job and execute like this
val queryRequest = new QueryRequest().setQuery(query)
val queryJob = client.jobs().query(ProjectId, queryRequest)
queryJob.execute()
The problem I am facing is the for the same query, the client returns before the job is complete i.e. the number of rows in result is zero.
I tried printing the response and it shows
{"jobComplete":false,"jobReference":{"jobId":"job_bTLRGrw5_xR26i9Li3a9EQvuA6c","projectId":"analytics-production"},"kind":"bigquery#queryResponse"}
From that I can see that the job is not complete. The why did the client return before the job is complete ?
While building the client, I use the HttpRequestInitializer and in the initialize method I provide the timeout parameters.
override def initialize(request: HttpRequest): Unit = {
request.setConnectTimeout(...)
request.setReadTimeout(...)
}
Tried giving high values for timeout like 240 seconds etc..but no luck. The behavior is still the same. It fails intermitently.

Make sure you set the timeout on the Bigquery request body, and not the HTTP object.
val queryRequest = new QueryRequest().setQuery(query).setTimeoutMs(10000) //10 seconds
The param is timeoutMs. This is documented here: https://cloud.google.com/bigquery/docs/reference/v2/jobs/query
Please also read the docs regarding this field: How long to wait for the query to complete, in milliseconds, before the request times out and returns. Note that this is only a timeout for the request, not the query. If the query takes longer to run than the timeout value, the call returns without any results and with the 'jobComplete' flag set to false. You can call GetQueryResults() to wait for the query to complete and read the results. The default value is 10000 milliseconds (10 seconds).
More about Synchronous queries here
https://cloud.google.com/bigquery/querying-data#syncqueries

Related

Python BigQuery client - setting query result timeout

Consider the following script (adapted from the Google Cloud Python documentation: https://google-cloud-python.readthedocs.io/en/0.32.0/bigquery/usage.html#querying-data), which runs a BigQuery query with a timeout of 30 seconds:
import logging
from google.cloud import bigquery
# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)
PROJECT_ID = "project_id" # replace by actual project ID
client = bigquery.Client(project=PROJECT_ID)
QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
TIMEOUT = 30 # in seconds
query_job = client.query(QUERY) # API request - starts the query
assert query_job.state == 'RUNNING'
# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
assert query_job.state == 'DONE'
assert len(rows) == 100
row = rows[0]
assert row[0] == row.name == row['name']
The linked documentation says:
Use of the timeout parameter is optional. The query will continue to
run in the background even if it takes longer the timeout allowed.
When I run it with google-cloud-bigquery version 1.23.1, the logging output seem to indicate that "timeoutMs" is 10 seconds.
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/5ceceaeb-e17c-4a86-8a27-574ad561b856?maxResults=0&timeoutMs=10000&location=US HTTP/1.1" 200 None
Notice the timeoutMs=10000 in the output above.
This seems to happen whenever I call result with a timeout value that is higher than 10. On the other hand, if I use a value lower than 10 as the timeout value, the timeoutMs value looks correct. For example, if I change TIMEOUT = 30 to TIMEOUT = 5 in the script above, the log shows:
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/71a28435-cbcb-4d73-b932-22e58e20d994?maxResults=0&timeoutMs=4900&location=US HTTP/1.1" 200 None
Is this behavior expected?
Thank you in advance and best regards.
The timeout parameter performs in a best-effort manner to execute all the API calls within the method in the timeframe indicated. Internally, the result() method can perform more than one request, and the getQueryResults request in the log:
DEBUG:urllib3.connectionpool:https://bigquery.googleapis.com:443 "GET /bigquery/v2/projects/project_id/queries/5ceceaeb-e17c-4a86-8a27-574ad561b856?maxResults=0&timeoutMs=10000&location=US HTTP/1.1" 200 None
is executed inside the done() method. You can see the source code to understand how the timeout for the request is calculated, but basically, it is the minimum value between 10 seconds and the user timeout. If the operation has not been completed, it will be retried until the timeout has been reached.

Change Google Cloud Dataflow BigQuery Priority

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.

Does Selenium implicit wait always take the entire wait time or can it finish sooner?

Does Selenium implicit wait always take the entire wait time or can it finish sooner? If I set the implicit wait to 10 seconds, could a call to .findElement finish in a few seconds or would it always take the entire 10 seconds?
This page implies that it waits the full 10 seconds, which is very confusing because its not what the javadoc implies.
The following code comment from WebDriver.java implies that its a polling action which can finish sooner than the implicit timeout is defined at. BUT, the last sentence in the comment really throws a wrench into that belief and makes me not totally sure about it. If it is actually polling, then how would it "adversely affect test time", since it wouldn't go the entire implicit wait duration?
/**
* from WebDriver.java
* Specifies the amount of time the driver should wait when searching for an element if
* it is not immediately present.
* <p/>
* When searching for a single element, the driver should poll the page until the
* element has been found, or this timeout expires before throwing a
* {#link NoSuchElementException}. When searching for multiple elements, the driver
* should poll the page until at least one element has been found or this timeout has
* expired.
* <p/>
* Increasing the implicit wait timeout should be used judiciously as it will have an
* adverse effect on test run time, especially when used with slower location
* strategies like XPath.
*
* #param time The amount of time to wait.
* #param unit The unit of measure for {#code time}.
* #return A self reference.
*/
Timeouts implicitlyWait(long time, TimeUnit unit);
Also, if anyone can provide information on how often the default "polling" occurs?
It can finish once it was able to find the element. If not it does throws the error and stops. The poll time is again very specific to the driver implementation ( not Java bindings , but the driver part, example: FireFox extension, Safari Extension etc.)
As I have mentioned here, these are very specific to the driver implementation. All driver related calls goes via execute method.
I'm putting up the gist over of the execute method (you can find the full source here):
protected Response execute(String driverCommand, Map<String, ?> parameters) {
Command command = new Command(sessionId, driverCommand, parameters);
Response response;
long start = System.currentTimeMillis();
String currentName = Thread.currentThread().getName();
Thread.currentThread().setName(
String.format("Forwarding %s on session %s to remote", driverCommand, sessionId));
try {
log(sessionId, command.getName(), command, When.BEFORE);
response = executor.execute(command);
log(sessionId, command.getName(), command, When.AFTER);
if (response == null) {
return null;
}
//other codes
}
The line:
response = executor.execute(command);
says the whole story. executor is of type CommandExecutor, so all calls goes to the specific driver class like ChromeCommandExecutor,SafariDriverCommandExecutor, which has their own handling.
So the polling is upto the driver implementation.
If you want to specify the polling time, then you should probably start using Explicit Waits.
As mentioned the code comment:
* When searching for a single element, the driver should poll the page until the
* element has been found, or this timeout expires before throwing a
* {#link NoSuchElementException}.
Its going to wait till that element present, or timeout occurs.
For example, If you set Implicit wait as 10 seconds, .findElement is going to wait maximum of 10 seconds for that element. Suppose that element available in the DOM in 5 seconds, then it will come out of "wait" and start executing next step.
Hope this clarifies.
To my knowledge polling period is not 0.5 seconds with implicit wait. It is the case with explicit wait. Explicit wait polls the DOM every 500ms. Implicit wait, if the element is not found on page load waits for the specified time and then checks again after time is run out. If not found it throws an error

Read Timed Out : sychronous query via Bigquery java API

We are using the big query JAVA API to retrieve results for our analytics reporting frontend. We are trying to retrieve the results synchronously. A lot of times we get Read timed out error, even before the query timeout as specified in the parameters. Here's the stack trace for a sample fail:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:331)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:830)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:787)
at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
I am not able to retrieve the job id of the resulting job as the error occurs before I can retrieve a JobReference object. The timeout specified in this case was 300 sec. The query failed well before it. The query contains three JOIN's and several GROUP EACH BY clauses. Can you suggest us a possible way to debug this ?
Adding the code snippet:
QueryRequest queryInfo = new QueryRequest().setQuery(sql)
.setTimeoutMs(timeOutInSec * 1000);
// get project id
BQGameConnectionDetails details = Config
.getBQConnectionDetails(gameId);
String projectId = details.getProjectId();
Bigquery.Jobs.Query queryRequest = getInstance(gameId).jobs()
.query(projectId, queryInfo);
QueryResponse response = queryRequest.execute();
There are two timeouts involved. The first timeout is in the HTTP request you've sent to bigquery. The second is in the bigquery request timeout. It sounds like you've set the latter to a large value, but the former is likely the timeout that you're hitting. If the HTTP request times out before the BigQuery timeout, the connection will be closed and BigQuery won't have a chance to respond.
There are two options: First is to increase the HTTP request timeout (which depends on the libraries you're using, but this page here may be helpful). The second is to decrease the bigquery timeout. This means you'll have to use jobs.getQueryResults() to read the actual results, but this is a more robust method because it doesn't matter how long the query takes, you can just call getQueryResults() in a loop. I would post a link to a good java sample that does this, but I don't know that one exists, unfortunately.

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110