BigQuery-Java: difference between QueryResponse and GetQueryResultsResponse - google-bigquery

In sample code provided by Google, 2 classes are used to fetch results. QueryResponse and GetQueryResultsResponse.
I am not able to understand purpose of these 2 classes and do we have to use these 2 classes?
We are getting data from both: queryResponse.getRows() and queryResults.getRows()
I have gone through docs but could not figure out. what is the difference between these 2 classes and which is better to use?

Those two results are virtually identical (in fact, they are identical in the raw HTTP request). The difference is how you get them.
QueryResponse is returned by jobs.query(). This method can be used to run a query, but has only limited configuration options. It is intended as a convenience function. For more query options (such as setting a destination table, allowing large results, etc), use jobs.insert(). Another limitation of jobs.query() is that it may time out before the query has completed. Partly, this is because many clients (such as in AppEngine) require all HTTP requests to finish within 30 seconds or so. If jobs.query() times out, it will still report a job id that can be used to fetch the results with jobs.get_query_results().
GetQueryResultsResponse is returned by jobs.get_query_results(). This can be used to get the results of a query started by either jobs.query() or jobs.insert(). Query results (if you don't specify a destination table) are available for 24 hours after the query completes. jobs.get_query_results() allows you to fetch these results at any time. jobs.query() only gives you the query results once.
There is a further difference between the two, which is that jobs.query() just returns the first page of results. jobs.get_query_results() can be used to get multiple pages of results.
Hopefully this clarifies things a bit.

Related

How to know the first ran jobId of a cached query in big-query?

When we run a query in big-query environment, the results are cached in the temporary table. From next time onwards, when we ran the same query multiple times, the subsequent runs will fetch the results from the cache for the next 24 hrs with some exceptions. Now my use case is, in the subsequent runs, i want to know like from which jobId this query cache results are got, previous first time run of the query ??
I have checked all the java docs related to query didn't find that info. We have cacheHit variable, which will tell you whether the query has fetched from the cache or not . Here i want to know one step further, from what jobId, the results got fetched. I expected like, may be in this method i can know the info, but i am always getting the null value for that. I also want to know what is meant by parentJob in big-query context.
It's unclear why you'd even care about this other than as a technical exercise. If you want to build your own application caching layer that's a different concern. More details about query caching can be found on https://cloud.google.com/bigquery/docs/cached-results.
The easiest way to do this would probably be by traversing jobs.list until you find a job that has the same destination table (it'll be prefaced with an anon prefix), and where the cacheHit stat is false/not present.
Your inquiry about parentJob is unrelated to this exercise. It's for finding all the child jobs created as part of a script or multi-statement execution. More information about this can be found on https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting-concepts.

How to count rows of finished BigQuery job using node.js client library

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?
Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job
job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

Listing BigQuery Tables in `huge/big` Datasets - 30K-40K+ tables

The task is to programmatically list all the tables within the given dataset with more than 30-40K tables
Initial option we explored was using tables.list API (as we do all the times for normal datasets with reasonable number of tables in them)
Looks like this API returns max 1000 entries (even if we try to set maxResults to bigger value)
To take next 1000 we need to “wait” for response of previous request then extract pageToken and repeat call and so on
For the datasets with 30K – 40K+ this can take up to 10-15 and more sec (under good weather)
So the timing is a problem for us that we want to address!
In above mentioned calls we are getting back only nextPageToken and tables/tableReference/tableId so size of response is extremely small!
Question:
Is there way to somehow increase maxResults, so to get all tables in one (or very few) call(s) (assuming it will be much faster than doing 30-40 calls)?
The workaround we tried so far is to use __TABLES_SUMMARY__ with jobs.insert or jobs.query API.
This way – the whole result is returned within the seconds – but in our particular case – using BigQuery jobs API is not an option for multiple reasons. We want to be able to use list API

BigQuery paging issues with tableData.list()

We're trying to load 120,000 rows from a BigQuery table using tableData.list. Our first call,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000
returns a pageToken as expected and the first 22482 (1 to 22482) rows. We assume this is due to the 10mb serialized JSON limit. Our second call however,
https://www.googleapis.com/bigquery/v2/projects/{myProject}/datasets/{myDataSet}/tables/{myTable}/data?maxResults=90000&pageToken=CIDBB777777QOGQIBCIL6BIQSC7QK===
returns not the next rows, but 22137 rows starting at row 900001 to the 112137, without a pageToken.
Strangely, If we change maxResults to 100,000, we get rows starting from 100,000.
We are able to work around this using startRowIndex to page. In this case, we start with the first call being startRowIndex =0 and we never get a pageToken in the response. We keep making calls until all rows are retrieved. However, the concern is without pageTokens, if the row order changes while the calls are being made, the data could be invalid.
We are able to reproduce this behavior on multiple tables with different sizes and structures.
Is there any issue with paging, or should we be structuring our calls differently?
This is a known, high priority bug. That has been fixed.

Youtube API problem - when searching for playlists, start-index does not work past 100

I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo
I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.