How to know the first ran jobId of a cached query in big-query? - google-bigquery

When we run a query in big-query environment, the results are cached in the temporary table. From next time onwards, when we ran the same query multiple times, the subsequent runs will fetch the results from the cache for the next 24 hrs with some exceptions. Now my use case is, in the subsequent runs, i want to know like from which jobId this query cache results are got, previous first time run of the query ??
I have checked all the java docs related to query didn't find that info. We have cacheHit variable, which will tell you whether the query has fetched from the cache or not . Here i want to know one step further, from what jobId, the results got fetched. I expected like, may be in this method i can know the info, but i am always getting the null value for that. I also want to know what is meant by parentJob in big-query context.

It's unclear why you'd even care about this other than as a technical exercise. If you want to build your own application caching layer that's a different concern. More details about query caching can be found on https://cloud.google.com/bigquery/docs/cached-results.
The easiest way to do this would probably be by traversing jobs.list until you find a job that has the same destination table (it'll be prefaced with an anon prefix), and where the cacheHit stat is false/not present.
Your inquiry about parentJob is unrelated to this exercise. It's for finding all the child jobs created as part of a script or multi-statement execution. More information about this can be found on https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting-concepts.

Related

Airflow Operator BigQueryTablePartitionExistenceSensor Question

I'm trying to use this BigQueryTablePartitionExistenceSensor operator in Airflow and I was wondering if this operator checks whether the partition is fully loaded or can potentially mark to success even if the data isn't complete yet.
For example, if my table is partitioned on DAY and the load for 20220420 has started but isn't complete, would this sensor trigger? Or, would it wait until that load step has been completed before marking the sensor to success?
Thanks
The Operator will not wait until your data has loaded, it will just check for the existence of the partition value at that moment in time. So if a single row gets inserted into that partition then this sensor would return True. See the sensor code that gets called by this operator.
An idea I've used in the past for similar problems has been to use a sentinel Label on the partitioned table to mark a load as "in-progress" or "done"
As has already been answered, it does not await anything except the existence of the partition.
If your data is streamed into partitions, and you have ordered delivery, you can probably add a sensor for the next-day partition — on the assumption that the previous day is complete when events have started streaming into the next.
If the load is managed by the same Airflow instance, I'd suggest using an ExternalTaskSensor on the load job. If not, you might be able to use the more generic SqlSensor, and run a custom SQL query on metadata tables to determine if a partition is complete, perhaps you can add a label or something with the Load job that you then query for.

Is there a way to check if a query hit a cached result or not in BigQuery?

We are performance tuning, both in terms of cost and speed, some of our queries and the results we get are a little bit weird. First, we had one query that did an overwrite on an existing table, we stopped that one after 4 hours. Running the same query to an entirely new table and it only took 5 minutes. I wonder if the 5 minute query maybe used a cached result from the first run, is that possible to check? Is it possible to force BigQuery to not use cache?
If you run query in UI - expand Options and make sure Use Cached Result properly set
Also, in UI, you can check Job Details to see if cached result was used
If you run your query programmatically - you should use respective attributes - configuration.query.useQueryCache and statistics.query.cacheHit

How to count rows of finished BigQuery job using node.js client library

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?
Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job
job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

How can I fetch (via GET) all JIRA issues? Do I go to the Search node?

It looks like /api/2/project easily returns all projects in a JIRA instance in JSON format.
I'd like to do the same for issues, but this does not appear to exist.
Is /api/2/search the standard way to do a mass-dump like this? And what is the best way to regularly update this to a database? Would I do something like search (update date > [last entry in database]) and then go through the pagination? Surely I can't be the first person attempting this, though I see no similar guide anywhere online to this (I checked Jira's own docs, no mass-issue-export guide really).
EDIT: Okay it looks like search really is the "issue dump" and not the issue node which, contrary to their documentation, does not default to a collection but really for creating issues or listing one at at time. I'll probably go the route of updated > [whatever last date is in the DB]
Unless you have very few issues, you can't fetch all of them at once.
What you can do is to execute the search step by step.
For example, lets say you have 1324 JIRA issues. In order to retrive all of them you have to execute a search similar to this several times:
/rest/api/2/search?&maxResults=100&startAt=0
This will retrive the first 100 JIRA issues starting from 0.
How to get the others?
When you execute the search, a field named total is returned. That field is the number of the total JIRA issues in your system (1324 issues).
The next query will be:
/rest/api/2/search?&maxResults=100&startAt=100
Repeat this operation, incrementing the value of startAt by 100 every time, until all the issues are returned.

BigQuery-Java: difference between QueryResponse and GetQueryResultsResponse

In sample code provided by Google, 2 classes are used to fetch results. QueryResponse and GetQueryResultsResponse.
I am not able to understand purpose of these 2 classes and do we have to use these 2 classes?
We are getting data from both: queryResponse.getRows() and queryResults.getRows()
I have gone through docs but could not figure out. what is the difference between these 2 classes and which is better to use?
Those two results are virtually identical (in fact, they are identical in the raw HTTP request). The difference is how you get them.
QueryResponse is returned by jobs.query(). This method can be used to run a query, but has only limited configuration options. It is intended as a convenience function. For more query options (such as setting a destination table, allowing large results, etc), use jobs.insert(). Another limitation of jobs.query() is that it may time out before the query has completed. Partly, this is because many clients (such as in AppEngine) require all HTTP requests to finish within 30 seconds or so. If jobs.query() times out, it will still report a job id that can be used to fetch the results with jobs.get_query_results().
GetQueryResultsResponse is returned by jobs.get_query_results(). This can be used to get the results of a query started by either jobs.query() or jobs.insert(). Query results (if you don't specify a destination table) are available for 24 hours after the query completes. jobs.get_query_results() allows you to fetch these results at any time. jobs.query() only gives you the query results once.
There is a further difference between the two, which is that jobs.query() just returns the first page of results. jobs.get_query_results() can be used to get multiple pages of results.
Hopefully this clarifies things a bit.