How to count rows of finished BigQuery job using node.js client library - google-bigquery

I would like to get the row count of job that was run using:
bigquery.startQuery(options)
The naive way of doing this, would be to stream the result (e.g. using):
job.getQueryResultsStream()
And count one by one. This obviously isn't very efficient, especially for large results. Another way I thought of is using the metadata of the job:
job.on('complete', function(metadata) {...}
Where I could kind of "reverse engineer" the response, to get the query plan, and see the number of written rows in the last step. I could find that in:
statistics.query.queryPlan[statistics.query.queryPlan.length - 1].recordsWritten
While a sample of different queries convinced me that this might work, it feels like a "hack", and it's difficult to say how robust it will be. Seems like I might need to handle different cases (failed queries, etc.)
EDIT: Another option suggested below is "SELECT COUNT"ing the temp table created by the original query (available in the job metadata). While this absolutely is a straightforward way to get the result I'm looking for, it has the disadvantage of requiring another roundtrip to query the BigQuery service, which costs several seconds. It is a 0 "bytes billed" query (counting a full table uses table metadata only), but it seems redundant when the job "knows" how many rows it has written to the output.
Is there a straightforward and "correct" way to get this count from the job object, without a roundtrip to BQ service? Perhaps a field I missed / misinterpreted, or a function in the job object that returns this?

Any job has destination table - even when you do not explicitly set it - result is still saved in so-called anonymous table that you can in turn query to get the count of output rows. So below simple extra query will work (note - names are just as an example)
SELECT COUNT(1)
FROM `yourProject._0511743a77ca76c1b55482d7cb1f8e91ac5c7b36.anon17286defe54b5c07ba6810a71abfdba6388ac4e0`
The actual destination table to use - can be retrieved from configuration.query.destinationTable property of job

job.on('complete', function(metadata) {
console.log(metadata.statistics.query.numDmlAffectedRows)
}

Related

How to know the first ran jobId of a cached query in big-query?

When we run a query in big-query environment, the results are cached in the temporary table. From next time onwards, when we ran the same query multiple times, the subsequent runs will fetch the results from the cache for the next 24 hrs with some exceptions. Now my use case is, in the subsequent runs, i want to know like from which jobId this query cache results are got, previous first time run of the query ??
I have checked all the java docs related to query didn't find that info. We have cacheHit variable, which will tell you whether the query has fetched from the cache or not . Here i want to know one step further, from what jobId, the results got fetched. I expected like, may be in this method i can know the info, but i am always getting the null value for that. I also want to know what is meant by parentJob in big-query context.
It's unclear why you'd even care about this other than as a technical exercise. If you want to build your own application caching layer that's a different concern. More details about query caching can be found on https://cloud.google.com/bigquery/docs/cached-results.
The easiest way to do this would probably be by traversing jobs.list until you find a job that has the same destination table (it'll be prefaced with an anon prefix), and where the cacheHit stat is false/not present.
Your inquiry about parentJob is unrelated to this exercise. It's for finding all the child jobs created as part of a script or multi-statement execution. More information about this can be found on https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting-concepts.

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

Is there a way to concatenate the results of multple mongodb queries together in one statement?

I have a mongodb database that contains a large amount of data without a highly consistent schema. It is used for doing Google Analytics-style interaction tracking with our applications. I need to gather some output covering a whole month, but I'm struggling with the performance of the query, and I don't really know MongoDB very well at all.
The only way I can get results out is by restricting the timespan I am querying within to one day at a time, using the _timestamp field which I believe is indexed by default (I might be wrong).
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-01T00:00:00.000Z"),$lte:ISODate("2019-09-02T00:00:00.000Z")}}); // Day 1..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-03T00:00:00.000Z"),$lte:ISODate("2019-09-04T00:00:00.000Z")}}); // Day 2..
db.myCollection.find({internalId:"XYZ",username:"Demo",_timestamp:{$gte:ISODate("2019-09-05T00:00:00.000Z"),$lte:ISODate("2019-09-06T00:00:00.000Z")}}); // Day 3..
This works 'fine', but I'd rather be able to SQL union those seperate queries together - but then I guess I'd still end up timing out.
Ideally I'd end up with each of those queries executing seperately, with the resultset being appended to each time and returned at the end.
I might be better off writing a simple application to do this.
Help me Obi-Wan Kenobi, you're my only hope.

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.