BigQuery streaming data not available instantly - google-bigquery

Since couple of days some data i am streaming to bigquery is not available instantly (as it normally happens) within bigquery web ui after being inserted successfully.
My use case consists of inserting thousand of lines using :
bigquery.tabledata().insertAll(...)
The results of the streaming inserts into the table are :
(i am also checking for insertErrors to be sure as described here):
BigQuery insert status : {"kind":"bigquery#tableDataInsertAllResponse"}
BigQuery insert errors : null
Total number of lines available in bigquery web ui is different that total inserted.
I would be grateful for any help.
Bigquery project details :
Project ID : favorable-beach-87616
Table : mtp_UA_xxxx_1_20150410
Project dependencies on google libraries:
compile 'com.google.api-client:google-api-client:1.19.0'
compile 'com.google.http-client:google-http-client:1.19.0'
compile 'com.google.http-client:google-http-client-jackson2:1.19.0'
compile 'com.google.oauth-client:google-oauth-client:1.19.0'
compile 'com.google.oauth-client:google-oauth-client-servlet:1.19.0'
compile 'com.google.apis:google-api-services-bigquery:v2-rev171-1.19.0'
compile 'com.google.api-client:google-api-client:1.17.0-rc'
Great thanks in advance for your help!

When you say the total number of lines available in the web UI, do you mean the number of rows that show up in the 'details' pane on the table, or the number of rows that are returned if you do a SELECT COUNT(*) query?
If the former, that is expected, since that counter only returns the number of rows that have been flushed to long-term storage (as opposed to the short-term storage buffers the streaming data originally gets written to). This is admittedly confusing, and we are working on a fix.
If the latter, the rows don't show up in a query, that is more concerning. If that is the case, please let us know and we'll investigate.

Related

Processing trace_info.screen_info.frozen_frame_ratio in BigQuery

I'm trying to do some post processing on trace_info.screen_info.frozen_frame_ratio, however I'm facing something I cannot explain: every time I compare data that I see in Firebase Console - it's different (bigger) if compared to my query results.
I am doing same kind of filtering (by event_type = "SCREEN_TRACE" and event_name that has it started with "_st_" and then goes my screen name). It almost looks as if I should exclude 0 values of trace_info.screen_info.frozen_frame_ratio from my query, but still results are significantly different if compared with Firebase (same period assumed of course). Why is that happening? Is that because of mysterious auto sampling during data export to BigQuery? How to make it right (need query to return same figures Firebase Console is returning easily)?

Superset with Impala - Invalid Session id/No protocol version header

I have Superset using Impala as the main data source. Most of the times, every query runs smoothly and I can build charts and dashboards with ease. I need to generate a Table Chart, containing around 100k records and 30+ columns, but I am having some issues. It is basically a SELECT *, no aggregations, filtering or ordering
are being used.
When the data is relatively big, Superset just throws a bunch of errors (It appears to be that the errors are coming from Impala). But I cannot find any
information regarding those errors. I have tried paginating the results, but it did not worked. Also, when I run the query in Superset Chart page, it doesn't take long, it just
displays the error. The only way some information gets displayed in the Table Chart is when I limit the rows at the "Row limit" option to 10 records. But, this will not work out for me.
Those are the errors that keep ocurring:
impala error: Invalid session id: f344bf1aa2a42e2b:ad1df0047d7f909c
impala error: No protocol version header
When I use the Oracle connection that I also have, I can generate a table chart from a large amount of records with no problem.
My setup is the following:
Impala v3.2.0-cdh6.3.3
Superset v0.36.0
So, is that a problem with Superset or Impala? Could have something to with configuration in Superset?

BigQuery Count Appears to be Processing Data

I noticed that running a SELECT count(*) FROM myTable on my larger BQ tables yields long running times, upwards of 30/40 seconds despite the validator claiming the query processes 0 bytes. This doesn't seem quite right when 500 GB queries run faster. Additionally, total row counts are listed under details -> Table Info. Am I doing something wrong? Is there a way to get total row counts instantly?
When you run a count BigQuery still needs to allocate resources (such as: slot units, shards etc). You might be reaching some limits which cause a delay. For example, the slots default per project is 2,000 units.
BigQuery execution plan provides very detail information about the process which can help you better understand the source of the delay.
One way to overcome this is to use an approximate method described in this link
This Slide by Google might also help you
For more details see this video about how to understand the execution plan

How to use BigQuery Slots

Hi,there.
Recently,I want to run a query in bigquery web UI by using "group by" over some tables(tables' name suits xxx_mst_yyyymmdd).The rows will be over 10 million. Unhappily,the query failed with this error:
Query Failed
Error: Resources exceeded during query execution.
I did some improvements with my query language,the error may not happen for this time.But with the increasement of my data, the Error will also appear in the future.So I checked the latest release of Bigquery,maybe there two ways to solve this:
1.After 2016/01/01,Bigquery will change the Query pricing tiers to satisfy the "High Compute Tiers" so that the "resourcesExceeded error" will not happen again.
2.BigQuery Slots.
I checked some documents in Google and didn't find a way on how to use BigQuery Slots.Is there any sample or usecase of BigQuery Slots?Or I have to contact with BigQuery Team to open the function?
Hope someone can help me to answer this question,thanks very much!
A couple of points:
I'm surprised that a GROUP BY with a cardinality of 10M failed with resources exceeded. Can you provide a job id of the failed query so we can investigate? You mention that you're concerned about hitting these errors more often as your data size increases; you should likely be able to increase your data size by a few more orders of magnitude without seeing this; likely you've encountered either a bug or something was strange with either your query or your data.
"High Compute Tiers" won't necessarily get rid of resourcesExceeded. For the most part, resourcesExceeded means that BigQuery ran into memory limitations; high compute tiers only address CPU usage. (and note, they haven't been enabled yet).
BigQuery slots enable you to process data faster and with more reliable performance. For the most part, they also wouldn't help prevent resourcesExceeded errors.
There is currently (as of Nov 5) a bug where you may need to provide an EACH keyword with a GROUP BY. Recent changes should enable BigQuery to automatically select the execution strategy, so EACH shouldn't be needed, but there are a couple of cases where it doesn't pick the right one. When in doubt, add an EACH to your JOIN and GROUP BY operations.
To get your project eligible for using slots you need to contact support.

How can I trigger an email or other notification based on a BigQuery query?

I would like to receive a notification, ideally via email, when some threshold is met in Google BigQuery. For example, if the query is:
SELECT name, count(id) FROM terrible_things
WHERE date(terrible_thing) < -1d
Then I would want to get an alert when there were greater than 0 results, and I would want that alert to contain the name of each object and how many there were.
BigQuery does not provide the kinds of services you'd need to build this without involving other technologies. However, you should be able to use something like appengine (which does have a task scheduling mechanism) to periodically issue your monitoring query probe, check the results of the job, and alert if there are nonzero rows in the results. Alternately, you could do this locally using some scripting and leveraging the BQ command line tool.
You could also refine things by using BQ's table decorators to only scan the data that's arrived since you last ran your monitoring query, if you retain knowledge of the last probe's execution in the calling system.
In short: Something else needs to issue the queries and react based on the outcome, but BQ can certainly evaluate the data.