BigQuery: result to my daily query is now NULL since March 15th - google-bigquery

This question/bug is mainly for the Google BigQuery team.
I have a daily report in Tableau that connect to a Google BigQuery live Connection. This report has been running for over a year without problems. Since March 15th however, the report is no longer working and the result of the gbq queries generated by Tableau now returns 'null'.
Note: The version of Tableau and version of the BigQuery driver have not changed for over a month. So, nothing has changed on our side. I have also checked in the Query History and the generated queries have always been the same in the last weeks.
One simple query that is generated by Tableau and that now returns 'null' looks like this:
SELECT (CASE WHEN 1000000 = 0 THEN NULL ELSE FLOAT([log_time]) / 1000000 END)
AS [none_Calculation_0500516094317075_ok]
FROM [GDT.MissingItems] [sqlproxy]
GROUP BY 1
This query comes from a simple calculated field created in Tableau that is divided by 1000000 and is cast to a INT. The job_id is job_ydTIq1c_ydnyua4s4SW3zJj00fs
This looks to me like something has changed recently and that is causing the query to now return 'null' instead of what it should return. This is a big problemfor us as we are using this report for operational purposes.
I posted my question/problem in Stackoverflow as mentioned in the Google BigQuery Support page:
https://developers.google.com/bigquery/support

This was a bug in the incorrect application of an optimization in the query execution engine. It has been fixed and we expect to release the fix today (it is possible that the fix won't go live until monday, because we often try to avoid making production changes last minute before the weekend).
The workaround in the meantime would be to use 0.0 rather than null in the case statement.

Related

Processing trace_info.screen_info.frozen_frame_ratio in BigQuery

I'm trying to do some post processing on trace_info.screen_info.frozen_frame_ratio, however I'm facing something I cannot explain: every time I compare data that I see in Firebase Console - it's different (bigger) if compared to my query results.
I am doing same kind of filtering (by event_type = "SCREEN_TRACE" and event_name that has it started with "_st_" and then goes my screen name). It almost looks as if I should exclude 0 values of trace_info.screen_info.frozen_frame_ratio from my query, but still results are significantly different if compared with Firebase (same period assumed of course). Why is that happening? Is that because of mysterious auto sampling during data export to BigQuery? How to make it right (need query to return same figures Firebase Console is returning easily)?

Failed to load FileDescriptorProto for '_CLOUD_QUERY_METADATA_SCHEMA_'

My Firebase project is integrated with BigQuery, so all raw Google Analytics events are exported daily & streamed to a dedicated collection.
Since today even simple queries on those events are failing with an error:
Error running query
Failed to load FileDescriptorProto for
'CLOUD_QUERY_METADATA_SCHEMA': ;Field number 23 has already been
used in "Msg_0_CLOUD_QUERY_TABLE" by field "items".
An example query which is failing:
SELECT * FROM `project.analytics_184030700.events_*` WHERE event_name IN ("share")
As I mentioned, those (and more advanced) queries used to run until yesterday. I did not change the schema nor any other configuration in the meantime. I've also noticed that BigQuery was updated yesterday.
Looking at the error description, looks like my table schema indeed contains a field called items (a very last one, 23rd) but it was automatically added by Google Analytics.
My suspicions:
Something went wrong with the recent BigQuery release
Something went wrong during daily sync Google Analytics -> BigQuery
Some old job or cache is getting in the way of new queries
At this point I have no idea what to try next. Does anyone have any insight into what could be causing this error?
EDIT:
I noticed that this problem was also just reported in the Google Issue Tracker here: https://issuetracker.google.com/issues/192325507.
I have same issue and I didn't solve it yet but as you said it's cause is Firebase I guess. There's an extra field problem which are limited only for three days (26th,27th and 28th June).
I checked all data older than 26th June but there was no privacy_info field. As you see there is no privacy_info field again for 29th June. I think firebase put this new field but they changed their mind for some reason. But this causes a big problem for us.
Update:
I changed this part:
SELECT * FROM `project.analytics_184030700.events_*`
Like this:
SELECT * FROM `project.analytics_184030700.events_2*`
Interestingly this worked for me.
You can do a workaround for that issue; It seems there are problems with the field
privacy_info
If you select multiple table partitions, just make sure you only select the fields you need, and omit the field privacy_info.
Not using "SELECT *" did resolve this error for me.

Incremental load of a full api call

I have an API I where I need to get signup data into my database from and aggregate it daily. Everytime I call the API I will get a full copy of the data. Sometimes old accounts will get deleted, so the historical data will change.
This is what the data from the API looks like:
I want to aggregate it like so, to see the daily account creations and activations:
Now, what I could do is a daily import of the full data and then aggregate like this:
SELECT
Current_date() as snapshot_date,
SUM(CASE WHEN accountCreateOn = current_date() THEN 1 ELSE 0 END) as accountCreateOn,
SUM(CASE WHEN accountActivateOn = current_date() THEN 1 ELSE 0 END) as accountActivateOn
FROM full_data
But this doesn't seem very failure resistant. What happens, if the pipeline fails for a couple of days? What would be the right way to solve such a problem?
The easiest and most fault-tolerant way is to store the data you are getting completely and as detailed as they are. You can't get any better information, and leaving away information - which includes aggregating it - always carries the danger that you will one day want to answer a question on those data that could have been answered on the complete dataset and can't be answered on the reduced one.
The only reason to leave this path could be datasets that are so huge that storing and processing them isn't feasible. For modern DBMS systems running on modern hardware, it's rather unlikely that you run into that problem. So I would create synthetic test data of the maximum size that I expect for my business, say 10 times the account activations per year that I dream of. If the database can handle this, it means you have one less problem to worry about.

How to improve performance of Query with new year of data

I have a SQL DB with 2 million rows of data. The query I have been using has been working fine, but then I added data for January 2019. For some reason, the query runs quickly on any other months, but when I try to extract information for just Jan 2019 it takes about an hour. Can anyone explain why it would perform worse for a new year and maybe provide a way to fix it? Would be incredibly helpful. I use:
Where Month([Posting Date]) = 1 and Year([Posting Date]) = 2019
to extract it with the query.
You may be a victim of parameter sniffing. SQL Server thinks it is using an appropriate index when it isn't. Try ending the query with OPTION (RECOMPILE)
To really see what is going on you would have to show us the query plan.

Changes in query behaviour

I have some queries that run every day for several month with no problem. I didn't change anything in the queries for a long while.
In the past few days some of them fail. Error message says something regarding some fields: "Field 'myfield' not found.". these queries usually involve some sub-queries and window functions.
Example for the BQ guys:
On 2015-08-03 Job ID: job_EUWyK5DIFSxJxGAEC4En2Q_hNO8 run successfully
on the following days, same query, failed. Job IDs: (job_A9KYJLbQJQvHjh1g7Fc0Abd2qsc , job__15ff66aYseR-YjYnPqWmSJ30N8)
In addition, for some other queries running times extended from minutes to hours and sometime return "timeout".
My questions:
Was something changed in the BQ engine?
What should I do to make my queries run again?
Thanks
So the problem could be two folded:
An update to query execution engine was rolled out during the week of August 3, 2015 as documented in the announcement
If this is the case, you need to update your queries.
Some performance issues were detected lately, but in order to actually identify if there is something wrong with your project or not, you need to create a issue request I did the same in past and it was fixed.