Cannot read non-required field as required STRUCT Field in bigquery - google-bigquery

We are running our etl incrementally on ga_sessions table partitioned by date, and for one particular day etl fails with this error ** Cannot read non-required field as required STRUCT Field: trafficSource.adwordsClickInfo.targetingCriteria**.
This failure is just for that particular day alone, and works for rest of the days. Not sure, what is the error all about and is there a way, where I can ignore those records or find those ?
Attaching the schema information along with sample query,
Below query works for other days.
Failure when 16th November is included.

Related

Data disappears when moving from events_intraday_ to events_

I am using BigQuery to analyze FirebaseAnalytics events. I use events_intraday_ for real-time analysis and events_ for daily analysis, and the data is automatically transferred from events_intraday to events_ after a certain time, but some data will disappear at that time. The table exists, but the data is clearly reduced. About 2 days out of a week's data is lost here. Please tell me why this happens.
Thanks.
Data should not be lost when moved from events_intraday_ to events_.
A common problem that is easy problem fix is with the set up of intraday collects the data from “today” in realtime, you first need to agree with Google BigQuery on what “today” refers to. BigQuery can’t guess what timezone you want to query, which is why the default UNIX timestamp format of the event_timestamp column in BigQuery is always in UTC time. this post explains it clearly Firebase BigQuery server offset time
Also I am not sure your last statement is correct "events_intraday_" and "events_" are not quite the same thing, an "events_intraday_" table contains raw, unsampled event data for the current day while the "events_" table contains processed and aggregated event data.
This processing of data after its collected but before data is exported to BigQuery, this means you would expect some data to be lost. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.), if these are areas you are looking at its probably a GA4 processing issue.

How do I create a backup for a table which will be used for a full-refresh?

I have an incremental model A where each day is calculated using the previous day's value. Running a full-refresh means that this table needs to be calculated since the beginning of time which is very inefficient and takes too long.
I have tried to create a backup table which will take a copy of the table's value each month, and have model A refer to the backup table during a full-refresh so that the values only after the backup need to be recalculated and I can arrive at today's value much quicker. However this gives me an error:
Encountered an error:
Found a cycle: model.model_A --> model.backup --> model.model_A
This is because the backup refers to the model to get the value each month, while model A also refers to the backup to build off in the case of a full-refresh.
Is there a way around this problem, avoiding rebuilding the entire model from the beginning of time every time I do a full-refresh?
Yes, you can't have 'circular loops' or cycles in your build process.
If there is an application that calculates the values for each day, you could perhaps store the new values back in the same source table(s), just adding a 'updated_at' or something similar. If I understand your use case correctly, you could then use this value whenever you need to query only the last day's information.

SSAS: How to handle date dimension when date is null

I'm trying to add a new column to my SSAS cube. The column is a date field, and links to my DimDate table (a Date dimension). This date represents the project completion date.
However.... not all of the projects have a project completion date due to old projects not ever being assigned this value. And this is expected. We don't want to put bogus dates into the field just to get SSAS to work.
When processing the cube, it crashes with:
Errors in the OLAP storage engine: The attribute key cannot be found when
processing: Table: 'dbo_FactMyTable', Column: 'MyDate_id', Value: '0'.
The attribute is 'Date Id'.
I can't disable "missing values" for the entire project because in most cases, this really is an error. How can I disable missing values for this dimension?
Or is there a better way to handle missing dates/values like this?
Small correction - based on your question, you need to change Processing error handling for special Measure Group, not Dimension. You can do it for all dimensions linked to some measure group, but not to specific dimension.
You can process individual measure group with _Table: 'dbo_FactMyTable'_ first with necessary missing value settings, and then - process rest of your cube with default settings.
Main problem here - how to process rest of the cube. You might have sophisticated system which creates processing XMLA scripts dynamically based on data update knowledge (I do it with SSIS); in this case you would not ask this question. Suppose your environment is simpler - you update cube and would like to process it as a whole completely. In such scenario I would sudgest the following workflow:
Process Default all Dimensions (will do initial processing or in structure changes)
Process Update all Dimensions
Process Cube with Unprocess - invalidating it
Process your special measure group
Process Cube with Process Default
This will first update Dimensions, then - clear processing status flag from all measure groups in the Cube. After that you process your measure group with special flags; this set processing status for this MG. And then during Process Default on Cube - only unprocessed MGs will be covered, which excludes your special MG from processing scope.
The answer is a bit complicated, but this article did a great job of explaining it, including screen shots for the SSAS-challenged like me.
http://msbusinessintelligence.blogspot.com/2015/06/handling-null-dates-in-sql-server.html?m=1

Big Query Table Last Modified Timestamp does not correspond to time of last table insertion

I have a table, rising-ocean-426:metrics_bucket.metrics_2015_05_09
According to the node js API, retrieving metadata for this table,
Table was created Sat, 09 May 2015 00:12:36 GMT-Epoch 1431130356251
Table was last modified Sun, 10 May 2015 02:09:43 GMT-Epoch 1431223783125
By my records, the last batch insertion to this table was actually on:
Sun, 10 May 2015 00:09:36 GMT - Epoch 1431216576000.
This is two hours earlier than the reported last modification time. Using table decorators, I can show that no records were inserted into the table after Epoch 1431216576000, proving that no records were inserted in the last two hours between the last batch insertion I made and the last modification time reported in the metadata:
The query: SELECT
count(1) as count
FROM [metrics_bucket.metrics_2015_05_09#1431216577000-1431223783125];
returns zero count. Whereas the query:
SELECT
count(1) as count
FROM [metrics_bucket.metrics_2015_05_09#1431216576000-1431216577000];
returns count: 222,891
Which shows that the correct last modification time was Sun, 10 May 2015 00:09:36 GMT, and not 02:09:43 GMT as the metadata asserts.
I am trying to programmatically generate a FROM clause that spans multiple tables with decorators, so I need accurate creation and last modification times for the tables in order to determine when decorators can be omitted because the time range spans the entire table. However, due to this time discrepancy, I cannot eliminate the table decorators.
The question is, am I looking at the right metadata to obtain correct creation and last modification information?
Short answer: You are indeed looking at the correct metadata.
Long answer:
The last modification time includes the time of some internal compaction of data, unrelated to data change. Executing a query against your table with a decorator ending at either 1431223783125 or 1431216576000 produces the same results, just like your experiments show, but executing at the later time includes our storage efficiency improvements that may slightly improve execution time and efficiency. We consider this a bug, and will soon update the API to return the last user-modification time instead.
In the meantime, there's no harm to including table decorators that aren't really needed, other than the added query text. Neither query cost or performance will change.

BigQuery: Why does Table Range Decorators return wrong result sometimes?

I've been using the Table Range Decorators feature daily since May in order to only query the data from the last 7 days in some of my tables.
Since 2 weeks, I've noticed that sometimes some data is missing when I use that feature. For example, if do a query to get the results for the last 7 days (by adding "#-604800000--1" to table), some data will be missing as opposed to if I query on the whole table (without a table decorator).
I wonder what could explain this and if there is a fix coming soon to address this?
If this can help the BigQuery team, I've noticed that when using Table Decorators some data was missing for us for October 16th between around 16:00 and 20:00 UTC time.
For the BigQuery team here are 2 jobs ids where some data is missing: job_-xtL4PlIYhNjQ5weMnssvqDmd6U , job_9ASNxqq_swjCd1eMmiQ6SmPpxlQ
and 1 job id where data is correct(without decorators): job_QbcRwYGbQv0BZdHreQEvRlYh-mM
This is a known issue with table decorators containing a time range. Due to a bug in BigQuery, it is possible for certain time ranges to omit data that should be included within the time range.
We're working on a fix and plan to have it released next week. After this fix is deployed time range decorators should again work as expected.