Google Cloud BigQuery is not refreshing? - google-bigquery

Excuse me for maybe a not very precise question, but I just need to check if I am missing something or it really is some kind of problem with Google Cloud (GC) BigQuery.
I've got this Java program that reads from a website and publish the data into a GC Pub/Sub Topic; a pipeline is conveniently up, pulling the message from Pub/Sub and sending it to BigQuery via the template job offered in GC Dataflow. In the end, a DataStudio dashboard is getting the data from the BigQuery table and building up its charts and all...
The thing is, all the process is working fine: I can see the resulting dashboard being populated correctly, BUT I cannot see the data in the table in BigQuery, even after refreshing the whole page. Sometimes the results show on the following day (!).
Is it me forgetting something, or is it GC BigQuery in a beta release being incomplete?

As #Pentium10 said, the GUI is just for quick previews. It does take some time to update itself. If you want to check if the data is in the table do a query.

Related

linking GA4 project to bigquery - streaming vs daily

I recently linked my GA4 property to bigquery to better look at the analytics data. That was initially on daily, so every day the data was exported from Google Analytics to Bigquery. However, I decided that streaming is necessary so I switched from daily to streaming in the BigQuery Linking section of GA4's admin tab. However, that streaming data is not showing up after a few hours. I'm wondering if anyone has done this with similar problems. Do I need to recreate an entire bigquery project?
If you look at your configuration options for GA to BigQuery, you will see a message under the streaming option.
This option will take effect after the next date boundary (tomorrow) for this property.
This property = the "Data exported continuously" option (streaming)
You will probably see your data tomorrow.

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

BigQuery python client library dropping data on insert_rows

I'm using the Python API to write to BigQuery -- I've had success previously, but I'm pretty novice with the BigQuery platform.
I recently updated a table schema to include some new nested records. After creating this new table, I'm seeing significant portions of data not making it to BigQuery.
However, some of the data is coming through. In a single write statement, my code will try to send through a handful of rows. Some of the rows make it and some do not, but no errors are being thrown from the BigQuery endpoint.
I have access to the stackdriver logs for this project and there are no errors or warnings indicating that a write would have failed. I'm not streaming the data -- using the BigQuery client library to call the API endpoint (I saw other answers that state issues with streaming data to a newly created table).
Has anyone else had issues with the BigQuery API? I haven't found any documentation stating about a delay to access the data (I found the opposite -- supposed to be near real-time, right?) and I'm not sure what's causing the issue at this point.
Any help or reference would be greatly appreciated.
Edit: Apparently the API is the streaming API -- missed on my part.
Edit2: This issue is related. Though, I've been writing to the table every 5 minutes for about 24 hours, and I'm still seeing missing data. I'm curious if writing to a BigQuery table within 10 minutes of it's creation puts you in a permanent state of losing data or if it would be expected to catching everything after the initial 10 minutes from creation.

BigQuery streaming - our data is not showing up anymore

We're using the BigQuery streaming API, and we have been for some time now. We noticed that about 4:05am UTC (June 18th) BigQuery was no longer reporting any new data being streamed in. We checked all our logs, and everything looks good, and we're even getting back 200's from the insertAll() request.
As a test, we created a table, and used the online insertAll() 'Test it!' webpage. Again, everything looks good, but the data is not showing up in BigQuery. We know that data might not be visible for a while, but we've never seen it take more than 5 minutes max to be available.
Is there any known issue with BigQuery streaming currently?
The issue is under investigation, see:
https://code.google.com/p/google-bigquery/issues/detail?id=263
More information in this forum https://groups.google.com/forum/#!forum/bigquery-downtime-notify

Using BigQuery for logs analysis

Im trying to do logs analysis with BigQuery. Specifically, I have an appengine app and a javascript client that will be sending log data to BigQuery. In bigquery, I'll store the full log text in one column but also extract important fields into other columns. I then want to be able to do adhoc queries over those columns.
Two questions:
1) Is BigQuery particularly good or particularly bad at this use case?
2) How do I setup revolving logs? I.e. I want to only store the last N logs or the last X GB of log data. I see delete is not supported.
Just so you know, there is an excellent demo of moving App Engine Log data to BigQuery via App Engine MapReduce called log2bq (http://code.google.com/p/log2bq/)
Re: "use case" - Stack Overflow is not a good place for judgements about best or worst, but BigQuery is used internally at Google to analyse really really big log data.
I don't see the advantage of storing full log text in a single column. If you decide that you must set up revolving "logs," you could ingest daily log dumps by creating separate BigQuery tables, perhaps one per day, and then delete the tables when they become old. See https://developers.google.com/bigquery/docs/reference/v2/tables/delete for more information on the Table.delete method.
After implementing this - we decided to open source the framework we built for it. You can see the details of the framework here: http://blog.streak.com/2012/07/export-your-google-app-engine-logs-to.html
If you want your Google App Engine (Google Cloud) project's logs to be in BigQuery, Google has added this functionality built in to the new Cloud Logging system. It is a beta feature known as "Logs Export"
https://cloud.google.com/logging/docs/install/logs_export
They summarize it as:
Export your Google Compute Engine logs and your Google App Engine logs to a Google Cloud Storage bucket, a Google BigQuery dataset, a Google Cloud Pub/Sub topic, or any combination of the three.
We use the "Stream App Engine Logs to BigQuery" feature in our Python GAE projects. This sends our app's logs directly to BigQuery as they are occurring to provide near real-time log records in a BigQuery dataset.
There is also a page describing how to use the exported logs.
https://cloud.google.com/logging/docs/export/using_exported_logs
When we want to query logs exported to BigQuery over multiple days (e.g. the last week), you can use a SQL query with a FROM clause like this:
FROM
(TABLE_DATE_RANGE(my_bq_dataset.myapplog_,
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY'), CURRENT_TIMESTAMP()))