I'm using the Python API to write to BigQuery -- I've had success previously, but I'm pretty novice with the BigQuery platform.
I recently updated a table schema to include some new nested records. After creating this new table, I'm seeing significant portions of data not making it to BigQuery.
However, some of the data is coming through. In a single write statement, my code will try to send through a handful of rows. Some of the rows make it and some do not, but no errors are being thrown from the BigQuery endpoint.
I have access to the stackdriver logs for this project and there are no errors or warnings indicating that a write would have failed. I'm not streaming the data -- using the BigQuery client library to call the API endpoint (I saw other answers that state issues with streaming data to a newly created table).
Has anyone else had issues with the BigQuery API? I haven't found any documentation stating about a delay to access the data (I found the opposite -- supposed to be near real-time, right?) and I'm not sure what's causing the issue at this point.
Any help or reference would be greatly appreciated.
Edit: Apparently the API is the streaming API -- missed on my part.
Edit2: This issue is related. Though, I've been writing to the table every 5 minutes for about 24 hours, and I'm still seeing missing data. I'm curious if writing to a BigQuery table within 10 minutes of it's creation puts you in a permanent state of losing data or if it would be expected to catching everything after the initial 10 minutes from creation.
Related
I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.
I noticed that BigQuery no longer cache the same query even I have chosen to use cache in the GUI (both Alpha and Classic). I didn't edit the query at all, just keep clicking run query button and every time GUI executed the query without using cache results.
It happens to my PHP script as well. Before, it was enable to use cache and came back with results very quick and now it executes the query every time even the same query has been executed minutes ago. I can confirm the behaviour in the logs.
I am wondering if there is anything changed in the last few weeks? Or some kind of account level settings control this? Because it was working fine for me.
As per official docs here cache is disable when:
...any of the tables referenced by the query have recently received
streaming inserts...
Even if you are streaming to one partition, and then querying to another, this will invalidate caching functionality for the whole table. There is this feature request opened where it is requested to be able to hit cache when doing streaming inserts to one partition but querying a different partition.
EDIT***:
After some investigation I've found out that some months ago there was an issue going on which was allowing to hit the cache even streaming inserts were being made. This was not expected behavior, and therefore it got solved in May. I guess this is the change you have experienced and what you are talking about.
Docs have not changed related to this, and they aren't/weren't incorrect. Just the previous behavior was the incorrect one.
Excuse me for maybe a not very precise question, but I just need to check if I am missing something or it really is some kind of problem with Google Cloud (GC) BigQuery.
I've got this Java program that reads from a website and publish the data into a GC Pub/Sub Topic; a pipeline is conveniently up, pulling the message from Pub/Sub and sending it to BigQuery via the template job offered in GC Dataflow. In the end, a DataStudio dashboard is getting the data from the BigQuery table and building up its charts and all...
The thing is, all the process is working fine: I can see the resulting dashboard being populated correctly, BUT I cannot see the data in the table in BigQuery, even after refreshing the whole page. Sometimes the results show on the following day (!).
Is it me forgetting something, or is it GC BigQuery in a beta release being incomplete?
As #Pentium10 said, the GUI is just for quick previews. It does take some time to update itself. If you want to check if the data is in the table do a query.
Project number :149410454586
We are facing the same issue mentioned below
BigQuery streaming insert data availability delay
We are getting the success response from BigQuery , but the data is not visible
while querying it.It is strange that it seemed to happen over a night, till then
it was working fine.
We were supposed to go to deployment for UAT tomorrow and we never expected this
heppen and now helpless, kindly if you could help out on this.
Face the same issue one hour ago. Any streamed in data are not available for queries and it is not a warm-up issue.
Project Number: 272433143339
It seems to be related to https://status.cloud.google.com/incident/bigquery/18010
We're using the BigQuery streaming API, and we have been for some time now. We noticed that about 4:05am UTC (June 18th) BigQuery was no longer reporting any new data being streamed in. We checked all our logs, and everything looks good, and we're even getting back 200's from the insertAll() request.
As a test, we created a table, and used the online insertAll() 'Test it!' webpage. Again, everything looks good, but the data is not showing up in BigQuery. We know that data might not be visible for a while, but we've never seen it take more than 5 minutes max to be available.
Is there any known issue with BigQuery streaming currently?
The issue is under investigation, see:
https://code.google.com/p/google-bigquery/issues/detail?id=263
More information in this forum https://groups.google.com/forum/#!forum/bigquery-downtime-notify