Why google analytics dataset delete my old data in bigquery? - google-bigquery

I have configured a GA4 bigquery linking, but the dataset only retains 60 days of data and I can't find a way to prevent it from deleting the old days.
In the configuration of the dataset I have the expiration in never as well as the partitions see screens below.
I would appreciate it if someone help me to solve the problem.
See Image

Once you set expiration date to never, only tables created after that will be affected.
You need to change the flag on older tables by yourself. See here

Related

BigQuery seems to be deleting my data piped from Segment every two months

I use Data Studio for visualizations and graphs of my stripe data.
I use Segment to pipe data from Stripe to BigQuery. It loaded historical data of over a year and continually allowed this data to stay in the warehouse. Then, without warning I noticed one day that the data before May of this year was gone. I contacted Segment and they said it wasn't them.
Of course that was not a satisfactory answer. BigQuery data is not going to just delete its self. I decided it must be a fluke and just reconnected the pipe.
Sure enough I checked today and May and June are gone and all that remains is end of June thru today.
I want to figure out what is causing this. The data is deleting its self. Ive tried manual sql queries and they pull up nothing either. Segment claims it isnt them. But it isn't me. I'm not sure what to do.
Please help. I hate dealing with data loss and I'm not excited about having to import the data by CSV.. especially if it will just delete its self too.
BigQuery does allow you to set table expiration. I think that can happen at the table-level and the dataset-level. Can you check your configurations to see if that is enabled? cloud.google.com/bigquery/docs/best-practices-storage

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

Export Billing Data to BigQuery not Working

To track BQ usage we created a new dataset and configured it in Billing export. But after waiting for a day also the dataset seems to be empty as no new tables is created.
Is there any other setup needs to be done for this to work.
Refer this link,
https://cloud.google.com/billing/docs/how-to/export-data-bigquery
Thanks and regards,
Gour
You just need to follow the How to enable billing export to BigQuery steps to begin using this functionality. Keep in mind that you have to wait certain time to start seeing your data, as mentioned in the Export Billing Data to BigQuery documentation.
After you enable BigQuery export, it might take a few hours to start seeing your data. Billing data automatically exports your data to BigQuery in regular intervals, but the frequency of updates in BigQuery varies depending on the services you're using.
In case you continue having this issue, I recommend you to take a look the Issue Tracker tool that you can use to raise a BigQuery ticket in order to verify this scenario with the Google Technical Support Team. Since this is an automated process, you might need some of their help to review your Project's internal configuration.

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

What is this vague accusation of RRD data loss about?

I want to use CollectD to gather some statistics (about storage) and have Graphite display them nicely. Apparently this can be done either by
having CollectD store the data as RRD files and pointing Graphite at
those, or
using a CollectD plugin to push the data to Graphite's Carbon API, which will store the data in a Whisper database (which is similar to RRD but not compatible).
I think I want to go with RRDs, but I found this statement in the Whisper docs that concerns me:
In many cases (depending on configuration) if an update is made to an
RRD series but is not followed up by another update soon, the original
update will be lost.
Hmmm. That's a bit scary, but the accusation is so vague that I don't know what to make of it. What is the configuration they are talking about, and the situation in which it causes data loss?
My situation is that the metrics data I am gathering will be available in chunks -- periodically I will go get the latest data and make as many entries into the database as there are new samples available. So, for example, I might grab some data and update the database with the values from 3 minutes ago, 2 minutes ago, and 1 minute ago, one right after the other. In fact, I might have dozens of new samples to put in the database at once. Does using RRD this way have anything to do with the Whisper accusation?
NOTE: I do not need to back-fill data; I will always be adding newer data than what has already been stored.
One scenario I see this happening would be if you have an AVERAGE RRA setup, and have the xxf value set to a low percentage. When the data is compressed over time, you could receive an unknown value and 'loose' all the data that was averaged. If you are using a RRD for what it was designed for, and have it setup with the proper type and settings, I wouldn't think you will run into a problem.
I would recommend taking an in depth look at the RRD documentation found HERE to answer questions about how RRD's and RRA's handle the data, and the different storage techniques that are available to you.