I get reports from 3rd party API on daily basis and going to store data in BigQuery table. Each report includes data for the last 90 days, so each new report has new records for new day, but loses some records for 91 day. My task is keeping data in Bigquery for period > 90 days.
I tried to setup BiqQuery data transfer from Cloud Storage with "Write preference" option "Mirror" and seems that it just overwrites my old data with new. If I change it to "Append" it will add data from new report to old with doubles.
Are there any ideas how can I just append new records to my table using BigQuery functional? Can't believe that it's impossible.
Related
I am using BigQuery to analyze FirebaseAnalytics events. I use events_intraday_ for real-time analysis and events_ for daily analysis, and the data is automatically transferred from events_intraday to events_ after a certain time, but some data will disappear at that time. The table exists, but the data is clearly reduced. About 2 days out of a week's data is lost here. Please tell me why this happens.
Thanks.
Data should not be lost when moved from events_intraday_ to events_.
A common problem that is easy problem fix is with the set up of intraday collects the data from “today” in realtime, you first need to agree with Google BigQuery on what “today” refers to. BigQuery can’t guess what timezone you want to query, which is why the default UNIX timestamp format of the event_timestamp column in BigQuery is always in UTC time. this post explains it clearly Firebase BigQuery server offset time
Also I am not sure your last statement is correct "events_intraday_" and "events_" are not quite the same thing, an "events_intraday_" table contains raw, unsampled event data for the current day while the "events_" table contains processed and aggregated event data.
This processing of data after its collected but before data is exported to BigQuery, this means you would expect some data to be lost. Generally, the affected fields are traffic sources and linked marketing products (AdWords, Campaign Manager, etc.), if these are areas you are looking at its probably a GA4 processing issue.
Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards
I'm trying to store daily snapshot of Employee data for FTE analysis project - analytics on how many FTE in various positions any given day.
I can call a REST API, which will give me data for all the active and terminated employees as of API call time. Is it prudent to call this API every single day and store daily snapshot or store only records which changed from previous version. What is the common design principle for this use case. Thanks!
I ended up using Delta tables on Databricks.
Sample code to create the table and load data
#read data into a dataframe
df = spark.read.format("json").option("multiline", "true").load("/mnt/datalake/path/to/file/employee.json")
#load data into a managed delta table
df.write.format("Delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("raw.DimEmployee")
Now you can query data of any given day using below query
select count(*) from raw.DimEmployee timestamp as of '2022-09-14'
can you please help me with time stamp of summay index..
we having disk space issue and we are clearing the old logs . but we want keep some field data so if will schedule a SI then does it will add the data from last 1 month at one time ..then why we need to schedule it ? have gone through the splunk document but unable to understand the steps and logic ..
The idea of a summary index is to store the results of a search until they are needed for a later search. The classic example is the end-of-month report. Rather than run a huge search over thirty days to crunch the thousands of events of each day into a final report, a daily search crunches the events of that day into a SI then the monthly report runs on day 30 to read the 30 summary events from the SI into a report that runs quickly. The same SI can then be used for end-of-week reports and to populate a dashboard with the daily sales (or whatever) figures.
The key is to make the summary smaller than the original data. One cannot dump 1 month of data into a SI and hope to save space - it won't happen.
A summary index can help save disk space by retaining a smaller set of summary data long after the original events have been discarded.
Summaries do not have to be scheduled, but that is the most common way to producing them. It means no one has to remember to run the daily sales reports everyday to be able to get the monthly sales report. That said, one can write events to a summary index in an ad-hoc search using the collect command.
I am trying to upload data into bigquery partitioned table using dataflow .I have successfully uploaded data on date basis and fetched this data on monthly basis using bigquery but my moto is to upload data on monthly basis/yearly basis. Is there any way to do that using dataflow.
You can have "monthly" partitions by using the date for the start of each month. For August, for example, you would store everything in the yourtable$20170801 partition. You would need to have some application-side logic to determine the appropriate $YYYYmmdd suffix for the table into which you are writing using Dataflow.