Data Architecture Question - Storing Daily Snapshot of Employee Master Data - sql

I'm trying to store daily snapshot of Employee data for FTE analysis project - analytics on how many FTE in various positions any given day.
I can call a REST API, which will give me data for all the active and terminated employees as of API call time. Is it prudent to call this API every single day and store daily snapshot or store only records which changed from previous version. What is the common design principle for this use case. Thanks!

I ended up using Delta tables on Databricks.
Sample code to create the table and load data
#read data into a dataframe
df = spark.read.format("json").option("multiline", "true").load("/mnt/datalake/path/to/file/employee.json")
#load data into a managed delta table
df.write.format("Delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("raw.DimEmployee")
Now you can query data of any given day using below query
select count(*) from raw.DimEmployee timestamp as of '2022-09-14'

Related

What is the best approach for bulk cleaning a database table that has a large amount of duplicated data loaded every day (snowflake db)

Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards

How to update insert new record with updated value from staging table in Azure Data Explorer

I have requirement, where data is indigested from the Azure IoT hub. Sample incoming data
{
"message":{
"deviceId": "abc-123",
"timestamp": "2022-05-08T00:00:00+00:00",
"kWh": 234.2
}
}
I have same column mapping in the Azure Data Explorer Table, kWh is always comes as incremental value not delta between two timestamps. Now I need to have another table which can have difference between last inserted kWh value and the current kWh.
It would be great help, if anyone have a suggestion or solution here.
I'm able to calculate the difference on the fly using the prev(). But I need to update the table while inserting the data into table.
As far as I know, there is no way to perform data manipulation on the fly and inject Azure IoT data to Azure Data explorer through JSON Mapping. However, I found a couple of approaches you can take to get the calculations you need. Both the approaches involve creation of secondary table to store the calculated data.
Approach 1
This is the closest approach I found which has on-fly data manipulation. For this to work you would need to create a function that calculates the difference of Kwh field for the latest entry. Once you have the function created, you can then bind it to the secondary(target) table using policy update and make it trigger for every new entry on your source table.
Refer the following resource, Ingest JSON records, which explains with an example of how to create a function and bind it to the target table. Here is a snapshot of the function the resource provides.
Note that you would have to create your own custom function that calculates the difference in kwh.
Approach 2
If you do not need a real time data manipulation need and your business have the leniency of a 1-minute delay, you can create a query something similar to below which calculates the temperature difference from source table (jsondata in my scenario) and writes it to target table (jsondiffdata)
.set-or-append jsondiffdata <| jsondata | serialize
| extend temperature = temperature - prev(temperature,1), humidity, timesent
Refer the following resource to get more information on how to Ingest from query. You can use Microsoft Power Automate to schedule this query trigger for every minute.
Please be cautious if you decide to go the second approach as it is uses serialization process which might prevent query parallelism in many scenarios. Please review this resource on Windows functions and identify a suitable query approach that is better optimized for your business needs.

How to append new records only to BiqQuery table?

I get reports from 3rd party API on daily basis and going to store data in BigQuery table. Each report includes data for the last 90 days, so each new report has new records for new day, but loses some records for 91 day. My task is keeping data in Bigquery for period > 90 days.
I tried to setup BiqQuery data transfer from Cloud Storage with "Write preference" option "Mirror" and seems that it just overwrites my old data with new. If I change it to "Append" it will add data from new report to old with doubles.
Are there any ideas how can I just append new records to my table using BigQuery functional? Can't believe that it's impossible.

Store Report Data in a SQL table

I have a report that is ran every quarter. The report is based on current values and creates a score card. We do this for about 50 locations and then have to manually create a report to compare the previous run to the current run. I'd like to automate by taking the report data and saving it to a table for each location and each quarter, then we can run reports that will show the data changes over time.
Data Sample:
Employees Active
Employees with ref checks
Clients Active
Clients with careplans
The reports are fairly complex and pulling data from many different tables so creating this via a query may not work or be just as complex. Any ideas on how to get the report data to a table without having to export each to a CSV or Excel file then importing manually?
If each score card has some dimensions (or metric names) and aggregate values (or metric values) then you can just add a time series table with columns for:
date
location or business unit
or instead of date and location, a scorecard ID (linking to another table with scorecard metadata)
dimension grouping
scores/values/metrics
Then, assuming you're creating the reports with a stored procedure, you can add a flag parameter to the stored procedure to update this table while generating a specific report on a date. This might be less work and/or faster than importing from CSVs, if you store intermediate report data into a temporary table that you can select from when additionally storing the data into the time series table described above.

how to create partition on monthly basis in bigquery

I am trying to upload data into bigquery partitioned table using dataflow .I have successfully uploaded data on date basis and fetched this data on monthly basis using bigquery but my moto is to upload data on monthly basis/yearly basis. Is there any way to do that using dataflow.
You can have "monthly" partitions by using the date for the start of each month. For August, for example, you would store everything in the yourtable$20170801 partition. You would need to have some application-side logic to determine the appropriate $YYYYmmdd suffix for the table into which you are writing using Dataflow.