ADO.NET Performing a partial Update / Insert - sql

I have been working with ORMs the last couple of years, and, on a personal project, am frustratingly finding myself struggling with simple ADO.NET.
I have a database with tables storing both transactional and slowly changing data. Data to update / insert is sourced via the network.
I am trying to use the disconnected Data Adapter paradigm in ADO.NET, in relatively generic DB classes to allow for many / all ADO.NET database implementations.
My problem is, due to the potential size of the database tables, I don't want to perform an Adapter.Fill into memory (as pretty much every reference and tutorial will demonstrate), rather use a delta DataSet to store push new / modified data back to the database.
If I peform a DbDataAdapter.FillSchema on a DataSet, I get a schema, and data tables I can populate, however all data, regardless of what I pass to my key fields, is treated as a new row when I update the table using Adapter.Update.
Am I using the correct ADO.NET classes to perform such a batch UPDATE / INSERT (by "batch" in terms of my not having to do it in a loop, rather than what any given database may be actually performing under the lid)?

The issue here turned out to be the RowState of the data row.
When a DataSet is filled through the DataAdapter, changes to existing and existing rows will set the row's RowState to modified.
Data that is added to the DataSet is seen as a new row, and its RowState will be set to Added.
RowState is a read only field, so it can not be manually set to Modified.
Therefore, all updating data received from a client should be added as follows:
// ... where dataTransferObject.IsNew == false
DataRow row = table.NewRow();
row["Id"] = dataTransferObject.Id;//Set Key fields on row
table.Rows.Add(row);//Add row to table
row.AcceptChanges();//change RowState to Unchanged
row["MyData1"] = dataTransferObject.MyData1; //Set other fields in row
row["MyData2"] = dataTransferObject.MyData2; //Set other fields in row
row["MyData3"] = dataTransferObject.MyData3; //Set other fields in row
I am also ignoring the CommandBuilder and hand crafting my Insert, Update and Select statements.
This data can now be persisted to the database using adapter.Update
I aknowledge that this is far from an elegant solution, but a working one. I will update this answer if I find a nicer method

Related

Advice on changing the partition field for dynamic BigQuery tables

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.
I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

PowerQuery (or PowerPivot): Updated Rows Only?

I'm trying to explore PowerPivot, and as far as I can tell it always wants to work with local data.
If the data-download were a one-time hit, then I could work with that - but whenever I want to refresh it appears to go and refetch all the data again.
Is there any way to use PowerBI (or the underlying PowerQuery) so that it can fetch only new or modified rows and add them to it's current dataset? For example, would a OData feed behave this way?
The backend DB in my case will be MSSQL or SSAS. I control the DB and could add change-tracking columns etc...if need be.
You could conceivably load in the bulk of your data up to a point in time and use that as a table named Static which you then append another table named Dynamic which only fetches rows that you specify (using a filter on a change-tracking column) that are added after that point in time.
The problem with this is that nothing from the Dynamic table ever makes its way into the Static table and you end up reloading stuff unnecessarily.
It's probably possible to implement a solution to pass rows into the Static table, but at that point, you're basically using the PowerPivot data model as a database, which probably isn't best practice.

How can I present the changes for updated data in Tableau

I am working on some data-sets which gets updated daily. By updation, I mean that three things happen:
1. New rows get added.
2. Some rows get deleted.
3. Some existing rows get replaced with new values.
Now I have prepared dash-boards on Tableau to analyze daily data, but I would also like to compare how the things are changing daily (i.e are we progressing or making loss from previous day.)
I am aware that we can take extracts from the data set. But if I go this way, I am not sure how to use all the extracts in one worksheet and compare the info given by all of them.
Tableau is simply a mechanism that builds an SQL query in the background and then builds tables and charts and such via that fetched query. This means that if you delete a row from the table it no longer exists so how can Tableau read it?? If anything your DB architecture should be creating new records and giving it a createtimestamp. You would NOT delete a record and put a new one. Then you'll only have one record in that table.... Sounds like a design issue

Synchronising SQL database through ADO.Net

The problem that i'm having is how can i synchronise my datasets in my VS 2008 project to any changes in the database.
As you know we read data from the db into the dataset which is disconnected, now lets say 1 minute later something updates the data in the database. What i want to do is after a set time interval check the db for any updates, i have a column already in my tables in the db that show when that row was last updated and so check this column and return all rows that are greater then the time of my last retrieval into my dataset.
Now the actual issue is what can I do to implement this structure? I was thinking of having some sort of loop that ran every so often which would get my new dataset with only the rows that have been updated but then how do i add those rows to my existing dataset where the existing dataset will replace any rows that are the same with the row from the new data and add any rows that aren't in existing one but are in the new one.
I did look at Sync Framework from Microsoft and the Local Cache but from what i can tell it only works on tables as i am hooking into a Stored Proc, unless i am wrong?
PS. I am coding in C#.
Can anyone help?
datatable.merge