The problem that i'm having is how can i synchronise my datasets in my VS 2008 project to any changes in the database.
As you know we read data from the db into the dataset which is disconnected, now lets say 1 minute later something updates the data in the database. What i want to do is after a set time interval check the db for any updates, i have a column already in my tables in the db that show when that row was last updated and so check this column and return all rows that are greater then the time of my last retrieval into my dataset.
Now the actual issue is what can I do to implement this structure? I was thinking of having some sort of loop that ran every so often which would get my new dataset with only the rows that have been updated but then how do i add those rows to my existing dataset where the existing dataset will replace any rows that are the same with the row from the new data and add any rows that aren't in existing one but are in the new one.
I did look at Sync Framework from Microsoft and the Local Cache but from what i can tell it only works on tables as i am hooking into a Stored Proc, unless i am wrong?
PS. I am coding in C#.
Can anyone help?
datatable.merge
Related
I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.
I am working on some data-sets which gets updated daily. By updation, I mean that three things happen:
1. New rows get added.
2. Some rows get deleted.
3. Some existing rows get replaced with new values.
Now I have prepared dash-boards on Tableau to analyze daily data, but I would also like to compare how the things are changing daily (i.e are we progressing or making loss from previous day.)
I am aware that we can take extracts from the data set. But if I go this way, I am not sure how to use all the extracts in one worksheet and compare the info given by all of them.
Tableau is simply a mechanism that builds an SQL query in the background and then builds tables and charts and such via that fetched query. This means that if you delete a row from the table it no longer exists so how can Tableau read it?? If anything your DB architecture should be creating new records and giving it a createtimestamp. You would NOT delete a record and put a new one. Then you'll only have one record in that table.... Sounds like a design issue
I have been working with ORMs the last couple of years, and, on a personal project, am frustratingly finding myself struggling with simple ADO.NET.
I have a database with tables storing both transactional and slowly changing data. Data to update / insert is sourced via the network.
I am trying to use the disconnected Data Adapter paradigm in ADO.NET, in relatively generic DB classes to allow for many / all ADO.NET database implementations.
My problem is, due to the potential size of the database tables, I don't want to perform an Adapter.Fill into memory (as pretty much every reference and tutorial will demonstrate), rather use a delta DataSet to store push new / modified data back to the database.
If I peform a DbDataAdapter.FillSchema on a DataSet, I get a schema, and data tables I can populate, however all data, regardless of what I pass to my key fields, is treated as a new row when I update the table using Adapter.Update.
Am I using the correct ADO.NET classes to perform such a batch UPDATE / INSERT (by "batch" in terms of my not having to do it in a loop, rather than what any given database may be actually performing under the lid)?
The issue here turned out to be the RowState of the data row.
When a DataSet is filled through the DataAdapter, changes to existing and existing rows will set the row's RowState to modified.
Data that is added to the DataSet is seen as a new row, and its RowState will be set to Added.
RowState is a read only field, so it can not be manually set to Modified.
Therefore, all updating data received from a client should be added as follows:
// ... where dataTransferObject.IsNew == false
DataRow row = table.NewRow();
row["Id"] = dataTransferObject.Id;//Set Key fields on row
table.Rows.Add(row);//Add row to table
row.AcceptChanges();//change RowState to Unchanged
row["MyData1"] = dataTransferObject.MyData1; //Set other fields in row
row["MyData2"] = dataTransferObject.MyData2; //Set other fields in row
row["MyData3"] = dataTransferObject.MyData3; //Set other fields in row
I am also ignoring the CommandBuilder and hand crafting my Insert, Update and Select statements.
This data can now be persisted to the database using adapter.Update
I aknowledge that this is far from an elegant solution, but a working one. I will update this answer if I find a nicer method
I have half a million records in a data set of which 50,000 are updated. Now I need to commit the updated records back to the SQL Server 2005 Database.
What is the best and efficient way to do this considering the fact that such updates could be frequent (though concurrency is not an issue but performance is)
I would use a Batch Update.
Also documented here.
I agree with David's answer, as that's what I use. However, there is an alternative approach you could take which is worth considering (all situations are different after all) - it's something I would consider in the future if I had another similar requirement.
You could bulk insert the updated records into a new table in the DB (using SqlBulkCopy) which is an extremely fast way of loading data into the db (example). Then run an UPDATE statement on your main table to pull in the updated values from this new table which you would drop at the end.
The batched update approach of using SqlDataAdapter allows you to easily deal with any errors on specific rows (e.g. you could tell it to continue in the event of an error with a specific updated row so it doesn't stop the whole process).
I am using sql server 2000. I need to get only updated records from remote server and need to insert that record in my local server on daily basis. But that table did not have created date or modified date field.
Use Transactional Replication.
Update
If you cannot do administrative operations on the source then you'll going to have to read all the data every day. Since you cannot detect changes (and keep in mind that even if you'd have a timestamp you still wouldn't be able to detect changes because there is no way to detect deletes with a timestamp) then you have to read every row every time you sync. And if you read every row, then the simplest solution is to just replace all the data you have with the new snapshot.
You need one of the following
a column in the table which flag new or updated records in a fashion or other (lastupdate_timestamp, incremental update counter...)
some trigger on Insert and Update, on the table, which produces some side-effect such as adding the corresponding row id into a separate table
You can also compare row-by-row the data from the remote server against that of the production server to get the list of new or updated rows... Such a differential update can also be produced by comparing some hash value, one per row, computed from the values of all columns for the row.
Barring one the above, and barring some MS-SQL built-in replication setup, the only other possibility I can think of is [not pretty]:
parsing the SQL Log to identify updates and addition to the table. This requires specialized software; I'm not even sure if the Log file format is published/documented, though I have seen this types of tools. Frankly this approach is more one for forensic-type situations...
If you can't change the remote server's database, your best option may be to come up with some sort of hash function on the values of a given row, compare the old and new tables, and pull only the ones where function(oldrow) != function(newrow).
You can also just do a direct comparison of the columns in question, and copy that record over when not all the columns in question are the same between old and new.
This means that you cannot modify values in the new table, or they'll get overwritten daily from the old. If this is an issue, you'll need another table in which to cache the old table's values from the day before; then you'll be able to tell whether old, new, or both were modified in the interim.
I solved this by using tablediff utility which will compare the data in two tables for non-convergence, and is particularly useful for troubleshooting non-convergence in a replication topology.
See the link.
tablediff utility
TO sum up:
You have an older remote db server that you can't modify anything in (such as tables, triggers, etc).
You can't use replication.
The data itself has no indication of date/time it was last modified.
You don't want to pull the entire table down each time.
That leaves us with an impossible situation.
You're only option if the first 3 items above are true is to pull the entire table. Even if they did have a modified date/time column, you wouldn't detect deletes. Which leaves us back at square one.
Go talk to your boss and ask for better requirements. Maybe something that can be done this time.