synchronization between 2 applications pooling a SQL table - sql

I have 2 instances of a VB.NET application each running on their own dedicated servers. The said application runs a While true loop with a 5s sleep on IDLE (IDLE is when the Table doesn't have any ProcessQuery to be treated). On each iteration, the application questions a table in the SQL Database to know if there is anything it could process.
The problem is that i sometimes encounter the problem where both of the instances are "taking" the same ProcessQuery.
I'm using EntityFramework6. I have looked into EntityState but i don't think it does exactly what i'm trying to accomplish.
I was wondering what would be my solution to have perfect parallel instances. It's not impossible at some point i have 12 instances running on 12 machines.
Thanks!
Dim conn As New Info_IndusEntities()
Dim DemandeWilma As WilmaDemandes = conn.WilmaDemandes.Where(Function(x) x.Site = 'LONDON' AndAlso x.Statut = 'toProcess').OrderBy(Function(x) x.RequestDate).FirstOrDefault
If Not IsNothing(DemandeWilma) Then
DemandeWilma.Statut = Statuts.EnTraitement.ToString
DemandeWilma.ServerName = Environment.MachineName
DemandeWilma.ProcessDate = DateTime.Now
conn.SaveChanges()
Return DemandeWilma
end if
UPDATE (21/06/19)
I found an article that I find interesting.
I started by adding a column to my Table :
UPDATED (21/06/19)
I then refreshed my model and changed the Concurrency Check property of RowVersion column in my ORM :
When I tested the update, here's the log of EF6 :
UPDATE [dbo].[WilmaDemandes] SET [Statut] = #0, [ServerName] = #1,
[DateDebut] = #2 WHERE (([ID] = #3) AND ([RowVersion] = #4)) SELECT
[RowVersion] FROM [dbo].[WilmaDemandes] WHERE ##ROWCOUNT > 0 AND [ID]
= #3
-- #0: 'EnTraitement' (Type = String, Size = 20)
-- #1: 'TRB5995' (Type = String, Size = 20)
-- #2: '2019-06-25 7:31:01 AM' (Type = DateTime2)
-- #3: '124373' (Type = Int32)
-- #4: 'System.Byte[]' (Type = Binary, Size = 8)
-- Executing at 2019-06-25 7:31:24 AM -04:00
-- Completed in 95 ms with result: SqlDataReader
Closed connection at 2019-06-25 7:31:24 AM -04:00
Exception thrown:
'System.Data.Entity.Infrastructure.DbUpdateConcurrencyException' in
EntityFramework.dll
UPDATED (25/06/19)
The problems, as explained in this post, starts when you are using DB-First instead of Code-First. Your property will get overwritten silently as soon as you update the model. Some people back then coded a console app workaround that they run on pre-build. I'm not sure i'm quite ready to take this solution as final solution.
Interesting tutorial on how to test optimistic concurrency and ways to resolve such an exception.

Add an "owner" column to your queue table
Your application updates one record (TOP 1) and sets the owner value to their identifier (WHERE Owner IS NULL)
Now your application goes back and reads their owned rows and processes them
It's a simple pattern and it works great. If any processes happen to take ownership 'simultaneously', only one will actually get the reservation.
I'm not very good at LINQ so here's a brute force method, multiline for clarity:
// First try reserving a row
conn.Database.ExecuteSqlCommand(
"WITH UpdateTop1 AS
(SELECT TOP 1 * FROM WilmaDemandes
WHERE Owner IS NULL
AND Site = 'LONDON'
ORDER BY RequestDate)
UPDATE UpdateTop1 SET Owner='ThisApplication'"
);
// See if we got one
Dim DemandeWilma As WilmaDemandes =
conn.WilmaDemandes.
Where(x => x.Owner=='ThisApplication').FirstOrDefault
// If we got a row, process it. Otherwise Idle and repeat
There's also no reason that you must reserve one row. You could reserve all the free rows and work your way through them. Meanwhile other processes will pick up any subsequently arriving rows
Personally I would refactor your status column and make it NULL for new records ready to be processed, otherwise it's the worker ID that has reserved it.
It also helps to add things like timestamp columns to record when the row was reserved etc.

Related

Using multiple threads for DB updates results in higher write time per update

So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib

getgroup() is very slow

I am using the function getgroup() to read all of the groups of a user in the active directory.
I'm not sure if I'm doing something wrong but it is very very slow. Each time it arrives at this point, it takes several seconds. I'm also accessing the rest of Active directory using the integrated function of "Accountmanagement" and it executes instantly.
Here's the code:
For y As Integer = 0 To AccountCount - 1
Dim UserGroupArray As PrincipalSearchResult(Of Principal) = UserResult(y).GetGroups()
UserInfoGroup(y) = New String(UserGroupArray.Count - 1) {}
For i As Integer = 0 To UserGroupArray.Count - 1
UserInfoGroup(y)(i) = UserGroupArray(i).ToString()
Next
Next
Later on...:
AccountChecker_Listview.Groups.Add(New ListViewGroup(Items(y, 0), HorizontalAlignment.Left))
For i As Integer = 0 To UserInfoGroup(y).Count - 1
AccountChecker_Listview.Items.Add(UserInfoGroup(y)(i)).Group = AccountChecker_Listview.Groups(y)
Next
Item(,) contains my normal Active directory data that I display Item(y, 0) contain the username.
y is the number of user accounts in AD. I also have some other code for the other information in this loop but it's not the issue here.
Anyone know how to make this goes faster or if there is another solution?
I'd recommend trying to find out where the time is spent. One option is to use a profiler, either the one built into Visual Studio or a third-party profiler like Redgate's Ants Profiler or the Yourkit .Net Profiler.
Another is to trace the time taken using the System.Diagnostics.Stopwatch class and use the results to guide your optimization efforts. For example time the function that retrieves data from Active Directory and separately time the function that populates the view to narrow down where the bottleneck is.
If the bottleneck is in the Active Directory lookup you may want to consider running the operation asynchronously so that the window is not blocked and populates as new data is retrieved. If it's in the listview you may want to consider for example inserting the data in a batch operation.

Data is not properly stored to hsqldb when using pooled data source by dbcp

I'm using hsqldb to create cached tables and indexed tables.
The data being stored has pretty high frequency so I need to use a connection pool.
Also because there is a lot of data I do not call checkpoint on every commit, but rather expect the data to be flushed after 50,000 rows are inserted.
So the thing is that I can see the .data file is growing but when I connect with hsqldb client I don't see the tables and the data.
So I had 2 simple tests, one inserted single row and one inserted 60,000 rows to new table. In both cases I couldn't see the result in any hsqldb client.
(Note that I use shutdown=true)
So when I add checkpoint after each commit, it solve the problem.
Also if specify in the connection string to use log, it solves the problem (I don't want the log in production though). Also not using pooled connection solved the problem and last is using pooled data source and explicitly close it before shutdown.
So I guess that some connections in the connection pool are not being closed, preventing from the db to somehow commit the changes and make them available for the client. But then, why couldn't I see the result even with 60,000 rows?
I also would expect the pool to be closed automatically...
What am I doing wrong? What is happening behind the scene?
The code to get the data source looks like this:
Class.forName("org.hsqldb.jdbcDriver");
String url = "jdbc:hsqldb:" + m_dbRoot + dbName + "/db" + ";hsqldb.log_data=false;shutdown=true;hsqldb.nio_data_file=false";
ConnectionFactory connectionFactory = new DriverManagerConnectionFactory(url, user, password);
GenericObjectPool connectionPool = new GenericObjectPool();
KeyedObjectPoolFactory stmtPool = new GenericKeyedObjectPoolFactory(null);
new PoolableConnectionFactory(connectionFactory, connectionPool, stmtPool, null, false, true);
DataSource ds = new PoolingDataSource(connectionPool);
And I'm using this Pooled data source to create table:
Connection c = m_dataSource.getConnection();
Statement st = c.createStatement();
String script = String.format("CREATE CACHED TABLE IF NOT EXISTS %s (id %s NOT NULL, entity %s NOT NULL, PRIMARY KEY (id));", m_tableName, m_idGenerator.getIdType(), TABLE_ENTITY_TYPE);
st.execute(script);
c.close;
st.close();
And insert rows:
Connection c = m_dataSource.getConnection();
c.setAutoCommit(false);
Statement stmt = c.prepareStatement(m_sqlInsert);
stmt.setObject(1, id);
stmt.setBinaryStream(2, Serializer.Helper.serialize(m_serializer, entity));
stmt.executeUpdate();
stmt.close();
stmt = null;
c.commit();
c.close();
stmt.close();
so the above seems to add data but it cannot be seen.
When I explicitly called
connectionPool.close();
Then and only then I could see the result.
I also tried to use JDBCDataSource and it worked as well.
So what is going on? And what is the right way to do this?
Your method of accessing the database from outside your application process is simply wrong.
Only one java process is supposed to connect to the file: database.
In order to achieve your aim, launch an HSQLDB server within your application, using exactly the same JDBC URL. Then connect to this server from the external client.
See the Guide:
http://www.hsqldb.org/doc/2.0/guide/listeners-chapt.html#lsc_app_start
Update: The OP commented that the external client was used after the application had stopped. Because you have turned the log off with hsqldb.log_data=false, nothing is persisted permanently. You need to perform an explicit CHECKPOINT or SHUTDOWN when your application completes its work. You cannot rely on shutdown=true at all, even without connection pooling.
See the Guide:
http://www.hsqldb.org/doc/2.0/guide/deployment-chapt.html#dec_bulk_operations

Linq to SQL Transaction Insert then Select really, really slow

I'm developing a piece of a system that basically migrates data from one set of tables to another set. Everything works fine, but I've decided to employ transactions instead of just failing on things that are partially completed. (That is, if some exception occurs, I want to rollback instead of having partial data migrated.)
I have a service (in the 3-tier architecture way, not web) which begins a transaction on the data access layer. The data context is shared in the data access class which contains many methods. Those methods use various LINQ-to-SQL techniques to update/insert/delete. All the LINQ-to-SQL "selects" are within CompiledQueries.
The "BeginTransaction" method starts a transaction like this:
Public Sub BeginTransaction() Implements ITransactionalQueriesBase.BeginTransaction
Me.Context.Connection.Open()
Me.Context.Transaction = Context.Connection.BeginTransaction()
IsInTransaction = True
End Sub
Basically, I have written a test which starts a transaction, inserts into a table, and then attempts to retrieve the value that was just inserted, all during the transaction. I did this because I wanted to assert that the insert method actually tries to insert. Then, during the test I would rollback, then test to ensure that the newly inserted value is not actually committed to the table. The test looks something like this:
<TestMethod()>
Public Sub FacilityService_Can_Rollback_A_Transaction()
faciService.BeginTransaction()
Dim devApp = UnitTestHelper.CreateDevelopmentApplication(devService.GetDevelopmentType("NEWFACI").ID, 1, 1, 1, 1)
Dim devInsertRes = devService.InsertDevelopmentApplication(devApp)
Assert.IsTrue(devInsertRes.ReturnValue > 0)
For Each dir1 In devInsertRes.Messages
Assert.Fail(dir1)
Next
Dim migrationResult = faciService.ProcessNewFacilityDevelopment(devInsertRes.ReturnValue)
Assert.IsTrue(migrationResult.ReturnValue.InsertResult)
Dim faciRetrieval1 = faciService.GetFacilityByID(migrationResult.ReturnValue.FacilityID)
Assert.IsNotNull(faciRetrieval1.ReturnValue)
faciService.Rollback()
Dim faciRetrieval2 = faciService.GetFacilityByID(migrationResult.ReturnValue.FacilityID)
Assert.IsNull(faciRetrieval2.ReturnValue)
End Sub
So, to my problem...
When the test gets to the "faciRetrieval1" step, it stays there for about 30-60 seconds before moving on. I'm not sure why this is happening. If I run the same queries in a transaction within SSMS it happens instantly. Does anyone have any ideas? The database is a SQL Server 2008 SP1 (R2?).
I figured out that if you have a data context using a transaction, any other data context appears to not be able to select from another context of the same type.
I ended up fixing it by using the same context throughout every select/update/delete while a transaction was happening.

SELECT through oledbcommand in vb.net not picking up recent changes

I'm using the following code to work out the next unique Order Number in an access database. ServerDB is a "System.Data.OleDb.OleDbConnection"
Dim command As New OleDb.OleDbCommand("", serverDB)
command.CommandText = "SELECT max (ORDERNO) FROM WORKORDR"
iOrder = command.ExecuteScalar()
NewOrderNo = (iOrder + 1)
If I subsequently create a WORKORDR (using a different DB connection), the code will not pick up the new "next order number."
e.g.
iFoo = NewOrderNo
CreateNewWorkOrderWithNumber(iFoo)
iFoo2 = NewOrderNo
will return the same value to both iFoo and iFoo2.
If I Close and then reopen serverDB, as part of the "NewOrderNo" function, then it works. iFoo and iFoo2 will be correct.
Is there any way to force a "System.Data.OleDb.OleDbConnection" to refresh the database in this situation without closing and reopening the connection.
e.g. Is there anything equivalent to serverdb.refresh or serverdb.FlushCache
How I create the order.
I wondered if this could be caused by not updating my transactions after creating the order. I'm using an XSD for the order creation, and the code I use to create the record is ...
Sub CreateNewWorkOrderWithNumber(ByVal iNewOrder As Integer)
Dim OrderDS As New CNC
Dim OrderAdapter As New CNCTableAdapters.WORKORDRTableAdapter
Dim NewWorkOrder As CNC.WORKORDRRow = OrderDS.WORKORDR.NewWORKORDRRow
NewWorkOrder.ORDERNO = iNewOrder
NewWorkOrder.name = "lots of fields filled in here."
OrderDS.WORKORDR.AddWORKORDRRow(NewWorkOrder)
OrderAdapter.Update(NewWorkOrder)
OrderDS.AcceptChanges()
End Sub
From MSDN
Microsoft Jet has a read-cache that is
updated every PageTimeout milliseconds
(default is 5000ms = 5 seconds). It
also has a lazy-write mechanism that
operates on a separate thread to main
processing and thus writes changes to
disk asynchronously. These two
mechanisms help boost performance, but
in certain situations that require
high concurrency, they may create
problems.
If you possibly can, just use one connection.
Back in VB6 you could force the connection to refresh itself using ADO. I don't know whether it's possible with VB.NET. My Google-fu seems to be weak today.
You can change the PageTimeout value in the registry but that will affect all programs on the computer that use the Jet engine (i.e. programmatic use of Access databases)
I always throw away a Connection Object after I used it. Due to Connection Pooling getting a new Connection is cheap.