How to control the batch-size when using MyBatis ExecutorType.Batch for batch insert operations - bulkinsert

I am trying to use mybatis batch execution (ExecutorType.BATCH) support. I want to batch insert few records in database performance and scalability reason. I want to override the default Mybatis batch size. I did not found any way to configure batch-size programmatically. Is there a way to override the default batch-size? The following is the code for your reference:
public static void BatchUsingMyBatis() throws Exception
{
Contact contact = new Contact();
contact.setname("someone");
contact.setphone("somephone");
contact.setemail("someone#somedomain.com");
ClassPathXmlApplicationContext appContext =
new ClassPathXmlApplicationContext("BeanConfiguration.xml");
SqlSessionFactoryBean factoryBean = appContext.getBean(org.mybatis.spring.SqlSessionFactoryBean.class);
SqlSessionFactory factory = factoryBean.getObject();
SqlSession session = factory.openSession(ExecutorType.BATCH, false);
session.insert ("ins", contact);
session.insert ("ins", contact);
session.insert ("ins", contact);
session.insert ("ins", contact);
session.insert ("ins", contact);
session.commit();
}
Thanks.

If I understand your needs correctly and you would like to be able to configure session so that one commit would generate multiple batch inserts. For instance if there would be possibility to set that maximum of thee statements were sent at once, code that you supplied would generate one batch insert(three rows at once) and another batch insert(two rows at once).
I couldn't find information about such functionality and I believe there is no such functionality, but to implement this behavior you would need to override BatchExecutor doUpdate method and Configuration newExecutor method so that it would be aware of new Executor class.

Related

Quartz scheduler for long running tasks skips jobs

This is my job. It takes about 3 to 5 minutes to complete each time:
[DisallowConcurrentExecution]
[PersistJobDataAfterExecution]
public class UploadNumberData : IJob
{
private readonly IServiceProvider serviceProvider;
public UploadNumberData(IServiceProvider serviceProvider)
{
this.serviceProvider = serviceProvider;
}
public async Task Execute(IJobExecutionContext context)
{
var jobDataMap = context.MergedJobDataMap;
string flattenedInput = jobDataMap.GetString("FlattenedInput");
string applicationName = jobDataMap.GetString("ApplicationName");
var parsedFlattenedInput = JsonSerializer.Deserialize<List<NumberDataUploadViewModel>>(flattenedInput);
var parsedApplicationName = JsonSerializer.Deserialize<string>(applicationName);
using (var scope = serviceProvider.CreateScope())
{
//Run Process
}
}
}
This is the function that calls the job:
try
{
var flattenedInput = JsonSerializer.Serialize(Input.NumData);
var triggerKey = Guid.NewGuid().ToString();
IJobDetail job = JobBuilder.Create<UploadNumberData >()
.UsingJobData("FlattenedInput", flattenedInput)
.UsingJobData("ApplicationName", flattenedApplicationName)
.StoreDurably()
.WithIdentity("BatchNumberDataJob", $"GP_BatchNumberDataJob")
.Build();
await scheduler.AddJob(job, true);
ITrigger trigger = TriggerBuilder.Create()
.ForJob(job)
.WithIdentity(triggerKey, $"GP_BatchNumberDataJob")
.WithSimpleSchedule(x => x.WithMisfireHandlingInstructionFireNow())
.StartNow()
.Build();
await scheduler.ScheduleJob(trigger);
}
catch(Exception e)
{
//log
}
Each job consists of 300 rows of data with the total count being about 14000 rows divided into 47 jobs.
This is the configuration:
NameValueCollection quartzProperties = new NameValueCollection
{
{"quartz.serializer.type","json" },
{"quartz.jobStore.type","Quartz.Impl.AdoJobStore.JobStoreTX, Quartz" },
{"quartz.jobStore.dataSource","default" },
{"quartz.dataSource.default.provider","MySql" },
{"quartz.dataSource.default.connectionString","connectionstring"},
{"quartz.jobStore.driverDelegateType","Quartz.Impl.AdoJobStore.MySQLDelegate, Quartz" },
{"quartz.jobStore.misfireThreshold","3600000" }
};
The problem now is that when I hit the function/api, only the first and last job gets inserted into the database. Strangely, the last job repeats itself multiple times as well.
I tried changing the Job Identity name to something different but I then get foreign key errors as my data is being inserted into the database.
Example sequence should be:
300,300,300,...,102
However, the sequence ends up being:
300,102,102,102
EDIT:
When I set the threads to 1 and changed the Job Identity to be dynamic, it works. However, does this defeat the purpose of DisallowConcurrentExecution?
I am reproduced your problem and found the way how you should rewrite your code to get expected behaviour as I understand it
Make job identity unique
First of all, I see you use same identity for every job you executing, duplicating causes because you have the same identity and 'replace' flag as 'true' in AddJob method call.
You are on the right way when you decide to use dynamic identity generation for each job, it could be new guid or some incremental int count for each identity. Something like this:
// 'i' variable is a job counter (0, 1, 2 ...)
.WithIdentity($"BatchNumberDataJob-{i}", $"GP_BatchNumberDataJob")
// or
.WithIdentity(Guid.NewGuid().ToString(), $"GP_BatchNumberDataJob")
// Also maybe you want to set 'replace' flag to 'false'
// to enable 'already exists' error if collision occurs.
// You may want handle such cases
await scheduler.AddJob(job, false);
After that you can remove [DisallowConcurrentExecution] attribute from the job, because it is based on a job key, it is not used anymore with such dynamic identity.
Concurrency
Basically, you have a few options how to execute your jobs, it really depends on what you trying to achieve.
Parallel execution
Fastest method to execute your code. Each job is completely separated from each others.
To do so you should prepare your database for such case (because as you said you have foreign key errors when you trying to achieve that behaviour).
It is hard exactly to say what you should change in database to support this behaviour because you say nothing about your database.
If your jobs needs to have an execution order - this method is not for you.
Ordered execution
The other way is to use ordered execution. If (for some reasons) you are not able to prepare your database to handle parallel job execution - you could use this method. This method is a way slower than parallel, but order which jobs are executing is determined.
You can achieve this behaviour two ways:
use jobs chaining. See this question.
set up max concurrency for scheduler:
var quartzProperties = new NameValueCollection
{
{"quartz.threadPool.maxConcurrency","1" },
};
So jobs will be executed in the way you triggering them in the right order completely without parallelism.
Summary
It is really depends of what you trying to achieve. If your point is a speed - then you should rework your database and your job to support completely separated job execution no matter which order it executing. If your point is an ordering - you should use non-parallel methods for job execution. It is up to you.

Java - Insert a single row at a time into google Big Query ?

I am creating an application where every time a user clicks on an article, I need to capture the article data and the user data to calculate the reach of every article and be able to run analytics on the reached data.
My application is on App Engine.
When I check documentation for inserts into BQ, most of them point towards bulk inserts in the form of Jobs or Streams.
Question:
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ? If so, could you point me to some Java code to effectively do this ?
There are limits on the number of load jobs and DML queries (1,000 per day), so you'll need to use streaming inserts for this kind of application. Note that streaming inserts are different from loading data from a Java stream.
TableId tableId = TableId.of(datasetName, tableName);
// Values of the row to insert
Map<String, Object> rowContent = new HashMap<>();
rowContent.put("booleanField", true);
// Bytes are passed in base64
rowContent.put("bytesField", "Cg0NDg0="); // 0xA, 0xD, 0xD, 0xE, 0xD in base64
// Records are passed as a map
Map<String, Object> recordsContent = new HashMap<>();
recordsContent.put("stringField", "Hello, World!");
rowContent.put("recordField", recordsContent);
InsertAllResponse response =
bigquery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("rowId", rowContent)
// More rows can be added in the same RPC by invoking .addRow() on the builder
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
(From the example at https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery-stream-data-java)
Note especially that a failed insert does not always throw an exception. You must also check the response object for errors.
Is it even a good practice to insert into big Query one row at a time every time a user action is initiated ?
Yes, it's pretty typical to stream event streams to BigQuery for analytics. You'll could get better performance if you buffer multiple events into the same streaming insert request to BigQuery, but one row at a time is definitely supported.
A simplified version of Google's example.
Map<String, Object> row1Data = new HashMap<>();
row1Data.put("booleanField", true);
row1Data.put("stringField", "myString");
Map<String, Object> row2Data = new HashMap<>();
row2Data.put("booleanField", false);
row2Data.put("stringField", "myOtherString");
TableId tableId = TableId.of("myDatasetName", "myTableName");
InsertAllResponse response =
bigQuery.insertAll(
InsertAllRequest.newBuilder(tableId)
.addRow("row1Id", row1Data)
.addRow("row2Id", row2Data)
.build());
if (response.hasErrors()) {
// If any of the insertions failed, this lets you inspect the errors
for (Map.Entry<Long, List<BigQueryError>> entry : response.getInsertErrors().entrySet()) {
// inspect row error
}
}
You can use Cloud Logging API to write one row at a time.
https://cloud.google.com/logging/docs/reference/libraries
Sample code from document
public class QuickstartSample {
/** Expects a new or existing Cloud log name as the first argument. */
public static void main(String... args) throws Exception {
// Instantiates a client
Logging logging = LoggingOptions.getDefaultInstance().getService();
// The name of the log to write to
String logName = args[0]; // "my-log";
// The data to write to the log
String text = "Hello, world!";
LogEntry entry =
LogEntry.newBuilder(StringPayload.of(text))
.setSeverity(Severity.ERROR)
.setLogName(logName)
.setResource(MonitoredResource.newBuilder("global").build())
.build();
// Writes the log entry asynchronously
logging.write(Collections.singleton(entry));
System.out.printf("Logged: %s%n", text);
}
}
In this case you need to create sink from dataflow logs. Then message will be redirect to the big Query table.
https://cloud.google.com/logging/docs/export/configure_export_v2

RavenDB fails with ConcurrencyException when using new transaction

This code always fails with a ConcurrencyException:
[Test]
public void EventOrderingCode_Fails_WithConcurrencyException()
{
Guid id = Guid.NewGuid();
using (var scope1 = new TransactionScope())
using (var session = DataAccess.NewOpenSession)
{
session.Advanced.UseOptimisticConcurrency = true;
session.Advanced.AllowNonAuthoritativeInformation = false;
var ent1 = new CTEntity
{
Id = id,
Name = "George"
};
using (var scope2 = new TransactionScope(TransactionScopeOption.RequiresNew))
{
session.Store(ent1);
session.SaveChanges();
scope2.Complete();
}
var ent2 = session.Load<CTEntity>(id);
ent2.Name = "Gina";
session.SaveChanges();
scope1.Complete();
}
}
It fails at the last session.SaveChanges. Stating that it is using a NonCurrent etag. If I use Required instead of RequiresNew for scope2 - i.e. using the same Transaction. It works.
Now, since I load the entity (ent2) it should be using the newest Etag unless this is some cached value attached to scope1 that I am using (but I have disabled Caching). So I do not understand why this fails.
I really need this setup. In the production code the outer TransactionScope is created by NServiceBus, and the inner is for controlling an aspect of event ordering. It cannot be the same Transaction.
And I need the optimistic concurrency too - if other threads uses the entity at the same time.
BTW: This is using Raven 2.0.3.0
Since no one else have answered, I had better give it a go myself.
It turns out this was a human error. Due to a bad configuration of our IOC container the DataAccess.NewOpenSession gave me the same Session all the time (across other tests). In other words Raven works as expected :)
Before I found out about this I also experimented with using TransactionScopeOption.Suppress instead of RequiresNew. That also worked. Then I just had to make sure that whatever I did in the suppressed scope could not fail. Which was a valid option in my case.

RavenDB Catch 22 - Optimistic Concurrency AND Seeing Changes from Other Clients

With RavenDB, creating an IDocumentSession upon app start-up (and never closing it until the app is closed), allows me to use optimistic concurrency by doing this:
public class GenericData : DataAccessLayerBase, IGenericData
{
public void Save<T>(T objectToSave)
{
Guid eTag = (Guid)Session.Advanced.GetEtagFor(objectToSave);
Session.Store(objectToSave, eTag);
Session.SaveChanges();
}
}
If another user has changed that object, then the save will correctly fail.
But what I can't do, when using one session for the lifetime of an app, is seeing changes, made by other instances of the app (say, Joe, five cubicles away), to documents. When I do this, I don't see Joe's changes:
public class CustomVariableGroupData : DataAccessLayerBase, ICustomVariableGroupData
{
public IEnumerable<CustomVariableGroup> GetAll()
{
return Session.Query<CustomVariableGroup>();
}
}
Note: I've also tried this, but it didn't display Joe's changes either:
return Session.Query<CustomVariableGroup>().Customize(x => x.WaitForNonStaleResults());
Now, if I go the other way, and create an IDocumentSession within every method that accesses the database, then I have the opposite problem. Because I have a new session, I can see Joe's changes. Buuuuuuut... then I lose optimistic concurrency. When I create a new session before saving, this line produces an empty GUID, and therefore fails:
Guid eTag = (Guid)Session.Advanced.GetEtagFor(objectToSave);
What am I missing? If a Session shouldn't be created within each method, nor at the app level, then what is the correct scope? How can I get the benefits of optimistic concurrency and the ability to see others' changes when doing a Session.Query()?
You won't see the changes, because you use the same session. See my others replies for more details
Disclaimer: I know this can't be the long-term approach, and therefore won't be an accepted answer here. However, I simply need to get something working now, and I can refactor later. I also know some folks will be disgusted with this approach, lol, but so be it. It seems to be working. I get new data with every query (new session), and I get optimistic concurrency working as well.
The bottom line is that I went back to one session per data access method. And whenever a data access method does some type of get/load/query, I store the eTags in a static dictionary:
public IEnumerable<CustomVariableGroup> GetAll()
{
using (IDocumentSession session = Database.OpenSession())
{
IEnumerable<CustomVariableGroup> groups = session.Query<CustomVariableGroup>();
CacheEtags(groups, session);
return groups;
}
}
Then, when I'm saving data, I grab the eTag from the cache. This causes a concurrency exception if another instance has modified the data, which is what I want.
public void Save(EntityBase objectToSave)
{
if (objectToSave == null) { throw new ArgumentNullException("objectToSave"); }
Guid eTag = Guid.Empty;
if (objectToSave.Id != null)
{
eTag = RetrieveEtagFromCache(objectToSave);
}
using (IDocumentSession session = Database.OpenSession())
{
session.Advanced.UseOptimisticConcurrency = true;
session.Store(objectToSave, eTag);
session.SaveChanges();
CacheEtag(objectToSave, session); // We have a new eTag after saving.
}
}
I absolutely want to do this the right way in the long run, but I don't know what that way is yet.
Edit: I'm going to make this the accepted answer until I find a better way.
Bob, why don't you just open up a new Session every time you want to refresh your data?
It has many trade-offs to open new sessions for every request, and your solution to optimistic concurrency (managing tags within your own singleton dictionary) shows that it was never intended to be used that way.
You said you have a WPF application. Alright, open a new Session on startup. Load and query whatever you want but don't close the Session until you want to refresh your data (e.g. a list of order, customers, i don't know...). Then, when you want to refresh it (after a user clicks on a button, a timer event is fired or whatever) dispose the session and open a new one. Does that work for you?

NHibernate - flush before querying?

I have a repository class that uses an NHibernate session to persist objects to the database. By default, the repository doesn't use an explicit transaction - that's up to the caller to manage. I have the following unit test to test my NHibernate plumbing:
[Test]
public void NHibernate_BaseRepositoryProvidesRequiredMethods()
{
using (var unitOfWork = UnitOfWork.Create())
{
// test the add method
TestRepo.Add(new TestObject() { Id = 1, Name = "Testerson" });
TestRepo.Add(new TestObject() { Id = 2, Name = "Testerson2" });
TestRepo.Add(new TestObject() { Id = 3, Name = "Testerson3" });
// test the getall method
var objects = TestRepo.GetAll();
Assert.AreEqual(3, objects.Length);
// test the remove method
TestRepo.Remove(objects[1]);
objects = TestRepo.GetAll();
Assert.AreEqual(2, objects.Length);
// test the get method
var obj = TestRepo.Get(objects[1].Id);
Assert.AreSame(objects[1], obj);
}
}
The problem is that the line
Assert.AreEqual(3, objects.Length);
fails the test because the object list returned from the GetAll method is empty. If I manually flush the session right after inserting the three objects, that part of the test passes. I'm using the default FlushMode on the session, and according to the documentation, it's supposed to flush before running the query to retrieve all the objects, but it's obviously not. What I am missing?
Edit: I'm using Sqlite for the unit test scenario, if that makes any difference.
You state that
according to the documentation, it's supposed to flush before running the query to retrieve all the objects
But the doc at https://www.hibernate.org/hib_docs/v3/api/org/hibernate/FlushMode.html, the doc states that in AUTO flush mode (emphasis is mine):
The Session is sometimes flushed
before query execution in order to
ensure that queries never return stale
state. This is the default flush mode.
So yes, you need to do a flush to save those values before expecting them to show up in your select.