Aerospike how to write records in batches to avoid inconsistency - aerospike

How we can write a batch function for aerospike that will help us in storing 10 million records and without any data loss. Moreover if the function is called simuntaneously then the data storing should be done in proper way no override should happen and no data loss .

Currently there is no batch write API in Aerospike. You have to write each record individually. The only way to never have any data loss is to use Strong Consistency Mode. It covers all sorts of corner cases and ensures committed writes are never lost.

Related

efficient way to store 100GB dataset

I have 100GB dataset in this format with row format as seen below.
cookie,iplong1,iplong2..,iplongN
I am currently trying to fit this data into redis as a sorted set data structure. I would also need to set a TTL for each of those IPs. I was thinking to have TTL implemented on each element in that set, I will probably add a score to them, where score is the epoch time. And may be I will write a separate script to parse the scores and remove expired IPs based score as applicable. With that said, I am also noticing that it almost takes 100GB memory to this 100GB dataset. I was wondering if there is any other way of efficiently packing this data in redis with minimal memory footprint.
I am also happy to know if there are any other tech stack out there that can handle this better. This dataset would be updated frequently based on hourly logs, also the expectation is we should be able to read from it at faster rate, concurrently.
Thanks in advance.

Cassandra Batch

I has just started newly with Cassandra, and I had one common question that
"Suppose I need to insert nearly about 2000+ records, most of people do say that don't use batch here, but on the other side also heard that "The closest feature to a stored procedure will be a batch as it will allow you to "bundle" different DML statements associated to an insert, update or delete."
So can anyone suggest what is the best way where I can create once, store and call for several times whenever it is required which can support faster execution as SP's in SQL
Batches in Cassandra have very specific usage:
to apply multiple changes at one, often to multiple tables, to provide consistency in the update of the data, guaranteeing that they all will be applied, or all will fail. This often called "logged batch" - in this case, Cassandra is doing a copy of batch on the multiple servers before applying changes, and delete after successful apply of batch operations. As result, such batches are much slower than usual operations.
to apply multiple operations inside the single partition - often it's called "unlogged batch" - in this case, all operations are considered as one mutation, and as result this is very fast, compared to multiple individual operations.
So batches could be used for multiple inserts/updates/deletes only inside single partition (otherwise you'll get worse performance compared to the individual statements), or when you need consistency of data between several tables. The fastest way to insert a lot of data is to issue multiple async operations. Also, if you want to load data from files, then maybe it's better to look to the tools like DSBulk that are heavily optimized for high performance load & unload of the data.
In more details about good & bad use of batches you can read in documentation, and DSE Architecture guide.
P.S. Technically speaking, Cassandra does classify batches either as multipatitioned - in this case they are always logged, or single partition - they aren't logged.

Multiple step Pandas processing with Airflow

I have a multiple stage ETL transform stage using pandas. Basically, I load almost 2Gb of data from Mongodb and then I apply several functions in the columns. My question is if there's any way to break those transformations in multiple Airflow tasks.
The options I have considered are:
Creating a temporary table in Mongodb and loading/storing the transformed data frame between steps. I found this cumbersome and totally prone to a non-usual overhead due to disk I/O
Passing data among the tasks using XCom. I think this is a nice solution but I worry about the sheer size of the data. The docs explicitly state
Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.
Using an in-memory storage between steps. Maybe saving the data in a Redis server or something, but I'm not really sure if that would be any better than just using XCom altogether.
So, does any of you have any tips on how to handle this situation? Thanks!

best practises of batch insert in hibernate(large insertions)

I have a job that runs and inserts over 20000 records parsing a json, I am connecting my whole application to oracle db using hibernate. It is taking around 1 hour of time because it also involves json calls and parsing of json, whereas just printing the parsed fields in the logs takes a minute or 2. My question here is, Is there a way to optimize the insertion process using hibernate.
I tried suggestions from Hibernate batch size confusion, but still I feel very slow.
I tried increasing batch size.
I tried disabling second level cache.
I also flushed and cleared my session depending on the batch size
I am planning to move to jdbc batch insertions, but wanna give a try to optimize using hibernate.
I hope this may give a generic expose to most of amateur programmers helping them with the best practises

Nhibernate: Batching and StatelessSession

I was trying around with just setting the batch value in the config file, and I see that there's a visible benefit in using it, as in inserting 25000 entries takes less time then without batching. My question is, what are the counter indication, or the dangers of using batching? As I see it I only see benefits in setting a batch value, and activating it.
Another question is regarding StatelessSession. I was also testing this and I've noticed that when I do a scope.Insert it takes more time compared to doing scope.Save of a regular Session, but when I do a commit it's lightning fast. Is there any reason for a Insert from a StatelessSession to take more time then a Save from a regular Session?
Thanks in advance
I can only speak to the first issue. A possible negative of having a large batch size is the size of the sql being sent across the wire in one go.