Migrating records to another set on Aerospike - migration

Current Setup:
I have a single namespace running on the Aerospike cluster. The namespace has few sets in it.
My Use-case:
I want to copy all the records from one set (has ~100 records) to another new set (keeping the schema same) under the same namespace.
My Finding:
I did some deep dive and found out a few solutions using aql:
List down all the records from the first set and insert them one-by-one into a new set.
Pros: Simple to implement.
Cons: Time taking, and Prone to manual error.
Using asbackup/asrestore command.
Pros: It is immune to manual error.
Cons: It doesn't allow to change the name of the set during restoration, which I can't afford. Aerospike's FAQ link does provide some workaround, but again it is risky.
Help Needed:
Is there any efficient way to migrate data from one to another set, with less effort and validation? I did think of writing some java code that would scan the entire set and write those records into another set, but again that was falling under the first categories I explained earlier.
Thanks!

A record in Aerospike is stored using the hash of your key and your set name. Set name is "stored" with that record in Aerospike purely as a metadata on that record. So you can scan an entire namespace and return records belonging to that set and in the scan callback, write each of them back as new records (due to the different set name). You will have to know "your key" for each record that comes back from the scan. By default Aerospike only stores the 20byte hash digest as the key for the record. So unless you stored it explicitly in the record either with send key true or in a bin, I don't see how you would identify "your key". Storing "your key" in a bin is easiest. You may have to first update all your 100 records and add a bin that has "your key" in it. Then in scan callback, where records come in no particular order, you will be able to compose a new Key with "your key" and "new set name". You will have to write your own java code for it. (If you have "your key" in the original records - its easy to do.)
I have not tested this .. but something along these lines would work assuming original records had your key in the "mykey" bin.
client.scanAll(null, "test", "set1", new ScanCallback(){
public void scanCallback (Key key, Record record) throws AerospikeException {
String mykey = (String)(record.getValue("mykey"));
String bin1data = record.getString("bin1");
//Alternate way to get string
Key reckey = new Key("test", "set2", mykey);
client.put(null, reckey, new Bin("bin1", bin1data));
}
});

Related

Ignite/GridGain putAllIfAbsent

I am use GridGain as my persistence database. I have following requirements.
Insert multiple records if key already not exists
Put multiple records if key exists or not.
For 1, I saw cache.putIfAbsent(key, value) method to insert single record if not exists. But I didn't find cache.putAllIfAbsent(Map<key, value>) like method. I can you loop to insert multiple records one by one. Is it given performance issue?
For 2, I think I can use cache.putAll(Map<key, value>) method. If it proper way?
I run server in Google cloud Kubernetes engine as thick clients.
putAll always overwrites existing records
putIfAbsent in a loop will be slower than putAll. Measure your specific use case to see by how much
If there is no requirement for ordering and atomicity, DataStreamer is a good choice. When allowOverwrite flag is false (default), you get putIfAbsent behavior, and good performance.
try (IgniteDataStreamer<Integer, String> stmr = ignite.dataStreamer("myCache")) {
stmr.allowOverwrite(false); // Don't overwrite existing data
Map<Integer, String> entries = getMydata()
stmr.addData(entries);
}

How to generate an incremental item number in BOPF

I created a custom table to store reasons for modifying an Object. I'm doing a POC with BOPF in order to learn, even it may not make sense to use it here.
This is how the persistent structure looks like (simplified):
define type zobject_modifications {
object_id : zobject_id;
#EndUserText.label : 'Modification Number'
mod_num : abap.numc(4);
reason_id : zreason_id;
#EndUserText.label : 'Modification Comments'
comments : abap.string(256);
}
The alternative key consists in the object_id + mod_num. The mod_num should be an auto-generated counter, always adding 1 to the last modification for the object_id.
I created a determination before_save to generate it, checking the MAX mod_num from the database BOs and from the current instantiated BOs and increasing by 1.
But when I try to create 2 BOs for the same object in a single transaction, I get an error because of the duplicated alternative key, since the field MOD_NUM is still initial and the before_save would be triggered later. I tried to change the determination to "After Modify" but I still get the same problem.
The question is: When and how should I generate the next MOD_NUM to be able to create multiple nodes for the same object ID safely?
This must be a very common problem so there must be a best practice way to do it, but I was not able to find it.
Use a number range to produce sequential identifiers. They ensure that you won't get duplicates if there are ongoing and concurrent transactions.
If you want to insist on determining the next identifier on your own, use the io_read input parameter of the determination to retrieve the biggest mod_num:
The database contains only those nodes that have already been committed. But your new nodes are not committed, yet, such that you won't get them.
io_read in contrast accesses BOPF's temporary buffer that also contains the nodes you just created, hence seeing the more actual data.

How do I delete all keys matching a specified key pattern using StackExchange.Redis?

I've got about 150,000 keys in a Redis cache, and need to delete > 95% of them - all keys matching a specific key prefix - as part of a cache rebuild. As I can see it, there are three ways to achieve this:
Use server.Keys(pattern) to pull out the entire key list matching my prefix pattern, and iterate through the keys calling KeyDelete for each one.
Maintain a list of keys in a Redis set - each time I insert a value, I also insert the key in the corresponding key set, and then retrieve these sets rather than using Keys. This would avoid the expensive Keys() call, but still relies on deleting tens of thousands of records one by one.
Isolate all of my volatile data in a specific numbered database, and just flush it completely at the start of a cache rebuild.
I'm using .NET and the StackExchange.Redis client - I've seen solutions elsewhere that use the CLI or rely on Lua scripting, but nothing that seems to address this particular use case - have I missed a trick, or is this just something you're not supposed to do with Redis?
(Background: Redis is acting as a view model in front of the Microsoft Dynamics CRM API, so the cache is populated on first run by pulling around 100K records out of CRM, and then kept in sync by publishing notifications from within CRM whenever an entity is modified. Data is cached in Redis indefinitely and we're dealing with a specific scenario here where the CRM plugins fail to fire for a period of time, which causes cache drift and eventually requires us to flush and rebuild the cache.)
Both options 2 & 3 are reasonable.
Steer clear of option 1. KEYS really is slow and only gets slower as your keyspace grows.
I'd normally go for 2 (without LUA, including LUA would increase the learning curve to support the solution - which of course is fine when justified and assuming it's existence is clear/documented.), but 3 could definitely be a contender, fast and simple, as long as you can be sure you won't exceed the configured DB limit.
Use scanStream instead of keys and it will work like a charm.
Docs - https://redis.io/commands/scan
The below code can get you a array of keys starting with LOGIN:: and you can loop through the array and execute redis DEL command to del the corresponding keys.
Example code in nodejs :-
const redis = require('ioredis');
let stream = redis.scanStream({
match: "LOGIN::*"
});
stream.on("data", async (keys = []) => {
let key;
for (key of keys) {
if (!keysArray.includes(key)) {
await keysArray.push(key);
}
}
});
stream.on("end", () => {
res(keysArray);
});

DynamoDB: Have sequencing within Items

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?
You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)
Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post
Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.

Fastest way to query for object existence in NHibernate

I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.