Materialized view to use different Serde - serialization

Version used: Kafka 3.1.1, Confluent 7.1.0, Avro 1.11.0
I’m creating a REST controller which is “searching” for AVRO objects in a topic. The objects in the topic are serialized using SpecificAvroSerde<>. Each topic has assigned two AVRO schemas. One for the key (with several fields of various types) and one for the value (multiple fields and types).
I’ve done this several times whereby I’m consuming the topic in a KTable and then materialize it. There is only one pair of serdes involved and the serialized format is the same for both the topic and the materialized view (RocksaltDb). The REST controller then can look up the store and either perform a get with a key or do a range scan between two keys. This all works as expected.
private final static String TOPIC_NAME = "input-topic";
private final static String VIEW_NAME = "materialized-view";
private final SpecificAvroSerde<ProductXrefKey> productXrefKeySerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXref> productXrefSerde = new SpecificAvroSerde<>();
final Map<String, Object> props = this.kafkaProperties.buildStreamsProperties();
productXrefKeySerde.configure(props, true);
productXrefSerde.configure(props, false);
KTable<ProductXrefKey, ProductXref> productXrefTable = builder
.table(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde),
Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(productXrefKeySerde)
.withValueSerde(productXrefSerde));
<…>
final ReadOnlyKeyValueStore<ProductXrefKey, ProductXref> store =
streamsBuilderFactoryBean.getKafkaStreams().store(fromNameAndType(VIEW_NAME, keyValueStore()));
try (KeyValueIterator<ProductXrefKey, ProductXref> range = store.range(fromKey, toKey)) {
if (range != null) {
range.forEachRemaining(kv -> {
<…>
});
} else {
log.info("Could not find {} in local ReadOnlyKeyValueStore {}", fromKey, viewName);
}
}
I now want to change this using a prefix scan instead. Since the key contains multiple fields there is no way to only serialize first part (i.e. first few fields) of the key I need a specialized serializer. This also means I have to use a different serializer for the materialized view itself (SpecificAvroSerde puts the magic number and schema ID at the beginning of the byte array) as otherwise the serialized output for the prefix and the key in the materialized view can’t be compared. Hence I created a specialised Serde which serializes the key using the same logic as when used for serializing the prefix but omitting the fields not required for the scan (i.e. omitting the last field). Above code now looks
private final static String TOPIC_NAME = "input-topic";
private final static String VIEW_NAME = "materialized-view";
private final SpecificAvroSerde<ProductXrefKey> productXrefKeySerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXref> productXrefSerde = new SpecificAvroSerde<>();
private final SpecificAvroSerde<ProductXrefKey> materializedProductXrefKeySerde = new ProductXrefKeySerde();
// for the value part we can still used standard serde as no change in serialization logic needed
private final SpecificAvroSerde<ProductXref> materializedProductXrefSerde = new SpecificAvroSerde<>();
// telling the serializer to cut off last field
private final SpecificAvroSerde<ProductXref> prefixScanProductXrefSerde = new ProductXrefKeySerde(true);
final Map<String, Object> props = this.kafkaProperties.buildStreamsProperties();
productXrefKeySerde.configure(props, true);
productXrefSerde.configure(props, false);
KTable<ProductXrefKey, ProductXref> productXrefTable = builder
.table(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde),
Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(materializedProductXrefKeySerde)
.withValueSerde(materializedProductXrefSerde));
<…>
final ReadOnlyKeyValueStore<ProductXrefKey, ProductXref> store =
streamsBuilderFactoryBean.getKafkaStreams().store(fromNameAndType(VIEW_NAME, keyValueStore()));
try (KeyValueIterator<ProductXrefKey, ProductXref> range = store.prefixScan(prefixKey, prefixScanProductXrefSerde)){
if (range != null) {
range.forEachRemaining(kv -> {
<…>
});
} else {
log.info("Could not find {} in local ReadOnlyKeyValueStore {}", prefixKey, viewName);
}
}
My assumption was, that the topic gets deserialized using the SpecificAvroSerde and then gets serialized for the view using my ProductXrefKeySerde. The problem is, that the content in the materialized view is still serialized using the same logic as in the original topic. It appears that the serializer is never used during the topic being processed and stored in the materialized view. I can verify that also on the file system and see that the keys in the RocksaltDb files are serialized with the magic byte and schema ID and hence prefixScan wont be able to fine anything.
How can I change the serialization format for the materialized view?
Or is there a better way for serializing a prefix AVRO object?

It appears that there is some optimization happening which avoids deserialization/serialization if KTable is directly materialized. I've changed the logic such that it consumes it as a KStream and then creates the KTable (toTable(...))
KTable<ProductXrefKey, ProductXref> productXrefStream = builder
.stream(TOPIC_NAME, Consumed.with(productXrefKeySerde, productXrefSerde))
.toTable(Materialized.<ProductXrefKey, ProductXref, KeyValueStore<Bytes, byte[]>>as(VIEW_NAME)
.withKeySerde(productXrefKeySerde)
.withValueSerde(productXrefSerde));
With this small change, data now gets deserialized (using SpecificAvroSerde<>) and serialized again using the provided ProductXrefKeySerde. Now also the prefix scan works and returns the records as expected.

Related

Graph traversal name to graph name mapping

Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
I am using the below messy code but it's error-prone if both graphs are using the same underlying storage.
Map<String, String> graphTraversalToNameMap = new ConcurrentHashMap<String, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
String currentGraphString = ( (GraphTraversalSource) graphManager.getAsBindings().get(traversalSource)).getGraph().toString();
graphNameTraversalMap.put(currentGraphString, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()){
String graphName = graphNamesIterator.next();
String currentGraphString = graphManager.getGraph(graphName).toString();
String traversalSource = graphNameTraversalMap.get(currentGraphString);
graphTraversalToNameMap.put(traversalSource, graphName);
}
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee? I can iterate over this and populate my map
Is there any API using which I can get graphTraversalName to graphName mapping defined in the script?
No. They share the same namespace in Gremlin Server so the relationship gets lost programmatically. You would need to do something like what you are doing but I wouldn't rely on toString() of a Graph for equality. Perhaps use the Graph instance itself? Although that might not work either depending on your situation and what you want for equality as you could have two different Graph configurations pointed at the same data and want to resolve those as the same graph. I'm also not sure that any approach will work generally for all graph systems. Anyway, I think I'd experiment with using Map<Graph, String> graphTraversalToNameMap for your case and see how that goes.
Does gremlinExecutor.getScriptEngineManager().getBindings().entrySet() provide order guarantee?
No as it is backed by a ConcurrentHashMap. You would have to provide your own order.
Underlying storage details can be obtained from the configuration object and can be used for the mapping, sample code:
public class GraphTraversalMappingUtil {
public static void populateGraphTraversalToNameMapping(GraphManager graphManager){
if(graphTraversalToNameMap.size() != 0){
return;
}
Iterator<String> traversalSourceIterator = graphManager.getTraversalSourceNames().iterator();
Map<StorageBackendKey, String> storageKeyToTraversalMap = new HashMap<StorageBackendKey, String>();
while(traversalSourceIterator.hasNext()){
String traversalSource = traversalSourceIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getTraversalSource(traversalSource).getGraph().configuration());
storageKeyToTraversalMap.put(key, traversalSource);
}
Iterator<String> graphNamesIterator = graphManager.getGraphNames().iterator();
while(graphNamesIterator.hasNext()) {
String graphName = graphNamesIterator.next();
StorageBackendKey key = new StorageBackendKey(
graphManager.getGraph(graphName).configuration());
graphTraversalToNameMap.put(storageKeyToTraversalMap.get(key), graphName);
}
}
}
For full code, refer: https://pastebin.com/7m8hi53p

Update typed object in CosmosDB

I want to update an document in CosmosDB, for this matter, I first retrieve the document and transform it to a typed object, like this:
public async Task<T> ReadRawAsync(Expression<Func<T, bool>> predicate)
{
var query = _client
.CreateDocumentQuery<T>(_uri)
.Where(predicate)
.AsDocumentQuery();
var results = new List<T>();
while (query.HasMoreResults)
results.AddRange(await query.ExecuteNextAsync<T>());
return results.FirstOrDefault();
}
After some transformations I want to update (replace) that document:
await _client.ReplaceDocumentAsync(_uri, document);
I am not completely sure this can be done in this way. Could you please point to what is required to do such an update?
ReplaceDocumentAsync will work only if the document has a property named id (notice the lower case, it needs to serialize to lowercase id, so you might need a JsonProperty("id")) and a document with this id must exist in the collection.
Also if your collection is partitioned then you need to provide the next object in this method which is the RequestOptions and add the PartitionKey value of this document in it.

Apache Ignite Continuous Queries : How to get the field names and field values in the listener updates when there are dynamic fields?

I am working on a POC on whether or not we should go ahead with Apache Ignite both for commerical and enterprise use. There is a use case though that we are trying to find an answer for.
Preconditions
Dynamically creation of tables i.e. there may be new fields that come to be put into the cache. Meaning there is no precompiled POJO(Model) defining the attributes of the table/cache.
Use case
I would like to write a SELECT continuous query where it gives me the results that are modified. So I wrote that query but the problem is that when the listener gets a notification, I am not able to find all the field names that are modified from any method call. I would like to be able to get all the field names and field values in some sort of Map, which I can use and then submit to other systems.
You could track all modified field values using binary object and continuous query:
IgniteCache<Integer, BinaryObject> cache = ignite.cache("person").withKeepBinary();
ContinuousQuery<Integer, BinaryObject> query = new ContinuousQuery<>();
query.setLocalListener(events -> {
for (CacheEntryEvent<? extends Integer, ? extends BinaryObject> event : events) {
BinaryType type = ignite.binary().type("Person");
if (event.getOldValue() != null && event.getValue() != null) {
HashMap<String,Object> oldProps = new HashMap<>();
HashMap<String,Object> newProps = new HashMap<>();
for (String field : type.fieldNames()) {
oldProps.put(field,event.getOldValue().field(field));
newProps.put(field,event.getValue().field(field));
}
com.google.common.collect.MapDifference<Object, Object> diff = com.google.common.collect.Maps.difference(oldProps, newProps);
System.out.println(diff.entriesDiffering());
}
}
});
cache.query(query);
cache.put(1, ignite.binary().builder("Person").setField("name","Alice").build());
cache.put(1, ignite.binary().builder("Person").setField("name","Bob").build());

Google diff-match-patch : How to unpatch to get Original String?

I am using Google diff-match-patch JAVA plugin to create patch between two JSON strings and storing the patch to database.
diff_match_patch dmp = new diff_match_patch();
LinkedList<Patch> diffs = dmp.patch_make(latestString, originalString);
String patch = dmp.patch_toText(diffs); // Store patch to DB
Now is there any way to use this patch to re-create the originalString by passing the latestString?
I google about this and found this very old comment # Google diff-match-patch Wiki saying,
Unpatching can be done by just looping through the diff, swapping
DIFF_INSERT with DIFF_DELETE, then applying the patch.
But i did not find any useful code that demonstrates this. How could i achieve this with my existing code ? Any pointers or code reference would be appreciated.
Edit:
The problem i am facing is, in the front-end i am showing a revisions module that shows all the transactions of a particular fragment (take for example an employee details), like which user has updated what details etc. Now i am recreating the fragment JSON by reverse applying each patch to get the current transaction data and show it as a table (using http://marianoguerra.github.io/json.human.js/). But some JSON data are not valid JSON and I am getting JSON.parse error.
I was looking to do something similar (in C#) and what is working for me with a relatively simple object is the patch_apply method. This use case seems somewhat missing from the documentation, so I'm answering here. Code is C# but the API is cross language:
static void Main(string[] args)
{
var dmp = new diff_match_patch();
string v1 = "My Json Object;
string v2 = "My Mutated Json Object"
var v2ToV1Patch = dmp.patch_make(v2, v1);
var v2ToV1PatchText = dmp.patch_toText(v2ToV1Patch); // Persist text to db
string v3 = "Latest version of JSON object;
var v3ToV2Patch = dmp.patch_make(v3, v2);
var v3ToV2PatchTxt = dmp.patch_toText(v3ToV2Patch); // Persist text to db
// Time to re-hydrate the objects
var altV3ToV2Patch = dmp.patch_fromText(v3ToV2PatchTxt);
var altV2 = dmp.patch_apply(altV3ToV2Patch, v3)[0].ToString(); // .get(0) in Java I think
var altV2ToV1Patch = dmp.patch_fromText(v2ToV1PatchText);
var altV1 = dmp.patch_apply(altV2ToV1Patch, altV2)[0].ToString();
}
I am attempting to retrofit this as an audit log, where previously the entire JSON object was saved. As the audited objects have become more complex the storage requirements have increased dramatically. I haven't yet applied this to the complex large objects, but it is possible to check if the patch was successful by checking the second object in the array returned by the patch_apply method. This is an array of boolean values, all of which should be true if the patch worked correctly. You could write some code to check this, which would help check if the object can be successfully re-hydrated from the JSON rather than just getting a parsing error. My prototype C# method looks like this:
private static bool ValidatePatch(object[] patchResult, out string patchedString)
{
patchedString = patchResult[0] as string;
var successArray = patchResult[1] as bool[];
foreach (var b in successArray)
{
if (!b)
return false;
}
return true;
}

Does this saving/loading pattern have a name?

There's a variable persistence concept I have integrated multiple times:
// Standard initialiation
boolean save = true;
Map<String, Object> dataHolder;
// variables to persist
int number = 10;
String text = "I'm saved";
// Use the variables in various ways in the project
void useVariables() { ... number ... text ...}
// Function to save the variables into a datastructure and for example write them to a file
public Map<String, Object> getVariables()
{
Map<String, Object> data = new LinkedHashMap<String, Object>();
persist(data);
return(data);
}
// Function to load the variables from the datastructure
public void setVariables(Map<String, Object> data)
{
persist(data);
}
void persist(Map<String, Object> data)
{
// If the given datastructure is empty, it means data should be saved
save = (data.isEmpty());
dataHolder = data;
number = handleVariable("theNumber", number);
text = handleVariable("theText", text);
...
}
private Object handleVariable(String name, Object value)
{
// If currently saving
if(save)
dataHolder.put(name, value); // Just add to the datastructure
else // If currently writing
return(dataHolder.get(name)); // Read and return from the datastruct
return(value); // Return the given variable (no change)
}
The main benefit of this principle is that you only have a single script where you have to mention new variables you add during the development and it's one simple line per variable.
Of course you can move the handleVariable() function to a different class which also contains the "save" and "dataHolder" variables so they wont be in the main application.
Additionally you could pass meta-information, etc. for each variable required for persisting the datastructure to a file or similar by saving a custom class which contains this information plus the variable instead of the object itself.
Performance could be improved by keeping track of the order (in another datastructure when first time running through the persist() function) and using a "dataHolder" based on an array instead of a search-based map (-> use an index instead of a name-string).
However, for the first time, I have to document this and so I wondered whether this function-reuse principle has a name.
Does someone recognize this idea?
Thank you very much!