Sending data using Avro objects, is there an advantage to using schema registry? - serialization

I have an application where I generate avro objects using an AVSC file and then produce using them objects. I can consumer then with the same schema in another application if I wish by generating the pojos there. This is done with an Avro plugin. The thing I noticed is that the schema doesn't exist in the schema registry.
I think if I change my producer type/settings it might create it there (I am using kafka spring). Is there any advantage to having it there, is what I am doing at the minute just serialization of data, is it the same as say just creating GSON objects from data and producing them?
Is it bad practice not having the schema in the registry?

To answer the question, "is there an advantage" - yes. At the least, it allows for other applications to discover what is contained in the topic, whether that's another Java application using Spring, or not.
You don't require the schemas to be contained within the consumer codebase
And you say you're using the Confluent serializers, but there's no way to "skip" schema registration, so the schemas should be in the Registry by default under "your_topic-value"

Related

data exchange format to use with Apache Kakfa that provides schema validation

what is the best message format to use with apachhe kafka so that producers and consumers can define contract and validate data and serialize/deserialize data? for example in xml we have xsd. but in json there is no universal schema.. i read about using apache avro but not sure how fast will it be as i can't afford more then 5 to 6 ms for schema validation and deserialisation? any inputs please?
We will be processing thousands of transactions per second and SLA for each transaction is 150ms so i am looking for something that's very fast
Avro is often quoted as being slow(er), and adds overhead compared to other binary formats, but I believe that is for the use-case of not using a Schema Registry where the schema is excluded from the actual payload.
Alternatively, you can use Protobuf or Thrift if you absolutely want a schema, however, I don't think serializers for these formats are readily available, from what I've seen. Plus, the schemas need to be passed between your clients if not otherwise committed to a central location.
I can confidently say that Avro should be fine for starting out, though, and the Registry is definitely useful, and not just for Kafka use cases.

Schema in Avro message

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead?
So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema?
On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?
I am reading following points from Avro Specs
Apache Avro is a data serialization system.
Avro relies on schemas.
When Avro data is read, the schema used when writing it is always
present.
The goal of serialization is to avoid per-value
overheads, to make serialization both fast and small.
When Avro data is stored in a file, its schema is stored with it.
You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.
There is no option available to eliminate schema, while writing avro file. It would be against avro specification.
IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.
You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.
Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.
And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.

JSON schema validation at Datapower

As per our current architecture, we have Datapower that acts as a gatekeeper for validating each incoming request (in JSON) against JSON schemas.
We have lot of restful services having corresponding JSON schemas residing at Datapower itself. However, every time there is a change in service definition corresponding schema has to be changed. That results in a Datapower deployment of affected schema.
Now we are planning to have a restful service that will be called by Datapower for every incoming request and it will return the JSON schema for the service to be invoked and that schema will be present along with service code itself not on Datapower. That way even if there are any changes in service definition, there itself we can make the changes in schema as well and deploy the service. It will save us an unnecessary Datapower deployment.
Is there any better approach to validate the schema? All I want is not to have Datapower deployment for every schema change.
Just FYI we get schema changes on frequent basis.
Keep your current solution as is as pulling in new JSON schemas for every request will affect performance. Instead when you deploy the schema in the backend system have a RMI (REST management interface) or SOMA call that uploads the new schema or simply a XML Firewall where you add a GWS script that writes the json data to file in the directory (requires 7.5 or higher).
Note that you have to clean the cache as well through the call!
A better approach is to have some push system based on subscription to changes. You can store schemas in etcd, redis, postgres or any other system that has notification channels for data changes so you can update schemas in the validating service without doing it on every request. If your validating service uses validator that compiles schemas to code (ajv - I am the author, is-my-json-valid, jsen) it would be even better performance gain if you only do it on change.

How to model the domain with Amazon S3

Lets say I have an Illustration entity which is an aggregate root. That entity contains some data about the artwork and is persisted in a SQL database while the artwork itself is persisted on Amazon S3. Also, I would want to save some scaled-down or thumbnail versions of the artwork so I introduced a Blob entity in many-to-one relationship with Illustration for representing the binary data of the artwork in various versions.
Now I wonder how should i design the persistence of Blobs. Amazon S3 is a kind of database (please don't start the flame what is a true database here ;) ), just different than SQL and I think that it should be abstracted like so, that means by a Repository. So I would have a BlobRepository where I could store artwork data. On the other hand - in this domain Blob is definitely not an aggregate root - it is always used as a part of Illustration aggregate. So it should't have its own repository.
So maybe Amazon S3 should not be treated as a persistence technology, but rather as a generic external service, next to EmailSender, CurrencyConverter etc.? If so, where should I inject this service? Into Illustration entity methods, IllustrationsRepository, application service layer?
First of all when dealing with DDD there is no many-to-one or any RDBMS concepts, because in DDD the db does not exist, everyhthing is sent to a Repository. If you're using an ORM, know that the ORM entities are NOT Domain entities, they are Persistence objects.
That being said, I think the Illustration Repository should abstract both the RDBMS and S3. These are persistence implementation details, the repo should deal with them. Basically the repo will receive the Illustration AR, which will be saved partly in RDBMS and partly as a blob in S3.
So the Domain doesn't know and shouldn't know about Amazon S3, maybe tomorrow you'll want to use Azure Db, why should the Domain care about it?. It's the Repository's responsibility to deal with it.

Is this a good use-case for Redis on a ServiceStack REST API?

I'm creating a mobile app and it requires a API service backend to get/put information for each user. I'll be developing the web service on ServiceStack, but was wondering about the storage. I love the idea of a fast in-memory caching system like Redis, but I have a few questions:
I created a sample schema of what my data store should look like. Does this seems like it's a good case for using Redis as opposed to a MySQL DB or something like that?
schema http://www.miles3.com/uploads/redis.png
How difficult is the setup for persisting the Redis store to disk or is it kind of built-in when you do writes to the store? (I'm a newbie on this NoSQL stuff)
I currently have my setup on AWS using a Linux micro instance (because it's free for a year). I know many factors go into this answer, but in general will this be enough for my web service and Redis? Since Redis is in-memory will that be enough? I guess if my mobile app skyrockets (hey, we can dream right?) then I'll start hitting the ceiling of the instance.
What to think about when desigining a NoSQL Redis application
1) To develop correctly in Redis you should be thinking more about how you would structure the relationships in your C# program i.e. with the C# collection classes rather than a Relational Model meant for an RDBMS. The better mindset would be to think more about data storage like a Document database rather than RDBMS tables. Essentially everything gets blobbed in Redis via a key (index) so you just need to work out what your primary entities are (i.e. aggregate roots)
which would get kept in its own 'key namespace' or whether it's non-primary entity, i.e. simply metadata which should just get persisted with its parent entity.
Examples of Redis as a primary Data Store
Here is a good article that walks through creating a simple blogging application using Redis:
http://www.servicestack.net/docs/redis-client/designing-nosql-database
You can also look at the source code of RedisStackOverflow for another real world example using Redis.
Basically you would need to store and fetch the items of each type separately.
var redisUsers = redis.As<User>();
var user = redisUsers.GetById(1);
var userIsWatching = redisUsers.GetRelatedEntities<Watching>(user.Id);
The way you store relationship between entities is making use of Redis's Sets, e.g: you can store the Users/Watchers relationship conceptually with:
SET["ids:User>Watcher:{UserId}"] = [{watcherId1},{watcherId2},...]
Redis is schema-less and idempotent
Storing ids into redis sets is idempotent i.e. you can add watcherId1 to the same set multiple times and it will only ever have one occurrence of it. This is nice because it means you don't ever need to check the existence of the relationship and can freely keep adding related ids like they've never existed.
Related: writing or reading to a Redis collection (e.g. List) that does not exist is the same as writing to an empty collection, i.e. A list gets created on-the-fly when you add an item to a list whilst accessing a non-existent list will simply return 0 results. This is a friction-free and productivity win since you don't have to define your schemas up front in order to use them. Although should you need to Redis provides the EXISTS operation to determine whether a key exists or a TYPE operation so you can determine its type.
Create your relationships/indexes on your writes
One thing to remember is because there are no implicit indexes in Redis, you will generally need to setup your indexes/relationships needed for reading yourself during your writes. Basically you need to think about all your query requirements up front and ensure you set up the necessary relationships at write time. The above RedisStackOverflow source code is a good example that shows this.
Note: the ServiceStack.Redis C# provider assumes you have a unique field called Id that is its primary key. You can configure it to use a different field with the ModelConfig.Id() config mapping.
Redis Persistance
2) Redis supports 2 types persistence modes out-of-the-box RDB and Append Only File (AOF). RDB writes routine snapshots whilst the Append Only File acts like a transaction journal recording all the changes in-between snapshots - I recommend adding both until your comfortable with what each does and what your application needs. You can read all Redis persistence at http://redis.io/topics/persistence.
Note Redis also supports trivial replication you can read more about at: http://redis.io/topics/replication
Redis loves RAM
3) Since Redis operates predominantly in memory the most important resource is that you have enough RAM to hold your entire dataset in memory + a buffer for when it snapshots to disk. Redis is very efficient so even a small AWS instance will be able to handle a lot of load - what you want to look for is having enough RAM.
Visualizing your data with the Redis Admin UI
Finally if you're using the ServiceStack C# Redis Client I recommend installing the Redis Admin UI which provides a nice visual view of your entities. You can see a live demo of it at:
http://servicestack.net/RedisAdminUI/AjaxClient/