Bijection - Java Avro Serialization - serialization

I am looking for an example to do Bijection on Avro SpecificRecordBase object similar to a GenericRecordBase or if there is a simpler way to use the AvroSerializer class as the Kafka key and value serializer.
Injection<GenericRecord, byte[]> genericRecordInjection =
GenericAvroCodecs.toBinary(schema);
byte[] bytes = genericRecordInjection.apply(type);

https://github.com/miguno/kafka-storm-starter provides such example code.
See, for instance, AvroDecoderBolt. From its javadocs:
This bolt expects incoming data in Avro-encoded binary format, serialized according to the Avro schema of T. It will deserialize the incoming data into a T pojo, and emit this pojo to downstream consumers. As such this bolt can be considered the Storm equivalent of Twitter Bijection's Injection.invert[T, Array[Byte]](bytes) for Avro data.
where
T: The type of the Avro record (e.g. a Tweet) based on the underlying Avro schema being used. Must be a subclass of Avro's SpecificRecordBase.
The key part of the code is (I collapsed the code into this snippet):
// With T <: SpecificRecordBase
implicit val specificAvroBinaryInjection: Injection[T, Array[Byte]] =
SpecificAvroCodecs.toBinary[T]
val bytes: Array[Byte] = ...; // the Avro-encoded data
val decodeTry: Try[T] = Injection.invert(bytes)
decodeTry match {
case Success(pojo) =>
System.out.println("Binary data decoded into pojo: " + pojo)
case Failure(e) => log.error("Could not decode binary data: " + Throwables.getStackTraceAsString(e))
}

Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(new File("/Users/.../schema.avsc"));
Injection<Command, byte[]> objectInjection = SpecificAvroCodecs.toBinary(schema);
byte[] bytes = objectInjection.apply(c);

Related

How to read Key Value pair in spark SQL?

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :
It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

how do i store image in room database

how do i store image in room database, From Json.
as i was trying to store with Byte it gives me error - java.lang.NumberFormatException: For input string: "https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg"
data Class
#Entity(tableName = "actor",indices = arrayOf(Index(value= arrayOf("id"),unique = true)))
data class ActorItem(
#PrimaryKey(autoGenerate = true)
val id_:Int,
#SerializedName("age")
#ColumnInfo(name = "age")
val age: String,
#SerializedName("id")
#ColumnInfo(name = "id")
val id: Int,
#SerializedName("image")
#ColumnInfo(name="image")
val image: Byte,
#SerializedName("name")
#ColumnInfo(name = "name")
val name: String
)
here is Json
[
{"id": 1,
"name": "Hero",
"image": "https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg",
"age": "23"}
]
Are you trying to store the path to an image in the DB, or the image itself?
If the path, then image would be of type String (not Byte). Storing a path (or name) in the DB, and not the image contents, is generally considered best practice. It requires downloading and saving the image in the filesystem outside of the DB.
If you are trying to store the image contents in the DB though, then the field type would be ByteArray (not Byte), the column type would be Blob, and you will need to download the image and write the bytes in to the DB yourself.
Related:
How insert image in room persistence library?
Simple solution is to store image file in internal directory(App-specific storage) and store the internal private path in your room database column.
Internal App storage Docs :-
https://developer.android.com/training/data-storage/app-specific
Or make base64 of file image file and store base64 string in your database column
From Json. as i was trying to store with Byte it gives me error - java.lang.NumberFormatException: For input string:
What number, between 0 and 255 does https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg resolve to 0? 1? 2? .... 255? (Rhetorical) Why? (Rhetorical)
how do i store image in room database
You VERY PROBABLY SHOULD NOT but instead store the image as a file in the Apps's storage (there would be no real difference in storage space) but probably a very noticeable improvement in response times.
The value https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg is, to many humans, obviously a link to a file that can be downloaded (an image). It does not equate to a byte or any number.
However, the file stored at that location is a series (aka stream) of bytes.
Thus in Room a ByteArray (which Room will assign a type affinity of BLOB the the column). So the column type should either be ByteArray or a type that would require a type converter where the type converter returns a ByteArray.
So instead of val image: Byte, you would probably have val image: ByteArray,.
To get the ByteArray you could (assuming permissions etc are all setup) use something like (but not restricted to):-
return URL(url).readBytes()
where url, in your case, would be the String https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg
IMPORTANT
However, at 2.7Mb
that is very likely to cause issues. Not due to SQLite limitations but due to limitations of the Android API which retrieves data from the SQLite database via a Cursor which is a buffer that is limited in size (4Mb). As such any image that is close to 4Mb may be stored but it couldn't be retrieved without complications AND highly inefficient/slow processing.
Demonstration of why NOT to store images, like the one mentioned in the question.** in the database**
As a demonstration consider the following which does store the equivalent of images (not actual images) in the image column of you table
(i.e. ByteArrays, the content unless actually displaying the image is irrelevant Room nor SQLite knows the difference between an image and any other BLOB value)
using a slightly modified version of your ActorItem class, as :-
#Entity(tableName = "actor",indices = arrayOf(Index(value= arrayOf("id"),unique = true)))
data class ActorItem(
#PrimaryKey(autoGenerate = true)
val id_:Int,
#ColumnInfo(name = "age")
val age: String,
#ColumnInfo(name = "id")
val id: Int,
#ColumnInfo(name="image")
val image: ByteArray,
#ColumnInfo(name = "name")
val name: String
) {
#androidx.room.Dao
interface Dao {
#Insert
fun insert(actorItem: ActorItem)
}
}
i.e. the important difference is a type of ByteArray as opposed to Byte for the image column
for brevity/convenience the DAO class has been included (it is sufficient just to insert some columns to demonstrate why saving the image is not a very good idea)
To accompany is an #Database class TheDatabase :-
#Database(entities = [ActorItem::class], version = 1, exportSchema = false)
abstract class TheDatabase: RoomDatabase() {
abstract fun getActorItemDao(): ActorItem.Dao
companion object {
private var instance: TheDatabase? = null
fun getInstance(context: Context): TheDatabase {
if ( instance == null) {
instance = Room.databaseBuilder(context,TheDatabase::class.java,"thedatabase.db")
.allowMainThreadQueries()
.build()
}
return instance as TheDatabase
}
}
allowMainThreadQueries included for brevity and convenience
Finally putting the above into action via an activity is MainActivity :-
class MainActivity : AppCompatActivity() {
lateinit var db: TheDatabase
lateinit var dao: ActorItem.Dao
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
db = TheDatabase.getInstance(this)
dao = db.getActorItemDao()
val url = "https://upload.wikimedia.org/wikipedia/commons/4/41/Sunflower_from_Silesia2.jpg"
try {
for (i in 1..100) {
dao.insert(
ActorItem(
0, "The age", i,
getBitmapFromURLAsString(
url,
/* The ByteArray (bitmap) */
/* BUT for demonstration of issues associated with size issues
start at 1.08Mb size incrementing at 8kb per row
last would be 1.8Mb (100 * 8kb)
*/
i * (4096 * 2) + (1024 * 1024)),
getNameFromURLAsString(url)
)
)
}
} catch (e: Exception) {
e.printStackTrace()
}
}
fun getBitmapFromURLAsString(url: String, size: Int): ByteArray {
/* Fake the byte array allowing the size of the bytearray to be specified */
val ba = ByteArray(size)
var byte: Byte = 8
for(i in 0 until (size)) {
ba[i] = byte++.mod(Byte.MAX_VALUE)
}
return ba
/* actual would be something like */
/* WARNING image Sunflower_from_Silesia2.jpg is 2.7Mb will likely cause issues */
//return URL(url).readBytes()
}
fun getNameFromURLAsString(url: String): String {
val split = url.split("/")
return split.get(split.size -1)
}
}
So the activity will try to insert 100 rows with a ByteArray in the image column (answer to how to store image in principle). For each row the size of the ByteArray is increased by 8k (the first row is 1.08Mb i.e. 1Mb and 8k in size). The name column
The above runs successfully without any trapped exceptions. And all 100 rows are inserted:-
using query to extract the length of each image column shows the size (of the last rows) :-
First warning sign that things are perhaps not that good
Running the query takes hardly any time at all. Refreshing, moving from start to end from the table view takes quite a bit of time (a minute (will be dependant upon PC/Laptop used)).
Second warning sign
Running the App takes a few seconds.
Third warning sign
Use i * (4096 * 2) + (1024 * 1024 * 2)), (i.e. start at 2Mb up to 2.8Mb), run the App and try to view via App Inspection and :-
As can be seen the Rows exist and have the expected data in them :-
Try to look at the Actor table, DatabaseInspector doesn't show the contents .
Run the query SELECT substr(image,1024 * 1024) FROM actor (i.e. 8k for the first row 1.8k for the 100th row) WAIT (for a minute or so), scroll to the last, WAIT (for a minutes or so) and :-
You should use ByteArray (on java it means byte[]). Sqlite supports saving byte arrays like this: How to store image in SQLite database
And on room database, you can just use ByteArray type for you image, and room will finish the rest of work.

Retrieving data from CBOR ByteArray

I am trying to serialize a map into CBOR in Kotlin with the Jackson CBOR Dataformats Library, this works fine if the key is a String , I can retrieve the value of that key easily but when the key in an Int, it returns null to me for every get I do, If I print out the output from values(), it gives me all values from all keys.
Code looks like this :
val mapper = CBORMapper()
val map = HashMap<Any,Any>()
map[123] = intArrayOf(22,67,2)
map[456] = intArrayOf(34,12,1)
val cborData = mapper.writeValueAsBytes(map)
println(cborData.toHex())
val deserialized = mapper.readValue(cborData, HashMap<Any,Any>().javaClass)
println(deserialized.get(123)) // returns null
println(values()) // returns all values
Try to iterate over keys and check the type:
deserialized.keys.iterator().next().javaClass
Above code, in your case should print:
123 - class java.lang.String
456 - class java.lang.String
And:
println(deserialized.get("123"))
prints:
[22, 67, 2]
Take a look on documentation:
Module extends standard Jackson streaming API (JsonFactory,
JsonParser, JsonGenerator), and as such works seamlessly with all the
higher level data abstractions (data binding, tree model, and
pluggable extensions).
You can force type using Kotlin's readValue method:
import com.fasterxml.jackson.module.kotlin.readValue
and use it like this:
val deserialized = mapper.readValue<Map<Int, IntArray>>(cborData)
deserialized.keys.forEach { key -> println("$key - ${key.javaClass}") }
println(Arrays.toString(deserialized[123]))
Above code prints:
456 - int
123 - int
[22, 67, 2]
See also:
How to use jackson to deserialize to Kotlin collections

What are the advantages of using tf.train.SequenceExample over tf.train.Example for variable length features?

Recently I read this guide on undocumented featuers in TensorFlow, as I needed to pass variable length sequences as input. However, I found the protocol for tf.train.SequenceExample relatively confusing (especially due to lack of documentation), and managed to build an input pipe using tf.train.Example just fine instead.
Are there any advantages to using tf.train.SequenceExample? Using the standard example protocol when there is a dedicated one for variable length sequences seems like a cheat, but does it bear any consequence?
Here are the definitions of the Example and SequenceExample protocol buffers, and all the protos they may contain:
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
Features context = 1;
FeatureLists feature_lists = 2;
};
An Example contains a Features, which contains a mapping from feature name to Feature, which contains either a bytes list, or a float list or an int64 list.
A SequenceExample also contains a Features, but it also contains a FeatureLists, which contains a mapping from list name to FeatureList, which contains a list of Feature. So it can do everything an Example can do, and more. But do you really need that extra functionality? What does it do?
Since each Feature contains a list of values, a FeatureList is a list of lists. And that's the key: if you need lists of lists of values, then you need SequenceExample.
For example, if you handle text, you can represent it as one big string:
from tensorflow.train import BytesList
BytesList(value=[b"This is the first sentence. And here's another."])
Or you could represent it as a list of words and tokens:
BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b".", b"And", b"here",
b"'s", b"another", b"."])
Or you could represent each sentence separately. That's where you would need a list of lists:
from tensorflow.train import BytesList, Feature, FeatureList
s1 = BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b"."])
s2 = BytesList(value=[b"And", b"here", b"'s", b"another", b"."])
fl = FeatureList(feature=[Feature(bytes_list=s1), Feature(bytes_list=s2)])
Then create the SequenceExample:
from tensorflow.train import SequenceExample, FeatureLists
seq = SequenceExample(feature_lists=FeatureLists(feature_list={
"sentences": fl
}))
And you can serialize it and perhaps save it to a TFRecord file.
data = seq.SerializeToString()
Later, when you read the data, you can parse it using tf.io.parse_single_sequence_example().
The link you provided lists some benefits. You can see how parse_single_sequence_example is used here https://github.com/tensorflow/magenta/blob/master/magenta/common/sequence_example_lib.py
If you managed to get the data into your model with Example, it should be fine. SequenceExample just gives a little more structure to your data and some utilities for working with it.

Avro specific vs generic record types - which is best or can I convert between?

We’re trying to decide between providing generic vs specific record formats for consumption by our clients
with an eye to providing an online schema registry clients can access when the schemas are updated.
We expect to send out serialized blobs prefixed with a few bytes denoting the version number so schema
retrieval from our registry can be automated.
Now, we’ve come across code examples illustrating the relative adaptability of the generic format for
schema changes but we’re reluctant to give up the type safety and ease-of-use provided by the specific
format.
Is there a way to obtain the best of both worlds? I.e. could we work with and manipulate the specific generated
classes internally and then have them converted them to generic records automatically just before serialization?
Clients would then deserialize the generic records (after looking up the schema).
Also, could clients convert these generic records they received to specific ones at a later time? Some small code examples would be helpful!
Or are we looking at this all the wrong way?
What you are looking for is Confluent Schema registry service and libs which helps to integrate with this.
Providing a sample to write Serialize De-serialize avro data with a evolving schema. Please note providing sample from Kafka.
import io.confluent.kafka.serializers.KafkaAvroDeserializer;
import io.confluent.kafka.serializers.KafkaAvroSerializer;
import org.apache.avro.generic.GenericRecord;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;
import java.util.HashMap; import java.util.Map;
public class ConfluentSchemaService {
public static final String TOPIC = "DUMMYTOPIC";
private KafkaAvroSerializer avroSerializer;
private KafkaAvroDeserializer avroDeserializer;
public ConfluentSchemaService(String conFluentSchemaRigistryURL) {
//PropertiesMap
Map<String, String> propMap = new HashMap<>();
propMap.put("schema.registry.url", conFluentSchemaRigistryURL);
// Output afterDeserialize should be a specific Record and not Generic Record
propMap.put("specific.avro.reader", "true");
avroSerializer = new KafkaAvroSerializer();
avroSerializer.configure(propMap, true);
avroDeserializer = new KafkaAvroDeserializer();
avroDeserializer.configure(propMap, true);
}
public String hexBytesToString(byte[] inputBytes) {
return Hex.encodeHexString(inputBytes);
}
public byte[] hexStringToBytes(String hexEncodedString) throws DecoderException {
return Hex.decodeHex(hexEncodedString.toCharArray());
}
public byte[] serializeAvroPOJOToBytes(GenericRecord avroRecord) {
return avroSerializer.serialize(TOPIC, avroRecord);
}
public Object deserializeBytesToAvroPOJO(byte[] avroBytearray) {
return avroDeserializer.deserialize(TOPIC, avroBytearray);
} }
Following classes have all the code you are looking for.
io.confluent.kafka.serializers.KafkaAvroDeserializer;
io.confluent.kafka.serializers.KafkaAvroSerializer;
Please follow the link for more details :
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/
Can I convert between them?
I wrote the following kotlin code to convert from a SpecificRecord to GenericRecord and back - via JSON.
PositionReport is an object generated off of avro with the avro plugin for gradle - it is:
#org.apache.avro.specific.AvroGenerated
public class PositionReport extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
...
The functions used are below
/**
* Encodes a record in AVRO Compatible JSON, meaning union types
* are wrapped. For prettier JSON just use the Object Mapper
* #param pos PositionReport
* #return String
*/
private fun PositionReport.toAvroJson() : String {
val writer = SpecificDatumWriter(PositionReport::class.java)
val baos = ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get().jsonEncoder(this.schema, baos)
writer.write(this, jsonEncoder)
jsonEncoder.flush()
return baos.toString("UTF-8")
}
/**
* Converts from Genreic Record into JSON - Seems smarter, however,
* to unify this function and the one above but whatevs
* #param record GenericRecord
* #param schema Schema
*/
private fun GenericRecord.toAvroJson(): String {
val writer = GenericDatumWriter<Any>(this.schema)
val baos = ByteArrayOutputStream()
val jsonEncoder = EncoderFactory.get().jsonEncoder(this.schema, baos)
writer.write(this, jsonEncoder)
jsonEncoder.flush()
return baos.toString("UTF-8")
}
/**
* Takes a Generic Record of a position report and hopefully turns
* it into a position report... maybe it will work
* #param gen GenericRecord
* #return PositionReport
*/
private fun toPosition(gen: GenericRecord) : PositionReport {
if (gen.schema != PositionReport.getClassSchema()) {
throw Exception("Cannot convert GenericRecord to PositionReport as the Schemas do not match")
}
// We will convert into JSON - and use that to then convert back to the SpecificRecord
// Probalby there is a better way
val json = gen.toAvroJson()
val reader: DatumReader<PositionReport> = SpecificDatumReader(PositionReport::class.java)
val decoder: Decoder = DecoderFactory.get().jsonDecoder(PositionReport.getClassSchema(), json)
val pos = reader.read(null, decoder)
return pos
}
/**
* Converts a Specific Record to a Generic Record (I think)
* #param pos PositionReport
* #return GenericData.Record
*/
private fun toGenericRecord(pos: PositionReport): GenericData.Record {
val json = pos.toAvroJson()
val reader : DatumReader<GenericData.Record> = GenericDatumReader(pos.schema)
val decoder: Decoder = DecoderFactory.get().jsonDecoder(pos.schema, json)
val datum = reader.read(null, decoder)
return datum
}
There are a couple difference however between the two:
Fields in the SpecificRecord that are of Instant type will be encoded in the GenericRecord as long and Enums are slightly different
So for example in my unit test of this function time fields are tested like this:
val gen = toGenericRecord(basePosition)
assertEquals(basePosition.getIgtd().toEpochMilli(), gen.get("igtd"))
And enums are validated by string
val gen = toGenericRecord(basePosition)
assertEquals(basePosition.getSource().toString(), gen.get("source").toString())
So to convert between you can do:
val gen = toGenericRecord(basePosition)
val newPos = toPosition(gen)
assertEquals(newPos, basePosition)